DeepScience

DeepScience — Artificial Intelligence

DeepScience

Artificial Intelligence · Daily Digest

May 25, 2026

285

Papers

11/11

Roadblocks Active

Connections

⚡ Signal of the Day

• Reasoning reliability is today's sharpest signal: multiple papers independently show that AI models can look right while reasoning wrong — representations converge across model families but reasoning processes diverge, overthinking traps models in wrong trajectories, and diagnostic benchmarks reveal correct final answers masking broken reasoning chains.

• The Convergence Without Understanding result is particularly consequential for AI safety: models agree most on problems they all fail, and post-decision representations scatter even when pre-decision ones align — meaning ensemble agreement and representation similarity are unreliable proxies for correctness.

• Watch the pairing of process-aware evaluation (DDX-TRACE, Co-ReAct) with inference-time reasoning control (Dynamic Closed-Loop Steering): the field is quietly shifting from measuring WHAT models answer to auditing HOW they reason, which will reshape benchmarking standards over the next year.

📄 Top 10 Papers

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Across 16 language models from 8 families, internal representations converge strongly before a model commits to an answer — but sharply diverge afterward, with post-decision similarity dropping from 0.875 to 0.274. More troublingly, models converge more on problems they all get wrong (CKA=0.897) than on problems they solve correctly (CKA=0.830). This means shared internal structure between AI systems is not a signal of correctness, directly undermining the assumption that model agreement or ensemble diversity indicates reliable reasoning.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Testing 18 vision-language models on tasks requiring spatial number understanding — such as counting positions from movement sequences or mapping coordinates to visual layouts — performance falls near random chance across both dynamic and static settings. Models exploit shallow surface cues rather than building stable coordinate-aware spatial representations. This exposes a fundamental capability gap that matters directly for robotics and embodied AI, where spatial quantity reasoning is non-negotiable.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

Dynamic Closed-Loop Steering for Robust and Interpretable System-2 Reasoning in Large Language Models

Unconstrained test-time compute scaling causes LLMs to 'overthink' — getting trapped in high-entropy incorrect trajectories — while existing static hidden-state interventions distort model geometry and cause semantic drift. This paper's Dynamic Closed-Loop Steering uses dual real-time sensors (entropy tracking and logit-margin convergence detection) to apply targeted steering only when needed, achieving a Pareto improvement over raw compute scaling on MATH500 and AIME without any model retraining. The result shows that adaptive inference-time control is more efficient and safer than simply giving models more compute to think.

██████████ 0.9 reasoning-reliability Peer-reviewed

Read

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

Using 211 physician-adjudicated neuroradiology cases with 1,609 images, this benchmark forces models to sequentially request evidence before diagnosing — like a real clinician — rather than seeing everything at once. Models frequently guess plausible diagnoses without requesting essential evidence, and request imaging studies they then misinterpret. The finding that final-diagnosis accuracy scores substantially misrepresent diagnostic reasoning quality is a direct warning for anyone deploying or evaluating medical AI on standard accuracy metrics.

██████████ 0.9 reasoning-reliability Preprint

Read Save Connections

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Current multimodal large language models systematically fail to generate accurate and insightful descriptions of complex charts from scientific papers, even though such charts are their natural habitat. The benchmark covers 896 chart-description pairs across 14 chart types, evaluated on four dimensions: factual accuracy, salient feature coverage, domain-informed guidance, and chart-text complementarity. Existing benchmarks used homogeneous, simple charts with shallow descriptions, meaning the field has been measuring an easier version of the problem than it actually needs to solve.

██████████ 0.9 hallucination-grounding Preprint

Read Save Connections

ETCHR: Editing To Clarify and Harness Reasoning

Rather than asking a single multimodal model to both perceive and reason, ETCHR trains a dedicated image editor to transform input images in ways that make reasoning easier — highlighting relevant regions, simplifying spatial layouts — decoupled from the downstream understanding model. A two-stage training process (imitation learning on edit trajectories, then reinforcement learning with VLM-derived rewards) closes both the language gap (mapping abstract questions to visual transforms) and the generation gap (edit quality degrading with reasoning depth). The plug-and-play design works across multiple VLM families and nine benchmarks, showing the approach generalizes beyond a single model's quirks.

██████████ 0.9 multimodal-understanding Preprint

Read Save Connections

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

ReAct-style agents tend to take shallow, redundant actions because they rely on internal judgment with no explicit criteria for what constitutes a good next step. Co-ReAct trains a rubric generator — using a ranking-based reward against expert consensus — to inject step-level criteria into the agent context before each tool call, with a verifier that triggers regeneration if criteria are unmet. Evaluated on DeepResearchBench and SQA-CS-V2 across both open-source and closed-source base models, this step-level guidance consistently outperforms post-hoc and training-time rubric alternatives.

██████████ 0.8 agent-tool-use Preprint

Read Save Connections

Agentic Proving for Program Verification

Claude, operating agentically with Lean 4 formal verification tools, generates arguably valid specifications for 98.8% of problems in the CLEVER benchmark and certifies implementations against correct specifications for 87.5% of problems. This demonstrates that current frontier LLMs can automate the most human-intensive part of formal verification — writing the specification itself — not just the proof search. The result is significant because specification writing has historically been the primary bottleneck preventing formal methods from scaling to real software development.

██████████ 0.8 reasoning-reliability Preprint

Read Save Connections

Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

Diffusion-based robotic manipulation policies face a trade-off between expensive high-performance models and cheap ones that generalize poorly; existing mixture-of-experts routing based on low-level noise statistics fragments reusable behaviors. SMoDP routes computation to specialized expert networks based on semantic task structure identified by VLM annotations, with a lightweight skill predictor at inference time. This improves both parameter efficiency and compositional transfer, meaning learned manipulation skills can be reused across related tasks without retraining the full policy.

██████████ 0.8 embodied-ai Preprint

Read Save Connections

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Memory-augmented LLM agents are vulnerable to adversarial injection attacks where malicious records are inserted through ordinary interactions and later hijack agent behavior. MemAudit identifies poisoned memories by combining counterfactual influence scoring — replaying agent behavior with and without each memory to measure its causal impact — with graph-based structural anomaly detection that flags memories inconsistent with the broader memory network. Tested against MINJA attacks on GPT-4o-based agents, the approach substantially reduces attack success rates, offering a post-hoc audit tool that does not require modifying the underlying agent.

██████████ 0.8 alignment-safety Preprint

Read Save Connections

🔬 Roadblock Activity

Roadblock	Papers	Status	Signal
Data Quality & Curation	105	Active	Highest paper volume today; activity spans benchmark construction (ChartFI, DDX-TRACE) and safety dataset design, reflecting growing recognition that evaluation data quality is as important as model quality.
Efficiency & Scaling	96	Active	Strong activity with Dynamic Closed-Loop Steering showing inference-time adaptive control can outperform raw compute scaling, and SMoDP demonstrating parameter-efficient MoE routing for robotics.
Hallucination & Grounding	87	Active	Multiple papers surface grounding failures in high-stakes domains — medical imaging (DDX-TRACE), chart description (ChartFI), and hazard detection — indicating grounding failures are domain-specific and require targeted benchmarks.
Reasoning Reliability	86	Active	Convergence Without Understanding and Dynamic Closed-Loop Steering together reframe reasoning reliability as a problem of process divergence and inference-time instability, not just output accuracy.
Interpretability	81	Active	Convergence Without Understanding uses representational geometry (CKA, SVCCA) to expose a gap between internal alignment and behavioral alignment, advancing interpretability as a tool for understanding model failure modes.
Agent Tool Use	77	Active	Co-ReAct and Agentic Proving both demonstrate that structured guidance at the action level — rubrics for search agents, formal specs for coding agents — substantially improves agent reliability over unguided tool use.
Multimodal Understanding	76	Active	SPACENUM and ETCHR from opposite directions — one exposing a fundamental spatial reasoning failure, one proposing an architectural fix via decoupled image editing — make multimodal understanding one of the more active and productive roadblocks today.
Alignment & Safety	69	Active	MemAudit addresses a concrete and underexplored attack surface — adversarial memory injection in deployed agents — while theoretical work on next-token prediction challenges alignment assumptions about what LLMs actually learn.
Long Context	33	Active	Moderate activity with no standout papers reaching the top tier today; the theoretical analysis of next-token prediction (When Is Next-Token Prediction Useful) touches long-context assumptions tangentially.
Embodied AI	33	Active	SMoDP and ChainFlow-VLA both address the same tension between expressiveness and efficiency in embodied control, with semantic structure emerging as the preferred routing signal over low-level statistics.
Information Retrieval	1	Low	Effectively quiet today with only one paper in scope; no significant signals to report.

View Full Analysis

DeepScience — Cross-domain scientific intelligence
Sources: arXiv · OpenAlex · Unpaywall
deepsci.io

Unsubscribe