DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Overconfidence, Better Agents, and Broken Visual Memory

Today's AI research reveals a system that can't say 'I don't know,' one that's learning faster than before, and one that forgets what it saw.

            May 16, 2026
          

Three papers today, and they fit together more neatly than usual — which I always find satisfying. One exposes a dangerous habit in medical AI, one shows how open-source agents are closing the gap with proprietary systems, and one reveals a surprisingly basic thing AI still can't do with images. Let's dig in.

Today's stories

              01 / 03
            

AI Systems in Medicine Say 'I Know' When They Should Say 'I Don't'

Between 53% and 82% of the time, leading AI systems give a confident medical answer even when the correct answer has been deliberately removed from the options.

Picture a doctor who, no matter what, always writes a prescription before you leave the room — even when the right move is to say 'I need more tests first.' That's the habit this paper documents in five major AI systems, and it has a name: premature closure. A team of researchers tested models including GPT, DeepSeek R1, and Grok 3 on two medical question datasets — MedQA and AfriMed-QA — with a twist: they removed the correct answer from the multiple-choice options. There was no right answer to pick. A good system should recognize that and say so. Instead, models picked a wrong answer between 53% and 82% of the time, depending on the model and dataset. The average, across all five models, was around 70%. They also tested the models on open-ended medical questions. On 78% of adversarial questions written by actual physicians — questions designed to be underspecified or tricky — models gave an inappropriate answer rather than asking for clarification or flagging uncertainty. The good news: adding a simple instruction to the prompt ('it is okay to say you don't know, or ask for more information') reduced the false-action rate by roughly 20 percentage points on average. The bad news: it didn't fix the problem, and one model — Grok 3 — became so cautious it also started refusing questions it should have answered. The catch here is scope. This study measured one specific failure mode on specific benchmarks. It doesn't tell us how these systems perform in real clinical workflows, and 'inappropriate answer' was scored against rubrics, which introduces judgment calls. But the direction is clear and worth watching: confident wrongness is measurable, common, and only partially fixable with prompting alone.

Glossary

premature closure — When an AI (or a human clinician) settles on an answer too early, before considering that the question might not have a good answer yet.

false-action rate — The percentage of times a model picks or commits to an answer when the correct response would have been to abstain or ask for more information.

NOTA item — A 'None of the Above' question constructed by removing the correct answer, used to test whether a model knows when it doesn't know.

Source: Quantifying and Mitigating Premature Closure in Frontier LLMs

              02 / 03
            

Open-Source AI Agents Can Now Write Code and Browse the Web Competitively

An open-source framework just matched or beat proprietary AI agents at coding, web browsing, and personal assistance — and it's telling us something interesting about how to train them.

Training an apprentice chef by showing them only perfect dishes misses half the lesson. The useful part is often in the batches that didn't work — which steps were fine, which ones ruined the whole thing. That's the core intuition behind Orchard, a new open-source framework for training AI agents released by a team at MiniMax. Orchard packages three agents: one for software engineering (Orchard-SWE), one for navigating graphical interfaces like websites and apps (Orchard-GUI), and one for personal assistant tasks (Orchard-Claw). The headline numbers are real: Orchard-SWE scores 67.5% on SWE-bench Verified — a standard test where AI systems try to fix real bugs in real GitHub repositories — which the authors claim is the best result among open-source models of comparable size. Orchard-GUI hits 74.1% on WebVoyager, a benchmark for navigating actual websites. What makes the training approach worth noting is 'credit-assignment SFT.' Rather than throwing away trajectories where the agent failed to complete a task, the method identifies and learns from the useful segments inside those failed runs. Think of it as salvaging the good cuts from a dish that didn't plate well. On top of that, a reinforcement learning step — called Balanced Adaptive Rollout — pushes performance further. The catch: benchmark scores are not the same as real-world reliability. SWE-bench and WebVoyager are well-constructed tests, but production environments are messier. The paper also openly uses data distilled from larger proprietary models (MiniMax-M2.5, Qwen3.5-397B) to bootstrap training — so 'open-source' here means the training recipe and model weights are available, not that the whole pipeline is independent of closed systems. That's worth knowing before you celebrate the openness.

Glossary

SWE-bench Verified — A benchmark where AI agents are given real bug reports from GitHub and asked to fix the underlying code — judged against actual test suites.

supervised fine-tuning (SFT) — A training step where a model learns from labeled examples of correct behavior, before or instead of trial-and-error reinforcement.

reinforcement learning — A training method where a model is rewarded for good outcomes and penalized for bad ones, so it learns from experience rather than just examples.

data distillation — Using a large, capable model to generate training data that a smaller model then learns from.

Source: Orchard: An Open-Source Agentic Modeling Framework

              03 / 03
            

AI Can't Really Remember What It Saw — Here's the Proof

If an AI 'remembers' a photo by converting it into a text description, how much of what it saw is actually gone?

Imagine you're trying to answer detailed questions about a film you saw last month, but all you have are your own text-message summaries from that night. You can answer the plot questions fine. But 'what colour was the third character's coat in the marketplace scene?' — that's gone. You never wrote it down. That gap is exactly what a team behind MemEye set out to measure in AI systems. They built a benchmark of 742 question-and-answer pairs covering eight everyday life scenarios — think reviewing a day's worth of photos, tracking objects around a home — and tested thirteen different memory systems across four different vision-language model backbones. The core finding is uncomfortable: most multimodal AI memory systems quietly convert images into text or captions as quickly as possible, then discard the original visual information. When you ask a fine-grained question later — 'has the plant on the kitchen counter wilted since Tuesday?' — the system is working from its notes, not its memory. The benchmark shows a much larger performance gap between 'caption only' and 'full visual access' than previous benchmarks did, which means earlier evaluations were effectively letting systems cheat by asking questions that text alone could answer. Three specific problems emerge: systems struggle to route the right piece of evidence to the right question, they fail to track how things change over time across multiple sessions, and they can't reliably extract fine-grained visual details even when they have access to them. The catch is scale. MemEye has 742 questions across eight tasks — a careful, well-constructed benchmark, but not enormous. And 'LLM-as-a-Judge' scoring (where another AI grades the answers) introduces its own error margin. Still, the finding that visual memory is fundamentally under-tested in existing work is hard to argue with.

Glossary

multimodal — Able to work with more than one type of input — in this case, both images and text.

vision-language model (VLM) — An AI system that can process and reason about both images and language together.

LLM-as-a-Judge — Using a large language model to score the quality of another model's answers, as a substitute for human evaluation.

caption-to-multimodal gap — The difference in performance between a system that only has text descriptions of images versus one that has the actual images — a measure of how much visual detail matters.

Source: MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

The bigger picture

Read these three together and a tension becomes visible. On one hand, open-source AI agents are genuinely getting more capable — Orchard shows that with real benchmark numbers and a training method that's worth understanding. On the other hand, two of today's papers are essentially documenting the floors beneath the feet of that progress. AI systems in medical contexts still commit to wrong answers at alarming rates rather than admitting uncertainty. And the visual memory underpinning multimodal agents is far shallower than we've been testing for. Here's the position I'd take: capability and brittleness are advancing in parallel right now, not sequentially. We're not fixing the foundations before building higher — we're discovering the cracks as the walls go up. That's not necessarily fatal, but it does mean anyone deploying these agents in consequential settings — medicine, legal research, personal assistance — should be treating the benchmark scores as a ceiling of best-case performance, not a floor of reliable behaviour.

What to watch next

The medical AI overconfidence finding is likely to get attention from people working on clinical deployment guidelines — watch for responses from groups at Mayo Clinic or similar institutions who are already running AI in diagnostic workflows. On the agent capability side, SWE-bench Verified scores have been climbing fast this year; I'd watch whether any open-source model crosses 70% in the next month or two, which would have been considered out of reach at the start of 2025. The open question I most want answered: does better visual memory (the MemEye gap) actually improve agent performance on real tasks, or is the text-summary shortcut good enough in practice?