DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI That Can't Say 'I Don't Know' Is a Problem

Today's digest is about what AI systems still fundamentally get wrong — and why it matters more than the wins.

            May 15, 2026
          

Happy Friday. Today I spent the morning with three papers that share an uncomfortable theme: AI systems confidently producing answers when the honest response would be silence, uncertainty, or 'I need more context.' Let me walk you through three stories that, taken together, say something important about where the gaps still are.

Today's stories

              01 / 03
            

AI Doctors Keep Guessing When They Should Stay Quiet

Remove the right answer from a medical exam and most top AI systems still confidently pick a wrong one, roughly 70% of the time.

Picture a game-show contestant who, when the host removes the correct answer from the board, still confidently picks one of the remaining wrong options instead of saying 'I don't know.' That is exactly what five of today's most capable AI systems do when asked medical questions that have no right answer. Researchers tested GPT-5.4, Claude Opus 4.7, Gemini 2.5 Pro, DeepSeek R1, and Grok 3 on modified versions of two established medical question benchmarks — MedQA and AfriMed-QA. They removed the correct answer from roughly half the questions in each set, leaving only wrong options on the table. The safe move is to say 'none of these' or 'I can't safely advise.' Instead, the models gave a confident wrong answer between 55% and 82% of the time. Average across all five models: about 70%. The researchers call this 'premature closure' — an AI's tendency to commit to an answer even when pausing, asking for clarification, or declining would be the right call. The stakes are real. These models are increasingly embedded in health apps, telemedicine tools, and clinical decision support. A system that can't say 'I don't know' is more dangerous than one that's merely occasionally wrong. Some good news: a simple tweak to the model's instructions — a safety-oriented prompt — dropped false answers from around 70% to around 48%. DeepSeek R1 showed the biggest improvement without sacrificing accuracy on questions that did have a right answer. That's a real lever. The catch: this covers five specific models on two specific benchmarks. We don't know how this plays out in real clinical settings, with real patients, under real time pressure. It's an early signal, not a verdict.

Glossary

premature closure — When an AI commits to an answer in situations where the correct response is to pause, ask for clarification, or say it doesn't know.

NOTA item — A question where the correct answer ('none of the above') has been removed, leaving only wrong options, used here to test whether AI knows to abstain.

Source: Quantifying and Mitigating Premature Closure in Frontier LLMs

              02 / 03
            

This AI Learned to Use Apps by Watching Millions of Tutorials

Imagine learning to use Photoshop by watching 4.2 million YouTube tutorials — that is roughly what this AI pipeline just did.

Think about how you learned to use a piece of software you'd never touched before. You probably found a screen-recording tutorial, watched someone click through the steps, and copied what they did. A research team has now automated that process at a scale that's hard to picture. The project, called Video2GUI, built a pipeline that watches instructional screen-recording videos from the internet and automatically extracts the actions the person performed: which button they clicked, in what sequence, on what kind of application. No human had to label any of it. The pipeline sifted through metadata from 500 million videos, filtered down to 4.2 million high-quality tutorials — about 300,000 hours of footage — and used Gemini 2.5 Pro to extract structured interaction records. The result is a dataset called WildGUI: 12.7 million interaction sequences covering more than 1,500 apps and websites across desktop, mobile, and web. When the team used WildGUI to pre-train two existing AI models — Qwen2.5-VL and Mimo-VL — those models improved by 5 to 20 percent across a range of tests measuring how well they navigate graphical interfaces. Why does this matter? AI agents that can reliably operate software are one of the nearer-term useful applications of this technology. An assistant that can book a flight, fill out a form, or navigate a government website the way a human assistant would is commercially and practically valuable. WildGUI is now the largest reported open-source dataset of its kind. The catch: tutorial videos vary wildly in quality, and the 5–20% improvement figures come from benchmarks the team chose themselves. Independent replication hasn't happened yet, and 'better on benchmarks' and 'reliably useful in practice' are still two different things.

Glossary

GUI agent — An AI that can read a screen and click, type, or scroll to complete tasks — like a digital assistant that operates your computer.

pre-training — An initial phase of AI training on a large dataset before the model is fine-tuned for a specific task, like warming up before a race.

interaction trajectory — A recorded sequence of actions — clicks, keystrokes, scrolls — that together complete a task on a screen.

Source: Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

              03 / 03
            

AI Reads Ancient Artifacts Like They Were Made Yesterday

The best AI vision system tested on historical artifacts scored 58.7% — not far above what careful guessing would get you.

If you handed someone a photograph of a bronze-age cooking pot and asked them what it was made of and when, you'd expect them to think about the era, the available materials, the culture. A team evaluating ten AI vision systems found that most of them don't do this. They interpret historical objects through a modern lens — as if every artifact were produced recently. The researchers built a benchmark called TAB-VLM: 600 multiple-choice questions drawn from 1,600 carefully chosen Indian cultural artifacts spanning prehistoric times through to the modern period. The questions tested things like: which of these objects couldn't have existed in this era? What technique was used to craft this? Which item doesn't belong in this group? The best model tested — GPT-5.2 — answered correctly 58.7% of the time. On a four-option multiple-choice test, random guessing gives you 25%. So GPT-5.2 is doing better than chance, but not dramatically. Eight of the other nine models scored lower. The researchers call this 'cultural anachronism' — the models import today's categories onto yesterday's objects. Why does this happen? Because these systems are trained overwhelmingly on contemporary internet images. Their visual vocabulary is anchored in the present. Ask them to reason backwards in time and they reach for the nearest familiar concept, regardless of whether it fits the era. The catch: the benchmark focuses specifically on Indian artifacts, and the expert curation process is not fully documented in the paper. Whether other cultural traditions show the same patterns — or different ones — remains untested. This is a first probe, not a sweeping verdict.

Glossary

vision-language model (VLM) — An AI that can both look at images and read or generate text, allowing it to answer questions about what it sees.

cultural anachronism — Interpreting something old using the ideas and categories of a different, usually more recent, time period.

TAB-VLM — The benchmark the researchers built: Temporal Anachronism Benchmark for Vision-Language Models, containing 600 questions on historical Indian artifacts.

Source: On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

The bigger picture

These three papers are describing the same underlying problem from three different angles. AI systems learn from whatever humans have put on the internet — videos, text, images — and they absorb the biases, gaps, and blind spots baked into that material without always recognising what's missing. Video2GUI shows that learning-from-what's-out-there can work well when the task is concrete, the data is rich, and the goal is narrow. The cultural anachronism work shows where the same strategy breaks down: when a task demands stepping outside the present moment. And the premature closure research shows the sharpest version of the problem — a system so trained to always produce an answer that it cannot recognise when no answer is the right answer. What connects all three is a single uncomfortable question: how do you build a system that knows the limits of what it knows? That question doesn't have a clean answer yet, and it's the one I'd want the field to focus on.

What to watch next

The premature closure findings are likely to draw attention from clinical AI researchers and health regulators — watch for responses from groups building tools used in actual healthcare settings, and whether any of the model providers publish their own analyses. On the cultural reasoning front, the natural next step is a benchmark that spans multiple cultural traditions beyond South Asian artifacts, which would tell us whether this is a universal pattern or one shaped by specific training data distributions. The open question I'd most want answered: does safety prompting in medical AI reduce dangerous overconfidence without making models so cautious they become useless in real practice?