DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Sees, Guesses, and Defaults to Male: Three Blind Spots

Today's AI research shows that confident-sounding answers and reliable answers are very different things.

            June 01, 2026
          

Today's batch is dense with papers, but three of them are pointing at the same underlying problem from three different angles — and I think you should know about all three. I spent the morning reading studies on how AI systems handle the limits of their own vision, and the short version is: not well, and often without admitting it. Let me walk you through what that looks like in practice.

Today's stories

              01 / 03
            

AI Models Say 'Male' Even When They Know Better Inside

A model can internally register 'female' and then output 'male' — and standard bias audits would never catch it.

Researchers studying four different vision-language models (VLMs — AI systems that process both images and text) ran a simple test: show the model a deliberately ambiguous photo of a person, no strong gender cues, and ask it to classify the person as male or female. The models defaulted to male. Almost every time. Even for occupations like babysitter or florist, which carry strong cultural associations with women. So far, that sounds like a standard bias story. But the team went further. Using a technique they developed called LALS — think of it like taking a thermometer reading mid-way through a cooking process rather than only tasting the finished dish — they measured what the models were computing internally, layer by layer, before the final answer came out. What they found was surprising: internally, the models were encoding female associations for those ambiguous images. The inside was reading 'female.' But by the time the answer arrived, it had flipped to male. This gap between internal state and output is what the researchers call a decoupled regime. It matters for a specific reason: most AI fairness auditing tools only check outputs. If a model passes a gender-bias test by looking at what it says, it can look clean while still running a systematic male default under the hood. The catch is real: this study used roughly 900 AI-generated images, verified by a single human annotator, across four models. That is a genuine finding, but a modest one. You would want many more images, multiple human reviewers, and a wider range of models before declaring this universal. What it does establish clearly is that output-level auditing alone is not enough.

Glossary

vision-language model (VLM) — An AI system trained to process both images and text at the same time, so it can answer questions about pictures.

LALS (Latent Association Leaning Score) — A technique that reads the AI's internal numerical activations mid-computation and converts them into a score for how strongly the model is associating an image with a concept like 'female' or 'male,' before the final answer is produced.

decoupled regime — A situation where the model's internal computations point one way but its final output points the other way — the inside and the outside disagree.

Source: Vision-Language Models Suppress Female Representations Under Ambiguous Input

              02 / 03
            

AI Confidently Answers Spatial Questions It Cannot Actually See

Block half a fruit bowl with a vase and ask someone to count the apples — a trustworthy friend admits they can't see them all. AI does not.

A team of researchers built a controlled benchmark called SpatialUncertain using 3D simulated environments, then tested eight different AI vision models — including GPT-4o, GPT-5, Gemini-2.5-Flash, and several open-source alternatives — on a deceptively simple question: do you know when you cannot see? They introduced two types of visual problems. First, occlusion: an object was placed between the camera and the target, so part of the scene was genuinely hidden. Second, perspective ambiguity: the camera was shifted to an angle where depth becomes physically impossible to judge from a single viewpoint. Then they asked the models spatial questions and measured whether the models would admit uncertainty or just answer anyway. Under occlusion, average accuracy dropped to around 30%. Under perspective ambiguity it dropped below 10%. The models kept answering confidently regardless. When the researchers went further and asked the models which camera angle would help them see better — basically, 'what information are you missing?' — the models performed near random chance. They could not identify what they could not see. This matters beyond a lab setting. Vision-language models are being deployed in warehouses, hospitals, and vehicles to interpret physical environments. A system that cannot flag its own blind spots does not just underperform — it provides confident misinformation at exactly the moment when caution is most important. The catch: this is a simulated benchmark, not real-world deployment. Real environments have more visual richness and often multiple camera angles. The researchers did find that fine-tuning on diverse ambiguity scenarios improved abstention — so there is a path forward, but nobody has walked it yet at scale.

Glossary

occlusion — When an object physically blocks the camera's line of sight to something else, so part of the scene is hidden.

perspective ambiguity — When a camera angle makes it impossible to judge depth or distance — like a flat photo of a hallway where you cannot tell if two objects are close together or far apart.

abstention — When an AI system correctly refuses to answer because it recognises it does not have enough information, rather than guessing.

Source: Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

              03 / 03
            

Sports Video Reveals a Steep Cliff in How Well AI Truly Reasons

At watching sports, AI scores a reasonable 73%. At reasoning about strategy, it scores 5%. That 68-point drop tells you something important.

A research team built a benchmark called SVI-Bench from roughly 35,000 hours of basketball, soccer, and hockey broadcast footage — complete with annotated actions, expert commentary, game reports, and statistical records. The point was not to build a sports trivia bot. It was to create a world with real complexity, explicit rules, and verifiable answers, so you could test AI reasoning honestly at different levels of difficulty. Think of the task levels like this: reading one sentence is easy. Summarising a paragraph is harder. Writing an investigative report that draws on a hundred sources is something else entirely. SVI-Bench organised tasks across four levels from simple recognition up to full agentic reasoning — where the model had to autonomously search through 1.8 million video clips, gather its own evidence, and integrate it to answer a strategic question. At the lowest level — 'what action just happened?' — the best models scored around 73%, reasonable if imperfect. At the top level, the best model scored 5%. Five percent. Every model tested showed the same cliff at each step upward. The researchers call it a capability cliff, and it is consistent across every model family they evaluated. Why does this matter outside sports? Because those top-level tasks — autonomous evidence gathering, multi-step integration, strategic synthesis — are precisely what companies are deploying AI agents for in business and scientific contexts. If these capabilities collapse even in a domain where the rules are written down and the outcomes are unambiguous, that is a meaningful signal about where real deployments will hit limits. The honest catch: ground truth for some reasoning tasks came from broadcast commentator transcripts, which are not always technically precise. And benchmarks are not deployments. Honestly, nobody knows yet how these exact failure modes translate to production systems.

Glossary

agentic task — A task where the AI must take a sequence of actions autonomously — searching, retrieving, deciding — rather than just answering a single question with information already in front of it.

capability cliff — A sharp drop in performance when a task moves from one cognitive level to the next, rather than a gradual decline.

Source: SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

The bigger picture

What do these three stories share? Each catches an AI system performing reasonably on an easy version of a task, then failing on a version that demands something harder: admitting what you cannot see, translating internal understanding into honest output, or sustaining reasoning across many steps. That pattern matters. A lot of AI evaluation focuses on average accuracy. These papers suggest average accuracy is the wrong frame. The real question is what happens at the edges — when visual evidence is incomplete, when a prompt introduces pressure, when a task requires chaining many inferences together. That is where current models consistently break. None of this means AI is useless. It means the gap between 'impressive demo' and 'trustworthy system' runs specifically through these edge cases. And right now, most of those edges are poorly mapped and poorly tested before deployment.

What to watch next

On the spatial reasoning front, the SpatialUncertain team flagged fine-tuning on diverse ambiguity conditions as a partial fix — watch for follow-up work testing whether that holds when models are put into physical environments, not simulated ones. On the gender bias finding, LALS-style internal probing could become a standard auditing tool; it would be worth tracking whether any major lab announces they are running something like it on production models. And on the sports reasoning benchmark: Google DeepMind and OpenAI have both named video understanding as a 2026 priority — SVI-Bench is now a public target they will have to address.