DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Fails at Clicks, Novel Ideas, and Knowing Its Limits

Today's papers draw a sharper map of where AI can actually be trusted — and where it confidently isn't.

            May 19, 2026
          

Three stories today, and they happen to rhyme. Each one comes from a different corner of AI research — robotic labs, academic writing, website design — and each one lands on the same uncomfortable finding: AI is remarkably capable at executing tasks, and remarkably bad at judging them. Let me walk you through each one.

Today's stories

              01 / 03
            

GPT Can't Predict Where You'll Click on a Website

Imagine hiring someone to predict where your customers will click — and finding out they're wrong half the time.

When you redesign a website, a standard trick is to run what UX designers call a first-click test: show real people an interface, ask a question, see where they click first. It takes time and costs money. So some teams have started outsourcing this to GPT instead. A research team compared GPT's click predictions against 3,431 real human participants across 12 actual first-click studies drawn from real design practice. The finding: in 53% of tasks, GPT's predicted click pattern was statistically different from what real users did. That is basically a coin flip for reliability. They tried every tweak you would hope might help — asking GPT to reason step-by-step before answering (chain-of-thought prompting), giving it fake 'user personas' ('act as a 35-year-old online shopper'), adjusting its randomness settings. None of it moved the needle. The researchers tested both GPT-4.1 and GPT-5.2, and both performed about as poorly, suggesting this is not a model-version problem you can wait your way out of. Why? GPT reads words and their meanings. It clicks where language sounds relevant. Real humans look at visual layout, button size, colour contrast, spatial proximity — things that are not in the text at all. The result is outputs that read as thoughtful and plausible, but are systematically disconnected from real visual behaviour. The catch: this study tested first-click tests specifically. Some structured UX tasks — generating question lists, summarising results, writing survey copy — may still benefit from AI assistance. But if someone is selling you 'AI-powered synthetic user testing at scale,' hand them this paper first.

Glossary

first-click test — A usability method where participants are shown a web interface and asked to click where they would go to complete a specific task — revealing how intuitively the layout works.

chain-of-thought prompting — A technique where you ask an AI to write out its reasoning step-by-step before giving an answer, in the hope that explicit reasoning improves accuracy.

Source: What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience

              02 / 03
            

AI Can Write a Research Paper for $15. It Still Fakes the Results.

You can now generate a research paper for roughly the price of a sandwich — which sounds alarming until you find out what is actually inside it.

A team of researchers released a detailed map of everything AI can currently do across the full lifecycle of a scientific paper: finding ideas, writing drafts, checking results, submitting to journals. It is a survey, meaning they synthesised existing systems and papers rather than running their own experiments — worth knowing up front. The headline number is striking: fully automated systems can generate a research paper for as little as $15. But the authors are careful about what that means, and so should we be. The map they draw has two very different zones. In the first zone — structured, tool-mediated tasks like literature search, reference formatting, code scaffolding, and summarising existing findings — AI genuinely helps and is largely reliable. Think of it like a diligent kitchen assistant who can chop vegetables perfectly but has never tasted the dish. In the second zone — judging whether an idea is genuinely new, catching hidden experimental errors, deciding if results are scientifically valid — frontier large language models (the most powerful AI systems available today) reliably fabricate results, miss subtle problems, and fail to judge novelty under real scientific pressure. The authors are direct about this. No fully automated system has consistently gotten papers accepted at major research venues. And, critically, increased automation can hide failure modes rather than eliminate them: a paper can look complete and credible while containing invented findings. Honestly, this is one of those surveys where the honest framing is more useful than any single result. The message is: use AI inside the first zone, distrust it inside the second.

Glossary

large language model — An AI system trained on vast amounts of text to generate and reason about language — the kind of system behind tools like ChatGPT or Claude.

fabricate results — In AI research, this means an AI system confidently produces false information — numbers, citations, findings — that it presents as real.

Source: AI for Auto-Research: Roadmap & User Guide

              03 / 03
            

A Robot AI Just Made Graphene and Tiny Transistors All by Itself

Graphene — one of the thinnest materials ever made — just got its first AI-only manufacturer, and it also built transistors a few atoms thick.

In a custom robotic minilab, a multi-agent AI system called Qumus just completed two tasks that previously required trained human hands: creating graphene, a sheet of carbon only one atom thick, and assembling atomically thin transistors — the tiny switches at the heart of all modern electronics — by stacking those atomic sheets together. Qumus works a bit like a small team in a kitchen, where each person has one job. One AI agent acts as project manager, assigning tasks and tracking goals. Another supervises the lab workflow. A third handles microscopy and physical assembly. These agents communicate with robotic arms, precision motion stages, and a computer vision system based on YOLO (a type of AI trained to recognise objects in images). When something goes wrong — say, a flake of graphene lands in the wrong position — the system detects the problem and corrects its own actions, closing the loop without human intervention. The catch is significant, and the team is honest about it: this is a proof-of-concept demonstration. The paper does not compare Qumus's speed or accuracy against a human expert working the same equipment. There is no statistical analysis across many repeated trials. What the researchers have shown is that this pipeline can work end-to-end in a physical environment — not that it works reliably or at scale. Think of it as the first time someone baked bread in an automated kitchen: it proves the kitchen can function, not that you should close your bakery yet. Still, building a device that is literally a few atoms thick, without human hands touching it, is not nothing.

Glossary

graphene — A material made of a single layer of carbon atoms arranged in a hexagonal grid — extraordinarily thin, strong, and electrically conductive.

transistor — A tiny electronic switch that controls the flow of electricity; billions of them are packed into every modern computer chip.

van der Waals stacking — A technique for assembling electronic devices by layering atomically thin materials on top of each other, held together by weak natural attractions between the layers.

multi-agent system — An AI setup where multiple specialised AI programs coordinate with each other, each handling a different part of a larger task.

Source: Qumus: Realization of An Embodied AI Quantum Material Experimentalist

The bigger picture

Three papers today, three different domains, one pattern: AI is much better at doing than judging. Qumus can physically fabricate a transistor a few atoms thick — that is executing. But it cannot tell you whether that transistor design was the right choice. The auto-research survey shows AI can write a literature review for the price of a coffee — that is executing. But it reliably fails when asked to evaluate whether an idea is genuinely new. And the UX study shows GPT can produce plausible-sounding predictions about human behaviour — that is also executing. But those predictions are wrong half the time because they require something closer to judgment than retrieval. What we are watching, slowly, is the map of what AI can and cannot be trusted with getting sharper and more specific. That map is more useful to you than either the hype or the panic. Today's papers are three more data points drawn on it.

What to watch next

The Qumus team describes its real-world robotic trials as only 'preliminary' — a follow-up study quantifying success rates against human expert benchmarks would tell us whether this becomes a real lab tool or stays a demonstration. On the UX side, the finding that GPT is wrong on 53% of first-click tasks will likely provoke responses from the growing AI-in-design industry; counter-studies are worth watching for this summer. The open question I want answered: at what point does AI-assisted academic writing start degrading the quality of the scientific literature that future AI systems will train on? Nobody has a credible answer yet, and it is becoming urgent.