DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI Solves 56-Year-Old Math Problems. Mostly Struggles Otherwise.

Today's AI research shows real wins in structured tasks — and real fragility the moment conditions get messy.

            May 23, 2026
          

Three papers today, and they tell a surprisingly coherent story once you lay them side by side. An AI system cracked open mathematical problems that humans had been stuck on for decades. A different team showed that AI vision falls apart in rain or blur worse than we thought. And a third group doubled an AI's spreadsheet performance just by letting it practice. Let me walk you through all three — because together they say something useful about where AI is actually strong and where it quietly isn't.

Today's stories

              01 / 03
            

An AI Cracked Math Problems That Had Been Open for Decades

Two of the math problems this AI just solved had been sitting unanswered since 1969.

Think of a list of puzzles so hard that professional mathematicians spent decades staring at them and quietly moved on. Paul Erdős, a prolific Hungarian mathematician who died in 1996, left behind 353 such open problems — written conjectures nobody had yet proved true or false. This week, a system called AlphaProof Nexus, built around Google's Gemini 3.1 Pro combined with Lean (a proof-checking program that acts like a ruthlessly strict editor), solved 9 of them. How? The system generates candidate proof sketches using the language model, then hands each draft to Lean for verification — think of it as a composer improvising melodies while a strict music teacher immediately marks every wrong note. The agent runs this loop thousands of times, evolving better drafts. Two of the nine problems had been open for 56 years. The team also proved 44 of 492 conjectures from the Online Encyclopedia of Integer Sequences — a public database of number patterns — and resolved a 15-year-old open problem in a branch of algebra called algebraic geometry. Cost at inference time: a few hundred dollars per problem. Here is the catch. Nine out of 353 is about 2.5%. The other 97.5% remain unsolved — meaning this is a useful research collaborator, not a math-problem vending machine. The comparison between different versions of the agent was done after the fact, not through a pre-registered experiment, so it is harder to know exactly which design choices drove the wins. And this is formal mathematics, a domain with clear right-or-wrong answers. Whether the same approach transfers to fields where truth is murkier is an open question nobody has answered yet.

Glossary

Lean — A computer program that checks mathematical proofs line by line and rejects anything logically invalid — like a spell-checker for mathematical logic.

Erdős problems — A collection of mathematical conjectures posed by Hungarian mathematician Paul Erdős, many of which remain unsolved decades after his death.

OEIS conjecture — A proposed but unproved pattern in number sequences, listed in a free online database called the Online Encyclopedia of Integer Sequences.

Source: Advancing Mathematics Research with AI-Driven Formal Proof Search

              02 / 03
            

Blurry Photos Make AI Spatial Reasoning Fall Apart — By 21 Points

Ask an AI where the chair is — fine on a clear photo, off a cliff in rain.

Point any AI vision system at a blurry photo — motion shake from a moving bus, rain on a windshield, a dimly lit corridor — and ask it to tell you where objects are or how far apart things sit. How much does it struggle? The SpaceDG team decided to actually measure this, and the answer is: far more than the field had tested before. They built a simulation engine on top of about 1,000 indoor scenes, generating roughly one million question-and-answer pairs across nine types of visual degradation: motion blur, low light, rain, fog, lens distortion, compression artefacts, and more. Think of it as building a driving-test course with foggy goggles, a wet windshield, and a flickering dashboard — instead of only testing drivers on a sunny day. They then ran 25 AI models through it, including GPT-5.4 and Gemini 3.1 Pro. The drop is stark: models lost an average of 20.9 percentage points in spatial reasoning accuracy on degraded images compared with clean ones. For context, humans also drop — from 80.4% accuracy on clean images to 59.5% in degraded conditions — so this is not purely an AI failure. But AI drops harder. A smaller model the team fine-tuned specifically on degraded data — 8 billion parameters — ended up outperforming GPT-5.4 (46.2%), Gemini 3.1 Pro (53.3%), and several other large systems on this benchmark, reaching 66.1%. The catch: the scenes are all indoors, generated from a research dataset called ScanNet++. The degradations are simulated, not captured in actual rain. Whether the same gap shows up with real cameras on real robots or self-driving cars still needs to be tested.

Glossary

spatial reasoning — An AI's ability to understand the physical arrangement of objects in a scene — distances, positions, and orientations.

visual degradation — Any factor that reduces image quality, including blur, low light, weather, lens distortion, or compression.

fine-tuning — Additional training applied to an already-trained model, using a specific dataset to improve performance on a particular task.

Source: SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

              03 / 03
            

Letting AI Practice in a Spreadsheet Sandbox Doubled Its Score

A small AI model just edged past Microsoft Copilot at Excel — by doing homework first.

If you have ever watched an AI assistant fumble through a multi-step spreadsheet formula — misreading cell ranges, losing track of what it already changed, giving up halfway — you have seen the problem this paper is trying to fix. A team working with Alibaba's Qwen3 model trained a small AI specifically on spreadsheet tasks using reinforcement learning (RL), a training method where the model earns a reward when it gets things right and gradually adjusts its behaviour through repetition. The training happened inside a Microsoft Excel Python sandbox — a virtual workspace where the model could try things, break things, and try again, the way a student does practice problems before sitting a real exam. The training data came from online forums where real people had posted spreadsheet problems and solutions, scraped automatically by a separate tool. The result: on SpreadsheetBench, a set of 912 expert-verified tasks, the model's pass rate went from 12.0% to 23.4% after training. On a separate finance-and-supply-chain dataset the team built themselves, it went from 8.4% to 17.2% — roughly doubling in both cases. Here is the reality check: ChatGPT Agent scores 45.5% on the same SpreadsheetBench. Microsoft Copilot scores 20.0%. So the RL-trained small model just edged past Copilot, but sits at roughly half of ChatGPT Agent's level. Doubling performance through targeted practice on a 4-billion-parameter model is a genuine result — it tells us that structured practice in the right environment matters a lot. But 'better than Copilot' and 'ready for your finance team' are two very different claims, and this paper only makes the first one.

Glossary

reinforcement learning (RL) — A training method where an AI earns rewards for correct actions and penalties for wrong ones, gradually learning better behaviour through trial and error.

Pass@1 — The probability that a model gets the right answer on its very first attempt, without multiple tries.

SpreadsheetBench — A benchmark of 912 expert-verified spreadsheet tasks used to measure how well AI agents handle real-world Excel workflows.

Source: Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

The bigger picture

Lay these three stories side by side and a pattern comes into focus. AI systems perform impressively when the feedback is clear and the rules are unambiguous — formal math proofs are either valid or not, a spreadsheet answer is either right or wrong. Give the system a clean signal and enough practice reps, and it improves substantially. But take away that clean signal — introduce blur, rain, low light, anything messy — and performance drops hard, faster than it does for humans in the same conditions. That asymmetry is worth holding onto. The near-term wins in AI are clustering in domains with crisp, verifiable rules: mathematics, spreadsheets, code. The harder battles are in real-world perception and judgment, where the signal is noisy, the ground truth is ambiguous, and practice in a simulator does not fully transfer. Nobody is close to bridging that gap yet.

What to watch next

The AlphaProof Nexus paper points to the remaining 344 unsolved Erdős problems as an ongoing target — worth watching whether future agent versions push past the current 2.5% solve rate, or whether the remaining problems resist the same approach. On the vision side, the SpaceDG team's next logical step is testing their degradation benchmark on outdoor scenes and real-captured (not simulated) bad-weather images. That would tell us whether the 20.9-point drop holds in the wild.