DeepScience — Artificial Intelligence

DeepScience · Artificial Intelligence · Daily Digest

AI watches long videos, fails audio, rewrites medical rules

Three papers today show AI getting sharper at some hard tasks — and embarrassingly bad at others.

            June 08, 2026
          

Good morning. Today's batch leans heavily on video and audio — two areas where AI has been promising a lot and delivering unevenly. I picked three stories: one genuine step forward in understanding long video, one honest report card showing audio AI is basically failing, and one clever trick for improving medical triage without retraining anything. Let's dig in.

Today's stories

              01 / 03
            

AI watches hour-long videos almost as well as humans now

What if, instead of reading a whole novel every time someone asked you a question about it, you had a brilliant research assistant who had already taken structured notes?

That is essentially what the MemDreamer system does. Built by a team whose paper landed on arxiv this week, it splits the job of watching a video from the job of answering questions about it — two tasks current AI systems try to do simultaneously, badly. Here is how it works. A first module watches the video and builds a layered set of notes — like a family tree for events. At the top are broad chapters ('the argument happens in the kitchen'). In the middle, scenes. At the bottom, specific details: who was holding what, who caused what. These notes are organized as a graph — a web of connected facts, not a flat list. When you ask a question, a second module — the reasoning agent — navigates that graph, pulling only the relevant sections. It ends up reading roughly 2% of all the information captured, instead of the entire video transcript. Yet it answers better: the system narrows the gap with human expert performance on a benchmark called LVBench to just 3.7 percentage points, while cutting the amount of text the AI processes by 41 to 124 times compared to current approaches. The real-world stakes are obvious. Long video is everywhere — surgery recordings, security footage, multi-hour meeting archives. Systems that require holding everything in memory at once are expensive and error-prone. The catch: this was tested on benchmark videos, not live footage from messy real-world sources. The paper also does not say how the system handles ambiguous or poorly filmed scenes. And 'human expert performance' here means people who watched specifically to answer questions — a higher bar than casual viewing. A 3.7-point gap still exists. Not there yet, but noticeably closer.

Glossary

LVBench — A benchmark dataset of hours-long videos used to test how well AI systems answer questions about long video content.

graph memory — A way of storing information as a web of connected nodes — like a mind map — rather than a simple list, making it easier to navigate to relevant facts.

Source: MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

              02 / 03
            

Every AI audio editing tool we tested scored nearly zero

Imagine a cooking competition where every contestant's dish is, by the judges' standards, inedible — and the most complex recipes score a perfect zero.

That is essentially what a new benchmark called MMAE found when it put today's best audio editing AI through its paces. A team built a rigorous test suite — 2,000 tasks spanning everything from removing background noise to blending multiple sound types and adjusting rhythm and pitch simultaneously. Think of it like a driving exam with a detailed marking rubric: did the volume change at the right moment? Did the instrument enter on beat? Did the ambient noise actually disappear? Instead of asking 'does this sound good?' — which is subjective and hard to measure — each task was broken into roughly nine specific pass-or-fail checklist items, adding up to 17,741 verifiable criteria in total. The primary score, called Exact Match Rate, requires the AI to nail every item on the checklist for a task to count as correct. The result: every one of the five AI models tested scored below 5% on the full benchmark. On complex tasks that required combining multiple types of edits at once, every model scored exactly 0%. Why does this matter? Audio editing tools are being embedded into content creation pipelines, accessibility software, and automated transcription. If these systems cannot reliably execute structured instructions, trusting them in production is risky. The honest caveats: the benchmark is synthetic — tasks were designed by researchers, not drawn from real workflows. A 0% score on the most complex tasks may partly reflect deliberately hard design choices. And this benchmark is brand new, so we do not yet know how quickly models will improve once they are trained specifically against it. For now, though, the score is the score.

Glossary

Exact Match Rate (EMR) — A strict scoring method that only counts a task as correct if every single criterion on the checklist is satisfied — partial credit is not enough.

mixed-modality tasks — Audio editing tasks that combine multiple types of changes at once, such as adjusting volume while also modifying pitch and removing background noise.

Source: MMAE: A Massive Multitask Audio Editing Benchmark

              03 / 03
            

AI rewrites its own medical triage rules and catches nearly all emergencies

What if you could breed a better medical decision rule the same way farmers breed better crops — selecting winners, discarding losers, repeating?

That is the core idea behind a paper from a team using the GigaEvo framework this week. They started with a basic software program that sorted patients by urgency — the kind of rule-based triage logic used in emergency departments — and applied something borrowed from evolutionary biology. A large AI model (built on GPT) was used as a mutation engine: it rewrote the decision program hundreds of times, scored each version against real patient data, kept the best performers, and discarded the rest. No additional training of the underlying AI was needed. The program just got iteratively better. The results on urgency triage are striking. On the Semigran benchmark — a standard set of medical vignettes — overall accuracy jumped from 77.3% to 87.1%. More importantly, emergency recall — the rate at which the system correctly flags true emergencies — rose from 0.60 to 0.97. That first number means 4 in 10 emergencies were missed before. The second means nearly all are caught after. The improved programs also transferred to a held-out hospital dataset and to a separate clinical dataset the system had never seen, which is a meaningful sign of robustness. The real-world stakes need no explanation: missing an emergency in triage has direct consequences for patient survival. The catch is significant, though. These results come from benchmark datasets and retrospective hospital records — not from a live emergency room. Nobody has deployed this in a real clinical setting yet. The mutation engine (the frontier AI model doing the rewriting) is also expensive to run, so this is not a free upgrade. And benchmark performance and real-world safety are two very different things. A promising step. Not a solved problem.

Glossary

MAP-Elites — An evolutionary search algorithm that maintains a diverse archive of candidate solutions, selecting for both quality and variety rather than just optimizing a single score.

recall — In medical testing, the proportion of true cases (here: real emergencies) that the system correctly identifies — a high recall means few are missed.

Semigran benchmark — A published set of clinical vignettes used to test how accurately AI systems assign urgency levels to patient cases.

Source: LLM-Guided Evolution for Medical Decision Pipelines

The bigger picture

Put these three stories side by side and a pattern emerges that I find more useful than any individual result. MemDreamer shows that when you give AI a smarter memory architecture — structured notes instead of brute-force recall — performance on complex tasks jumps substantially. The medical evolution paper shows the same logic applied differently: instead of a smarter architecture, you run a smarter search over possible decision rules, and the output improves dramatically. Both stories say the same thing: raw model size is not the only lever. Structure and search matter enormously. Then MMAE comes along and reminds you that in audio — a domain nobody talks about as much as video or text — we are essentially at zero. The gap between 'AI can do this in a demo' and 'AI can reliably execute this under formal evaluation' is still enormous. The lesson I take from today: the wins are real, but they are narrow and domain-specific. Do not generalise from one modality to another.

What to watch next

For MemDreamer, the interesting next question is whether the graph-memory approach holds up on noisy, real-world video rather than benchmark clips — watch for independent replication attempts in the next month or two. For the medical evolution work, the critical step is a prospective clinical trial; that is the bar between 'interesting result' and 'deployable tool,' and it has not been cleared yet. For MMAE, the benchmark is now public, so watch whether any team posts scores above 10% within the quarter — that would signal rapid progress, or reveal that the benchmark itself needs recalibration.