DeepScience — Mental Health

DeepScience · Mental Health · Daily Digest

AI Mental Health Tools Are Leaking Secrets — Including Yours

Three new papers ask a question that matters: what are these mental health AI tools actually learning?

            June 09, 2026
          

Good morning. Today's digest is dense in a good way — three papers that independently arrived at the same uncomfortable place. The AI tools being built to help detect and treat mental health conditions are doing things their designers didn't fully plan for. Let me walk you through each one.

Today's stories

              01 / 03
            

Your Voice Leaks Your Gender — This Tool Tries to Hide It

Every time you speak to a mental health app, your voice is quietly sending a second message you never wrote.

When you talk, your voice carries far more than your words. It carries signals about your gender, your age, your accent — demographic fingerprints baked into the rhythm and texture of how you speak. Now imagine a mental health app that screens you for depression by listening to your voice. It may be doing two jobs at once: assessing your mood and quietly building a demographic profile of you. That's the problem a team of researchers tried to fix with a framework they call InfoShield. Think of it like a filter on a kitchen tap. You want clean water — the clinically relevant signal, depression risk — but the tap is also letting through sediment you don't want anyone to collect: your gender, your age. InfoShield is designed to catch that sediment before it goes anywhere. In their tests, the system cut gender-guessing accuracy from 92.6 percent down to 55.5 percent — close to a coin flip. Age-guessing dropped from 55.7 percent to 30.3 percent. Critically, the depression-detection score did not suffer. It actually improved slightly, from an F1 of 0.723 to 0.784. F1 is a combined accuracy measure that penalises both false positives and false negatives — higher is better. The catch, and it is a significant one: every single test was run on one dataset, called the Androids Corpus. One dataset is a proof of concept, not a deployment. Whether this holds across different languages, recording conditions, or clinical settings is completely unknown. Also, 55.5 percent gender accuracy is not zero — the leak is smaller, not sealed.

Glossary

F1 score — A single number summarising a model's accuracy that combines its rate of correct positives and its rate of missed cases — higher is better, 1.0 is perfect.

mutual information minimisation — A mathematical technique for training an AI to ignore a specific type of information — here, demographic signals — while keeping everything else.

Source: InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

              02 / 03
            

Brain-Scan AI Is Mostly Just Recognising You, Not Your Mood

The AI isn't reading your brain state — it might be reading your ID badge.

EEG is a way to measure the electrical activity of your brain using electrodes placed on your scalp. It looks like a shower cap covered in sensors. The ambition in mental health research is compelling: use EEG to catch signatures of depression or anxiety directly from brain activity, no questionnaire needed. To do this at scale, researchers have been building what are called foundation models — large AI systems pre-trained on huge amounts of EEG data, designed to learn general patterns they can then apply to clinical tasks. A team of researchers just ran a diagnostic audit on three of these models — LaBraM, CBraMod, and REVE — across four public datasets. Their tool is called FMScope. What they found is alarming. When they froze the models and looked at what the AI had actually learned, individual identity — who you are — dominated the signal by a factor of 13 to 89 times compared to what you'd expect from random noise. The model had essentially memorised each person's unique neural fingerprint. Worse: when they fine-tuned the models on clinical data, the problem got worse, not better. Think of training a wine sommelier who, instead of learning about grape varieties and regions, has memorised the exact glass each wine was served in. The good news is the team also found you can subtract the identity signal mathematically — and when you do, the clinically meaningful signal improves by 6 to 27 percentage points depending on the task. The catch: this is a diagnostic audit. No fix is deployed. And it raises hard questions about every EEG accuracy number published without subject-disjoint testing.

Glossary

EEG (electroencephalography) — A method of recording the brain's electrical activity through sensors placed on the scalp — non-invasive and relatively cheap compared to brain scans.

foundation model — A large AI system trained on broad data first and then adapted to specific tasks — the same basic idea behind GPT, applied here to brain signals.

subject-disjoint testing — A testing rule where the people in the test set are entirely different individuals from those used in training — the only honest way to check if an AI has learned something general rather than memorised specific people.

variance decomposition — A statistical technique for measuring how much of the variation in a dataset comes from each possible source — here, used to show how much is individual identity versus brain state.

Source: The Identity Trap in EEG Foundation Models: A Diagnostic Audit

              03 / 03
            

AI Psychiatric Screening Ranges From Coin-Flip to Decent — and Is Biased Against Women

Five AI models read 555 real clinical interviews and tried to spot depression, anxiety, and PTSD — here is how badly they stumbled.

A research team gave five AI language models — LLaMA 3, DeepSeek, GPT-4o Mini, GPT-4.1 Mini, and GPT-5 Mini — a reading test. Each model received a transcript of a semi-structured clinical interview with a real person, alongside that person's formal diagnosis from a trained clinician using standardised tools. The task: spot depression, anxiety, PTSD, or any mental health condition. No special training. Just reading. The results span a wide range. Accuracy ran from 0.49 — essentially the same as guessing — up to 0.86. The MCC score, which measures whether the model is doing something genuinely meaningful beyond random chance, topped out at 0.38 across all conditions. That is modest. Coin-flip at the low end, cautiously useful at the high end. GPT-4.1 Mini and GPT-5 Mini were the most consistent performers. But here is where it gets uncomfortable. For depression specifically, the models were more accurate for male participants than for female participants — across all five models tested. Nobody knows exactly why yet, but the team dug into the AI's written reasoning for GPT-4.1 Mini. A pattern emerged: when a transcript mentioned that someone was coping well, had good social support, or was still functioning at work, the model often pushed away from a positive diagnosis — even when clear symptoms were also present. The model was weighting protective context against symptoms in ways a human clinician might not. The catch: this was zero-shot testing — no clinical training, no specialisation. Clinicians bring years of contextual knowledge. The gender finding is real but needs replication before drawing firm conclusions.

Glossary

zero-shot testing — Testing an AI on a task it received no specific training for — the model uses only its general knowledge from pre-training.

MCC (Matthews correlation coefficient) — A single number measuring whether an AI's classifications are genuinely predictive or just lucky — ranges from -1 (always wrong) to +1 (perfect), with 0 meaning random.

SCID (Structured Clinical Interview for DSM) — A standardised interview tool used by trained clinicians to produce formal psychiatric diagnoses — used here as the gold-standard reference.

Source: When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

The bigger picture

Read these three together and a single uncomfortable pattern emerges. The AI tools being built to help with mental health are doing things their designers did not fully intend. Your voice leaks demographic data you never agreed to share. Brain-scan models may be silently memorising who you are instead of what your brain is doing. And the most capable language models have modest accuracy and measurable blind spots around gender. None of this means these tools should be abandoned. It means they are at the stage where honest auditing matters more than confident deployment. The field is not moving too slowly — if anything, the pressure to ship is outrunning the pressure to check. What today's papers collectively argue, without quite saying it in those words, is that mental health AI needs a culture of adversarial scrutiny built in from the start. That culture is not yet standard.

What to watch next

The EEG identity trap finding raises immediate questions for any lab publishing depression-detection accuracy numbers from EEG — watch for responses from the LaBraM, CBraMod, and REVE teams, and for follow-up work testing the subject-axis erasure method in real clinical cohorts. On the LLM screening side, the gender accuracy gap is a finding that deserves independent replication; if it holds across different datasets and languages, it becomes a serious barrier to any real-world deployment. The open question I'd most want answered: does the same protective-context bias show up when the AI is given richer clinical context, rather than just a single interview transcript?