Blog
AI in Research: 6 Recent Studies, 5-Minute Journal Club
⏱️ Reading time: ~5 minutes
Six studies from 2025–2026 examine how AI performs across different clinical tasks: pattern recognition in imaging, diagnostic reasoning, record analysis, and evidence summarization. What the studies show varies considerably by task and setting.
🩺 AI identified pre-diagnostic pancreatic cancer signatures 16 months before clinical detection
Mukherjee et al. | Gut | 2026
Study design: Retrospective multi-institutional cohort (US)
Participants: 1,462 patients; 156 with pre-diagnostic cancer scans, 1,306 controls
Task: Detect pre-diagnostic radiomic signatures on routine CT scans before clinical suspicion arose
In this retrospective cohort, the REDMOD framework achieved 73% sensitivity on an independent test set, nearly twice the 38.9% detected by radiologists reviewing the same images. The median lead time was approximately 475 days (~16 months) before clinical diagnosis.
Key takeaway
Pancreatic cancer is rarely caught before it is advanced. Retrospective detection performance is encouraging; prospective validation in high-risk cohorts is required before this approach could inform screening practice.
🧠 An LLM outperformed physicians on real diagnostic cases in a prospective comparison
Brodeur et al. | Science | 2026 | doi:10.1126/science.adz4433
Study design: Multibenchmark evaluation
Participants: Hundreds of clinicians across specialties, including emergency physicians
Task: Diagnostic reasoning on cases drawn from actual clinical encounters, including real ER patients
OpenAI’s o1 model matched or outperformed practicing physicians across diagnostic cases drawn from real clinical settings, among other benchmarks. The authors describe this as the first such comparison at scale using real-world patient cases rather than standardized question banks.
Key takeaway
Performance on real clinical cases, as described by the authors, is a different benchmark than standardized testing — but the study does not measure what happens to patients as a result.
🔬 An LLM identified symptom patterns in GP notes that preceded ovarian cancer diagnosis
Funston et al. | British Journal of General Practice | 2026 | doi:10.3399/BJGP.2025.0366
Study design: Retrospective LLM analysis of routine GP consultation records (UK)
Participants: Patients subsequently diagnosed with ovarian cancer
Task: Identify symptom patterns in existing notes that preceded a confirmed cancer diagnosis
In this retrospective analysis, an LLM surfaced associations between symptoms documented across multiple GP consultations — patterns that had been recorded individually but not connected at the time of care. The associations were identifiable in hindsight from existing records.
Key takeaway
Retrospective detectability is not the same as prospective clinical utility. The gap between the two and the effect on patient outcomes is not addressed by this study.
💡 Providing AI reasoning alongside AI answers improved radiologist accuracy in a randomized trial
Spitzer et al. | npj Digital Medicine | 2026 | doi:10.1038/s41746-026-02619-0
Study design: Randomized controlled trial
Participants: 101 radiologists
Task: Radiology cases under three conditions — no AI, AI answer only, AI answer with step-by-step reasoning
In this trial, radiologists given both the AI recommendation and a step-by-step explanation of how the AI reached its conclusion improved diagnostic accuracy by 12.2 percentage points compared to those without AI support (P = 0.001). Radiologists given only the AI’s answer without reasoning did not significantly outperform the no-AI control group (P = 0.150).
Key takeaway
The trial found that the format of AI output, not just whether AI was present, affected accuracy. Showing the answer alone produced no significant benefit; the gain required the reasoning.
⚠️ Differential diagnosis was the weakest domain across 21 LLMs in a standardized evaluation
Rao AS, Esmail KP et al. | JAMA Network Open | 2026
Study design: Cross-sectional evaluation of 21 frontier LLMs on 29 standardized clinical vignettes
Participants: 21 AI models from 5 companies — OpenAI (GPT-4o), Anthropic (Claude), Google (Gemini), DeepSeek, and xAI (Grok)
Task: Five clinical reasoning domains assessed sequentially: differential diagnosis (DDx), diagnostic testing, final diagnosis, management, miscellaneous
In this evaluation:
DDx failure rates exceeded 80% across all 21 models
Final diagnosis accuracy exceeded 90% in the same models
The authors describe a pattern of “premature closure”. Models converged on a single answer rather than generating a range of possibilities. The study used standardized vignettes; performance on real patient presentations may differ.
Key takeaway
High accuracy on final diagnosis did not predict high accuracy on differential diagnosis in this evaluation. These are distinct tasks, and they were handled very differently across the models tested.
📄 LLM summaries consistently overgeneralized medical paper findings in a controlled study
Peters & Chin-Yee | Royal Society Open Science | 2025 | doi:10.1098/rsos.241776
Study design: Experimental; 10 LLMs tested on summarization of 200 abstracts and 100 full medical articles
Participants: Models including: GPT-3.5 Turbo, GPT-4 Turbo, ChatGPT-4o, ChatGPT-4.5, LLaMA 2 70B, LLaMA 3.3 70B, Claude 2, Claude 3.5 Sonnet, Claude 3.7 Sonnet, and DeepSeek; 4,900 summaries generated
Task: Summarize medical papers; compare outputs to original texts and to human expert summaries from NEJM Journal Watch
In this study:
LLM summaries were nearly 5× more likely to contain broad generalizations than summaries produced by human expert reviewers from the same papers (odds ratio = 4.85, 95% CI [3.06, 7.70])
Among the most affected models, overgeneralization occurred in 26–73% of cases (DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B)
Claude models were notable exceptions, with the lowest overgeneralization rates across all models tested
Several newer models showed higher overgeneralization rates than older ones — though the pattern was not uniform across all models, and the study does not explain the mechanism
Prompting models to “be accurate” increased rather than reduced overgeneralizations
Setting temperature to 0 (a setting that controls the creativity and randomness of AI models, with 0 being more conservative and precise) made overgeneralizations 76% less likely (relative reduction)
The authors note that not every generalization is unwarranted; the study identifies a tendency toward scope inflation, not a uniform error in every output.
Key takeaway
In this study, AI summaries tended to expand the scope of what papers claimed. Readers of AI-generated medical summaries should check whether the original paper actually supports the breadth of the conclusion presented.
🔧 What This Means in Practice (SwissMed AI)
Based on these findings, consider the following tips:
Prioritize reasoning over answers. AI explanations appear to matter more than the recommendation itself; structure workflows around understanding why, not just accepting what.
Use AI to surface patterns across records. It can flag symptom clusters or connections you might miss when information is scattered across multiple visits.
Verify AI-summarized papers against originals. LLMs tend to overgeneralize findings. A quick check protects your clinical decisions.
How you ask questions shapes the answer. Specificity and framing of prompts measurably change output. Worth testing what works for your questions.
Keep differential diagnosis as human work. AI still struggles here; use it to inform, not anchor, your thinking.
These remain exploratory signals, not practice directives, until we see patient outcomes improve.
📚 Sources
1. Mukherjee et al. | Gut | 2026
Next-generation AI for visually occult pancreatic cancer detection in a low-prevalence setting with longitudinal stability and multi-institutional generalisability2. Brodeur et al. | Science | 2026
Performance of a large language model on the reasoning tasks of a physician3. Funston et al. | British Journal of General Practice | 2026
Using large language models to identify pre-diagnostic clinical features of ovarian cancer from healthcare records: a population-based case-control study4. Spitzer et al. | npj Digital Medicine | 2026
The effect of medical explanations from large language models on diagnostic accuracy in radiology5. Rao AS, Esmail KP et al. | JAMA Network Open | 2026
Large Language Model Performance and Clinical Reasoning Tasks6. Peters & Chin-Yee | Royal Society Open Science | 2025
Generalization bias in large language model summarization of scientific research