Blog

AI in Research: Head-to-Head Comparison of 3 AI Tools in Rheumatology Diagnosis

28 January 2026 By SwissMed AI

Does EU certification or a subscription price tag guarantee better AI diagnostic performance? A head-to-head study of three tools says: not necessarily.

Diagnosing rare rheumatic diseases presents significant clinical challenges. A 2026 study in Rheumatology International compared three AI diagnostic tools to determine whether design, certification, or cost affect diagnostic performance.

Tools Evaluated

Prof. Valmed – Subscription-based, EU-certified medical device using RAG architecture
ChatGPT-5 Thinking – General-purpose large language model requiring subscription
OpenEvidence – Free, RAG-based healthcare tool

Methodology

Researchers input 60 clinical vignettes of rare rheumatic diseases into each system using identical prompts. Each tool generated five diagnostic suggestions with probability scores, which three blinded rheumatologists rated as identical, plausible, or diagnostically different.

Results

Top-1 Accuracy (Correct Primary Diagnosis):

Prof. Valmed: 23.3%
ChatGPT-5 Thinking: 26.7%
OpenEvidence: 35.0%

Top-5 Performance (Identical/Plausible Among Five):

Prof. Valmed: 51.7%
ChatGPT-5 Thinking: 58.3%
OpenEvidence: 56.7%

Total Diagnostic Score:

Prof. Valmed: 212
ChatGPT-5 Thinking: 226
OpenEvidence: 221

Processing Time (Average):

Prof. Valmed: 20 seconds
ChatGPT-5 Thinking: 36 seconds
OpenEvidence: 31 seconds

Differences between systems were statistically non-significant.

Key Findings

No single clear winner emerged
EU certification did not guarantee superior performance
General-purpose AI remained competitive
All systems demonstrated reasonable confidence calibration
Processing speed was not a limiting factor

Clinical Significance

Even the highest-performing tool identified the correct diagnosis first only 35% of the time. These systems function best as differential-generation aids requiring physician oversight rather than standalone diagnostic engines.

Limitations

Clinical vignettes lack real-world complexity
Possible data leakage from published case sources
Single prompt structure tested
Small sample size (60 cases)
Results represent a specific moment in model evolution

Conclusion

AI tools can broaden differentials in seconds with appropriate clinician judgement guiding final decisions, though they remain unreliable as independent diagnostic instruments for rare rheumatological conditions.

Back to blog