Blog
AI in Research: Head-to-Head Comparison of 3 AI Tools in Rheumatology Diagnosis
Does EU certification or a subscription price tag guarantee better AI diagnostic performance? A head-to-head study of three tools says: not necessarily.
Diagnosing rare rheumatic diseases presents significant clinical challenges. A 2026 study in Rheumatology International compared three AI diagnostic tools to determine whether design, certification, or cost affect diagnostic performance.
Tools Evaluated
- Prof. Valmed – Subscription-based, EU-certified medical device using RAG architecture
- ChatGPT-5 Thinking – General-purpose large language model requiring subscription
- OpenEvidence – Free, RAG-based healthcare tool
Methodology
Researchers input 60 clinical vignettes of rare rheumatic diseases into each system using identical prompts. Each tool generated five diagnostic suggestions with probability scores, which three blinded rheumatologists rated as identical, plausible, or diagnostically different.
Results
Top-1 Accuracy (Correct Primary Diagnosis):
- Prof. Valmed: 23.3%
- ChatGPT-5 Thinking: 26.7%
- OpenEvidence: 35.0%
Top-5 Performance (Identical/Plausible Among Five):
- Prof. Valmed: 51.7%
- ChatGPT-5 Thinking: 58.3%
- OpenEvidence: 56.7%
Total Diagnostic Score:
- Prof. Valmed: 212
- ChatGPT-5 Thinking: 226
- OpenEvidence: 221
Processing Time (Average):
- Prof. Valmed: 20 seconds
- ChatGPT-5 Thinking: 36 seconds
- OpenEvidence: 31 seconds
Differences between systems were statistically non-significant.
Key Findings
- No single clear winner emerged
- EU certification did not guarantee superior performance
- General-purpose AI remained competitive
- All systems demonstrated reasonable confidence calibration
- Processing speed was not a limiting factor
Clinical Significance
Even the highest-performing tool identified the correct diagnosis first only 35% of the time. These systems function best as differential-generation aids requiring physician oversight rather than standalone diagnostic engines.
Limitations
- Clinical vignettes lack real-world complexity
- Possible data leakage from published case sources
- Single prompt structure tested
- Small sample size (60 cases)
- Results represent a specific moment in model evolution
Conclusion
AI tools can broaden differentials in seconds with appropriate clinician judgement guiding final decisions, though they remain unreliable as independent diagnostic instruments for rare rheumatological conditions.