Testing LLM Diagnostics in Endodontics: The Impact of Linguistic Variation on Unseen Cases

Research output: Contribution to journalArticlepeer-review

Abstract

Aim: To assess the diagnostic performance of two language models, GPT-5 Plus and Gemini 2.5 Flash using a curated benchmark dataset of unseen endodontic and restorative dentistry related clinical case scenarios and the linguistic variations introduced around the original dataset. Additionally, a descriptive qualitative analysis was performed on a subset of cases to evaluate the quality of reasoning generated by both models. Methodology: One hundred single best answer MCQs were generated using standardised resources, constituting a benchmark dataset. Controlled linguistic variations were introduced around the original dataset; paraphrasing (sentence/clause rewording), perturbation (token-level substitutions), and permutation (answer-order shuffle). These case scenarios were presented to both models using a standardised prompt, and the performance metrics (accuracy/recall, F-1 score) were computed. Agreement between and within models was analysed using Cohen's κ, while paired differences were evaluated using McNemar's test with a significant p-value < 0.05. Qualitative analysis was performed on a subset of the total sample, and the responses were evaluated on a 3-point Likert scale. Results: GPT-5 Plus achieved 80% accuracy on benchmark dataset compared to 66% for Gemini 2.5 Flash (McNemar's p-value = 0.0066). When linguistic variations were introduced, the performance of GPT-5 Plus declined, with perturbation having the most significant effect (McNemar's p-value = 0.003). Gemini 2.5 Flash, on the other hand, though inferior initial performance on benchmark dataset, maintained uniform decision patterns across all transformations with no significant drop further. The descriptive qualitative analysis demonstrated an overall higher proportion of responses rated as good (8/10, 80% for original dataset; 7/10, 70% for linguistic variations) for Gemini 2.5 Flash as opposed to GPT-5 Plus. Conclusion: GPT-5 Plus outperformed Gemini 2.5 Flash on benchmark dataset; however, it was sensitive to linguistic variations. Perturbation negatively influenced the performance of GPT-5 Plus, emphasising the need to further investigate the linguistic phenomenon that may have affected the model's degradation. Additionally, the descriptive qualitative analysis demonstrated relatively higher performance for Gemini 2.5 Flash compared to GPT-5 Plus on the original dataset and across linguistic variations. However, owing to the descriptive nature of findings and limited sample size, the results should be interpreted with caution.

Original languageEnglish (US)
JournalInternational Endodontic Journal
DOIs
Publication statusAccepted/In press - 2026

Keywords

  • artificial intelligence
  • benchmarking dataset
  • dentistry
  • linguistic variations
  • natural language processing
  • paraphrasing
  • permutation
  • perturbations

Fingerprint

Dive into the research topics of 'Testing LLM Diagnostics in Endodontics: The Impact of Linguistic Variation on Unseen Cases'. Together they form a unique fingerprint.

Cite this