Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry

Chau, Reinhard Chun Wang; Thu, Khaing Myat; Yu, Ollie Yiru; Hsung, Richard Tai Chiu; Wang, Denny Chon Pei; Man, Manuel Wing Ho; Wang, John Junwen; Lam, Walter Yu Hang

File Download

content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.3390/dj13070279
Scopus: eid_2-s2.0-105011507292

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Faculty of Dentistry: Journal/Magazine Articles

Article: Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry

Title	Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry
Authors	Chau, Reinhard Chun Wang Thu, Khaing Myat Yu, Ollie Yiru Hsung, Richard Tai Chiu Wang, Denny Chon Pei Man, Manuel Wing Ho Wang, John Junwen Lam, Walter Yu Hang
Keywords	artificial intelligence deep learning machine learning mouth rehabilitation natural language processing
Issue Date	21-Jun-2025
Publisher	MDPI
Citation	Dentistry Journal, 2025, v. 13, n. 7 How to Cite? DOI: http://dx.doi.org/10.3390/dj13070279
Abstract	Background/Objectives: This study aims to evaluate the response accuracy and quality of three AI chatbots—GPT-4.0, Claude-2, and Llama-2—in answering multiple-choice questions in prosthodontic and restorative dentistry. Methods: A total of 191 text-based multiple-choice questions were selected from the prosthodontic and restorative dentistry sections of the United States Integrated National Board Dental Examination (INBDE) (n = 80) and the United Kingdom Overseas Registration Examination (ORE) (n = 111). These questions were inputted into the chatbots, and the AI-generated answers were compared with the official answer keys to determine their accuracy. Additionally, two dental specialists independently evaluated the rationales accompanying each chatbot response for accuracy, relevance, and comprehensiveness, categorizing them into four distinct ratings. Chi-square and post hoc Z-tests with Bonferroni adjustment were used to analyze the responses. The inter-rater reliability for evaluating the quality of the rationale ratings among specialists was assessed using Cohen’s Kappa (κ). Results: GPT-4.0 (65.4%; n = 125/191) demonstrated a significantly higher proportion of correctly answered multiple-choice questions when compared to Claude-2 (41.9%; n = 80/191) (p < 0.017) and Llama-2 (26.2%; n = 50/191) (p < 0.017). Significant differences were observed in the answer accuracy among all of the chatbots (p < 0.001). In terms of the rationale quality, GPT-4.0 (58.1%; n = 111/191) had a significantly higher proportion of “Correct Answer, Correct Rationale” than Claude-2 (37.2%; n = 71/191) (p < 0.017) and Llama-2 (24.1%; n = 46/191) (p < 0.017). Significant differences were observed in the rationale quality among all of the chatbots (p < 0.001). The inter-rater reliability was very high (κ = 0.83). Conclusions: GPT-4.0 demonstrated the highest accuracy and quality of reasoning in responding to prosthodontic and restorative dentistry questions. This underscores the varying efficacy of AI chatbots within specialized dental contexts.
Persistent Identifier	http://hdl.handle.net/10722/366477

DC Field	Value	Language
dc.contributor.author	Chau, Reinhard Chun Wang	-
dc.contributor.author	Thu, Khaing Myat	-
dc.contributor.author	Yu, Ollie Yiru	-
dc.contributor.author	Hsung, Richard Tai Chiu	-
dc.contributor.author	Wang, Denny Chon Pei	-
dc.contributor.author	Man, Manuel Wing Ho	-
dc.contributor.author	Wang, John Junwen	-
dc.contributor.author	Lam, Walter Yu Hang	-
dc.date.accessioned	2025-11-25T04:19:37Z	-
dc.date.available	2025-11-25T04:19:37Z	-
dc.date.issued	2025-06-21	-
dc.identifier.citation	Dentistry Journal, 2025, v. 13, n. 7	-
dc.identifier.uri	http://hdl.handle.net/10722/366477	-
dc.description.abstract	Background/Objectives: This study aims to evaluate the response accuracy and quality of three AI chatbots—GPT-4.0, Claude-2, and Llama-2—in answering multiple-choice questions in prosthodontic and restorative dentistry. Methods: A total of 191 text-based multiple-choice questions were selected from the prosthodontic and restorative dentistry sections of the United States Integrated National Board Dental Examination (INBDE) (n = 80) and the United Kingdom Overseas Registration Examination (ORE) (n = 111). These questions were inputted into the chatbots, and the AI-generated answers were compared with the official answer keys to determine their accuracy. Additionally, two dental specialists independently evaluated the rationales accompanying each chatbot response for accuracy, relevance, and comprehensiveness, categorizing them into four distinct ratings. Chi-square and post hoc Z-tests with Bonferroni adjustment were used to analyze the responses. The inter-rater reliability for evaluating the quality of the rationale ratings among specialists was assessed using Cohen’s Kappa (κ). Results: GPT-4.0 (65.4%; n = 125/191) demonstrated a significantly higher proportion of correctly answered multiple-choice questions when compared to Claude-2 (41.9%; n = 80/191) (p < 0.017) and Llama-2 (26.2%; n = 50/191) (p < 0.017). Significant differences were observed in the answer accuracy among all of the chatbots (p < 0.001). In terms of the rationale quality, GPT-4.0 (58.1%; n = 111/191) had a significantly higher proportion of “Correct Answer, Correct Rationale” than Claude-2 (37.2%; n = 71/191) (p < 0.017) and Llama-2 (24.1%; n = 46/191) (p < 0.017). Significant differences were observed in the rationale quality among all of the chatbots (p < 0.001). The inter-rater reliability was very high (κ = 0.83). Conclusions: GPT-4.0 demonstrated the highest accuracy and quality of reasoning in responding to prosthodontic and restorative dentistry questions. This underscores the varying efficacy of AI chatbots within specialized dental contexts.	-
dc.language	eng	-
dc.publisher	MDPI	-
dc.relation.ispartof	Dentistry Journal	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	artificial intelligence	-
dc.subject	deep learning	-
dc.subject	machine learning	-
dc.subject	mouth rehabilitation	-
dc.subject	natural language processing	-
dc.title	Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.3390/dj13070279	-
dc.identifier.scopus	eid_2-s2.0-105011507292	-
dc.identifier.volume	13	-
dc.identifier.issue	7	-
dc.identifier.eissn	2304-6767	-
dc.identifier.issnl	2304-6767	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Evaluation of Chatbot Responses to Text-Based Multiple-Choice Questions in Prosthodontic and Restorative Dentistry

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats