Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]

Lee, ECN; Fung, ACH; Wong, CWY; Yeung, F; Chan, IHY; Chung, PHY; Wong, KKY

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Surgery: Conference papers

Conference Paper: Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]

Title	Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]
Authors	Lee, ECN Fung, ACH Wong, CWY Yeung, F Chan, IHY Chung, PHY Wong, KKY
Issue Date	24-May-2025
Abstract	Aim: Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out. Methods: Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score. Results: Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life. Conclusion: The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making.
Persistent Identifier	http://hdl.handle.net/10722/356636

DC Field	Value	Language
dc.contributor.author	Lee, ECN	-
dc.contributor.author	Fung, ACH	-
dc.contributor.author	Wong, CWY	-
dc.contributor.author	Yeung, F	-
dc.contributor.author	Chan, IHY	-
dc.contributor.author	Chung, PHY	-
dc.contributor.author	Wong, KKY	-
dc.date.accessioned	2025-06-06T00:35:09Z	-
dc.date.available	2025-06-06T00:35:09Z	-
dc.date.issued	2025-05-24	-
dc.identifier.uri	http://hdl.handle.net/10722/356636	-
dc.description.abstract	<p><b>Aim: </b>Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out.</p><p><b>Methods: </b>Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score.</p><p><b>Results: </b>Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life.</p><p><b>Conclusion: </b>The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making.</p>	-
dc.language	eng	-
dc.relation.ispartof	26th European Paediatric Surgeons' Association Congress (EUPSA2025) (21/05/2025-24/05/2025, Dubrovnik, Croatia)	-
dc.title	Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]	-
dc.type	Conference_Paper	-

File Download

Supplementary

Conference Paper: Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats