File Download

There are no files associated with this item.

Supplementary

Conference Paper: Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]

TitleEvaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]
Authors
Issue Date24-May-2025
Abstract

Aim: Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out.

Methods: Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score.

Results: Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life.

Conclusion: The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making.


Persistent Identifierhttp://hdl.handle.net/10722/356636

 

DC FieldValueLanguage
dc.contributor.authorLee, ECN-
dc.contributor.authorFung, ACH-
dc.contributor.authorWong, CWY-
dc.contributor.authorYeung, F-
dc.contributor.authorChan, IHY-
dc.contributor.authorChung, PHY-
dc.contributor.authorWong, KKY-
dc.date.accessioned2025-06-06T00:35:09Z-
dc.date.available2025-06-06T00:35:09Z-
dc.date.issued2025-05-24-
dc.identifier.urihttp://hdl.handle.net/10722/356636-
dc.description.abstract<p><b>Aim: </b>Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out.</p><p><b>Methods: </b>Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score.</p><p><b>Results: </b>Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life.</p><p><b>Conclusion: </b>The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making.</p>-
dc.languageeng-
dc.relation.ispartof26th European Paediatric Surgeons' Association Congress (EUPSA2025) (21/05/2025-24/05/2025, Dubrovnik, Croatia)-
dc.titleEvaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]-
dc.typeConference_Paper-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats