File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation]
| Title | Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation] |
|---|---|
| Authors | |
| Issue Date | 24-May-2025 |
| Abstract | Aim: Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out. Methods: Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score. Results: Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life. Conclusion: The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making. |
| Persistent Identifier | http://hdl.handle.net/10722/356636 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Lee, ECN | - |
| dc.contributor.author | Fung, ACH | - |
| dc.contributor.author | Wong, CWY | - |
| dc.contributor.author | Yeung, F | - |
| dc.contributor.author | Chan, IHY | - |
| dc.contributor.author | Chung, PHY | - |
| dc.contributor.author | Wong, KKY | - |
| dc.date.accessioned | 2025-06-06T00:35:09Z | - |
| dc.date.available | 2025-06-06T00:35:09Z | - |
| dc.date.issued | 2025-05-24 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/356636 | - |
| dc.description.abstract | <p><b>Aim: </b>Phimosis is a very common paediatric surgical condition. Given the convenience brought by artificial intelligence, much of medical information are available on large language models (LLM). In anticipation of more parents using LLM to search for medical information in the near future, a readability, quality, and accuracy evaluation on the information provided by LLM on phimosis is carried out.</p><p><b>Methods: </b>Six case questions related to phimosis on its definition, diagnosis, treatment options and potential complications was inputted into each LLM-Chatbots (ChatGPT-3.5, Gemini-1.5, and Claude-3). All the responses was reformatted, shuffled and recombined into three passages for single-blind scoring on accuracy by three paediatric surgeons. The responses were also assessed for readability using the Gunning-Fog score, SMOG index, Dale-Chall score and for quality using the DISCERN score.</p><p><b>Results: </b>Comprehension of the information was considered difficult with a mean Gunning-Fog score, SMOG index score, Dale-Chall score of 14.2+/-2.6, 11.3+/-2.2 and 9.9+/-0.9 respectively. The mean summed DISCERN score was 39.6+/-2 out of 80. Claude-3-sonnet was considered of poor quality while the others were fair. The mean accuracy score of the LLM-chatbots was 4.4+/- 0.4 out of 5. The three LLMs performed similarly on readability and accuracy. The LLM-chatbots generally scored well in providing treatment options and their mechanisms, while lacked references and information regarding treatment benefit, risks of alternative treatments and impact on quality of life.</p><p><b>Conclusion: </b>The three LLM-chatbots performed similarly and satisfactorily in terms of accuracy. Yet they are considered of suboptimal quality and difficult for public to comprehend. It is recommended to combine with other reliable data sources to improve clinical practice and decision-making.</p> | - |
| dc.language | eng | - |
| dc.relation.ispartof | 26th European Paediatric Surgeons' Association Congress (EUPSA2025) (21/05/2025-24/05/2025, Dubrovnik, Croatia) | - |
| dc.title | Evaluation of large language models on phimosis - A readability, and accuracy comparative analysis based on CHATGPT-3.5, GEMINI-1.5, and CLAUDE-3 [Poster presentation] | - |
| dc.type | Conference_Paper | - |
