File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1017/dap.2025.8
- Scopus: eid_2-s2.0-105001650180
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: A large language model based data generation framework to improve mild cognitive impairment detection sensitivity
| Title | A large language model based data generation framework to improve mild cognitive impairment detection sensitivity |
|---|---|
| Authors | |
| Keywords | biases reduction counterfactual data generation large language models linguistic markers-based early-stage disease detection mild cognitive impairment |
| Issue Date | 26-Mar-2025 |
| Publisher | Cambridge University Press |
| Citation | Data & Policy, 2025, v. 7 How to Cite? |
| Abstract | Recent studies utilizing AI-driven speech-based Alzheimer's disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection. |
| Persistent Identifier | http://hdl.handle.net/10722/366402 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Han, Yang | - |
| dc.contributor.author | Lam, Jacqueline CK | - |
| dc.contributor.author | Li, Victor OK | - |
| dc.contributor.author | Cheung, Lawrence YL | - |
| dc.date.accessioned | 2025-11-25T04:19:13Z | - |
| dc.date.available | 2025-11-25T04:19:13Z | - |
| dc.date.issued | 2025-03-26 | - |
| dc.identifier.citation | Data & Policy, 2025, v. 7 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/366402 | - |
| dc.description.abstract | Recent studies utilizing AI-driven speech-based Alzheimer's disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection. | - |
| dc.language | eng | - |
| dc.publisher | Cambridge University Press | - |
| dc.relation.ispartof | Data & Policy | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject | biases reduction | - |
| dc.subject | counterfactual data generation | - |
| dc.subject | large language models | - |
| dc.subject | linguistic markers-based early-stage disease detection | - |
| dc.subject | mild cognitive impairment | - |
| dc.title | A large language model based data generation framework to improve mild cognitive impairment detection sensitivity | - |
| dc.type | Article | - |
| dc.identifier.doi | 10.1017/dap.2025.8 | - |
| dc.identifier.scopus | eid_2-s2.0-105001650180 | - |
| dc.identifier.volume | 7 | - |
| dc.identifier.eissn | 2632-3249 | - |
| dc.identifier.issnl | 2632-3249 | - |
