File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: A large language model based data generation framework to improve mild cognitive impairment detection sensitivity

TitleA large language model based data generation framework to improve mild cognitive impairment detection sensitivity
Authors
Keywordsbiases reduction
counterfactual data generation
large language models
linguistic markers-based early-stage disease detection
mild cognitive impairment
Issue Date26-Mar-2025
PublisherCambridge University Press
Citation
Data & Policy, 2025, v. 7 How to Cite?
AbstractRecent studies utilizing AI-driven speech-based Alzheimer's disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection.
Persistent Identifierhttp://hdl.handle.net/10722/366402

 

DC FieldValueLanguage
dc.contributor.authorHan, Yang-
dc.contributor.authorLam, Jacqueline CK-
dc.contributor.authorLi, Victor OK-
dc.contributor.authorCheung, Lawrence YL-
dc.date.accessioned2025-11-25T04:19:13Z-
dc.date.available2025-11-25T04:19:13Z-
dc.date.issued2025-03-26-
dc.identifier.citationData & Policy, 2025, v. 7-
dc.identifier.urihttp://hdl.handle.net/10722/366402-
dc.description.abstractRecent studies utilizing AI-driven speech-based Alzheimer's disease (AD) detection have achieved remarkable success in detecting AD dementia through the analysis of audio and text data. However, detecting AD at an early stage of mild cognitive impairment (MCI), remains a challenging task, due to the lack of sufficient training data and imbalanced diagnostic labels. Motivated by recent advanced developments in Generative AI (GAI) and Large Language Models (LLMs), we propose an LLM-based data generation framework, leveraging prior knowledge encoded in LLMs to generate new data samples. Our novel LLM generation framework introduces two novel data generation strategies, namely, the cross-lingual and the counterfactual data generation, facilitating out-of-distribution learning over new data samples to reduce biases in MCI label prediction due to the systematic underrepresentation of MCI subjects in the AD speech dataset. The results have demonstrated that our proposed framework significantly improves MCI Detection Sensitivity and F1-score on average by a maximum of 38% and 31%, respectively. Furthermore, key speech markers in predicting MCI before and after LLM-based data generation have been identified to enhance our understanding of how the novel data generation approach contributes to the reduction of MCI label prediction biases, shedding new light on speech-based MCI detection under low data resource constraint. Our proposed methodology offers a generalized data generation framework for improving downstream prediction tasks in cases where limited and/or imbalanced data have presented significant challenges to AI-driven health decision-making. Future study can focus on incorporating more datasets and exploiting more acoustic features for speech-based MCI detection.-
dc.languageeng-
dc.publisherCambridge University Press-
dc.relation.ispartofData & Policy-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subjectbiases reduction-
dc.subjectcounterfactual data generation-
dc.subjectlarge language models-
dc.subjectlinguistic markers-based early-stage disease detection-
dc.subjectmild cognitive impairment-
dc.titleA large language model based data generation framework to improve mild cognitive impairment detection sensitivity-
dc.typeArticle-
dc.identifier.doi10.1017/dap.2025.8-
dc.identifier.scopuseid_2-s2.0-105001650180-
dc.identifier.volume7-
dc.identifier.eissn2632-3249-
dc.identifier.issnl2632-3249-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats