File Download
Supplementary

postgraduate thesis: Toward self-improving and sustainable large language models

TitleToward self-improving and sustainable large language models
Authors
Advisors
Advisor(s):Kong, LKao, CM
Issue Date2025
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Li, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractLarge language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges. First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies. Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments. Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities. In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness.
DegreeDoctor of Philosophy
SubjectNatural language processing (Computer science)
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/367417

 

DC FieldValueLanguage
dc.contributor.advisorKong, L-
dc.contributor.advisorKao, CM-
dc.contributor.authorLi, Qintong-
dc.contributor.author李沁桐-
dc.date.accessioned2025-12-11T06:41:50Z-
dc.date.available2025-12-11T06:41:50Z-
dc.date.issued2025-
dc.identifier.citationLi, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/367417-
dc.description.abstractLarge language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges. First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies. Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments. Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities. In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshNatural language processing (Computer science)-
dc.titleToward self-improving and sustainable large language models-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045147147203414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats