Toward self-improving and sustainable large language models

Li, Qintong; 李沁桐

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Toward self-improving and sustainable large language models

Title	Toward self-improving and sustainable large language models
Authors	Li, Qintong 李沁桐
Advisors	Advisor(s):Kong, L Kao, CM
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Li, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Large language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges. First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies. Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments. Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities. In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness.
Degree	Doctor of Philosophy
Subject	Natural language processing (Computer science)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/367417

DC Field	Value	Language
dc.contributor.advisor	Kong, L	-
dc.contributor.advisor	Kao, CM	-
dc.contributor.author	Li, Qintong	-
dc.contributor.author	李沁桐	-
dc.date.accessioned	2025-12-11T06:41:50Z	-
dc.date.available	2025-12-11T06:41:50Z	-
dc.date.issued	2025	-
dc.identifier.citation	Li, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/367417	-
dc.description.abstract	Large language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges. First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies. Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments. Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities. In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.title	Toward self-improving and sustainable large language models	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045147147203414	-

File Download

Supplementary

postgraduate thesis: Toward self-improving and sustainable large language models

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats