File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Toward self-improving and sustainable large language models
| Title | Toward self-improving and sustainable large language models |
|---|---|
| Authors | |
| Advisors | |
| Issue Date | 2025 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Li, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | Large language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges.
First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies.
Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments.
Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities.
In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness. |
| Degree | Doctor of Philosophy |
| Subject | Natural language processing (Computer science) |
| Dept/Program | Computer Science |
| Persistent Identifier | http://hdl.handle.net/10722/367417 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Kong, L | - |
| dc.contributor.advisor | Kao, CM | - |
| dc.contributor.author | Li, Qintong | - |
| dc.contributor.author | 李沁桐 | - |
| dc.date.accessioned | 2025-12-11T06:41:50Z | - |
| dc.date.available | 2025-12-11T06:41:50Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Li, Q. [李沁桐]. (2025). Toward self-improving and sustainable large language models. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/367417 | - |
| dc.description.abstract | Large language models (LLMs) have revolutionized natural language processing (NLP) by training on vast amounts of human-generated data, enabling them to replicate many human capabilities with impressive competence. However, LLMs still fall short of achieving human-level or superhuman performance in critical domains such as mathematics, coding, and science. As the effectiveness of pretraining on existing human-generated data approaches its limits and the complexity of downstream applications continues to grow, there is an increasing need to rethink the processes of evaluation and data generation that can evolve with the growing capabilities of language models. This dissertation discusses my research on developing versatile benchmarks and designing generalizable data algorithms to address these challenges. First, I describe our efforts to examine the robustness of LLMs' mathematical reasoning capabilities by testing a wide range of question variations. I introduce the adversarial grade school math dataset (GSMPlus), where each math problem is augmented with eight mathematical perturbations. Extensive experiments show that while LLMs demonstrate varying levels of math reasoning ability, their performance lacks robustness. Even for problems they successfully solve, LLMs often fail when new statements are added or question targets are altered. The research also explores an iterative method that generates and verifies intermediate steps by composing existing prompting methods. However, this approach still fails to close the performance gap between standard benchmarks and our adversarial dataset. GSMPlus offers a nuanced interpretation of LLM performance that accounts for these behavioral inconsistencies. Next, I discuss my research on whether LLMs can function as reliable alternatives to human evaluation for various benchmarks, focusing on their understanding of evaluation tasks and ability to meet diverse criteria. I examine three NLP tasks and 252 daily instruction-following tasks, each with distinct evaluation requirements. The analysis reveals that (1) LLM evaluators may introduce unnecessary criteria or omit critical ones, potentially biasing subsequent sample-wise evaluations, and (2) while LLM evaluators excel at language-level criteria, they struggle with reasoning-intensive criteria. Furthermore, it is found that incorporating human-in-the-loop evaluation can mitigate LLM biases and reduce human outliers, resulting in more objective assessments. Finally, I describe an LLM self-play approach, ReverseGen, which automatically generates training samples that expose the weaknesses of LLMs based on their own evaluations. Unlike prior methods that rely on human-annotated data or predefined task templates, ReverseGen employs an optimizable generator trained to craft instructions that cause target models to fail. These challenging instructions serve as valuable training data, enabling models to address their shortcomings. At its core, ReverseGen establishes an iterative evaluation-improvement loop that enhances the model's overall performance. Flexible and adaptable to models of various scales, ReverseGen consistently produces fine-tuned models with improved performance, offering a novel approach to task-specific data synthesis for advancing LLM capabilities. In conclusion, this thesis offers a deeper exploration of the capabilities of LLMs across diverse reasoning scenarios and utilizes these insights to inform the development of more effective data optimization strategies. It envisions the future of evaluation methods and data foundations for next-generation LLMs, emphasizing advancements in efficiency, interactivity, and robustness. | - |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Natural language processing (Computer science) | - |
| dc.title | Toward self-improving and sustainable large language models | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Computer Science | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045147147203414 | - |
