File Download
Supplementary

postgraduate thesis: Visual prediction and generation via diffusion models and beyond

TitleVisual prediction and generation via diffusion models and beyond
Authors
Issue Date2025
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Chen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractDeep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework. The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks. In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings. Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation. Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications. Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications.
DegreeDoctor of Philosophy
SubjectComputer vision
Machine learning
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/364036

 

DC FieldValueLanguage
dc.contributor.authorChen, Shoufa-
dc.contributor.author陳守法-
dc.date.accessioned2025-10-20T02:56:41Z-
dc.date.available2025-10-20T02:56:41Z-
dc.date.issued2025-
dc.identifier.citationChen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/364036-
dc.description.abstractDeep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework. The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks. In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings. Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation. Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications. Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications.en
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshMachine learning-
dc.titleVisual prediction and generation via diffusion models and beyond-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045117251603414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats