Visual prediction and generation via diffusion models and beyond

Chen, Shoufa; 陳守法

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Visual prediction and generation via diffusion models and beyond

Title	Visual prediction and generation via diffusion models and beyond
Authors	Chen, Shoufa 陳守法
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Chen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Deep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework. The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks. In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings. Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation. Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications. Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications.
Degree	Doctor of Philosophy
Subject	Computer vision Machine learning
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/364036

DC Field	Value	Language
dc.contributor.author	Chen, Shoufa	-
dc.contributor.author	陳守法	-
dc.date.accessioned	2025-10-20T02:56:41Z	-
dc.date.available	2025-10-20T02:56:41Z	-
dc.date.issued	2025	-
dc.identifier.citation	Chen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/364036	-
dc.description.abstract	Deep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework. The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks. In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings. Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation. Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications. Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications.	en
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Machine learning	-
dc.title	Visual prediction and generation via diffusion models and beyond	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991045117251603414	-

File Download

Supplementary

postgraduate thesis: Visual prediction and generation via diffusion models and beyond

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats