File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Visual prediction and generation via diffusion models and beyond
| Title | Visual prediction and generation via diffusion models and beyond |
|---|---|
| Authors | |
| Issue Date | 2025 |
| Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
| Citation | Chen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
| Abstract | Deep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework.
The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks.
In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings.
Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation.
Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications.
Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications. |
| Degree | Doctor of Philosophy |
| Subject | Computer vision Machine learning |
| Dept/Program | Computer Science |
| Persistent Identifier | http://hdl.handle.net/10722/364036 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Chen, Shoufa | - |
| dc.contributor.author | 陳守法 | - |
| dc.date.accessioned | 2025-10-20T02:56:41Z | - |
| dc.date.available | 2025-10-20T02:56:41Z | - |
| dc.date.issued | 2025 | - |
| dc.identifier.citation | Chen, S. [陳守法]. (2025). Visual prediction and generation via diffusion models and beyond. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
| dc.identifier.uri | http://hdl.handle.net/10722/364036 | - |
| dc.description.abstract | Deep learning has revolutionized computer vision, enabling advances in both visual recognition and generation across various applications such as autonomous systems, entertainment, and healthcare. However, unlike natural language processing (NLP), where unified frameworks have emerged, computer vision remains fragmented, with specialized models for tasks like image classification, object detection, and video generation. This thesis seeks to bridge this fragmentation by proposing methods that unify visual recognition and generation tasks under a single framework. The first contribution of this work is DiffusionDet, a generative approach to object detection that treats bounding box refinement as a noise-to-box process using denoising diffusion models. This novel framework eliminates the need for heuristic priors, improving flexibility and scalability in detection tasks. In the realm of generative models, we introduce GenTron, a transformer-based architecture that integrates diffusion models for scalable and high-quality image and video generation. By leveraging free-form text captions as conditioning inputs, GenTron outperforms traditional CNN-based models and extends its capabilities to video generation. We further expand upon this with Goku, a flow-based video generation model, which incorporates transformers for enhanced scalability and computational efficiency in large-scale settings. Additionally, we explore LlamaGen, an autoregressive image generation model based on the next-token prediction paradigm from large language models (LLMs). By focusing on scalable training and efficient tokenization, LlamaGen achieves competitive performance without relying on explicit inductive biases, setting a new standard for fast and flexible image generation. Finally, we address scalability challenges in dense prediction tasks with AdaptFormer and CycleMLP. AdaptFormer enhances the adaptability of Vision Transformers through lightweight modules, optimizing transfer learning for resource-constrained environments. CycleMLP tackles the computational burden of dense predictions by introducing a local window mechanism, making it efficient for both high-resolution and temporal data. These contributions advance the unification of vision tasks, offering scalable and efficient models for a wide range of applications. Collectively, these contributions advance the field of computer vision by offering innovative frameworks and architectures that address key challenges, paving the way for unified, adaptive, and scalable solutions across a range of applications. | en |
| dc.language | eng | - |
| dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
| dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
| dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
| dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
| dc.subject.lcsh | Computer vision | - |
| dc.subject.lcsh | Machine learning | - |
| dc.title | Visual prediction and generation via diffusion models and beyond | - |
| dc.type | PG_Thesis | - |
| dc.description.thesisname | Doctor of Philosophy | - |
| dc.description.thesislevel | Doctoral | - |
| dc.description.thesisdiscipline | Computer Science | - |
| dc.description.nature | published_or_final_version | - |
| dc.date.hkucongregation | 2025 | - |
| dc.identifier.mmsid | 991045117251603414 | - |
