Enhancing scene understanding with deep learning

Ge, Chongjian; 葛崇剑

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Enhancing scene understanding with deep learning

Title	Enhancing scene understanding with deep learning
Authors	Ge, Chongjian 葛崇剑
Advisors	Advisor(s):Luo, P Wang, WP
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Ge, C. [葛崇剑]. (2024). Enhancing scene understanding with deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Scene understanding is a critical component of computer vision, enabling machines to interpret and interact with complex environments. Despite significant progress in previous research, considerable challenges persist due to increasingly complex tasks and insufficient labeled data for model training. This dissertation aims to address these challenges and enhance scene understanding by introducing systematic innovations in both network architectures and learning algorithms, covering both 2D and 3D scenarios. Specifically, this dissertation explores three research topics that advance scene understanding through deep learning techniques, including (i) designing foundational visual backbones for 2D perception, (ii) introducing self-supervised learning for visual modeling, and (iii) enhancing robust 3D perception for real-world autonomous driving. For foundational visual backbones, this dissertation introduces GroupMixFormer, which is based on Vision Transformers. It incorporates GroupMixAttention (GMA) to simultaneously capture token-to-token, token-to-group, and group-to-group correlations, serving as an advanced replacement for traditional self-attention mechanisms. Due to its comprehensive visual modeling, GroupMixFormer facilitates various visual tasks in scene understanding. For self-supervised learning (SSL) algorithms, this dissertation first demonstrates a double-stream framework for contrastive learning, CARE, which revitalizes Convolutional Neural Networks (CNNs) with feature attention. Specifically, CARE comprises a CNN stream (C-stream) and a Transformer stream (T-stream), utilizing the T-stream to supervise the C-stream in encoding network attention. This facilitates CNNs in producing more generalizable representations for various visual tasks. Additionally, this dissertation identifies that the widely-used binary instance comparison is insufficient for measuring sample correlations in contrastive learning. To address this limitation, it introduces another SSL framework, SNCLR, which employs soft weights to measure sample correlations during training. This approach achieves consistent improvements in various visual tasks. For 3D scene understanding system, this dissertation reveals that sensor failures (LiDAR or Camera failure) pose significant risks to the safety of autonomous systems in 3D perception. To address these challenges, we introduce MetaBEV, a robust 3D understanding framework capable of handling extreme real-world environments, including sensor corruption and sensor absence. Specifically, modality-specific encoders and MoE structures are proposed in MetaBEV to enhance robust 3D scene understanding. Finally, this dissertation discusses future work towards developing foundation vision models and multi-modality models for more advanced scene understanding and even scene reasoning.
Degree	Doctor of Philosophy
Subject	Computer vision Deep learning (Machine learning)
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/355582

DC Field	Value	Language
dc.contributor.advisor	Luo, P	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Ge, Chongjian	-
dc.contributor.author	葛崇剑	-
dc.date.accessioned	2025-04-23T01:31:12Z	-
dc.date.available	2025-04-23T01:31:12Z	-
dc.date.issued	2024	-
dc.identifier.citation	Ge, C. [葛崇剑]. (2024). Enhancing scene understanding with deep learning. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/355582	-
dc.description.abstract	Scene understanding is a critical component of computer vision, enabling machines to interpret and interact with complex environments. Despite significant progress in previous research, considerable challenges persist due to increasingly complex tasks and insufficient labeled data for model training. This dissertation aims to address these challenges and enhance scene understanding by introducing systematic innovations in both network architectures and learning algorithms, covering both 2D and 3D scenarios. Specifically, this dissertation explores three research topics that advance scene understanding through deep learning techniques, including (i) designing foundational visual backbones for 2D perception, (ii) introducing self-supervised learning for visual modeling, and (iii) enhancing robust 3D perception for real-world autonomous driving. For foundational visual backbones, this dissertation introduces GroupMixFormer, which is based on Vision Transformers. It incorporates GroupMixAttention (GMA) to simultaneously capture token-to-token, token-to-group, and group-to-group correlations, serving as an advanced replacement for traditional self-attention mechanisms. Due to its comprehensive visual modeling, GroupMixFormer facilitates various visual tasks in scene understanding. For self-supervised learning (SSL) algorithms, this dissertation first demonstrates a double-stream framework for contrastive learning, CARE, which revitalizes Convolutional Neural Networks (CNNs) with feature attention. Specifically, CARE comprises a CNN stream (C-stream) and a Transformer stream (T-stream), utilizing the T-stream to supervise the C-stream in encoding network attention. This facilitates CNNs in producing more generalizable representations for various visual tasks. Additionally, this dissertation identifies that the widely-used binary instance comparison is insufficient for measuring sample correlations in contrastive learning. To address this limitation, it introduces another SSL framework, SNCLR, which employs soft weights to measure sample correlations during training. This approach achieves consistent improvements in various visual tasks. For 3D scene understanding system, this dissertation reveals that sensor failures (LiDAR or Camera failure) pose significant risks to the safety of autonomous systems in 3D perception. To address these challenges, we introduce MetaBEV, a robust 3D understanding framework capable of handling extreme real-world environments, including sensor corruption and sensor absence. Specifically, modality-specific encoders and MoE structures are proposed in MetaBEV to enhance robust 3D scene understanding. Finally, this dissertation discusses future work towards developing foundation vision models and multi-modality models for more advanced scene understanding and even scene reasoning.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.title	Enhancing scene understanding with deep learning	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991044955307703414	-

File Download

Supplementary

postgraduate thesis: Enhancing scene understanding with deep learning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats