Towards efficient and open-world 3D scene understanding

Ding, Runyu; 丁润语

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Towards efficient and open-world 3D scene understanding

Title	Towards efficient and open-world 3D scene understanding
Authors	Ding, Runyu 丁润语
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Ding, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	The field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding. First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks. Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data. Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks. Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications.
Degree	Doctor of Philosophy
Subject	Computer vision Pattern recognition systems Visual perception
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/352701

DC Field	Value	Language
dc.contributor.author	Ding, Runyu	-
dc.contributor.author	丁润语	-
dc.date.accessioned	2024-12-19T09:27:26Z	-
dc.date.available	2024-12-19T09:27:26Z	-
dc.date.issued	2024	-
dc.identifier.citation	Ding, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352701	-
dc.description.abstract	The field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding. First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks. Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data. Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks. Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Pattern recognition systems	-
dc.subject.lcsh	Visual perception	-
dc.title	Towards efficient and open-world 3D scene understanding	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891405503414	-

File Download

Supplementary

postgraduate thesis: Towards efficient and open-world 3D scene understanding

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats