File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Towards efficient and open-world 3D scene understanding
Title | Towards efficient and open-world 3D scene understanding |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Ding, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | The field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding.
First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks.
Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data.
Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks.
Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications. |
Degree | Doctor of Philosophy |
Subject | Computer vision Pattern recognition systems Visual perception |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/352701 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Ding, Runyu | - |
dc.contributor.author | 丁润语 | - |
dc.date.accessioned | 2024-12-19T09:27:26Z | - |
dc.date.available | 2024-12-19T09:27:26Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Ding, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/352701 | - |
dc.description.abstract | The field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding. First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks. Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data. Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks. Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Computer vision | - |
dc.subject.lcsh | Pattern recognition systems | - |
dc.subject.lcsh | Visual perception | - |
dc.title | Towards efficient and open-world 3D scene understanding | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044891405503414 | - |