File Download
Supplementary

postgraduate thesis: Towards efficient and open-world 3D scene understanding

TitleTowards efficient and open-world 3D scene understanding
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Ding, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThe field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding. First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks. Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data. Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks. Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications.
DegreeDoctor of Philosophy
SubjectComputer vision
Pattern recognition systems
Visual perception
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/352701

 

DC FieldValueLanguage
dc.contributor.authorDing, Runyu-
dc.contributor.author丁润语-
dc.date.accessioned2024-12-19T09:27:26Z-
dc.date.available2024-12-19T09:27:26Z-
dc.date.issued2024-
dc.identifier.citationDing, R. [丁润语]. (2024). Towards efficient and open-world 3D scene understanding. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/352701-
dc.description.abstractThe field of 3D scene understanding has witnessed remarkable advancements in recent years, driven by the proliferation of 3D scanning technologies and their applications in areas such as autonomous driving, robotic manipulation, and virtual reality. This thesis, titled ``Towards Efficient and Open-world 3D Scene Understanding'', addresses three critical aspects of this domain: efficient point cloud analysis, label-efficient 3D scene understanding via domain adaptation, and open-world 3D scene understanding. First, we explore efficient point cloud analysis, tackling the challenges posed by the sparse, irregular, and unordered structure of point clouds. While deep learning approaches have made significant strides in this area, existing methods often struggle with computational efficiency or flexibility in modeling spatial variations. We introduce Position Adaptive Convolution (PAConv), a novel convolutional operation that dynamically assembles kernels from a weight bank based on relative point positions. This approach offers both efficiency and flexibility in processing 3D point cloud data, outperforming existing methods on various benchmarks. Secondly, to address the challenge of label scarcity in 3D semantic segmentation, we propose a sim-to-real unsupervised domain adaptation (UDA) approach to leverage simulated data for real-world scene understanding. Our method, DODA (Data-Oriented Domain Adaptation for 3D indoor semantic segmentation), introduces two key components: Virtual Scan Simulation (VSS) to mimic real-world scanning artifacts, and Tail-aware Cuboid Mixing (TACM) to bridge the context gap between simulated and real scenes. These innovations significantly improve the performance of 3D semantic segmentation in real-world scenarios without requiring additional labeled data. Finally, we tackle the problem of open-world 3D scene understanding, where models need to recognize and localize objects beyond their initial training categories. This capability is crucial for real-world applications but is challenged by the high annotation costs of 3D datasets. Our approach uses advanced vision-language models to establish associations between 3D points and language at various levels of detail and combine language information from multiple sources to learn 3D representations that are language-aware. This comprehensive method allows for efficient open-world understanding of 3D scenes, including recognizing objects and their meanings and locating specific instances of objects. This is achieved with minimal need for labeled training data, making our approach powerful and practical for real-world 3D scene understanding tasks. Throughout this thesis, we demonstrate the effectiveness of our proposed methods through extensive experiments on challenging benchmarks, consistently achieving state-of-the-art performance across various 3D scene understanding tasks. Our work contributes to advancing the field of 3D scene understanding by addressing key challenges in efficiency, label scarcity, and open-world generalization, paving the way for more robust and versatile 3D perception systems in real-world applications.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshPattern recognition systems-
dc.subject.lcshVisual perception-
dc.titleTowards efficient and open-world 3D scene understanding-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044891405503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats