File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Towards transferable and efficient embodied visual intelligence
Title | Towards transferable and efficient embodied visual intelligence |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Open-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments.
To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$)
\textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments.
In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings.
First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner.
ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates.
Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios.
ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization.
ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark.
To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors.
Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models.
Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources.
To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding.
RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC.
Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks.
Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks. |
Degree | Doctor of Philosophy |
Subject | Computer vision Three-dimensional imaging - Data processing Artificial intelligence |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/352658 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, Jihan | - |
dc.contributor.author | 楊霽晗 | - |
dc.date.accessioned | 2024-12-19T09:27:02Z | - |
dc.date.available | 2024-12-19T09:27:02Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Yang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/352658 | - |
dc.description.abstract | Open-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments. To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$) \textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments. In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings. First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner. ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates. Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios. ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization. ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark. To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors. Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models. Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources. To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding. RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC. Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks. Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Computer vision | - |
dc.subject.lcsh | Three-dimensional imaging - Data processing | - |
dc.subject.lcsh | Artificial intelligence | - |
dc.title | Towards transferable and efficient embodied visual intelligence | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044891405303414 | - |