Towards transferable and efficient embodied visual intelligence

Yang, Jihan; 楊霽晗

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Towards transferable and efficient embodied visual intelligence

Title	Towards transferable and efficient embodied visual intelligence
Authors	Yang, Jihan 楊霽晗
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Open-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments. To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$) \textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments. In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings. First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner. ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates. Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios. ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization. ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark. To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors. Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models. Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources. To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding. RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC. Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks. Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks.
Degree	Doctor of Philosophy
Subject	Computer vision Three-dimensional imaging - Data processing Artificial intelligence
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/352658

DC Field	Value	Language
dc.contributor.author	Yang, Jihan	-
dc.contributor.author	楊霽晗	-
dc.date.accessioned	2024-12-19T09:27:02Z	-
dc.date.available	2024-12-19T09:27:02Z	-
dc.date.issued	2024	-
dc.identifier.citation	Yang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352658	-
dc.description.abstract	Open-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments. To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$) \textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments. In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings. First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner. ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates. Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios. ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization. ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark. To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors. Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models. Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources. To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding. RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC. Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks. Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Three-dimensional imaging - Data processing	-
dc.subject.lcsh	Artificial intelligence	-
dc.title	Towards transferable and efficient embodied visual intelligence	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891405303414	-

File Download

Supplementary

postgraduate thesis: Towards transferable and efficient embodied visual intelligence

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats