File Download
Supplementary

postgraduate thesis: Towards transferable and efficient embodied visual intelligence

TitleTowards transferable and efficient embodied visual intelligence
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Yang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractOpen-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments. To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$) \textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments. In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings. First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner. ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates. Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios. ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization. ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark. To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors. Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models. Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources. To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding. RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC. Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks. Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks.
DegreeDoctor of Philosophy
SubjectComputer vision
Three-dimensional imaging - Data processing
Artificial intelligence
Dept/ProgramElectrical and Electronic Engineering
Persistent Identifierhttp://hdl.handle.net/10722/352658

 

DC FieldValueLanguage
dc.contributor.authorYang, Jihan-
dc.contributor.author楊霽晗-
dc.date.accessioned2024-12-19T09:27:02Z-
dc.date.available2024-12-19T09:27:02Z-
dc.date.issued2024-
dc.identifier.citationYang, J. [楊霽晗]. (2024). Towards transferable and efficient embodied visual intelligence. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/352658-
dc.description.abstractOpen-world embodied 3D visual intelligence is pivotal in building agents that can interact with the 3D physical world. It enhances the perception and understanding of autonomous agents within sensor-rich 3D environments. To understand complex and ever-changing real-world scenarios, embodied visual intelligence should be ($i$) \textbf{transferable} to unseen data distributions, without the need for costly human annotations.; ($ii$) \textbf{computation-efficient} to understand surroundings in real-time; ($iii$) \textbf{being capable of recognizing} any novel semantic categories in an \textbf{open world}; ($iv$) \textbf{practically applicable} for embodied agents to perform daily tasks across realistic and globe-scale environments. In this thesis, we delve into these lenses, aiming to advance the effectiveness and adaptability of autonomous embodied systems in real-world settings. First, to enhance the transferability in 3D visual understanding, we design an unsupervised domain adaptation framework in 3D object detection ST3D, which can effectively transfer models to novel domains in an unsupervised manner. ST3D pre-trains the 3D detector on the source domain using a random object scaling strategy to reduce domain bias. The detector is then refined on the target domain through iterative pseudo-label updates. Then, we further develop ST3D++ to enhance the ST3D framework, offering superior transferability to multiple categories scenarios. ST3D++ simultaneously reduces noise in pseudo labels and mitigates their negative impact on optimization. ST3D and ST3D++ set the new state of the art in all evaluated settings and even surpass fully supervised results on the KITTI benchmark. To advance the development of efficient embodied vision systems, we investigate the utility of model compression and knowledge distillation (KD) techniques on 3D object detectors. Our exploration summarizes the key designing factors for efficient 3D detectors. Besides, we observe existing KD techniques perform inferior synergy effects. Hence, we propose a simple and effective knowledge distillation pipeline SparseKD, which focuses on the knowledge transfer of sparse key positions between teacher and student models. Our efficient detector with SparseKD outperforms its teacher model while requiring only 44\% of the teacher's computational resources. To empower the open-vocabulary capability of embodied visual intelligence, we propose a lightweight and scalable framework, namely RegionPLC, for open-world 3D scene understanding. RegionPLC introduces a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from various 2D VL models, yielding high-quality language descriptions. Also, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from regional language supervision. Extensive experiments show the effectiveness, scalability, and efficiency of RegionPLC. Finally, to evaluate embodied visual intelligence, a realistic, globally distributed, and interactive testing environment is necessary — one that is not limited by the physical constraints of actual hardware and control systems. With this objective, we present \virl, a scalable platform that grounds vision systems into embodied agents and evaluates them with a variety of practical and interactive tasks. Moreover, our \virl platform establishes three worldwide benchmarks for assessing the effectiveness of fundamental language and vision models, both in isolation using our platform's real-world data and embodied visual intelligence performance on challenging tasks.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshThree-dimensional imaging - Data processing-
dc.subject.lcshArtificial intelligence-
dc.titleTowards transferable and efficient embodied visual intelligence-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineElectrical and Electronic Engineering-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044891405303414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats