Towards generalizable embodied AI system

Mu, Yao; 穆尧

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Towards generalizable embodied AI system

Title	Towards generalizable embodied AI system
Authors	Mu, Yao 穆尧
Advisors	Advisor(s):Luo, P Wang, WP
Issue Date	2025
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Mu, Y. [穆尧]. (2025). Towards generalizable embodied AI system. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Embodied AI represents a crucial frontier in artificial intelligence research, aiming to create systems that can perceive, reason, and act within physical environments. Unlike traditional AI systems that operate purely in digital domains, embodied AI agents must navigate the complexities of real-world interactions, understanding spatial relationships, physical constraints, and multi-modal sensory inputs. In this dissertation, we study the problem of building generalizable embodied AI systems: developing effective embodied perception and cognition, efficient policy learning, and generalization in the real world. We tackle fundamental challenges in embodied AI system design under an integrated framework, including vision-language pre-training, policy learning, and sim-to-real transfer. We divide the dissertation into three parts. For Part I, we explore embodied perception and reasoning through two complementary approaches. In Chapter 2, we develop EmbodiedGPT for decomposing complex instructions into executable atomic skills through vision-language pre-training with embodied chain-of-thought capabilities. In Chapter 3, we introduce RoboCodeX for multimodal code generation that translates semantic understanding into robotic control codes. In Chapter Chapter 4, we present Emergent Communication for Embodied Control (EC²) that bridges visual demonstrations and symbolic language via emergent communication to establish a more natural connection between perceptual experiences and symbolic representations. For Part II, we focus on developing efficient and transferable policy learning approaches that enable robots to acquire skills with limited data while transferring knowledge between tasks. In Chapter 5, we introduce IDM (Imagining from Derived Memory), which improves sample efficiency and enhances policy robustness through imagination-based training with derived memory. Unlike previous approaches that rely solely on real experiences, IDM constructs a "memory prosthesis" to enrich the diversity of imagination without requiring additional environment interactions. In Chapter 6, we further develop CtrlFormer, which learns transferable state representations through a transformer-based architecture. By simultaneously learning visual features and policy representations across multiple tasks using an innovative attention mechanism, CtrlFormer enables effective knowledge transfer while preventing catastrophic forgetting. For Part III, we advance sim-to-real transfer to bridge the gap between simulation and real-world deployment. This encompasses both context learning for dynamics generalization and digital twin frameworks for reliable deployment. In Chapter 7, we propose DOMINO (DecOmposed Mutual INformation Optimization), a novel framework that improves generalization to unseen environments through decomposed mutual information optimization. By learning disentangled context vectors that capture different aspects of environmental variations, DOMINO enables more effective adaptation across diverse scenarios. In Chapter 8, we further introduce RoboTwin, a comprehensive framework that advances sim-to-real transfer through generative digital twins and spatially-aware code generation. Starting from 2D images, RoboTwin employs foundation models to generate diverse 3D assets, incorporates spatial annotations for precise manipulation, and leverages large language models for task decomposition and code generation. These works jointly address the fundamentals of general embodied AI systems from three different perspectives, which together form an integrated framework for improved perception and cognition, efficient policy learning, and generalizable real-world deployment.
Degree	Doctor of Philosophy
Subject	Artificial intelligence
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/356573

DC Field	Value	Language
dc.contributor.advisor	Luo, P	-
dc.contributor.advisor	Wang, WP	-
dc.contributor.author	Mu, Yao	-
dc.contributor.author	穆尧	-
dc.date.accessioned	2025-06-05T09:31:11Z	-
dc.date.available	2025-06-05T09:31:11Z	-
dc.date.issued	2025	-
dc.identifier.citation	Mu, Y. [穆尧]. (2025). Towards generalizable embodied AI system. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/356573	-
dc.description.abstract	Embodied AI represents a crucial frontier in artificial intelligence research, aiming to create systems that can perceive, reason, and act within physical environments. Unlike traditional AI systems that operate purely in digital domains, embodied AI agents must navigate the complexities of real-world interactions, understanding spatial relationships, physical constraints, and multi-modal sensory inputs. In this dissertation, we study the problem of building generalizable embodied AI systems: developing effective embodied perception and cognition, efficient policy learning, and generalization in the real world. We tackle fundamental challenges in embodied AI system design under an integrated framework, including vision-language pre-training, policy learning, and sim-to-real transfer. We divide the dissertation into three parts. For Part I, we explore embodied perception and reasoning through two complementary approaches. In Chapter 2, we develop EmbodiedGPT for decomposing complex instructions into executable atomic skills through vision-language pre-training with embodied chain-of-thought capabilities. In Chapter 3, we introduce RoboCodeX for multimodal code generation that translates semantic understanding into robotic control codes. In Chapter Chapter 4, we present Emergent Communication for Embodied Control (EC²) that bridges visual demonstrations and symbolic language via emergent communication to establish a more natural connection between perceptual experiences and symbolic representations. For Part II, we focus on developing efficient and transferable policy learning approaches that enable robots to acquire skills with limited data while transferring knowledge between tasks. In Chapter 5, we introduce IDM (Imagining from Derived Memory), which improves sample efficiency and enhances policy robustness through imagination-based training with derived memory. Unlike previous approaches that rely solely on real experiences, IDM constructs a "memory prosthesis" to enrich the diversity of imagination without requiring additional environment interactions. In Chapter 6, we further develop CtrlFormer, which learns transferable state representations through a transformer-based architecture. By simultaneously learning visual features and policy representations across multiple tasks using an innovative attention mechanism, CtrlFormer enables effective knowledge transfer while preventing catastrophic forgetting. For Part III, we advance sim-to-real transfer to bridge the gap between simulation and real-world deployment. This encompasses both context learning for dynamics generalization and digital twin frameworks for reliable deployment. In Chapter 7, we propose DOMINO (DecOmposed Mutual INformation Optimization), a novel framework that improves generalization to unseen environments through decomposed mutual information optimization. By learning disentangled context vectors that capture different aspects of environmental variations, DOMINO enables more effective adaptation across diverse scenarios. In Chapter 8, we further introduce RoboTwin, a comprehensive framework that advances sim-to-real transfer through generative digital twins and spatially-aware code generation. Starting from 2D images, RoboTwin employs foundation models to generate diverse 3D assets, incorporates spatial annotations for precise manipulation, and leverages large language models for task decomposition and code generation. These works jointly address the fundamentals of general embodied AI systems from three different perspectives, which together form an integrated framework for improved perception and cognition, efficient policy learning, and generalizable real-world deployment.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Artificial intelligence	-
dc.title	Towards generalizable embodied AI system	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2025	-
dc.identifier.mmsid	991044970874803414	-

File Download

Supplementary

postgraduate thesis: Towards generalizable embodied AI system

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats