Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos

Wu, Xiuzhe; 吴秀哲

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos

Title	Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos
Authors	Wu, Xiuzhe 吴秀哲
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wu, X. [吴秀哲]. (2024). Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	3D reconstruction, synthesis, and animation are key challenges in 3D computer vision, and these tasks are closely related. The process begins with perceiving 3D information, like depth and motion, from images. This perception enhances model training for 3D scene reconstruction. Once perception and reconstruction are achieved, static and motion data can be disentangled, allowing models to learn the relationship between motion signals and object movements, enabling object animation during testing. However, acquiring 3D data for training often requires expensive equipment or specialized design expertise. In contrast, video data, captured by widely available cameras, offers a more accessible and information-rich source for learning about 3D information. This thesis focuses on achieving accurate, continuous, and efficient learning from video data for the aforementioned tasks. The first challenge addressed is depth and 3D motion estimation from monocular videos. The lack of ground-truth data limits simple supervised methods, and while self-supervised approaches have improved, they still struggle with dynamic real-world scenes. To overcome this, a system with a depth estimation module and a Decomposed Object-wise 3D Motion (DO3D) module is proposed. These modules work together to effectively model scene geometry and dynamics, significantly improving estimation accuracy. After achieving single-frame perception, multiple viewpoints can be used to learn consistent appearance and geometry for a complete scene understanding. Neural Radiance Fields (NeRF) excel at generating high-fidelity images, but adapting to dynamic changes is time-consuming. The second challenge is efficiently adapting NeRF to real-world changes using short videos, addressed by CL-NeRF. This approach includes a lightweight expert adaptor, a conflict-aware knowledge distillation scheme to retain the knowledge of unchanged scene parts, and a new benchmark for evaluating continuous learning capabilities. Once the model's perception and reconstruction capabilities are developed, additional applications, such as avatar animation, are explored. In this task, the model learns the static appearance and geometry of an avatar and how to use motion signals to drive changes. My focus is on learning avatar animation from short videos or pre-trained models. Three subtasks are studied, categorized by the modality of the driving signal: video-driven, audio-driven, and text-driven. In the video-driven subtask, NeRF is leveraged for 3D avatar animation. PortraitNeRF generates high-fidelity videos by synchronizing head movements, facial expressions, lip movements, and torso movements with the driving video. In the audio-driven subtask, previous methods struggled with inaccurate lip shapes and poor image quality due to limited facial regions driven by speech. Speech2Lip addresses this using a decomposition-synthesis-composition approach, employing a speech-driven lip model and a geometry-aware mapping module to create natural-looking videos, enhanced by BlendNet and contrastive sync loss. Finally, in the text-driven subtask, talking avatar videos are generated based on a text description of the avatar's appearance and should speak user-specific text content. The lack of paired text and avatar video data is tackled through a zero-shot approach, integrating a 3D-aware static avatar generation model, an explicit parametric model called 3D Morphable Model (3DMM), a self-supervised InpaintNet, and novel consistency losses to ensure quality and temporal coherence.
Degree	Doctor of Philosophy
Subject	Computer vision Image processing - Digital techniques Three-dimensional imaging
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/352664

DC Field	Value	Language
dc.contributor.author	Wu, Xiuzhe	-
dc.contributor.author	吴秀哲	-
dc.date.accessioned	2024-12-19T09:27:05Z	-
dc.date.available	2024-12-19T09:27:05Z	-
dc.date.issued	2024	-
dc.identifier.citation	Wu, X. [吴秀哲]. (2024). Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352664	-
dc.description.abstract	3D reconstruction, synthesis, and animation are key challenges in 3D computer vision, and these tasks are closely related. The process begins with perceiving 3D information, like depth and motion, from images. This perception enhances model training for 3D scene reconstruction. Once perception and reconstruction are achieved, static and motion data can be disentangled, allowing models to learn the relationship between motion signals and object movements, enabling object animation during testing. However, acquiring 3D data for training often requires expensive equipment or specialized design expertise. In contrast, video data, captured by widely available cameras, offers a more accessible and information-rich source for learning about 3D information. This thesis focuses on achieving accurate, continuous, and efficient learning from video data for the aforementioned tasks. The first challenge addressed is depth and 3D motion estimation from monocular videos. The lack of ground-truth data limits simple supervised methods, and while self-supervised approaches have improved, they still struggle with dynamic real-world scenes. To overcome this, a system with a depth estimation module and a Decomposed Object-wise 3D Motion (DO3D) module is proposed. These modules work together to effectively model scene geometry and dynamics, significantly improving estimation accuracy. After achieving single-frame perception, multiple viewpoints can be used to learn consistent appearance and geometry for a complete scene understanding. Neural Radiance Fields (NeRF) excel at generating high-fidelity images, but adapting to dynamic changes is time-consuming. The second challenge is efficiently adapting NeRF to real-world changes using short videos, addressed by CL-NeRF. This approach includes a lightweight expert adaptor, a conflict-aware knowledge distillation scheme to retain the knowledge of unchanged scene parts, and a new benchmark for evaluating continuous learning capabilities. Once the model's perception and reconstruction capabilities are developed, additional applications, such as avatar animation, are explored. In this task, the model learns the static appearance and geometry of an avatar and how to use motion signals to drive changes. My focus is on learning avatar animation from short videos or pre-trained models. Three subtasks are studied, categorized by the modality of the driving signal: video-driven, audio-driven, and text-driven. In the video-driven subtask, NeRF is leveraged for 3D avatar animation. PortraitNeRF generates high-fidelity videos by synchronizing head movements, facial expressions, lip movements, and torso movements with the driving video. In the audio-driven subtask, previous methods struggled with inaccurate lip shapes and poor image quality due to limited facial regions driven by speech. Speech2Lip addresses this using a decomposition-synthesis-composition approach, employing a speech-driven lip model and a geometry-aware mapping module to create natural-looking videos, enhanced by BlendNet and contrastive sync loss. Finally, in the text-driven subtask, talking avatar videos are generated based on a text description of the avatar's appearance and should speak user-specific text content. The lack of paired text and avatar video data is tackled through a zero-shot approach, integrating a 3D-aware static avatar generation model, an explicit parametric model called 3D Morphable Model (3DMM), a self-supervised InpaintNet, and novel consistency losses to ensure quality and temporal coherence.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Image processing - Digital techniques	-
dc.subject.lcsh	Three-dimensional imaging	-
dc.title	Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891404103414	-

File Download

Supplementary

postgraduate thesis: Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats