File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos
Title | Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos |
---|---|
Authors | |
Issue Date | 2024 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Wu, X. [吴秀哲]. (2024). Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | 3D reconstruction, synthesis, and animation are key challenges in 3D computer vision, and these tasks are closely related. The process begins with perceiving 3D information, like depth and motion, from images. This perception enhances model training for 3D scene reconstruction. Once perception and reconstruction are achieved, static and motion data can be disentangled, allowing models to learn the relationship between motion signals and object movements, enabling object animation during testing.
However, acquiring 3D data for training often requires expensive equipment or specialized design expertise. In contrast, video data, captured by widely available cameras, offers a more accessible and information-rich source for learning about 3D information. This thesis focuses on achieving accurate, continuous, and efficient learning from video data for the aforementioned tasks.
The first challenge addressed is depth and 3D motion estimation from monocular videos. The lack of ground-truth data limits simple supervised methods, and while self-supervised approaches have improved, they still struggle with dynamic real-world scenes. To overcome this, a system with a depth estimation module and a Decomposed Object-wise 3D Motion (DO3D) module is proposed. These modules work together to effectively model scene geometry and dynamics, significantly improving estimation accuracy.
After achieving single-frame perception, multiple viewpoints can be used to learn consistent appearance and geometry for a complete scene understanding. Neural Radiance Fields (NeRF) excel at generating high-fidelity images, but adapting to dynamic changes is time-consuming. The second challenge is efficiently adapting NeRF to real-world changes using short videos, addressed by CL-NeRF. This approach includes a lightweight expert adaptor, a conflict-aware knowledge distillation scheme to retain the knowledge of unchanged scene parts, and a new benchmark for evaluating continuous learning capabilities.
Once the model's perception and reconstruction capabilities are developed, additional applications, such as avatar animation, are explored. In this task, the model learns the static appearance and geometry of an avatar and how to use motion signals to drive changes. My focus is on learning avatar animation from short videos or pre-trained models. Three subtasks are studied, categorized by the modality of the driving signal: video-driven, audio-driven, and text-driven.
In the video-driven subtask, NeRF is leveraged for 3D avatar animation. PortraitNeRF generates high-fidelity videos by synchronizing head movements, facial expressions, lip movements, and torso movements with the driving video. In the audio-driven subtask, previous methods struggled with inaccurate lip shapes and poor image quality due to limited facial regions driven by speech. Speech2Lip addresses this using a decomposition-synthesis-composition approach, employing a speech-driven lip model and a geometry-aware mapping module to create natural-looking videos, enhanced by BlendNet and contrastive sync loss. Finally, in the text-driven subtask, talking avatar videos are generated based on a text description of the avatar's appearance and should speak user-specific text content. The lack of paired text and avatar video data is tackled through a zero-shot approach, integrating a 3D-aware static avatar generation model, an explicit parametric model called 3D Morphable Model (3DMM), a self-supervised InpaintNet, and novel consistency losses to ensure quality and temporal coherence. |
Degree | Doctor of Philosophy |
Subject | Computer vision Image processing - Digital techniques Three-dimensional imaging |
Dept/Program | Electrical and Electronic Engineering |
Persistent Identifier | http://hdl.handle.net/10722/352664 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Wu, Xiuzhe | - |
dc.contributor.author | 吴秀哲 | - |
dc.date.accessioned | 2024-12-19T09:27:05Z | - |
dc.date.available | 2024-12-19T09:27:05Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | Wu, X. [吴秀哲]. (2024). Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/352664 | - |
dc.description.abstract | 3D reconstruction, synthesis, and animation are key challenges in 3D computer vision, and these tasks are closely related. The process begins with perceiving 3D information, like depth and motion, from images. This perception enhances model training for 3D scene reconstruction. Once perception and reconstruction are achieved, static and motion data can be disentangled, allowing models to learn the relationship between motion signals and object movements, enabling object animation during testing. However, acquiring 3D data for training often requires expensive equipment or specialized design expertise. In contrast, video data, captured by widely available cameras, offers a more accessible and information-rich source for learning about 3D information. This thesis focuses on achieving accurate, continuous, and efficient learning from video data for the aforementioned tasks. The first challenge addressed is depth and 3D motion estimation from monocular videos. The lack of ground-truth data limits simple supervised methods, and while self-supervised approaches have improved, they still struggle with dynamic real-world scenes. To overcome this, a system with a depth estimation module and a Decomposed Object-wise 3D Motion (DO3D) module is proposed. These modules work together to effectively model scene geometry and dynamics, significantly improving estimation accuracy. After achieving single-frame perception, multiple viewpoints can be used to learn consistent appearance and geometry for a complete scene understanding. Neural Radiance Fields (NeRF) excel at generating high-fidelity images, but adapting to dynamic changes is time-consuming. The second challenge is efficiently adapting NeRF to real-world changes using short videos, addressed by CL-NeRF. This approach includes a lightweight expert adaptor, a conflict-aware knowledge distillation scheme to retain the knowledge of unchanged scene parts, and a new benchmark for evaluating continuous learning capabilities. Once the model's perception and reconstruction capabilities are developed, additional applications, such as avatar animation, are explored. In this task, the model learns the static appearance and geometry of an avatar and how to use motion signals to drive changes. My focus is on learning avatar animation from short videos or pre-trained models. Three subtasks are studied, categorized by the modality of the driving signal: video-driven, audio-driven, and text-driven. In the video-driven subtask, NeRF is leveraged for 3D avatar animation. PortraitNeRF generates high-fidelity videos by synchronizing head movements, facial expressions, lip movements, and torso movements with the driving video. In the audio-driven subtask, previous methods struggled with inaccurate lip shapes and poor image quality due to limited facial regions driven by speech. Speech2Lip addresses this using a decomposition-synthesis-composition approach, employing a speech-driven lip model and a geometry-aware mapping module to create natural-looking videos, enhanced by BlendNet and contrastive sync loss. Finally, in the text-driven subtask, talking avatar videos are generated based on a text description of the avatar's appearance and should speak user-specific text content. The lack of paired text and avatar video data is tackled through a zero-shot approach, integrating a 3D-aware static avatar generation model, an explicit parametric model called 3D Morphable Model (3DMM), a self-supervised InpaintNet, and novel consistency losses to ensure quality and temporal coherence. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Computer vision | - |
dc.subject.lcsh | Image processing - Digital techniques | - |
dc.subject.lcsh | Three-dimensional imaging | - |
dc.title | Continuous and efficient learning of 3D reconstruction, synthesis, and animation from videos | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Electrical and Electronic Engineering | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2024 | - |
dc.identifier.mmsid | 991044891404103414 | - |