Deep learning methods for 3D visual data processing

Liu, Chang; 刘畅

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Electrical & Electronic Engineering: Theses

postgraduate thesis: Deep learning methods for 3D visual data processing

Title	Deep learning methods for 3D visual data processing
Authors	Liu, Chang 刘畅
Issue Date	2024
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Liu, C. [刘畅]. (2024). Deep learning methods for 3D visual data processing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Over the past decade, deep neural networks (DNNs) have made significant strides in Artificial Intelligence. In computer vision, 3D vision is essential for AI to understand the physical world. However, many technical challenges persist. This thesis presents a roadmap for advancing 3D neural networks, covering data-specific network design for event cameras, autolabeling of 3D point clouds, and a general, efficient 3D backbone network. Initially, we explore DNNs for event cameras, which capture optical changes with high temporal resolution, leading to issues like misalignment and motion blur. To address this, we propose ACE-BET, a novel architecture that converts event data into a dense tensor using heuristic methods and 3D spatial-temporal convolution. ACE-BET processes this tensor in two branches for static and dynamic tasks, achieving state-of-the-art accuracy and the fastest inference speed across all seven tasks in experiments. Recognizing that event data can be treated as a type of 3D point cloud, we then focus on general 3D point clouds. One of the major bottlenecks for 3D point cloud training is the annotation process. Annotating these clouds is time-consuming and challenging due to their sparsity. To alleviate this, we introduce MAP-Gen and MTrans, autolabelers that generate 3D annotations using multimodal inputs of images and point clouds. By leveraging dense image information, these tools estimate 3D coordinates of image pixels. Therefore, they densify the sparse point cloud, significantly improving annotation quality. MAP-Gen and MTrans outperform all previous autolabelers, achieving 94-98% of human annotation performance. Finally, we investigate a general network for understanding 3D point clouds. During the research of ACE-BET, MAP-Gen and MTrans, one major challenge is the scale of point clouds. Particularly for transformer architectures with quadratic complexity, the cloud's resolution and size are constrained, leading to information loss. Therefore, we propose ISAPnet, which uses a space tree to split 3D space based on point cloud distribution. Varying-length sequences are used to align with the floating detail levels in point clouds. Experiments show ISAPnet's efficiency, achieving comparable or 0.4% higher accuracy while using only 7% of the FLOPs of previous state-of-the-art methods.
Degree	Doctor of Philosophy
Subject	Computer vision Three-dimensional imaging Deep learning (Machine learning)
Dept/Program	Electrical and Electronic Engineering
Persistent Identifier	http://hdl.handle.net/10722/352688

DC Field	Value	Language
dc.contributor.author	Liu, Chang	-
dc.contributor.author	刘畅	-
dc.date.accessioned	2024-12-19T09:27:21Z	-
dc.date.available	2024-12-19T09:27:21Z	-
dc.date.issued	2024	-
dc.identifier.citation	Liu, C. [刘畅]. (2024). Deep learning methods for 3D visual data processing. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/352688	-
dc.description.abstract	Over the past decade, deep neural networks (DNNs) have made significant strides in Artificial Intelligence. In computer vision, 3D vision is essential for AI to understand the physical world. However, many technical challenges persist. This thesis presents a roadmap for advancing 3D neural networks, covering data-specific network design for event cameras, autolabeling of 3D point clouds, and a general, efficient 3D backbone network. Initially, we explore DNNs for event cameras, which capture optical changes with high temporal resolution, leading to issues like misalignment and motion blur. To address this, we propose ACE-BET, a novel architecture that converts event data into a dense tensor using heuristic methods and 3D spatial-temporal convolution. ACE-BET processes this tensor in two branches for static and dynamic tasks, achieving state-of-the-art accuracy and the fastest inference speed across all seven tasks in experiments. Recognizing that event data can be treated as a type of 3D point cloud, we then focus on general 3D point clouds. One of the major bottlenecks for 3D point cloud training is the annotation process. Annotating these clouds is time-consuming and challenging due to their sparsity. To alleviate this, we introduce MAP-Gen and MTrans, autolabelers that generate 3D annotations using multimodal inputs of images and point clouds. By leveraging dense image information, these tools estimate 3D coordinates of image pixels. Therefore, they densify the sparse point cloud, significantly improving annotation quality. MAP-Gen and MTrans outperform all previous autolabelers, achieving 94-98% of human annotation performance. Finally, we investigate a general network for understanding 3D point clouds. During the research of ACE-BET, MAP-Gen and MTrans, one major challenge is the scale of point clouds. Particularly for transformer architectures with quadratic complexity, the cloud's resolution and size are constrained, leading to information loss. Therefore, we propose ISAPnet, which uses a space tree to split 3D space based on point cloud distribution. Varying-length sequences are used to align with the floating detail levels in point clouds. Experiments show ISAPnet's efficiency, achieving comparable or 0.4% higher accuracy while using only 7% of the FLOPs of previous state-of-the-art methods.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Computer vision	-
dc.subject.lcsh	Three-dimensional imaging	-
dc.subject.lcsh	Deep learning (Machine learning)	-
dc.title	Deep learning methods for 3D visual data processing	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Electrical and Electronic Engineering	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044891408603414	-

File Download

Supplementary

postgraduate thesis: Deep learning methods for 3D visual data processing

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats