File Download
Supplementary

postgraduate thesis: Towards generic human-centric visual analysis

TitleTowards generic human-centric visual analysis
Authors
Issue Date2024
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Jin, S. [金晟]. (2024). Towards generic human-centric visual analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractHuman-centric visual analysis, also referred to as human-centric perception (HCP), encompasses several essential research areas, such as pedestrian detection, segmentation, pose estimation, and attribute analysis. HCP is integral to numerous industrial applications, including sports analysis, virtual reality (VR) and augmented reality (AR). Existing specialist models for HCP often cater to specific tasks, limiting their broader applicability. This dissertation endeavors to construct a ``generic'' HCP system characterized by its accuracy, efficiency, versatility, and openness to support a broad spectrum of applications. Three primary challenges are addressed: unifying the end-to-end pipeline, unifying different body parts, and unifying different visual tasks. The research first delves into multi-person pose estimation, a critical aspect of HCP. Previous algorithms have predominantly relied on multi-stage pipeline approaches, particularly top-down or bottom-up methods. This research focuses on bottom-up pose estimation, which is more efficient but faces challenges with independent keypoint grouping as a post-processing step, resulting in sub-optimal, non-end-to-end trainable models. To address this, a differentiable hierarchical graph grouping method is introduced. This method treats keypoint grouping as a graph clustering problem, enabling end-to-end training and hierarchical supervision for improved accuracy, as validated by extensive experiments on COCO and OCHuman datasets. Further enhancing versatility involves unifying the visual analysis of distinct body parts, such as the human body, face, and hands. Rather than treating these analyses independently as in previous research studies, the research integrates them into a single task: 2D human whole-body pose estimation. This involves localizing dense landmarks across the entire human body. Due to the absence of adequate datasets, models had to be cobbled together from separate datasets, leading to bias and complexity issues. This research presents the COCO-WholeBody dataset, a new benchmark with extensive body-part annotations, and introduces ZoomNet, a novel model that leverages the human body's hierarchical structure for enhanced performance. Expanding beyond pose estimation, the study explores the unification of different visual tasks for enhanced ``versatility''. By proposing a single-stage framework for multi-person multi-task HCP, including classification, detection, segmentation, and pose estimation, the research underscores the correlation among diverse visual tasks. Central to this approach is the ``Human Query'' concept, encapsulating individual instance-level features to manage complex multi-person environments. The introduced COCO-UniHuman benchmark dataset facilitates exhaustive model development and evaluation, showcasing the superiority of the proposed method over existing multi-task frameworks. To address the challenge of instance perception in the ``open'' category within the few-shot learning context, the study introduces UniFS, a novel approach for overcoming the impediments of conventional few-shot learning by framing various instance perception tasks within a dynamic point representation learning framework. By incorporating structure-aware point learning, the generalist model enhances representation learning by capturing higher-order structures among points, yielding competitive outcomes against specialized models. In conclusion, this research offers comprehensive solutions with far-reaching implications for future research and industrial applications. By establishing benchmarks and methodologies in human-centric visual analysis technology, it paves the way for advancements in the field.
DegreeDoctor of Philosophy
SubjectComputer vision
Pattern recognition systems
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/353386

 

DC FieldValueLanguage
dc.contributor.authorJin, Sheng-
dc.contributor.author金晟-
dc.date.accessioned2025-01-17T09:46:13Z-
dc.date.available2025-01-17T09:46:13Z-
dc.date.issued2024-
dc.identifier.citationJin, S. [金晟]. (2024). Towards generic human-centric visual analysis. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/353386-
dc.description.abstractHuman-centric visual analysis, also referred to as human-centric perception (HCP), encompasses several essential research areas, such as pedestrian detection, segmentation, pose estimation, and attribute analysis. HCP is integral to numerous industrial applications, including sports analysis, virtual reality (VR) and augmented reality (AR). Existing specialist models for HCP often cater to specific tasks, limiting their broader applicability. This dissertation endeavors to construct a ``generic'' HCP system characterized by its accuracy, efficiency, versatility, and openness to support a broad spectrum of applications. Three primary challenges are addressed: unifying the end-to-end pipeline, unifying different body parts, and unifying different visual tasks. The research first delves into multi-person pose estimation, a critical aspect of HCP. Previous algorithms have predominantly relied on multi-stage pipeline approaches, particularly top-down or bottom-up methods. This research focuses on bottom-up pose estimation, which is more efficient but faces challenges with independent keypoint grouping as a post-processing step, resulting in sub-optimal, non-end-to-end trainable models. To address this, a differentiable hierarchical graph grouping method is introduced. This method treats keypoint grouping as a graph clustering problem, enabling end-to-end training and hierarchical supervision for improved accuracy, as validated by extensive experiments on COCO and OCHuman datasets. Further enhancing versatility involves unifying the visual analysis of distinct body parts, such as the human body, face, and hands. Rather than treating these analyses independently as in previous research studies, the research integrates them into a single task: 2D human whole-body pose estimation. This involves localizing dense landmarks across the entire human body. Due to the absence of adequate datasets, models had to be cobbled together from separate datasets, leading to bias and complexity issues. This research presents the COCO-WholeBody dataset, a new benchmark with extensive body-part annotations, and introduces ZoomNet, a novel model that leverages the human body's hierarchical structure for enhanced performance. Expanding beyond pose estimation, the study explores the unification of different visual tasks for enhanced ``versatility''. By proposing a single-stage framework for multi-person multi-task HCP, including classification, detection, segmentation, and pose estimation, the research underscores the correlation among diverse visual tasks. Central to this approach is the ``Human Query'' concept, encapsulating individual instance-level features to manage complex multi-person environments. The introduced COCO-UniHuman benchmark dataset facilitates exhaustive model development and evaluation, showcasing the superiority of the proposed method over existing multi-task frameworks. To address the challenge of instance perception in the ``open'' category within the few-shot learning context, the study introduces UniFS, a novel approach for overcoming the impediments of conventional few-shot learning by framing various instance perception tasks within a dynamic point representation learning framework. By incorporating structure-aware point learning, the generalist model enhances representation learning by capturing higher-order structures among points, yielding competitive outcomes against specialized models. In conclusion, this research offers comprehensive solutions with far-reaching implications for future research and industrial applications. By establishing benchmarks and methodologies in human-centric visual analysis technology, it paves the way for advancements in the field.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshPattern recognition systems-
dc.titleTowards generic human-centric visual analysis-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991044897478403414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats