Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Liu, Xian; Wu, Qianyi; Zhou, Hang; Xu, Yinghao; Qian, Rui; Lin, Xinyi; Zhou, Xiaowei; Wu, Wayne; Dai, Bo; Zhou, Bolei

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR52688.2022.01021
Scopus: eid_2-s2.0-85135544090
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Title	Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
Authors	Liu, Xian Wu, Qianyi Zhou, Hang Xu, Yinghao Qian, Rui Lin, Xinyi Zhou, Xiaowei Wu, Wayne Dai, Bo Zhou, Bolei
Keywords	Face and gestures Vision + X
Issue Date	2022
Citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, v. 2022-June, p. 10452-10462 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR52688.2022.01021
Abstract	Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and out-performs previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G.
Persistent Identifier	http://hdl.handle.net/10722/352302
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331

DC Field	Value	Language
dc.contributor.author	Liu, Xian	-
dc.contributor.author	Wu, Qianyi	-
dc.contributor.author	Zhou, Hang	-
dc.contributor.author	Xu, Yinghao	-
dc.contributor.author	Qian, Rui	-
dc.contributor.author	Lin, Xinyi	-
dc.contributor.author	Zhou, Xiaowei	-
dc.contributor.author	Wu, Wayne	-
dc.contributor.author	Dai, Bo	-
dc.contributor.author	Zhou, Bolei	-
dc.date.accessioned	2024-12-16T03:57:57Z	-
dc.date.available	2024-12-16T03:57:57Z	-
dc.date.issued	2022	-
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, v. 2022-June, p. 10452-10462	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/352302	-
dc.description.abstract	Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and out-performs previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	-
dc.subject	Face and gestures	-
dc.subject	Vision + X	-
dc.title	Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/CVPR52688.2022.01021	-
dc.identifier.scopus	eid_2-s2.0-85135544090	-
dc.identifier.volume	2022-June	-
dc.identifier.spage	10452	-
dc.identifier.epage	10462	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats