Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Zeng, W; Jin, S; Liu, W; Qian, C; Luo, P; Ouyang, W; Wang, X

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Title	Not all tokens are equal: Human-centric visual analysis via token clustering transformer
Authors	Zeng, W Jin, S Liu, W Qian, C Luo, P Ouyang, W Wang, X
Issue Date	2022
Publisher	IEEE Computer Society.
Citation	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Virtual), New Orleans, Louisiana, USA, 19-24 June, 2022. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, p. 11101-11111 How to Cite?
Abstract	Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging humancentric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW.
Description	Oral
Persistent Identifier	http://hdl.handle.net/10722/315678

DC Field	Value	Language
dc.contributor.author	Zeng, W	-
dc.contributor.author	Jin, S	-
dc.contributor.author	Liu, W	-
dc.contributor.author	Qian, C	-
dc.contributor.author	Luo, P	-
dc.contributor.author	Ouyang, W	-
dc.contributor.author	Wang, X	-
dc.date.accessioned	2022-08-19T09:02:24Z	-
dc.date.available	2022-08-19T09:02:24Z	-
dc.date.issued	2022	-
dc.identifier.citation	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Virtual), New Orleans, Louisiana, USA, 19-24 June, 2022. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, p. 11101-11111	-
dc.identifier.uri	http://hdl.handle.net/10722/315678	-
dc.description	Oral	-
dc.description.abstract	Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging humancentric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society.	-
dc.relation.ispartof	Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022	-
dc.rights	Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Copyright © IEEE Computer Society.	-
dc.title	Not all tokens are equal: Human-centric visual analysis via token clustering transformer	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.hkuros	335587	-
dc.identifier.spage	11101	-
dc.identifier.epage	11111	-
dc.publisher.place	United States	-

File Download

Supplementary

Conference Paper: Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats