TCFormer: Visual Recognition via Token Clustering Transformer

Zeng, Wang; Jin, Sheng; Xu, Lumin; Liu, Wentao; Qian, Chen; Ouyang, Wanli; Luo, Ping; Wang, Xiaogang

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2024.3425768
Scopus: eid_2-s2.0-85198376786
Find via

Supplementary

Citations:
- Scopus: 1
Appears in Collections:
- Computer Science: Journal/Magazine Articles
- HKU Musketeers Foundation Institute of Data Science: Journal/Magazine Articles

Article: TCFormer: Visual Recognition via Token Clustering Transformer

Title	TCFormer: Visual Recognition via Token Clustering Transformer
Authors	Zeng, Wang Jin, Sheng Xu, Lumin Liu, Wentao Qian, Chen Ouyang, Wanli Luo, Ping Wang, Xiaogang
Keywords	dynamic token human pose estimation Image classification image classification object detection Object detection semantic segmentation Semantic segmentation Semantics Shape Task analysis Transformers Vision transformer
Issue Date	1-Jan-2024
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-16 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2024.3425768
Abstract	Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.
Persistent Identifier	http://hdl.handle.net/10722/348565
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Zeng, Wang	-
dc.contributor.author	Jin, Sheng	-
dc.contributor.author	Xu, Lumin	-
dc.contributor.author	Liu, Wentao	-
dc.contributor.author	Qian, Chen	-
dc.contributor.author	Ouyang, Wanli	-
dc.contributor.author	Luo, Ping	-
dc.contributor.author	Wang, Xiaogang	-
dc.date.accessioned	2024-10-10T00:31:37Z	-
dc.date.available	2024-10-10T00:31:37Z	-
dc.date.issued	2024-01-01	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-16	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/348565	-
dc.description.abstract	Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer.	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	dynamic token	-
dc.subject	human pose estimation	-
dc.subject	Image classification	-
dc.subject	image classification	-
dc.subject	object detection	-
dc.subject	Object detection	-
dc.subject	semantic segmentation	-
dc.subject	Semantic segmentation	-
dc.subject	Semantics	-
dc.subject	Shape	-
dc.subject	Task analysis	-
dc.subject	Transformers	-
dc.subject	Vision transformer	-
dc.title	TCFormer: Visual Recognition via Token Clustering Transformer	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2024.3425768	-
dc.identifier.scopus	eid_2-s2.0-85198376786	-
dc.identifier.spage	1	-
dc.identifier.epage	16	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: TCFormer: Visual Recognition via Token Clustering Transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats