File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/TPAMI.2024.3425768
- Scopus: eid_2-s2.0-85198376786
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: TCFormer: Visual Recognition via Token Clustering Transformer
Title | TCFormer: Visual Recognition via Token Clustering Transformer |
---|---|
Authors | |
Keywords | dynamic token human pose estimation Image classification image classification object detection Object detection semantic segmentation Semantic segmentation Semantics Shape Task analysis Transformers Vision transformer |
Issue Date | 1-Jan-2024 |
Publisher | Institute of Electrical and Electronics Engineers |
Citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-16 How to Cite? |
Abstract | Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer. |
Persistent Identifier | http://hdl.handle.net/10722/348565 |
ISSN | 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Zeng, Wang | - |
dc.contributor.author | Jin, Sheng | - |
dc.contributor.author | Xu, Lumin | - |
dc.contributor.author | Liu, Wentao | - |
dc.contributor.author | Qian, Chen | - |
dc.contributor.author | Ouyang, Wanli | - |
dc.contributor.author | Luo, Ping | - |
dc.contributor.author | Wang, Xiaogang | - |
dc.date.accessioned | 2024-10-10T00:31:37Z | - |
dc.date.available | 2024-10-10T00:31:37Z | - |
dc.date.issued | 2024-01-01 | - |
dc.identifier.citation | IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, p. 1-16 | - |
dc.identifier.issn | 0162-8828 | - |
dc.identifier.uri | http://hdl.handle.net/10722/348565 | - |
dc.description.abstract | Transformers are widely used in computer vision areas and have achieved remarkable success. Most state-of-the-art approaches split images into regular grids and represent each grid region with a vision token. However, fixed token distribution disregards the semantic meaning of different image regions, resulting in sub-optimal performance. To address this issue, we propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning. Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens. Through extensive experimentation across various applications, including image classification, human pose estimation, semantic segmentation, and object detection, we demonstrate the effectiveness of our TCFormer. The code and models for this work are available at https://github.com/zengwang430521/TCFormer. | - |
dc.language | eng | - |
dc.publisher | Institute of Electrical and Electronics Engineers | - |
dc.relation.ispartof | IEEE Transactions on Pattern Analysis and Machine Intelligence | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject | dynamic token | - |
dc.subject | human pose estimation | - |
dc.subject | Image classification | - |
dc.subject | image classification | - |
dc.subject | object detection | - |
dc.subject | Object detection | - |
dc.subject | semantic segmentation | - |
dc.subject | Semantic segmentation | - |
dc.subject | Semantics | - |
dc.subject | Shape | - |
dc.subject | Task analysis | - |
dc.subject | Transformers | - |
dc.subject | Vision transformer | - |
dc.title | TCFormer: Visual Recognition via Token Clustering Transformer | - |
dc.type | Article | - |
dc.identifier.doi | 10.1109/TPAMI.2024.3425768 | - |
dc.identifier.scopus | eid_2-s2.0-85198376786 | - |
dc.identifier.spage | 1 | - |
dc.identifier.epage | 16 | - |
dc.identifier.eissn | 1939-3539 | - |
dc.identifier.issnl | 0162-8828 | - |