Davit: Dual attention vision transformers

Ding, M; Xiao, B; Codella, N; Luo, P; Yuan, L

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Davit: Dual attention vision transformers

Title	Davit: Dual attention vision transformers
Authors	Ding, M Xiao, B Codella, N Luo, P Yuan, L
Issue Date	2022
Publisher	Ortra Ltd..
Citation	European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 23-27, 2022 How to Cite?
Abstract	In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both 'spatial tokens' and 'channel tokens'. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.
Description	Poster no. 812
Persistent Identifier	http://hdl.handle.net/10722/315794

DC Field	Value	Language
dc.contributor.author	Ding, M	-
dc.contributor.author	Xiao, B	-
dc.contributor.author	Codella, N	-
dc.contributor.author	Luo, P	-
dc.contributor.author	Yuan, L	-
dc.date.accessioned	2022-08-19T09:04:33Z	-
dc.date.available	2022-08-19T09:04:33Z	-
dc.date.issued	2022	-
dc.identifier.citation	European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 23-27, 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/315794	-
dc.description	Poster no. 812	-
dc.description.abstract	In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both 'spatial tokens' and 'channel tokens'. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.	-
dc.language	eng	-
dc.publisher	Ortra Ltd..	-
dc.title	Davit: Dual attention vision transformers	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.hkuros	335565	-
dc.publisher.place	Israel	-

File Download

Supplementary

Conference Paper: Davit: Dual attention vision transformers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats