Vicinity Vision Transformer

Sun, Weixuan; Qin, Zhen; Deng, Hui; Wang, Jianyuan; Zhang, Yi; Zhang, Kaihao; Barnes, Nick; Birchfield, Stan; Kong, Lingpeng; Zhong, Yiran

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2023.3285569
Scopus: eid_2-s2.0-85162654278
PMID: 37310842
Find via

Supplementary

Citations:
- Scopus: 0
- PubMed Central: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Vicinity Vision Transformer

Title	Vicinity Vision Transformer
Authors	Sun, Weixuan Qin, Zhen Deng, Hui Wang, Jianyuan Zhang, Yi Zhang, Kaihao Barnes, Nick Birchfield, Stan Kong, Lingpeng Zhong, Yiran
Keywords	2D vicinity image classification linear transformer semantic segmentation vision transformer
Issue Date	1-Oct-2023
Publisher	Institute of Electrical and Electronics Engineers
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, v. 45, n. 10, p. 12635-12649 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2023.3285569
Abstract	Vision transformers have shown great success on numerous computer vision tasks. However, their central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Linear attention was introduced in natural language processing (NLP) which reorders the self-attention mechanism to mitigate a similar issue, but directly applying existing linear attention to vision may not lead to satisfactory results. We investigate this problem and point out that existing linear attention methods ignore an inductive bias in vision tasks, i.e., 2D locality. In this article, we propose Vicinity Attention, which is a type of linear attention that integrates 2D locality. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance from its neighbouring patches. In this case, we achieve 2D locality in a linear complexity where the neighbouring image patches receive stronger attention than far away patches. In addition, we propose a novel Vicinity Attention Block that is comprised of Feature Reduction Attention (FRA) and Feature Preserving Connection (FPC) in order to address the computational bottleneck of linear attention approaches, including our Vicinity Attention, whose complexity grows quadratically with respect to the feature dimension. The Vicinity Attention Block computes attention in a compressed feature space with an extra skip connection to retrieve the original feature distribution. We experimentally validate that the block further reduces computation without degenerating the accuracy. Finally, to validate the proposed methods, we build a linear vision transformer backbone named Vicinity Vision Transformer (VVT). Targeting general vision tasks, we build VVT in a pyramid structure with progressively reduced sequence length. We perform extensive experiments on CIFAR-100, ImageNet-1 k, and ADE20 K datasets to validate the effectiveness of our method. Our method has a slower growth rate in terms of computational overhead than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous approaches.
Persistent Identifier	http://hdl.handle.net/10722/365840
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158

DC Field	Value	Language
dc.contributor.author	Sun, Weixuan	-
dc.contributor.author	Qin, Zhen	-
dc.contributor.author	Deng, Hui	-
dc.contributor.author	Wang, Jianyuan	-
dc.contributor.author	Zhang, Yi	-
dc.contributor.author	Zhang, Kaihao	-
dc.contributor.author	Barnes, Nick	-
dc.contributor.author	Birchfield, Stan	-
dc.contributor.author	Kong, Lingpeng	-
dc.contributor.author	Zhong, Yiran	-
dc.date.accessioned	2025-11-12T00:35:58Z	-
dc.date.available	2025-11-12T00:35:58Z	-
dc.date.issued	2023-10-01	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, v. 45, n. 10, p. 12635-12649	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/365840	-
dc.description.abstract	<p>Vision transformers have shown great success on numerous computer vision tasks. However, their central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Linear attention was introduced in natural language processing (NLP) which reorders the self-attention mechanism to mitigate a similar issue, but directly applying existing linear attention to vision may not lead to satisfactory results. We investigate this problem and point out that existing linear attention methods ignore an inductive bias in vision tasks, i.e., 2D locality. In this article, we propose Vicinity Attention, which is a type of linear attention that integrates 2D locality. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance from its neighbouring patches. In this case, we achieve 2D locality in a linear complexity where the neighbouring image patches receive stronger attention than far away patches. In addition, we propose a novel Vicinity Attention Block that is comprised of Feature Reduction Attention (FRA) and Feature Preserving Connection (FPC) in order to address the computational bottleneck of linear attention approaches, including our Vicinity Attention, whose complexity grows quadratically with respect to the feature dimension. The Vicinity Attention Block computes attention in a compressed feature space with an extra skip connection to retrieve the original feature distribution. We experimentally validate that the block further reduces computation without degenerating the accuracy. Finally, to validate the proposed methods, we build a linear vision transformer backbone named Vicinity Vision Transformer (VVT). Targeting general vision tasks, we build VVT in a pyramid structure with progressively reduced sequence length. We perform extensive experiments on CIFAR-100, ImageNet-1 k, and ADE20 K datasets to validate the effectiveness of our method. Our method has a slower growth rate in terms of computational overhead than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous approaches.</p>	-
dc.language	eng	-
dc.publisher	Institute of Electrical and Electronics Engineers	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	2D vicinity	-
dc.subject	image classification	-
dc.subject	linear transformer	-
dc.subject	semantic segmentation	-
dc.subject	vision transformer	-
dc.title	Vicinity Vision Transformer	-
dc.type	Article	-
dc.identifier.doi	10.1109/TPAMI.2023.3285569	-
dc.identifier.pmid	37310842	-
dc.identifier.scopus	eid_2-s2.0-85162654278	-
dc.identifier.volume	45	-
dc.identifier.issue	10	-
dc.identifier.spage	12635	-
dc.identifier.epage	12649	-
dc.identifier.eissn	1939-3539	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Vicinity Vision Transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats