Dynamic token normalization improves vision transformer

Shao, W; Ge, Y; Zhang, Z; Xu, X; Wang, X; Shan, Y; Luo, P

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Dynamic token normalization improves vision transformer

Title	Dynamic token normalization improves vision transformer
Authors	Shao, W Ge, Y Zhang, Z Xu, X Wang, X Shan, Y Luo, P
Keywords	Classification Normalization Transformer
Issue Date	2022
Publisher	ICLR.
Citation	10th International Conference on Learning Representation (ICLR) (Virtual), April 25-29, 2022 How to Cite?
Abstract	Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. {Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by - top-1 accuracy on ImageNet, by - box AP in object detection on COCO benchmark, by - mCE in robustness experiments on ImageNet-C, and by - accuracy in Long ListOps on Long-Range Arena.} Codes will be made public at url{https://github.com/wqshao126/DTN}
Description	Poster
Persistent Identifier	http://hdl.handle.net/10722/315683

DC Field	Value	Language
dc.contributor.author	Shao, W	-
dc.contributor.author	Ge, Y	-
dc.contributor.author	Zhang, Z	-
dc.contributor.author	Xu, X	-
dc.contributor.author	Wang, X	-
dc.contributor.author	Shan, Y	-
dc.contributor.author	Luo, P	-
dc.date.accessioned	2022-08-19T09:02:29Z	-
dc.date.available	2022-08-19T09:02:29Z	-
dc.date.issued	2022	-
dc.identifier.citation	10th International Conference on Learning Representation (ICLR) (Virtual), April 25-29, 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/315683	-
dc.description	Poster	-
dc.description.abstract	Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. {Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by - top-1 accuracy on ImageNet, by - box AP in object detection on COCO benchmark, by - mCE in robustness experiments on ImageNet-C, and by - accuracy in Long ListOps on Long-Range Arena.} Codes will be made public at url{https://github.com/wqshao126/DTN}	-
dc.language	eng	-
dc.publisher	ICLR.	-
dc.relation.ispartof	Advances in Neural Information Processing Systems (NeurlPS)	-
dc.subject	Classification	-
dc.subject	Normalization	-
dc.subject	Transformer	-
dc.title	Dynamic token normalization improves vision transformer	-
dc.type	Conference_Paper	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.hkuros	335599	-
dc.publisher.place	United States	-

File Download

Supplementary

Conference Paper: Dynamic token normalization improves vision transformer

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats