Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Zheng, Sixiao; Lu, Jiachen; Zhao, Hengshuang; Zhu, Xiatian; Luo, Zekun; Wang, Yabiao; Fu, Yanwei; Feng, Jianfeng; Xiang, Tao; Torr, Philip H.S.; Zhang, Li

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR46437.2021.00681
Scopus: eid_2-s2.0-85117131558
WOS: WOS:000739917307010
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Title	Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Authors	Zheng, Sixiao Lu, Jiachen Zhao, Hengshuang Zhu, Xiatian Luo, Zekun Wang, Yabiao Fu, Yanwei Feng, Jianfeng Xiang, Tao Torr, Philip H.S.Zhang, Li
Issue Date	2021
Citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, p. 6877-6886 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR46437.2021.00681
Abstract	Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
Persistent Identifier	http://hdl.handle.net/10722/333514
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331
ISI Accession Number ID	WOS:000739917307010

DC Field	Value	Language
dc.contributor.author	Zheng, Sixiao	-
dc.contributor.author	Lu, Jiachen	-
dc.contributor.author	Zhao, Hengshuang	-
dc.contributor.author	Zhu, Xiatian	-
dc.contributor.author	Luo, Zekun	-
dc.contributor.author	Wang, Yabiao	-
dc.contributor.author	Fu, Yanwei	-
dc.contributor.author	Feng, Jianfeng	-
dc.contributor.author	Xiang, Tao	-
dc.contributor.author	Torr, Philip H.S.	-
dc.contributor.author	Zhang, Li	-
dc.date.accessioned	2023-10-06T05:20:05Z	-
dc.date.available	2023-10-06T05:20:05Z	-
dc.date.issued	2021	-
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, p. 6877-6886	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/333514	-
dc.description.abstract	Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	-
dc.title	Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/CVPR46437.2021.00681	-
dc.identifier.scopus	eid_2-s2.0-85117131558	-
dc.identifier.spage	6877	-
dc.identifier.epage	6886	-
dc.identifier.isi	WOS:000739917307010	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats