File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/CVPR52688.2022.01762
- Scopus: eid_2-s2.0-85128285110
- WOS: WOS:000870783003094
- Find via
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Title | LAVT: Language-Aware Vision Transformer for Referring Image Segmentation |
---|---|
Authors | |
Keywords | grouping and shape analysis Segmentation Vision + language |
Issue Date | 2022 |
Citation | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, v. 2022-June, p. 18134-18144 How to Cite? |
Abstract | Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ('cross-madal') decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on Ref CoCo, RefCOCO+, and G-Ref by large margins. |
Persistent Identifier | http://hdl.handle.net/10722/333534 |
ISSN | 2023 SCImago Journal Rankings: 10.331 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, Zhao | - |
dc.contributor.author | Wang, Jiaqi | - |
dc.contributor.author | Tang, Yansong | - |
dc.contributor.author | Chen, Kai | - |
dc.contributor.author | Zhao, Hengshuang | - |
dc.contributor.author | Torr, Philip H.S. | - |
dc.date.accessioned | 2023-10-06T05:20:15Z | - |
dc.date.available | 2023-10-06T05:20:15Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, v. 2022-June, p. 18134-18144 | - |
dc.identifier.issn | 1063-6919 | - |
dc.identifier.uri | http://hdl.handle.net/10722/333534 | - |
dc.description.abstract | Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ('cross-madal') decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on Ref CoCo, RefCOCO+, and G-Ref by large margins. | - |
dc.language | eng | - |
dc.relation.ispartof | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition | - |
dc.subject | grouping and shape analysis | - |
dc.subject | Segmentation | - |
dc.subject | Vision + language | - |
dc.title | LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | - |
dc.type | Conference_Paper | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1109/CVPR52688.2022.01762 | - |
dc.identifier.scopus | eid_2-s2.0-85128285110 | - |
dc.identifier.volume | 2022-June | - |
dc.identifier.spage | 18134 | - |
dc.identifier.epage | 18144 | - |
dc.identifier.isi | WOS:000870783003094 | - |