File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1109/CVPR.2019.00427
- Scopus: eid_2-s2.0-85078745850
- WOS: WOS:000529484004033
- Find via
Supplementary
- Citations:
- Appears in Collections:
Conference Paper: Cross-Modal Relationship Inference for Grounding Referring Expressions
Title | Cross-Modal Relationship Inference for Grounding Referring Expressions |
---|---|
Authors | |
Keywords | Vision + Language Recognition Detection Categorization Retrieval |
Issue Date | 2019 |
Publisher | IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147 |
Citation | Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16-20 June 2019, p. 4140-4149 How to Cite? |
Abstract | Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image.
A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring
expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual
relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms
all existing state-of-the-art methods. |
Description | Language & Reasoning: Paper ID1735 ; Poster no. 204 |
Persistent Identifier | http://hdl.handle.net/10722/271322 |
ISSN | 2023 SCImago Journal Rankings: 10.331 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, S | - |
dc.contributor.author | Li, G | - |
dc.contributor.author | Yu, Y | - |
dc.date.accessioned | 2019-06-24T01:07:36Z | - |
dc.date.available | 2019-06-24T01:07:36Z | - |
dc.date.issued | 2019 | - |
dc.identifier.citation | Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16-20 June 2019, p. 4140-4149 | - |
dc.identifier.issn | 1063-6919 | - |
dc.identifier.uri | http://hdl.handle.net/10722/271322 | - |
dc.description | Language & Reasoning: Paper ID1735 ; Poster no. 204 | - |
dc.description.abstract | Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods. | - |
dc.language | eng | - |
dc.publisher | IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147 | - |
dc.relation.ispartof | IEEE Conference on Computer Vision and Pattern Recognition. Proceedings | - |
dc.rights | IEEE Conference on Computer Vision and Pattern Recognition. Proceedings. Copyright © IEEE Computer Society. | - |
dc.rights | ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | - |
dc.subject | Vision + Language | - |
dc.subject | Recognition | - |
dc.subject | Detection | - |
dc.subject | Categorization | - |
dc.subject | Retrieval | - |
dc.title | Cross-Modal Relationship Inference for Grounding Referring Expressions | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Yu, Y: yzyu@cs.hku.hk | - |
dc.identifier.authority | Yu, Y=rp01415 | - |
dc.description.nature | postprint | - |
dc.identifier.doi | 10.1109/CVPR.2019.00427 | - |
dc.identifier.scopus | eid_2-s2.0-85078745850 | - |
dc.identifier.hkuros | 297946 | - |
dc.identifier.spage | 4140 | - |
dc.identifier.epage | 4149 | - |
dc.identifier.isi | WOS:000529484004033 | - |
dc.publisher.place | United States | - |