Cross-Modal Relationship Inference for Grounding Referring Expressions

Yang, S; Li, G; Yu, Y

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/CVPR.2019.00427
Scopus: eid_2-s2.0-85078745850
WOS: WOS:000529484004033
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Cross-Modal Relationship Inference for Grounding Referring Expressions

Title	Cross-Modal Relationship Inference for Grounding Referring Expressions
Authors	Yang, S Li, G Yu, Y
Keywords	Vision + Language Recognition Detection Categorization Retrieval
Issue Date	2019
Publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147
Citation	Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16-20 June 2019, p. 4140-4149 How to Cite? DOI: http://dx.doi.org/10.1109/CVPR.2019.00427
Abstract	Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.
Description	Language & Reasoning: Paper ID1735 ; Poster no. 204
Persistent Identifier	http://hdl.handle.net/10722/271322
ISSN	1063-6919 2023 SCImago Journal Rankings: 10.331
ISI Accession Number ID	WOS:000529484004033

DC Field	Value	Language
dc.contributor.author	Yang, S	-
dc.contributor.author	Li, G	-
dc.contributor.author	Yu, Y	-
dc.date.accessioned	2019-06-24T01:07:36Z	-
dc.date.available	2019-06-24T01:07:36Z	-
dc.date.issued	2019	-
dc.identifier.citation	Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16-20 June 2019, p. 4140-4149	-
dc.identifier.issn	1063-6919	-
dc.identifier.uri	http://hdl.handle.net/10722/271322	-
dc.description	Language & Reasoning: Paper ID1735 ; Poster no. 204	-
dc.description.abstract	Grounding referring expressions is a fundamental yet challenging task facilitating human-machine communication in the physical world. It locates the target object in an image on the basis of the comprehension of the relationships between referring natural language expressions and the image. A feasible solution for grounding referring expressions not only needs to extract all the necessary information (i.e. objects and the relationships among them) in both the image and referring expressions, but also compute and represent multimodal contexts from the extracted information. Unfortunately, existing work on grounding referring expressions cannot extract multi-order relationships from the referring expressions accurately and the contexts they obtain have discrepancies with the contexts described by referring expressions. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships, that have connections with a given expression, with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experiments on various common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, outperforms all existing state-of-the-art methods.	-
dc.language	eng	-
dc.publisher	IEEE Computer Society. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000147	-
dc.relation.ispartof	IEEE Conference on Computer Vision and Pattern Recognition. Proceedings	-
dc.rights	IEEE Conference on Computer Vision and Pattern Recognition. Proceedings. Copyright © IEEE Computer Society.	-
dc.rights	©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.subject	Vision + Language	-
dc.subject	Recognition	-
dc.subject	Detection	-
dc.subject	Categorization	-
dc.subject	Retrieval	-
dc.title	Cross-Modal Relationship Inference for Grounding Referring Expressions	-
dc.type	Conference_Paper	-
dc.identifier.email	Yu, Y: yzyu@cs.hku.hk	-
dc.identifier.authority	Yu, Y=rp01415	-
dc.description.nature	postprint	-
dc.identifier.doi	10.1109/CVPR.2019.00427	-
dc.identifier.scopus	eid_2-s2.0-85078745850	-
dc.identifier.hkuros	297946	-
dc.identifier.spage	4140	-
dc.identifier.epage	4149	-
dc.identifier.isi	WOS:000529484004033	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Cross-Modal Relationship Inference for Grounding Referring Expressions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats