Relationship-Embedded Representation Learning for Grounding Referring Expressions

YANG, S; LI, G; Yu, Y

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/TPAMI.2020.2973983
Scopus: eid_2-s2.0-85111789902
PMID: 32078531
WOS: WOS:000670578800018
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Computer Science: Journal/Magazine Articles

Article: Relationship-Embedded Representation Learning for Grounding Referring Expressions

Title	Relationship-Embedded Representation Learning for Grounding Referring Expressions
Authors	YANG, S LI, G Yu, Y
Keywords	Referring Expressions Cross-Modal Relationship Extractor Gated Graph Convolutional Network
Issue Date	2020
Publisher	IEEE. The Journal's web site is located at http://www.computer.org/tpami
Citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, Epub 2020-02-14 How to Cite? DOI: http://dx.doi.org/10.1109/TPAMI.2020.2973983
Abstract	Grounding referring expressions in images aims to locate the object instance in an image described by a referring expression. It involves a joint understanding of natural language and image content and is essential for a range of visual tasks related to human-computer interaction. As a language-to-vision matching task, the core of this problem is to not only extract all the necessary information in both the image and referring expressions, but also to to make full use of context information to achieve alignment of cross-modal semantic concepts in the extracted information. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships related to the given expression, with a cross-modal attention mechanism, and represent the extracted information as language-guided visual relation graphs. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic context by fusing information from different modes and propagating multimodal information in the structured relation graphs. Experimental results on three common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, greatly surpass all existing state-of-the-art methods.
Persistent Identifier	http://hdl.handle.net/10722/289197
ISSN	0162-8828 2023 Impact Factor: 20.8 2023 SCImago Journal Rankings: 6.158
ISI Accession Number ID	WOS:000670578800018

DC Field	Value	Language
dc.contributor.author	YANG, S	-
dc.contributor.author	LI, G	-
dc.contributor.author	Yu, Y	-
dc.date.accessioned	2020-10-22T08:09:14Z	-
dc.date.available	2020-10-22T08:09:14Z	-
dc.date.issued	2020	-
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, Epub 2020-02-14	-
dc.identifier.issn	0162-8828	-
dc.identifier.uri	http://hdl.handle.net/10722/289197	-
dc.description.abstract	Grounding referring expressions in images aims to locate the object instance in an image described by a referring expression. It involves a joint understanding of natural language and image content and is essential for a range of visual tasks related to human-computer interaction. As a language-to-vision matching task, the core of this problem is to not only extract all the necessary information in both the image and referring expressions, but also to to make full use of context information to achieve alignment of cross-modal semantic concepts in the extracted information. In this paper, we propose a Cross-Modal Relationship Extractor (CMRE) to adaptively highlight objects and relationships related to the given expression, with a cross-modal attention mechanism, and represent the extracted information as language-guided visual relation graphs. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic context by fusing information from different modes and propagating multimodal information in the structured relation graphs. Experimental results on three common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, greatly surpass all existing state-of-the-art methods.	-
dc.language	eng	-
dc.publisher	IEEE. The Journal's web site is located at http://www.computer.org/tpami	-
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence	-
dc.rights	IEEE Transactions on Pattern Analysis and Machine Intelligence. Copyright © IEEE.	-
dc.rights	©20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.subject	Referring Expressions	-
dc.subject	Cross-Modal Relationship Extractor	-
dc.subject	Gated Graph Convolutional Network	-
dc.title	Relationship-Embedded Representation Learning for Grounding Referring Expressions	-
dc.type	Article	-
dc.identifier.email	Yu, Y: yzyu@cs.hku.hk	-
dc.identifier.authority	Yu, Y=rp01415	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/TPAMI.2020.2973983	-
dc.identifier.pmid	32078531	-
dc.identifier.scopus	eid_2-s2.0-85111789902	-
dc.identifier.hkuros	317121	-
dc.identifier.volume	Epub 2020-02-14	-
dc.identifier.isi	WOS:000670578800018	-
dc.publisher.place	United States	-
dc.identifier.issnl	0162-8828	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Relationship-Embedded Representation Learning for Grounding Referring Expressions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats