Relation constraint self-attention for image captioning

Ji, Junzhong; Wang, Mingzhan; Zhang, Xiaodan; Lei, Minglong; Qu, Liangqiong

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1016/j.neucom.2022.06.062
Scopus: eid_2-s2.0-85133231302
WOS: WOS:000829601500010
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Statistics & Actuarial Science: Journal/Magazine Articles

Article: Relation constraint self-attention for image captioning

Title	Relation constraint self-attention for image captioning
Authors	Ji, Junzhong Wang, Mingzhan Zhang, Xiaodan Lei, Minglong Qu, Liangqiong
Keywords	Image captioning Relation constraint self-attention Scene graph Transformer
Issue Date	2022
Citation	Neurocomputing, 2022, v. 501, p. 778-789 How to Cite? DOI: http://dx.doi.org/10.1016/j.neucom.2022.06.062
Abstract	Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.
Persistent Identifier	http://hdl.handle.net/10722/325566
ISSN	0925-2312 2023 Impact Factor: 5.5 2023 SCImago Journal Rankings: 1.815
ISI Accession Number ID	WOS:000829601500010

DC Field	Value	Language
dc.contributor.author	Ji, Junzhong	-
dc.contributor.author	Wang, Mingzhan	-
dc.contributor.author	Zhang, Xiaodan	-
dc.contributor.author	Lei, Minglong	-
dc.contributor.author	Qu, Liangqiong	-
dc.date.accessioned	2023-02-27T07:34:21Z	-
dc.date.available	2023-02-27T07:34:21Z	-
dc.date.issued	2022	-
dc.identifier.citation	Neurocomputing, 2022, v. 501, p. 778-789	-
dc.identifier.issn	0925-2312	-
dc.identifier.uri	http://hdl.handle.net/10722/325566	-
dc.description.abstract	Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.	-
dc.language	eng	-
dc.relation.ispartof	Neurocomputing	-
dc.subject	Image captioning	-
dc.subject	Relation constraint self-attention	-
dc.subject	Scene graph	-
dc.subject	Transformer	-
dc.title	Relation constraint self-attention for image captioning	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1016/j.neucom.2022.06.062	-
dc.identifier.scopus	eid_2-s2.0-85133231302	-
dc.identifier.volume	501	-
dc.identifier.spage	778	-
dc.identifier.epage	789	-
dc.identifier.eissn	1872-8286	-
dc.identifier.isi	WOS:000829601500010	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Relation constraint self-attention for image captioning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats