File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1016/j.neucom.2022.06.062
- Scopus: eid_2-s2.0-85133231302
- WOS: WOS:000829601500010
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Relation constraint self-attention for image captioning
Title | Relation constraint self-attention for image captioning |
---|---|
Authors | |
Keywords | Image captioning Relation constraint self-attention Scene graph Transformer |
Issue Date | 2022 |
Citation | Neurocomputing, 2022, v. 501, p. 778-789 How to Cite? |
Abstract | Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA. |
Persistent Identifier | http://hdl.handle.net/10722/325566 |
ISSN | 2023 Impact Factor: 5.5 2023 SCImago Journal Rankings: 1.815 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Ji, Junzhong | - |
dc.contributor.author | Wang, Mingzhan | - |
dc.contributor.author | Zhang, Xiaodan | - |
dc.contributor.author | Lei, Minglong | - |
dc.contributor.author | Qu, Liangqiong | - |
dc.date.accessioned | 2023-02-27T07:34:21Z | - |
dc.date.available | 2023-02-27T07:34:21Z | - |
dc.date.issued | 2022 | - |
dc.identifier.citation | Neurocomputing, 2022, v. 501, p. 778-789 | - |
dc.identifier.issn | 0925-2312 | - |
dc.identifier.uri | http://hdl.handle.net/10722/325566 | - |
dc.description.abstract | Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA. | - |
dc.language | eng | - |
dc.relation.ispartof | Neurocomputing | - |
dc.subject | Image captioning | - |
dc.subject | Relation constraint self-attention | - |
dc.subject | Scene graph | - |
dc.subject | Transformer | - |
dc.title | Relation constraint self-attention for image captioning | - |
dc.type | Article | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1016/j.neucom.2022.06.062 | - |
dc.identifier.scopus | eid_2-s2.0-85133231302 | - |
dc.identifier.volume | 501 | - |
dc.identifier.spage | 778 | - |
dc.identifier.epage | 789 | - |
dc.identifier.eissn | 1872-8286 | - |
dc.identifier.isi | WOS:000829601500010 | - |