File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/978-3-030-01228-1_18
- Scopus: eid_2-s2.0-85055121996
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Conference Paper: Rethinking the Form of Latent States in Image Captioning
Title | Rethinking the Form of Latent States in Image Captioning |
---|---|
Authors | |
Issue Date | 2018 |
Citation | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 294-310 How to Cite? |
Abstract | RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain. |
Persistent Identifier | http://hdl.handle.net/10722/352471 |
ISSN | 2023 SCImago Journal Rankings: 0.606 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Dai, Bo | - |
dc.contributor.author | Ye, Deming | - |
dc.contributor.author | Lin, Dahua | - |
dc.date.accessioned | 2024-12-16T03:59:16Z | - |
dc.date.available | 2024-12-16T03:59:16Z | - |
dc.date.issued | 2018 | - |
dc.identifier.citation | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 294-310 | - |
dc.identifier.issn | 0302-9743 | - |
dc.identifier.uri | http://hdl.handle.net/10722/352471 | - |
dc.description.abstract | RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain. | - |
dc.language | eng | - |
dc.relation.ispartof | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | - |
dc.title | Rethinking the Form of Latent States in Image Captioning | - |
dc.type | Conference_Paper | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/978-3-030-01228-1_18 | - |
dc.identifier.scopus | eid_2-s2.0-85055121996 | - |
dc.identifier.volume | 11209 LNCS | - |
dc.identifier.spage | 294 | - |
dc.identifier.epage | 310 | - |
dc.identifier.eissn | 1611-3349 | - |