Rethinking the Form of Latent States in Image Captioning

Dai, Bo; Ye, Deming; Lin, Dahua

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/978-3-030-01228-1_18
Scopus: eid_2-s2.0-85055121996
Find via

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

Conference Paper: Rethinking the Form of Latent States in Image Captioning

Title	Rethinking the Form of Latent States in Image Captioning
Authors	Dai, Bo Ye, Deming Lin, Dahua
Issue Date	2018
Citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 294-310 How to Cite? DOI: http://dx.doi.org/10.1007/978-3-030-01228-1_18
Abstract	RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain.
Persistent Identifier	http://hdl.handle.net/10722/352471
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606

DC Field	Value	Language
dc.contributor.author	Dai, Bo	-
dc.contributor.author	Ye, Deming	-
dc.contributor.author	Lin, Dahua	-
dc.date.accessioned	2024-12-16T03:59:16Z	-
dc.date.available	2024-12-16T03:59:16Z	-
dc.date.issued	2018	-
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11209 LNCS, p. 294-310	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	http://hdl.handle.net/10722/352471	-
dc.description.abstract	RNNs and their variants have been widely adopted for image captioning. In RNNs, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. We rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states. This is motivated by the curiosity about a question: how the spatial structures in the latent states affect the resultant captions? Our study on MSCOCO and Flickr30k leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain.	-
dc.language	eng	-
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.title	Rethinking the Form of Latent States in Image Captioning	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/978-3-030-01228-1_18	-
dc.identifier.scopus	eid_2-s2.0-85055121996	-
dc.identifier.volume	11209 LNCS	-
dc.identifier.spage	294	-
dc.identifier.epage	310	-
dc.identifier.eissn	1611-3349	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Rethinking the Form of Latent States in Image Captioning

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats