Move Forward and Tell: A Progressive Generator of Video Descriptions

Xiong, Yilei; Dai, Bo; Lin, Dahua

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1007/978-3-030-01252-6_29
Scopus: eid_2-s2.0-85055133234
Find via

Supplementary

Citations:
- Scopus: 22
Appears in Collections:
- HKU Musketeers Foundation Institute of Data Science: Conference papers

See more details

Conference Paper: Move Forward and Tell: A Progressive Generator of Video Descriptions

Title	Move Forward and Tell: A Progressive Generator of Video Descriptions
Authors	Xiong, Yilei Dai, Bo Lin, Dahua
Keywords	Move forward and tell Recurrent network Reinforcement learning Repetition evaluation Video captioning
Issue Date	2018
Citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11215 LNCS, p. 489-505 How to Cite? DOI: http://dx.doi.org/10.1007/978-3-030-01252-6_29
Abstract	We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption conditioned on a single embedding. On the contrary, we consider videos with rich temporal structures and aim to generate paragraph descriptions that can preserve the story flow while being coherent and concise. Towards this goal, we propose a new approach, which produces a descriptive paragraph by assembling temporally localized descriptions. Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner. Particularly, the selection of clips and the production of sentences are done jointly and progressively driven by a recurrent network – what to describe next depends on what have been said before. Here, the recurrent network is learned via self-critical sequence training with both sentence-level and paragraph-level rewards. On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos. Compared to those by other methods, the descriptions produced by our method are often more relevant, more coherent, and more concise.
Persistent Identifier	http://hdl.handle.net/10722/352472
ISSN	0302-9743 2023 SCImago Journal Rankings: 0.606

DC Field	Value	Language
dc.contributor.author	Xiong, Yilei	-
dc.contributor.author	Dai, Bo	-
dc.contributor.author	Lin, Dahua	-
dc.date.accessioned	2024-12-16T03:59:16Z	-
dc.date.available	2024-12-16T03:59:16Z	-
dc.date.issued	2018	-
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, v. 11215 LNCS, p. 489-505	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	http://hdl.handle.net/10722/352472	-
dc.description.abstract	We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption conditioned on a single embedding. On the contrary, we consider videos with rich temporal structures and aim to generate paragraph descriptions that can preserve the story flow while being coherent and concise. Towards this goal, we propose a new approach, which produces a descriptive paragraph by assembling temporally localized descriptions. Given a video, it selects a sequence of distinctive clips and generates sentences thereon in a coherent manner. Particularly, the selection of clips and the production of sentences are done jointly and progressively driven by a recurrent network – what to describe next depends on what have been said before. Here, the recurrent network is learned via self-critical sequence training with both sentence-level and paragraph-level rewards. On the ActivityNet Captions dataset, our method demonstrated the capability of generating high-quality paragraph descriptions for videos. Compared to those by other methods, the descriptions produced by our method are often more relevant, more coherent, and more concise.	-
dc.language	eng	-
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
dc.subject	Move forward and tell	-
dc.subject	Recurrent network	-
dc.subject	Reinforcement learning	-
dc.subject	Repetition evaluation	-
dc.subject	Video captioning	-
dc.title	Move Forward and Tell: A Progressive Generator of Video Descriptions	-
dc.type	Conference_Paper	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1007/978-3-030-01252-6_29	-
dc.identifier.scopus	eid_2-s2.0-85055133234	-
dc.identifier.volume	11215 LNCS	-
dc.identifier.spage	489	-
dc.identifier.epage	505	-
dc.identifier.eissn	1611-3349	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Move Forward and Tell: A Progressive Generator of Video Descriptions

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats