Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Chen, Z; Ma, L; Luo, W; Wong, KKY

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Title	Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Authors	Chen, Z Ma, L Luo, W Wong, KKY
Issue Date	2019
Publisher	Association for Computational Linguistics.
Citation	Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July - 2 August 2019, p. 1884–1894 How to Cite?
Abstract	In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatiotemporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches. Our code and the constructed VID-sentence dataset are available at: https://github.com/JeffCHEN2017/WSSTG.git.
Description	Session 3E: Vision, Robotics, Multimodal, Grounding and Speech
Persistent Identifier	http://hdl.handle.net/10722/272014

DC Field	Value	Language
dc.contributor.author	Chen, Z	-
dc.contributor.author	Ma, L	-
dc.contributor.author	Luo, W	-
dc.contributor.author	Wong, KKY	-
dc.date.accessioned	2019-07-20T10:33:59Z	-
dc.date.available	2019-07-20T10:33:59Z	-
dc.date.issued	2019	-
dc.identifier.citation	Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July - 2 August 2019, p. 1884–1894	-
dc.identifier.uri	http://hdl.handle.net/10722/272014	-
dc.description	Session 3E: Vision, Robotics, Multimodal, Grounding and Speech	-
dc.description.abstract	In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatiotemporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches. Our code and the constructed VID-sentence dataset are available at: https://github.com/JeffCHEN2017/WSSTG.git.	-
dc.language	eng	-
dc.publisher	Association for Computational Linguistics.	-
dc.relation.ispartof	Annual Meeting of the Association for Computational Linguistics	-
dc.title	Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video	-
dc.type	Conference_Paper	-
dc.identifier.email	Wong, KKY: kykwong@cs.hku.hk	-
dc.identifier.authority	Wong, KKY=rp01393	-
dc.identifier.hkuros	299481	-
dc.identifier.spage	1884–1894	-
dc.identifier.epage	1884–1894	-
dc.publisher.place	Florence, Italy	-

File Download

Supplementary

Conference Paper: Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats