File Download

There are no files associated with this item.

Supplementary

Conference Paper: Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

TitleWeakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
Authors
Issue Date2019
PublisherAssociation for Computational Linguistics.
Citation
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July - 2 August 2019, p. 1884–1894 How to Cite?
AbstractIn this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatiotemporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches. Our code and the constructed VID-sentence dataset are available at: https://github.com/JeffCHEN2017/WSSTG.git.
DescriptionSession 3E: Vision, Robotics, Multimodal, Grounding and Speech
Persistent Identifierhttp://hdl.handle.net/10722/272014

 

DC FieldValueLanguage
dc.contributor.authorChen, Z-
dc.contributor.authorMa, L-
dc.contributor.authorLuo, W-
dc.contributor.authorWong, KKY-
dc.date.accessioned2019-07-20T10:33:59Z-
dc.date.available2019-07-20T10:33:59Z-
dc.date.issued2019-
dc.identifier.citationProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July - 2 August 2019, p. 1884–1894-
dc.identifier.urihttp://hdl.handle.net/10722/272014-
dc.descriptionSession 3E: Vision, Robotics, Multimodal, Grounding and Speech-
dc.description.abstractIn this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatiotemporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches. Our code and the constructed VID-sentence dataset are available at: https://github.com/JeffCHEN2017/WSSTG.git.-
dc.languageeng-
dc.publisherAssociation for Computational Linguistics. -
dc.relation.ispartofAnnual Meeting of the Association for Computational Linguistics-
dc.titleWeakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video-
dc.typeConference_Paper-
dc.identifier.emailWong, KKY: kykwong@cs.hku.hk-
dc.identifier.authorityWong, KKY=rp01393-
dc.identifier.hkuros299481-
dc.identifier.spage1884–1894-
dc.identifier.epage1884–1894-
dc.publisher.placeFlorence, Italy-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats