Miles: Visual bert pre-training with  injected language semantics for video-text retrieval

Ge, Y; Ge, Y; Liu, X; Wang, JA; Wu, J; Shan, Y; Qie, X; Luo, P

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Miles: Visual bert pre-training with injected language semantics for video-text retrieval

Title	Miles: Visual bert pre-training with injected language semantics for video-text retrieval
Authors	Ge, Y Ge, Y Liu, X Wang, JA Wu, J Shan, Y Qie, X Luo, P
Keywords	Computer Vision
Issue Date	2022
Publisher	Ortra Ltd..
Citation	European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 23-27, 2022 How to Cite?
Abstract	Dominant pre-training work for video-text retrieval mainly adopt the 'dual-encoder' architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the 'dual-encoder' architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving 'tokenizer' to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.
Persistent Identifier	http://hdl.handle.net/10722/315553

DC Field	Value	Language
dc.contributor.author	Ge, Y	-
dc.contributor.author	Ge, Y	-
dc.contributor.author	Liu, X	-
dc.contributor.author	Wang, JA	-
dc.contributor.author	Wu, J	-
dc.contributor.author	Shan, Y	-
dc.contributor.author	Qie, X	-
dc.contributor.author	Luo, P	-
dc.date.accessioned	2022-08-19T09:00:02Z	-
dc.date.available	2022-08-19T09:00:02Z	-
dc.date.issued	2022	-
dc.identifier.citation	European Conference on Computer Vision (ECCV), Tel Aviv, Israel, October 23-27, 2022	-
dc.identifier.uri	http://hdl.handle.net/10722/315553	-
dc.description.abstract	Dominant pre-training work for video-text retrieval mainly adopt the 'dual-encoder' architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the 'dual-encoder' architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving 'tokenizer' to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.	-
dc.language	eng	-
dc.publisher	Ortra Ltd..	-
dc.subject	Computer Vision	-
dc.title	Miles: Visual bert pre-training with injected language semantics for video-text retrieval	-
dc.type	Conference_Paper	-
dc.identifier.email	Ge, Y: geyixiao831@gmail.com	-
dc.identifier.email	Ge, Y: geyixiao831@gmail.com	-
dc.identifier.email	Luo, P: pluo@hku.hk	-
dc.identifier.authority	Luo, P=rp02575	-
dc.identifier.hkuros	335567	-
dc.publisher.place	Israel	-

File Download

Supplementary

Conference Paper: Miles: Visual bert pre-training with injected language semantics for video-text retrieval

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats