File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: mCLIP: Multilingual CLIP via Cross-lingual Transfer
Title | mCLIP: Multilingual CLIP via Cross-lingual Transfer |
---|---|
Authors | |
Issue Date | 1-Jul-2023 |
Abstract | Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable performance on various downstream cross-modal tasks. However, they are usually biased towards English due to the lack of sufficient non-English image-text pairs. Existing multilingual VLP methods often learn retrieval-inefficient single-stream models by translation-augmented non-English image-text pairs. In this paper, we introduce mCLIP, a retrieval-efficient dual-stream multilingual VLP model, trained by aligning the CLIP model and a Multilingual Text Encoder (MTE) through a novel Triangle Cross-modal Knowledge Distillation (TriKD) method. It is parameter-efficient as only two light projectors on the top of them are updated during distillation. Furthermore, to enhance the token- and sentence-level multilingual representation of the MTE, we propose to train it with machine translation and contrastive learning jointly before the TriKD to provide a better initialization. Empirical results show that mCLIP achieves new state-of-the-art performance for both zero-shot and finetuned multilingual image-text retrieval task. |
Persistent Identifier | http://hdl.handle.net/10722/333844 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Chen, Guanhua | - |
dc.contributor.author | Hou, Lu | - |
dc.contributor.author | Chen, Yun | - |
dc.contributor.author | Dai, Wenliang | - |
dc.contributor.author | Shang, Lifeng | - |
dc.contributor.author | Jiang, Xin | - |
dc.contributor.author | Liu, Qun | - |
dc.contributor.author | Pan, Jia | - |
dc.contributor.author | Wang, Wenping | - |
dc.date.accessioned | 2023-10-06T08:39:33Z | - |
dc.date.available | 2023-10-06T08:39:33Z | - |
dc.date.issued | 2023-07-01 | - |
dc.identifier.uri | http://hdl.handle.net/10722/333844 | - |
dc.description.abstract | <p>Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable performance on various downstream cross-modal tasks. However, they are usually biased towards English due to the lack of sufficient non-English image-text pairs. Existing multilingual VLP methods often learn retrieval-inefficient single-stream models by translation-augmented non-English image-text pairs. In this paper, we introduce mCLIP, a retrieval-efficient dual-stream multilingual VLP model, trained by aligning the CLIP model and a Multilingual Text Encoder (MTE) through a novel Triangle Cross-modal Knowledge Distillation (TriKD) method. It is parameter-efficient as only two light projectors on the top of them are updated during distillation. Furthermore, to enhance the token- and sentence-level multilingual representation of the MTE, we propose to train it with machine translation and contrastive learning jointly before the TriKD to provide a better initialization. Empirical results show that mCLIP achieves new state-of-the-art performance for both zero-shot and finetuned multilingual image-text retrieval task.</p> | - |
dc.language | eng | - |
dc.relation.ispartof | Annual Meeting of the Association for Computational Linguistics (ACL 2023) (11/07/2023-18/07/2023) | - |
dc.title | mCLIP: Multilingual CLIP via Cross-lingual Transfer | - |
dc.type | Conference_Paper | - |
dc.identifier.doi | 10.18653/v1/2023.acl-long.728 | - |