File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

TitleMimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
Authors
KeywordsFeature mimicking
Image classification
Masked autoencoders
Representation learning
Issue Date2024
Citation
International Journal of Computer Vision, 2024, v. 132, n. 5, p. 1546-1556 How to Cite?
AbstractMasked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE.
Persistent Identifierhttp://hdl.handle.net/10722/351486
ISSN
2023 Impact Factor: 11.6
2023 SCImago Journal Rankings: 6.668

 

DC FieldValueLanguage
dc.contributor.authorGao, Peng-
dc.contributor.authorLin, Ziyi-
dc.contributor.authorZhang, Renrui-
dc.contributor.authorFang, Rongyao-
dc.contributor.authorLi, Hongyang-
dc.contributor.authorLi, Hongsheng-
dc.contributor.authorQiao, Yu-
dc.date.accessioned2024-11-20T03:56:39Z-
dc.date.available2024-11-20T03:56:39Z-
dc.date.issued2024-
dc.identifier.citationInternational Journal of Computer Vision, 2024, v. 132, n. 5, p. 1546-1556-
dc.identifier.issn0920-5691-
dc.identifier.urihttp://hdl.handle.net/10722/351486-
dc.description.abstractMasked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE.-
dc.languageeng-
dc.relation.ispartofInternational Journal of Computer Vision-
dc.subjectFeature mimicking-
dc.subjectImage classification-
dc.subjectMasked autoencoders-
dc.subjectRepresentation learning-
dc.titleMimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking-
dc.typeArticle-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/s11263-023-01898-4-
dc.identifier.scopuseid_2-s2.0-85178284406-
dc.identifier.volume132-
dc.identifier.issue5-
dc.identifier.spage1546-
dc.identifier.epage1556-
dc.identifier.eissn1573-1405-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats