File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1007/s11263-023-01898-4
- Scopus: eid_2-s2.0-85178284406
- Find via
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
Title | Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking |
---|---|
Authors | |
Keywords | Feature mimicking Image classification Masked autoencoders Representation learning |
Issue Date | 2024 |
Citation | International Journal of Computer Vision, 2024, v. 132, n. 5, p. 1546-1556 How to Cite? |
Abstract | Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE. |
Persistent Identifier | http://hdl.handle.net/10722/351486 |
ISSN | 2023 Impact Factor: 11.6 2023 SCImago Journal Rankings: 6.668 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Gao, Peng | - |
dc.contributor.author | Lin, Ziyi | - |
dc.contributor.author | Zhang, Renrui | - |
dc.contributor.author | Fang, Rongyao | - |
dc.contributor.author | Li, Hongyang | - |
dc.contributor.author | Li, Hongsheng | - |
dc.contributor.author | Qiao, Yu | - |
dc.date.accessioned | 2024-11-20T03:56:39Z | - |
dc.date.available | 2024-11-20T03:56:39Z | - |
dc.date.issued | 2024 | - |
dc.identifier.citation | International Journal of Computer Vision, 2024, v. 132, n. 5, p. 1546-1556 | - |
dc.identifier.issn | 0920-5691 | - |
dc.identifier.uri | http://hdl.handle.net/10722/351486 | - |
dc.description.abstract | Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Pretrained checkpoints are released at https://github.com/Alpha-VL/ConvMAE. | - |
dc.language | eng | - |
dc.relation.ispartof | International Journal of Computer Vision | - |
dc.subject | Feature mimicking | - |
dc.subject | Image classification | - |
dc.subject | Masked autoencoders | - |
dc.subject | Representation learning | - |
dc.title | Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking | - |
dc.type | Article | - |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1007/s11263-023-01898-4 | - |
dc.identifier.scopus | eid_2-s2.0-85178284406 | - |
dc.identifier.volume | 132 | - |
dc.identifier.issue | 5 | - |
dc.identifier.spage | 1546 | - |
dc.identifier.epage | 1556 | - |
dc.identifier.eissn | 1573-1405 | - |