WACV 2024 Accepted Paper
Method
Reduce the number of decoder layers
- To improve the efficiency of our model, we use a one-layer decoder as the default setting.
- This design choice is motivated by the fact that using an eight-layer decoder, while providing only marginal improvements, consumes over 50% of the total FLOPs in the MAE model.
- In practice, switching to a one-layer decoder setup leads to an acceleration of over 60% compared to the speed of an eight-layer decoder.
Reduce the number of pre-training epochs
- training epochs from 1600 to 100 and increasing the masking ratio from 75% to 90%
- further speeds up the MAE training by 23×
- however, the final performance is 3.2% worse than the original MAE model (80.4% vs 81.8%)
Reduce the pre-training batch size
- The conventional pre-training batch size is 4096
- By reducing pre-training batch size from 4096 to 1024, increase the performance by 1.2%
Layer-wise learning rate decay
- 기존 MAE에서는 PT 때 lower-level feature가 잘 학습되었기 때문에 FT 때는 update가 크게 필요없가고 가정하고 있다.
- 하지만 reduce version MAE에서는 이러한 가정이 깨질 수 있다.
- 그래서 LLRD rate를 올려봤더니 FT 성능이 오른다
- Mask ratio 90%기준: 1.2%가 올랐다
Optimal LLRD of different pre-train recipes
- 최적값으로 학습 레시피를 짜고 학습한다면 기존 8-layer decoder MAE와 성능차이는 0.1%만 차이난다.
- 하지만 학습 속도는 50배 차이난다.
Low-cost parameter searching
- recipe를 어떻게 작성하는지에 따라서 성능차이가 아주 크다.
Experiment

'AI' 카테고리의 다른 글
| DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model (0) | 2024.11.06 |
|---|---|
| MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING (0) | 2024.11.06 |
| OpenSEED: A Simple Framework for Open-Vocabulary Segmentation and Detection (0) | 2024.11.06 |
| Morphing Tokens Draw Strong Masked Image Models (0) | 2024.11.05 |
| DAT: Vision Transformer with Deformable Attention (0) | 2023.09.19 |
