본문으로 바로가기

Summary

  • Background:
    • foundation models for geospatial and satellite remote sensing applications are commonly trained on large optical RGB or multi-spectral datasets
  • Limitations:
    • although data from a wide variety of heterogeneous sensors are available in the remote sensing domain.
    • This leads to significant discrepancies between pre-training and downstream target data distributions for many important applications.
    • Fine-tuning large foundation models to bridge that gap incurs high computational cost and can be infeasible when target datasets are small.
  • Goal:
    • Address the question of how large, pre-trained foundational transformer models can be efficiently adapted to downstream remote sensing tasks involving different data modalities or limited dataset size.

Introduction or Motivation

  • Computer vision approaches for remote sensing data are highly fragmented into specialized sub-fields defined by the different modalities or the application of interest
    • ex) RGB, NIR, hyperspectral data or SAR
  • Without zero- or few-shot capabilities on modalities other than optical data, resulting in the re-training of large foundation models for datasets involving new modalities.
    • Expensive fine-tuning protocols have to be employed
    • This requires large amounts of labeled samples to adapt the model and comes with high computational cost.

Method

  • Scaled Low-Rank (SLR) adapters
    • a small number of parameters to add new data modalities to a pre-trained foundation model.
    • these additional parameters allow the model to adapt to the characteristics of the new data modality, while the pre-trained parameters are kept fixed.
    • helps to generalize remote sensing foundation models beyond their pre-training data modalities while fully leveraging their existing capabilities.
  • MAE의 $b^{th}$ block의 MSHA와 MLP 및 LN은 아래의 수식을 따른다.
    • $z'_b=\text{MSA}_b(\text{LN}b^1(z_b))+z_b$
    • $z'{b+1}=\text{MSA}_b(\text{LN}_b^2(z'_b))+z'_b$
  • 그리고 original linear transform은 아래와 같다.
    • $(s_i^1 \odot z_i) W_i$
  • LoRA:
    • symmetric low-rank matrices $W_i^1\in\mathbb{R}^{D \times r}, W^2_i\in\mathbb{R}^{r \times D}$
    • scaled input feature: $((s_i^1\odot z)W_i^1)W_i^2$
  • LoRA를 MAE에 적용하면 아래와 같다.
    • $f_{\text{ada}} = s_i^2[(s_i^1\odot z_i)W_i+((s_i^1\odot z_i)W_i^1)W_i^2]$

Experiment

  • The number of training steps is fixed for each dataset
  • report the test performance of the checkpoint with the lowest validation loss.

MisoYuri's Deck
블로그 이미지 MisoYuri 님의 블로그
VISITOR 오늘 / 전체