Learning Representations of Satellite Images From Metadata Supervision

ECCV 2024 Accepted Paper

Summary

Satellite Metadata-Image Pretraining (SatMIP)
- A new approach for harnessing metadata in the pretraining phase through a flexible and unified multimodal learning objective.
- Represents metadata as textual captions and aligns images with metadata in a shared embedding space by solving a meta data image contrastive task.
SatMIPS
- combining image self-supervision and metadata supervision
- Improves over its image-image pretraining baseline, SimCLR, and accelerates convergence.

Encodes pairs of images and metadata as separate modalities and aligns them in a deep embedding space via a contrastive task

Within satellite imagery, metadata such as time and location often hold significant semantic information that improves scene understanding.
we aim to learn a visual encoder that embeds metadata information, and their latent se- mantic characteristics, into image features.
By co-solving an image-image and a metadata-image contrastive task with an efficient “coupled” architecture, SatMIPS benefits from both sources of supervision, and improves over it’s SimCLR baseline, yielding better representations while converging faster.

Contrastive Loss
- $\mathcal{L}^{\text{clr}}(a_i, b_i) = - \log \left( \frac{\exp(s(a_i, b_i)/\tau)}{\sum_{j=1}^{K} \exp(s(a_i, b_j)/\tau)} \right)$
- $\mathcal{L}_{i}^{\text{MI}}(\mathbf{z}_i^{\mathcal{I}}, \mathbf{z}_i^{\mathcal{M}}) = \frac{1}{2} \left( \mathcal{L}^{\text{clr}}(\mathbf{z}_i^{\mathcal{I}}, \mathbf{z}_i^{\mathcal{M}}) + \mathcal{L}^{\text{clr}}(\mathbf{z}_i^{\mathcal{M}}, \mathbf{z}_i^{\mathcal{I}}) \right)$
  - $z^{\mathcal{I}}$: Image embedding feature
  - $z^{\mathcal{M}}$: Meta embedding feature
Note that there does not exist a simple 1:1 mapping between images and metadata, because metadata can match many image variations and vice versa
- e.g., due to the non-deterministic nature of weather.
This prevents the model from solely overfitting the pretext task. In addition, we apply data augmentation to the images which further regularizes the task.

SimCLR like Contrastive loss: image-image & image-meta
- $\mathcal{L}_{i}^{\text{Sim}}(\mathbf{z}_i, \mathbf{z}_i') = \frac{1}{2} \left( \mathcal{L}^{\text{clr}}(\mathbf{z}_i, \mathbf{z}_i') + \mathcal{L}^{\text{clr}}(\mathbf{z}_i', \mathbf{z}_i) \right)$
- $\mathcal{L}_{i}^{\text{MI+Sim}}(\mathbf{z}_i^{\mathcal{I}}, \mathbf{z}_i^{\mathcal{M}}, \mathbf{z}i, \mathbf{z}i') = \mathcal{L}{i}^{\text{MI}} + \lambda \mathcal{L}{i}^{\text{Sim}}$
  - $\lambda$: set 1 as default

fMoW-RGB로 pretraining
The metadata is composed of a diverse set of metadata fields
- GSD, timestamp, location, location-derived information such as UTM zone and country, cloud cover, and various imaging angles
1024 global batch
1 epoch of linear warmup

Inception Transformer (0)	2024.11.06
OmniSat: Self-Supervised Modality Fusion for Earth Observation (0)	2024.11.06
D2Q-DETR: DECOUPLING AND DYNAMIC QUERIES FOR ORIENTED OBJECT DETECTION WITH TRANSFORMERS (0)	2024.11.06
Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation (0)	2024.11.06
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations (0)	2024.11.06