Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation

어떤 문제를 분석하고 문제를 해결했다기 보다는 기존 context text를 decoupling해서 text를 조금 더 다양하게 활용하자는 취지로써 여러 모듈을 제안

Introduction or Motivation

The one of key challenges for this RRSIS task is to learn discriminative multi-modal features via text-image alignment.

limitation of previous approach

Linguistic representation is directly fused with the visual features by leveraging pixel-level attention.
This is a concise and direct method, but it neglects the intrinsic information within the referring expression and the fine-grained relationship between the image and the textual description.

the original referring sentence is regarded as a context expression. It then is parsed into ground object and spatial position.

Propose:

Fine-grained Image-text Alignment Module:
- simultaneously leverage the features of the input image and the corresponding text, better discriminative representations across modalities
Text-aware Multi-scale Enhancement Module:
- adaptively perform cross-scale fusion and intersections under text guidance.

Decompose the context into two fragments about ground objects and spatial positions(Fig.1.)

Fine-Grained Image-Text Alignment

Object-Position Alignment Block(OPAB):
- perform the intersection of features of ground object and spatial position with the visual representation
- Ground Object Branch:
  - multi-fusion between the textual features of ground objects and the visual features
  - enhance the discriminative ability of the model on the referent target
- Spatial Position Branch:
  - capture the spatial prior guided by the original visual feature and textual features of positional description
  - better integrated with the ground object features
Context Alignment with Visual Features:
- Original context text가 context 정보를 더 많이 가지고 있을 거라서 image feature랑 Pixel-Word Attention하겠다.
- Pixel-Word Attention Module을 통해서 구현.
Channel Modulation:
- Readjust the extracted multi-modal features which can further enhance the discriminative ability of the proposed method

Text-Aware Multi-Scale Enhancement

모든 image feature를 last image feature space에 맞춰 downsampling을 한다.
그 다음 모든 image feature를 concatenate해서 text feature와 ‘text-aware multi-scale attention’을 수행한다.
이 때 사용되는 text feature는 context text feature $F_C$다.

Learning Representations of Satellite Images From Metadata Supervision (0)	2024.11.06
D2Q-DETR: DECOUPLING AND DYNAMIC QUERIES FOR ORIENTED OBJECT DETECTION WITH TRANSFORMERS (0)	2024.11.06
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations (0)	2024.11.06
Parameter Efficient Self-Supervised Geospatial Domain Adaptation (0)	2024.11.06
Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models (0)	2024.11.06