Link: https://arxiv.org/pdf/2304.06446
tl;dr: naive idea, good exp, interesting but no contribution
Summary
- Hypothesize that both spectral and multi-headed attention plays a major role.
- Combining spectral and multi-headed attention layers provides a better transformer architecture.
Introduction or Motivation
- The Fourier domain plays a major role in extracting frequency-based analysis of image information
- Hypothesize that for the image domain, both spectral and multi-headed self-attention plays an important role.
- Motivated by the works related to spectral and also hierarchical transformers
- SpectFormer
- uses spectral layers implemented with Fourier Transform to capture relevant features in the initial layers of the architecture.
Method
Spectral Block
- Objective: capture the different frequency components of the image to comprehend localized frequencies.
- Can be achieved using a spectral gating network, that comprises a Fast Fourier Transform (FFT) layer
- The spectral layer converts physical space into the spectral space using FFT
- Use learnable weights for each frequency component to accurately capture image lines and edges
- Use FFT and iFFT: physical space → spectral space → physical space
- wavelet and inverse wavelet can be considered as another option
Experiment
전반적인 실험들
- Spectral Block을 어디에 얼마나 사용해야할지 자세하기 기술
- 순서에 따른 ablation도 잘 되어있음
- 비교의 경우 완전히 1 to 1은 아니지만 많이함.
- 하지만 전체적인 idea나 proposed method가 빈약
- 첫번째 그림의 경우 코끼리와 얼룩말이 같이 있는데, 이는 image classification problem으로 접근하는 경우 얼룩말과 코끼리 중 고민을 해야하기 때문에 SpectFormer가 얼룩말에 high prob.을 주었다는 것은 직관적이지 못함.
- 잘못된 figure로 판단됨.
'AI' 카테고리의 다른 글
Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG (0) | 2024.11.14 |
---|---|
Scattering Vision Transformer: Spectral Mixing Matters (0) | 2024.11.06 |
Inception Transformer (0) | 2024.11.06 |
OmniSat: Self-Supervised Modality Fusion for Earth Observation (0) | 2024.11.06 |
Learning Representations of Satellite Images From Metadata Supervision (0) | 2024.11.06 |