SpectFormer: Frequency and Attention is what you need in a Vision Transformer

tl;dr: naive idea, good exp, interesting but no contribution

Summary

Hypothesize that both spectral and multi-headed attention plays a major role.
Combining spectral and multi-headed attention layers provides a better transformer architecture.

The Fourier domain plays a major role in extracting frequency-based analysis of image information
Hypothesize that for the image domain, both spectral and multi-headed self-attention plays an important role.
Motivated by the works related to spectral and also hierarchical transformers
SpectFormer
- uses spectral layers implemented with Fourier Transform to capture relevant features in the initial layers of the architecture.

Objective: capture the different frequency components of the image to comprehend localized frequencies.
Can be achieved using a spectral gating network, that comprises a Fast Fourier Transform (FFT) layer
The spectral layer converts physical space into the spectral space using FFT
Use learnable weights for each frequency component to accurately capture image lines and edges
Use FFT and iFFT: physical space → spectral space → physical space
wavelet and inverse wavelet can be considered as another option

첫번째 그림의 경우 코끼리와 얼룩말이 같이 있는데, 이는 image classification problem으로 접근하는 경우 얼룩말과 코끼리 중 고민을 해야하기 때문에 SpectFormer가 얼룩말에 high prob.을 주었다는 것은 직관적이지 못함.
잘못된 figure로 판단됨.

Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG (0)	2024.11.14
Scattering Vision Transformer: Spectral Mixing Matters (0)	2024.11.06
Inception Transformer (0)	2024.11.06
OmniSat: Self-Supervised Modality Fusion for Earth Observation (0)	2024.11.06
Learning Representations of Satellite Images From Metadata Supervision (0)	2024.11.06