본문으로 바로가기

Link: https://arxiv.org/pdf/2304.06446

 

tl;dr: naive idea, good exp, interesting but no contribution

Summary

  • Hypothesize that both spectral and multi-headed attention plays a major role.
  • Combining spectral and multi-headed attention layers provides a better transformer architecture.

Introduction or Motivation

  • The Fourier domain plays a major role in extracting frequency-based analysis of image information
  • Hypothesize that for the image domain, both spectral and multi-headed self-attention plays an important role.
  • Motivated by the works related to spectral and also hierarchical transformers
  • SpectFormer
    • uses spectral layers implemented with Fourier Transform to capture relevant features in the initial layers of the architecture.

Method

Spectral Block

  • Objective: capture the different frequency components of the image to comprehend localized frequencies.
  • Can be achieved using a spectral gating network, that comprises a Fast Fourier Transform (FFT) layer
  • The spectral layer converts physical space into the spectral space using FFT
  • Use learnable weights for each frequency component to accurately capture image lines and edges
  • Use FFT and iFFT: physical space → spectral space → physical space
  • wavelet and inverse wavelet can be considered as another option

Experiment

전반적인 실험들

  • Spectral Block을 어디에 얼마나 사용해야할지 자세하기 기술
  • 순서에 따른 ablation도 잘 되어있음
  • 비교의 경우 완전히 1 to 1은 아니지만 많이함.
  • 하지만 전체적인 idea나 proposed method가 빈약

  • 첫번째 그림의 경우 코끼리와 얼룩말이 같이 있는데, 이는 image classification problem으로 접근하는 경우 얼룩말과 코끼리 중 고민을 해야하기 때문에 SpectFormer가 얼룩말에 high prob.을 주었다는 것은 직관적이지 못함.
  • 잘못된 figure로 판단됨.

MisoYuri's Deck
블로그 이미지 MisoYuri 님의 블로그
VISITOR 오늘 / 전체