본문으로 바로가기

Scattering Vision Transformer: Spectral Mixing Matters

category AI 2024. 11. 6. 03:24

NeurIPS 2023 Accepted Paper

Link: https://github.com/badripatro/svt

 

Summary

  • 다양한 Vision Transformer-series가 많이 연구되고 있다.
  • But one challenge faced by vision transformers is the increasing computational complexity of the self-attention module as the sequence length or image resolution grows

Introduction or Motivation

  • Fourier-Based Transformers
    • Purpose: Minimize the loss of information by using Fourier Transform.
    • Examples:
      • FourierFormer
      • FNet
      • GFNet
      • AFNO
    • Inherent Problem:
      • Difficulty in separating low and high-frequency components.
  • Proposed Solution: Scattering Vision Transformer $($SVT$)$
    • Components:
      • Spectral Scattering Network:
        • Addresses attention complexity.
      • Dual-Tree Complex Wavelet Transform $($DTCWT$)$:
        • Captures fine-grained information.
        • Performs spectral decomposition into low-frequency and high-frequency components of an image.
  • Frequency Component Handling in SVT
    • High-Frequency Component:
      • Captures fine-grained information from the scattering network using DTCWT.
      • Method: Einstein Blending Method $($EBM$)$.
    • Low-Frequency Component:
      • Represents the energy component of the signal.
      • Method: Tensor Blending Method $($TBM$)$.
  • Spectral Gating Network $($SGN$)$
    • Function: Captures effective features in both low and high-frequency components.
  • Contributions of SVT
    • Utilizes TBM for low-frequency components.
    • Utilizes EBM for high-frequency components.
  • Characteristics of Frequency Components
    • Low-Frequency Components:
      • Contain the energy component of the signal.
      • Require all frequency components to provide energy compaction.
    • High-Frequency Components:
      • Can be represented by only a few components.
      • Achieved using Einstein multiplication.

Method

Discrete Wavelet Transform$($DWT$)$

  • $x(t) = \sum_{n=-\infty}^{\infty} c(n) \varphi(t - n) + \sum_{j=0}^{\infty} \sum_{n=-\infty}^{\infty} d(j,n) 2^{j/2} \psi(2^j t - n)$
    • $\varphi(t)$: low-pass scaling function
    • $\psi(t)$: shifted version of a band-pass wavelet function
  • $c(n) = \int_{-\infty}^{\infty} x(t) \varphi(t - n) \, dt, \quad d(j,n) = 2^{j/2} \int_{-\infty}^{\infty} x(t) \psi(2^j t - n) \, dt$
    • $x(t)$: input
    • $c(n)$: scaling coefficient
    • $d(j,n)$: wavelet coefficient
  • weakness:
    • oscillations
    • shift variance
    • aliasing
    • lack of directionality
  • Complex Wavelet Transform $($CWT$)$
    • solve the one of weakness of DWT with complex-valued scaling and wavelet function

Dual-Tree Complex Wavelet Transform $($DTCWT$)$

  • Fourier Transform과 매우 유사한 특성을 가지고 있다.
    • Smooth and non-oscillating magnitude
    • Nearly shift-invariant magnitude with a simple near-linear phase encoding of signal shifts
    • Substantially reduced aliasing
    • Better directional selectivity in higher dimensions
  • Real Tree
    • 첫 번째 Wavelet Tree로, 일반적인 real number wavelet transform을 수행합니다.
  • Imaginary Tree
    • 두 번째 Wavelet Tree, Real Tree와는 약간 위상차가 있는 필터를 사용하여 복소수 wavelet을 생성합니다.

Equations

  • $g_0(n) \approx h_0(n - 0.5)$
    • 필터의 위상 관계
    • Imaginary Tree의 low-pass filter $g_0(n)$이 Real tree’s low-pass filter $h_o(n)$을 반 샘플 시프타한 것과 유사함을 나타낸다
    • 트리간의 위상 차이를 90도로 유지하여 복소수 wavelet coefficient를 형성
  • $\psi_g(t) \approx \mathcal{H}\{\psi_h(t)\}$
    • Imaginary Tree’s wavelet function $\psi_g(t)$가 real tree’s wavelet function $\psi_h(t)$의 hillbert transform $\mathcal{H}\{\psi_h(t)\}$ 와 유사함을 타나탠다.
  • $\psi_h(t) = \sqrt{2} \sum_n h_1(n) \varphi_h(t)$
    • Real Tree의 wavelet function $\psi_h(t)$는 high-pass filter $h_1(n)$과 scaling function $\varphi_h(t)$의 합성으로 정의
  • $\varphi_h(t) = \sqrt{2} \sum_n h_0(n) \varphi_h(t)$
    • Real Tree의 scaling function $\varphi_h(t)$는 low-pass filter $h_0(n)$와 자신의 다운샘플링된 버전의 합성으로 정의

Scattering Transformation

  • Via DTCWT
  • Fine-grain information:
    • Consists of texture, patterns, and small features.
    • Encoded by the high-frequency components of the spectral transform.
  • Global information:
    • Consists of overall brightness, contrast, edges, and contours.
    • Encoded by the low-frequency components of the spectral transform.
  • Frequency representations: $\mathbf{X}F=\mathcal{F}\text{scatter}(\mathbf{X})=\mathbf{DTCWT}({\mathbf{x})}$
  • $\mathbf{X_F}(u, v) = \mathbf{X_\varphi}(u, v) + \mathbf{X_\psi}(u, v)\\~~~~~~~~~~~~~~~~= \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} c_{M,h,w} \varphi_{M,h,w} + \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} \sum_{k=1}^{6} d_{m,h,w}^{k} \psi_{m,h,w}^{k}$
    • $\mathbf{X_\varphi}(u, v)$: 저주파 성분(approximation coefficients)을 사용하여 재구성된 신호
    • $\mathbf{X_\psi}(u, v)$: 고주파 성분(detail coefficients)을 사용하여 재구성된 신호

Spectral Gating Network

  • extract spectral features from both low and high-frequency components of the scattering transform
  • use learnable weight parameters to blend each frequency components
  • Tensor Blending Method $($TBM$)$ for low-frequency
    • 저주파수 성분 $\textbf{X}\varphi$과 학습 가능한 가중치 $\textbf{W}\varphi$를 요소별 텐서 곱셈(Hadamard multiplication)을 사용해 혼합
    • 이미지의 전역 정보(예: 밝기, 대비, 가장자리)를 캡처
    • $\mathcal{M_\varphi} = [\mathbf{X_\varphi} \odot \mathbf{W_\varphi}], \quad \text{where } (\mathbf{X_\varphi}, \mathbf{W_\varphi}) \in \mathbb{R}^{C \times H \times W}, \text{ and } \mathcal{M_\varphi} \in \mathbb{R}^{C \times H \times W}$
  • Einstein Blending Method $($EBM$)$ for high-frequency
    • 이미지의 세밀한 정보(예: 텍스처 및 작은 세부 사항)를 캡처
    • 파라미터 수와 계산 비용을 효율적으로 제어
    • EBM 수행 단계:
      • 텐서 $A$를  $\mathbb{R}^{H \times W \times C}$ 에서 $\mathbb{R}^{H \times W \times C_b \times C_d}$로 reshape.
        • 여기서 $C = C_b \times C_d$ , $b \gg d$ .
    • 가중치 행렬 정의:
      • 크기가 $\mathbb{R}^{C_b \times C_d \times C_d}$인 가중치 행렬 $W$를 정의.
    • 아인슈타인 곱셈$($Einstein multiplication$)$ 수행:
      • 텐서 $A$와 $W$를 마지막 두 차원에서 곱해 혼합된 특징 텐서 $Y$ 생성.
      • 결과 $Y$는 $\mathbb{R}^{H \times W \times C_b \times C_d}$
    • EBM 공식:
      • $\mathbf{Y}^{H \times W \times C_b \times C_d} = \mathbf{A}^{H \times W \times C_b \times C_d} \ast \mathbf{W}^{C_b \times C_d \times C_d}$

Spectral Channel and Token Mixing

  • EBM을 spectral channel에 적용한게 Spectral Channel Mixing
    • $\mathbf{S_{\psi_c}}^{2k \times H \times W \times C_b \times C_d} = \mathbf{X_{\psi}}^{2k \times H \times W \times C_b \times C_d} \ast \mathbf{W_{\psi_c}}^{C_b \times C_d \times C_d} + b_{\psi_c}$
  • EBM을 token-level에서 수행하는게 Spectral Token Mixing
    • $\mathbf{S_{\psi_t}}^{2k \times C \times W \times H} = \mathbf{S_{\psi_c}}^{2k \times C \times W \times H} \ast \mathbf{W_{\psi_t}}^{W \times H \times H} + b_{\psi_t}$

Experiment


MisoYuri's Deck
블로그 이미지 MisoYuri 님의 블로그
VISITOR 오늘 / 전체