본문으로 바로가기

ECCV2024 Accepted Paper

Link: https://arxiv.org/pdf/2407.04538

 

Summary

PDiscoFormer는 비전 트랜스포머를 기반으로 한 새로운 비지도 부분 발견 접근 방식으로, 기존의 엄격한 기하학적 가정들을 완화하는 데 초점을 맞추고 있습니다.

본 논문의 주요 목표는 fine-grained image classification task에 도움이 되는 구분 가능한 부분들을 자동으로 식별하는 것입니다.

Method

 

  • Obtain $z$ via drop cls token and register tokens
    • $\mathbf{z} = h_{\theta}(\mathbf{x}) \in \mathbb{R}^{D \times H \times W}$
  • Attention Maps: $\mathbf{A} \in [0,1]^{(K+1) \times H \times W}$
    • where the last channel in the first dimension is used to represent the background region in the image.
    • computed using the negative squared Euclidean distance function between patch token features $z_{ij} \in \mathbb{R}^D, i \in \{1,...,H\}, j \in \{1,...,W\}$ and each of the learnable prototypes $p_k \in \mathbb{R}^D, \text{with} k \in \{1,...,K+1\}$
    • $a_{ij}^{k} = \frac{\exp \left( -\| \mathbf{z}{ij} - \mathbf{p}^{k} \|^2 + \gamma_k \right)}{\sum{l=1}^{K+1} \exp \left( -\| \mathbf{z}_{ij} - \mathbf{p}^{l} \|^2 + \gamma_l \right)}$
  • Compute part embedding vectors $v_k \in \mathbb{R}^D$
    • $\mathbf{v}^{k} = \frac{\sum_i \sum_j a_{ij}^{k} \mathbf{z}_{ij}}{HW}$
  • layer normalization
    • $\mathbf{v}_m^{k} = \frac{\mathbf{v}^{k} - \mathbb{E}[\mathbf{v}^{k}]}{\sqrt{\text{Var}[\mathbf{v}^{k}]} + \epsilon} \odot \mathbf{w}_m^{k} + \mathbf{b}_m^{k}$
  • obtain a vector of class scores conditioned on the part embedding:
    • $y^k=W_c \cdot v^k_m$

Losses

  • Orthogonality Loss:
    • encourages the learned part embedding vectors to be decorrelated from one another.
    • $\mathcal{L}{\perp} = \sum{k=1}^{K+1} \sum_{l \neq k} \frac{\mathbf{v}_m^k \cdot \mathbf{v}_m^l}{\|\mathbf{v}_m^k\| \cdot \|\mathbf{v}_m^l\|}$
  • Equivariance Loss:
    • want to detect the same parts even if the image is translated, rotated, or scaled.
    • encourages the learned part attention maps to be equivariant to rigid transformations.
    • $\mathcal{L}_{\text{eq}} = 1 - \frac{1}{K} \sum_k \frac{\| A^k(\mathbf{x}) \cdot T^{-1}(A^k(T(\mathbf{x}))) \|}{\| A^k(\mathbf{x}) \| \cdot \| A^k(T(\mathbf{x})) \|}$
  • Presence Loss:
    • All discovered foreground parts are present in at least some images of the training dataset
    • While the background, the $(K+1)^{th}$ part, is expected to be present in all images in the dataset.
    • $\mathcal{L}{p_1} = 1 - \frac{1}{K} \sum_k \max{b,i,j} \bar{a}_{ij}^{k}(\mathbf{x}_b)$ where $\bar{A}^k(x_b) = \text{avgpool}(A^k(x_b))$
  • Stricter Presence Loss
    • To ensure the presence of the background
    • the background is at the same time expected in every image and is more likely to appear near the boundaries of the image.
    • $\mathcal{L}{p_0} = -\frac{1}{B} \sum_b \log \left( \max{i,j} m_{ij} \bar{a}_{ij}^{K+1}(\mathbf{x}_b) \right)$
    • $m_{ij} = 2 \left( \frac{i - 1}{H - 1} - \frac{1}{2} \right)^2+ 2 \left( \frac{j - 1}{W - 1} - \frac{1}{2} \right)^2$
      • where $M = [m_{ij}]^{H \times W}$ is a soft mask with $m_{ij} \in [0, 1]$ that privileges entries placed farther from the image center
  • Entropy Loss:
    • to ensure that each patch token is assigned to a unique part
    • $\mathcal{L}{\text{ent}} = \frac{-1}{K+1} \sum{k=1}^{K+1} \sum_{ij} a_{ij}^{k} \log (a_{ij}^{k})$
  • Total Variation Loss
    • Discovered parts to be composed of one or a few connected components.
    • $\mathcal{L}{\text{tv}} = \frac{1}{HW} \sum{k=1}^{K+1} \sum_{ij} |\nabla a_{ij}^{k}|$ where $\nabla a^{k}{ij}$ is the spatial image gradient of part map $A{k}$ in location $ij$.

Experimental Results

 


MisoYuri's Deck
블로그 이미지 MisoYuri 님의 블로그
VISITOR 오늘 / 전체