PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

ECCV2024 Accepted Paper

Link: https://arxiv.org/pdf/2407.04538

Summary

PDiscoFormer는 비전 트랜스포머를 기반으로 한 새로운 비지도 부분 발견 접근 방식으로, 기존의 엄격한 기하학적 가정들을 완화하는 데 초점을 맞추고 있습니다.

본 논문의 주요 목표는 fine-grained image classification task에 도움이 되는 구분 가능한 부분들을 자동으로 식별하는 것입니다.

Method

Obtain $z$ via drop cls token and register tokens
- $\mathbf{z} = h_{\theta}(\mathbf{x}) \in \mathbb{R}^{D \times H \times W}$
Attention Maps: $\mathbf{A} \in [0,1]^{(K+1) \times H \times W}$
- where the last channel in the first dimension is used to represent the background region in the image.
- computed using the negative squared Euclidean distance function between patch token features $z_{ij} \in \mathbb{R}^D, i \in \{1,...,H\}, j \in \{1,...,W\}$ and each of the learnable prototypes $p_k \in \mathbb{R}^D, \text{with} k \in \{1,...,K+1\}$
- $a_{ij}^{k} = \frac{\exp \left( -\| \mathbf{z}{ij} - \mathbf{p}^{k} \|^2 + \gamma_k \right)}{\sum{l=1}^{K+1} \exp \left( -\| \mathbf{z}_{ij} - \mathbf{p}^{l} \|^2 + \gamma_l \right)}$
Compute part embedding vectors $v_k \in \mathbb{R}^D$
- $\mathbf{v}^{k} = \frac{\sum_i \sum_j a_{ij}^{k} \mathbf{z}_{ij}}{HW}$
layer normalization
- $\mathbf{v}_m^{k} = \frac{\mathbf{v}^{k} - \mathbb{E}[\mathbf{v}^{k}]}{\sqrt{\text{Var}[\mathbf{v}^{k}]} + \epsilon} \odot \mathbf{w}_m^{k} + \mathbf{b}_m^{k}$
obtain a vector of class scores conditioned on the part embedding:
- $y^k=W_c \cdot v^k_m$

Losses

Orthogonality Loss:
- encourages the learned part embedding vectors to be decorrelated from one another.
- $\mathcal{L}{\perp} = \sum{k=1}^{K+1} \sum_{l \neq k} \frac{\mathbf{v}_m^k \cdot \mathbf{v}_m^l}{\|\mathbf{v}_m^k\| \cdot \|\mathbf{v}_m^l\|}$
Equivariance Loss:
- want to detect the same parts even if the image is translated, rotated, or scaled.
- encourages the learned part attention maps to be equivariant to rigid transformations.
- $\mathcal{L}_{\text{eq}} = 1 - \frac{1}{K} \sum_k \frac{\| A^k(\mathbf{x}) \cdot T^{-1}(A^k(T(\mathbf{x}))) \|}{\| A^k(\mathbf{x}) \| \cdot \| A^k(T(\mathbf{x})) \|}$
Presence Loss:
- All discovered foreground parts are present in at least some images of the training dataset
- While the background, the $(K+1)^{th}$ part, is expected to be present in all images in the dataset.
- $\mathcal{L}{p_1} = 1 - \frac{1}{K} \sum_k \max{b,i,j} \bar{a}_{ij}^{k}(\mathbf{x}_b)$ where $\bar{A}^k(x_b) = \text{avgpool}(A^k(x_b))$
Stricter Presence Loss
- To ensure the presence of the background
- the background is at the same time expected in every image and is more likely to appear near the boundaries of the image.
- $\mathcal{L}{p_0} = -\frac{1}{B} \sum_b \log \left( \max{i,j} m_{ij} \bar{a}_{ij}^{K+1}(\mathbf{x}_b) \right)$
- $m_{ij} = 2 \left( \frac{i - 1}{H - 1} - \frac{1}{2} \right)^2+ 2 \left( \frac{j - 1}{W - 1} - \frac{1}{2} \right)^2$
  - where $M = [m_{ij}]^{H \times W}$ is a soft mask with $m_{ij} \in [0, 1]$ that privileges entries placed farther from the image center
Entropy Loss:
- to ensure that each patch token is assigned to a unique part
- $\mathcal{L}{\text{ent}} = \frac{-1}{K+1} \sum{k=1}^{K+1} \sum_{ij} a_{ij}^{k} \log (a_{ij}^{k})$
Total Variation Loss
- Discovered parts to be composed of one or a few connected components.
- $\mathcal{L}{\text{tv}} = \frac{1}{HW} \sum{k=1}^{K+1} \sum_{ij} |\nabla a_{ij}^{k}|$ where $\nabla a^{k}{ij}$ is the spatial image gradient of part map $A{k}$ in location $ij$.

Experimental Results

'AI' 카테고리의 다른 글

Self-Supervised Representation Learning with Meta Comprehensive Regularization (0)	2025.03.05
Lost and Found: How Self-Supervised Learning Helps GPS Coordinates Find Their Way (0)	2025.03.05
Contrastive Knowledge Distillation from A Sample-wise Perspective (0)	2025.02.26
From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels (0)	2025.02.25
Open-World Panoptic Segmentation (0)	2025.02.17

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

MisoYuri's Deck

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

Summary

Method

Losses

Experimental Results

'AI' 카테고리의 다른 글

티스토리툴바

Summary

Method

Losses

Experimental Results

'AI' 카테고리의 다른 글

검색

티스토리툴바