ECCV2024 Accepted Paper
Link: https://arxiv.org/pdf/2407.04538
Summary
PDiscoFormer는 비전 트랜스포머를 기반으로 한 새로운 비지도 부분 발견 접근 방식으로, 기존의 엄격한 기하학적 가정들을 완화하는 데 초점을 맞추고 있습니다.
본 논문의 주요 목표는 fine-grained image classification task에 도움이 되는 구분 가능한 부분들을 자동으로 식별하는 것입니다.
Method

- Obtain $z$ via drop cls token and register tokens
- $\mathbf{z} = h_{\theta}(\mathbf{x}) \in \mathbb{R}^{D \times H \times W}$
- Attention Maps: $\mathbf{A} \in [0,1]^{(K+1) \times H \times W}$
- where the last channel in the first dimension is used to represent the background region in the image.
- computed using the negative squared Euclidean distance function between patch token features $z_{ij} \in \mathbb{R}^D, i \in \{1,...,H\}, j \in \{1,...,W\}$ and each of the learnable prototypes $p_k \in \mathbb{R}^D, \text{with} k \in \{1,...,K+1\}$
- $a_{ij}^{k} = \frac{\exp \left( -\| \mathbf{z}{ij} - \mathbf{p}^{k} \|^2 + \gamma_k \right)}{\sum{l=1}^{K+1} \exp \left( -\| \mathbf{z}_{ij} - \mathbf{p}^{l} \|^2 + \gamma_l \right)}$
- Compute part embedding vectors $v_k \in \mathbb{R}^D$
- $\mathbf{v}^{k} = \frac{\sum_i \sum_j a_{ij}^{k} \mathbf{z}_{ij}}{HW}$
- layer normalization
- $\mathbf{v}_m^{k} = \frac{\mathbf{v}^{k} - \mathbb{E}[\mathbf{v}^{k}]}{\sqrt{\text{Var}[\mathbf{v}^{k}]} + \epsilon} \odot \mathbf{w}_m^{k} + \mathbf{b}_m^{k}$
- obtain a vector of class scores conditioned on the part embedding:
- $y^k=W_c \cdot v^k_m$
Losses
- Orthogonality Loss:
- encourages the learned part embedding vectors to be decorrelated from one another.
- $\mathcal{L}{\perp} = \sum{k=1}^{K+1} \sum_{l \neq k} \frac{\mathbf{v}_m^k \cdot \mathbf{v}_m^l}{\|\mathbf{v}_m^k\| \cdot \|\mathbf{v}_m^l\|}$
- Equivariance Loss:
- want to detect the same parts even if the image is translated, rotated, or scaled.
- encourages the learned part attention maps to be equivariant to rigid transformations.
- $\mathcal{L}_{\text{eq}} = 1 - \frac{1}{K} \sum_k \frac{\| A^k(\mathbf{x}) \cdot T^{-1}(A^k(T(\mathbf{x}))) \|}{\| A^k(\mathbf{x}) \| \cdot \| A^k(T(\mathbf{x})) \|}$
- Presence Loss:
- All discovered foreground parts are present in at least some images of the training dataset
- While the background, the $(K+1)^{th}$ part, is expected to be present in all images in the dataset.
- $\mathcal{L}{p_1} = 1 - \frac{1}{K} \sum_k \max{b,i,j} \bar{a}_{ij}^{k}(\mathbf{x}_b)$ where $\bar{A}^k(x_b) = \text{avgpool}(A^k(x_b))$
- Stricter Presence Loss
- To ensure the presence of the background
- the background is at the same time expected in every image and is more likely to appear near the boundaries of the image.
- $\mathcal{L}{p_0} = -\frac{1}{B} \sum_b \log \left( \max{i,j} m_{ij} \bar{a}_{ij}^{K+1}(\mathbf{x}_b) \right)$
- $m_{ij} = 2 \left( \frac{i - 1}{H - 1} - \frac{1}{2} \right)^2+ 2 \left( \frac{j - 1}{W - 1} - \frac{1}{2} \right)^2$
- where $M = [m_{ij}]^{H \times W}$ is a soft mask with $m_{ij} \in [0, 1]$ that privileges entries placed farther from the image center
- Entropy Loss:
- to ensure that each patch token is assigned to a unique part
- $\mathcal{L}{\text{ent}} = \frac{-1}{K+1} \sum{k=1}^{K+1} \sum_{ij} a_{ij}^{k} \log (a_{ij}^{k})$
- Total Variation Loss
- Discovered parts to be composed of one or a few connected components.
- $\mathcal{L}{\text{tv}} = \frac{1}{HW} \sum{k=1}^{K+1} \sum_{ij} |\nabla a_{ij}^{k}|$ where $\nabla a^{k}{ij}$ is the spatial image gradient of part map $A{k}$ in location $ij$.
Experimental Results


