DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Link: https://arxiv.org/pdf/2410.16707v1

Introduction or Motivation

Inspiration
- will the performance imbalance at the beginning layer of the transformer decoder constrain the upper bound of the final performance?
- for this, conduct experiments to validate the negative impact of detection-segmentation imbalance issue on the model performance
Core Idea: improve the final performance by alleviating the detection-segmentation imbalance.
1. De-Imbalance (DI):
  - generate balance-aware query
2. Balance-Aware Tokens Optimization (BATO):
  - guide the optimization of the initial feature tokens via the balance-aware query

Multi-Task Training Impact
- Multi-task training can sometimes degrade the performance of individual tasks.
Crux of the Issue
- Imbalance Between Object Detection and Instance Segmentation
  - A key factor is the imbalance between object detection and instance segmentation tasks.
Observed Phenomenon
- A performance imbalance exists between object detection and instance segmentation.
Detection-Segmentation Imbalance
- The imbalance at the initial layers hinders the effective cooperation between object detection and instance segmentation.
Reasons for Imbalance
- Individual Characteristics of Tasks
  1. Segmentation
    - Nature: Pixel-level grouping and classification.
    - Focus: Local detailed information is crucial.
  2. Detection
    - Nature: Region-level task involving localization and regression of object bounding boxes.
    - Focus: Requires global information, emphasizing the complete object.
- Supervision Methods
  1. Segmentation Supervision
    - Method: Densely supervised using all pixels of the ground truth (GT) mask.
    - Impact: Provides richer and stronger information during optimization.
  2. Detection Supervision
    - Method: Sparsely supervised using a 4D vector (x, y, w, h) of the GT bounding box.
    - Impact: Less information compared to dense supervision.
- Optimization Dynamics
  - The dense supervision in segmentation leads to faster optimization compared to the sparse supervision in detection.
  - This asynchronous optimization speeds contribute to the overall imbalance issue.

Method

Feature Token Extractor

Conventional DETR-like Encoder

De-Imbalance

alleviate the imbalance instead of directly providing $T_i$ to the transformer decoder as MaskDINO does.
detection-segmentation imbalance:
- the performance of object detection lags behind that of instance segmentation at the beginning layer of the transformer decoder.
residual double-selection:
- select Top-$k_1$ ranked feature tokens in $T_i$ based on their category classification scores:
  - $T_{s1}=\mathcal{S}(T_i,k_1)$
  - most background information is filtered out, focusing on the objects
- The token interaction
  - $T_{s1}^{sa}=\text{MHSA}(T_{s1})$
  - the detection task to learn the interaction relation between patches.
  - different tokens representing the patches (belonging to the same object) can interact with each other to learn
    - benefiting the perception of object bounding boxes.
    - the global geometric, contextual, and semantic patch-to-patch relations
- Second selection:
  - $T_{s2}=\mathcal{S}(T_{s1}^{sa},k_2)$
- the residual is the necessary compensation for double-selection since the information loss occurs in the selection procedures
  - $\mathcal{Q}{bal}=\text{MHCA}(T{s2}, T_i)$

Balance-Aware Tokens Optimization

How $\mathcal{Q}_{bal}$ can guide the optimization of $T_i$?
- $T_i$ contains a large number of tokens conveying detailed local information for both background and foreground
- $\mathcal{Q}_{bal}$ consists of a small number of high-confidence tokens mainly focusing on foregrounds.
- Also, $\mathcal{Q}_{bal}$ has learned rich semantic and contextual interaction relation
- Therefore, $\mathcal{Q}_{bal}$ can guide the optimization of $T_i$

Generate the guiding mask tokens
- $T_g^{mask}=\mathcal{N}{mask}(\mathcal{Q}{bal})$, $T_g^{box}=\mathcal{N}{box}(\mathcal{Q}{bal})$
  - $\mathcal{N}(\cdot)$: MLP로 구성
- $T_g=T_{g}^{mask}+T_g^{mask}$
interacts $T_i$ and $T_g$
- the tokens $T_i$ that belong to the same object/instance will be aggregated, enhancing the foreground information.
- $T_{bal}=\text{MHCA}(T_{i}, T_g)$

Transformer Decoder

$\mathcal{Q}{ref}=\mathcal{N}{decoder}(\mathcal{Q}{bal}, T{bal})$
$\{c,b\}=\mathcal{N}{det}(\mathcal{Q}{ref})$
$m=\mathcal{N}{seg}(\mathcal{Q}{ref}, T_i, F_{cnn})$

Experiment

'AI' 카테고리의 다른 글

Parameter Efficient Self-Supervised Geospatial Domain Adaptation (0)	2024.11.06
Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models (0)	2024.11.06
MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING (0)	2024.11.06
Masked Autoencoders are Secretly Efficient Learners (0)	2024.11.06
OpenSEED: A Simple Framework for Open-Vocabulary Segmentation and Detection (0)	2024.11.06

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

MisoYuri's Deck

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Introduction or Motivation

Method

Feature Token Extractor

De-Imbalance

Balance-Aware Tokens Optimization

Transformer Decoder

Experiment

'AI' 카테고리의 다른 글

티스토리툴바

Introduction or Motivation

Method

Feature Token Extractor

De-Imbalance

Balance-Aware Tokens Optimization

Transformer Decoder

Experiment

'AI' 카테고리의 다른 글

검색

티스토리툴바