can jointly learn from detection and segmentation data towards an open-vocabulary model for both tasks.
locate the discrepancies in two tasks/datasets and propose separate techniques including shared semantic space, decoupled decoding, and conditioned mask assistance to mitigate the issues.
Method
대부분 MaskDINO와 비슷하다.
DINO와 동일하게 two-stage로 이루어져 있다.
Visual backbone
Encoder → Feature
Encdoer’s feature selection(two-stage manner)
decoder → Mask header, BBox header
다만, MaskDINO의 경우에는 language 성분이 들어가지 않았다는 주요 차이점이 존재한다.
Language-guided foreground query selection
Decoder contains a limited number of foreground queries (a few hundred typically)
making it hardly handle all possible concepts in the image.
Similarity(encoder feature, text feature) score를 기준으로 two stage query와 encoder feature를 상위 top-k개를 뽑아 사용한다.
Bridge Data Gap: Conditioned Mask Decoding
The ultimate goal is to bridge the data gap by using a single loss function to train multiple tasks
Detection dataset의 경우에는 coarse location(bbox)와 class 정보만 가지고 있다.