Abstract: With its powerful visual-language alignment capability, CLIP performs well in zero-shot and few-shot learning tasks. However, we found in experiments that CLIP’s logits suffer from serious ...
One-shot semantic segmentation is to segment the object regions of unseen categories with only one annotated example as the supervision. Existing methods often adopt the multimodal pre-trained model ...