Abstract: Vision-Language Models (VLMs), such as CLIP, excel in zero-shot image-level visual understanding but struggle with object-based tasks requiring precise localization and recognition. Visual ...
Abstract: Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against ...