Attentive Mask CLIP
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. Removing a large portion of image tokens may discard the semantic content associated with a given text description. Proposed an attentive token removal approach for CLIP training which retains tokens with a high semantic correlation to the text description....