You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your interest in our work! The model is designed for CLIP versions that use ResNet as the backbone. A lot of changes need to be made to make it run for vision transformers. If you want to use the CLIP ViT as the backbone, I guess you need to use the output feature of the last layer.
How can I use the clip-vit as the backbone? Which layer of the clip-vit is the 'feature_layer'?
The text was updated successfully, but these errors were encountered: