-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggest your favorite papers to add! #1
Comments
Florence https://arxiv.org/abs/2111.11432 |
Would it be possible to explicitly target the same API created by open ai for their CLIP? This way it can be used as a drop-in replacement in e.g. CLIP-guidance notebooks (but anywhere else CLIP is used as well, which is a lot of places). I think this would basically amount to using the same function signatures for clip.load(), encode_image, encode_text, etc. Not sure how limiting that could be in practice. |
sure! but in also thinking of extending this to any number of modalities (audio, biosequences, etc) |
LiT: Zero-Shot Transfer with Locked-image Text Tuning https://arxiv.org/abs/2111.07991 and in particular I think it would be interesting to be able to somehow transfer weights of existing models (clip image and text encoders but also other pretrained encoders) to this implementation somehow, and then continue training |
MURAL: Multimodal, Multitask Retrieval Across Languages: https://arxiv.org/abs/2109.05125 |
Combined Scaling for Zero-shot Transfer Learning |
yup, i think it'll end up something like clip = CLIP(
vision_model = vit_transformer,
text_model = text_transformer,
...
) |
CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations: https://arxiv.org/pdf/2112.07133.pdf |
RegionCLIP: https://arxiv.org/abs/2112.09106v1 They encourage region-level representations by using the released CLIP to both detect objects and to generate region-level captions for objects in a scene which becomes the dataset for finetuning an object detection task. Still reading but I believe it's a Microsoft paper. |
Hi, I would just to ask if it is possible to make your models scriptable? It looks like lambda functions make it problematic for normal user. Good thing about torchscript is, that it would export to onnx, tensorrt, etc ... |
https://github.com/facebookresearch/SLIP they combine the losses of CLIP (vision+language) and SimCLR (vision) and get better zero shot accuracy on a 15M dataset than clip on the same dataset |
https://github.com/FreddeFrallan/Multilingual-CLIP works pretty well although they used very little resources Here's one example showing it works well : Searching for blue dress in korean With clip With mclip (Many other examples can be tried on that ui) I think we may be able to learn something from their approach Edit: in practice I believe we already have what we need in the code here : the ability to plug some text encoder |
https://arxiv.org/abs/2112.09133 Any plan to implement MaskFeat? @lucidrains |
@haofanwang ohh nope, this doesn't look like it is related to contrastive learning i could add it to https://github.com/lucidrains/vit-pytorch , but i'd have to understand HOGs better |
this is a great paper :) but it also already came with code! |
Hi @lucidrains , I hope you are doing fine? This could be very interesting for x-clip: However, the official code seems to be on the way too: facebookresearch/mmf#1219 (comment) & https://github.com/facebookresearch/multimodal All the best, |
@MicPie hey Michael! miss you too ❤️ thanks for the share, i'll give it a read later tonight after i finish some code |
Looks interesting: “Unlike standard decoder transformers, CoCa omits cross-attention in the first half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations.” |
Please refer to our UniCL repo on the core algorithm used in Florence: https://github.com/microsoft/UniCL |
will start with
The text was updated successfully, but these errors were encountered: