Adding support for HuggingFace vision Transformers #2

NielsRogge · 2022-10-19T07:38:42Z

Hi,

Thanks for this great work!

In 🤗 Transformers, we support the Vision Transformer (ViT) - among many other models like MAE, BEiT, ConvNeXt, Swin Transformer, Swin Transformer v2, etc. Recent additions also include Transformer-based video models, like VideoMAE and X-CLIP.

As can be seen on the hub, the 2 most popular ViT models have +500k and +300k downloads respectively the last month. Would be great if people can leverage this speed up in performance! An increase in throughtput would be very beneficial for people putting these algorithms in production.

As models in the Transformers library are implemented very independently (we duplicate code rather than inheriting, for the sake of readability + independence among the models), we could add ToMe as a separate model in the library.

Let me know your thoughts :)

Best,

Niels
ML Engineer @ HuggingFace

dbolya · 2022-10-19T19:50:52Z

Sounds like a great idea!

What do you think would be best? I could make a patch for a couple of the most popular implementations like I did for timm and swag. Or ToMe could be a separate model in the library.

Since there is code duplication, making a patch has the possibility of support more models, since I'm assuming a lot of the implementations share the same stem.

NielsRogge · 2022-10-20T20:23:10Z

What do you think would be best? I could make a patch for a couple of the most popular implementations like I did for timm and swag. Or ToMe could be a separate model in the library.

I'd say both! Regarding a patch of existing implementations, those could be handled in this repo (if you want so, of course) and then we could add ToMe as a separate model (with weights) as well in the library.

FrancescoSaverioZuppichini · 2023-03-05T14:02:44Z

@NielsRogge see #21 , it looks like not implementing attention from scratch but using the torch built-in attention results in better performances. So, assuming I am correct, using torch.nn.MultiHeadAttention should have been the right choice in transformers even if it hurts "readability". Thus, it would make more sense to integrate that directly and/or find a way to make ToMe work with it.

Not valid for training since the fast path of nn.MultiHeatAttention is used only during inference.

@dbolya I am currently looking at ways to get bost of both worlds (Hannah Montana throwbacks 😃 )

Eager to know your thoughts and happy to help to make these models faster for the community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for HuggingFace vision Transformers #2

Adding support for HuggingFace vision Transformers #2

NielsRogge commented Oct 19, 2022 •

edited

Loading

dbolya commented Oct 19, 2022

NielsRogge commented Oct 20, 2022

FrancescoSaverioZuppichini commented Mar 5, 2023

Adding support for HuggingFace vision Transformers #2

Adding support for HuggingFace vision Transformers #2

Comments

NielsRogge commented Oct 19, 2022 • edited Loading

dbolya commented Oct 19, 2022

NielsRogge commented Oct 20, 2022

FrancescoSaverioZuppichini commented Mar 5, 2023

NielsRogge commented Oct 19, 2022 •

edited

Loading