[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

sineeli · 2024-11-21T23:52:17Z

This PR introduces a Vision Transformer (ViT) implementation

Backbone
Preprocessor
Image classifier
Weights transfer script

divyashreepathihalli

amazing work! one nit comment.
Also, can you please add a demo colab? and a colab to show the numerics is verified. Basically the validate output block from your conversion script.

divyashreepathihalli · 2024-11-26T01:31:23Z

keras_hub/src/models/vit/vit_backbone.py

+        dtype=None,
+        **kwargs,
+    ):
+        data_format = standardize_data_format(data_format)


add comment === Layers ===

divyashreepathihalli · 2024-11-26T01:39:52Z

keras_hub/src/models/vit/vit_backbone.py

+        dropout_rate=0.0,
+        attention_dropout=0.0,
+        layer_norm_epsilon=1e-6,
+        use_mha_bias=True,


these args are missing in get_config

sineeli · 2024-11-27T20:17:49Z

Weights Transfer for 4 variants colab gist: https://colab.research.google.com/gist/sineeli/10a7884bef6114eade3b237b63d7f2bd/-keras-hub-vit-weights-transfer.ipynb

mattdangerw

This looks great! Very nice work. Just a couple comments.

mattdangerw · 2024-12-03T21:43:05Z

keras_hub/src/models/vit/vit_presets.py

+                "image resolution of 384x384 "
+            ),
+            "params": 86090496,
+            "official_name": "ViT",


we no longer need the official name or model card, we've reduced what we show on Keras.io to make this simpler. Our Kaggle page will act as the new model card.

mattdangerw · 2024-12-03T21:44:19Z

tools/checkpoint_conversion/convert_vit_checkpoints.py

+    )
+
+
+def convert_weights(keras_hub_model, hf_model):


Could we write this as an in library converter? Seems very doable and then we expose this to anyone wanting to convert a vit checkpoint.

https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/utils/transformers

mattdangerw · 2024-12-03T21:46:49Z

keras_hub/src/models/image_classifier.py

@@ -137,7 +139,10 @@ def __init__(
        # === Functional Model ===
        inputs = self.backbone.input
        x = self.backbone(inputs)
-        x = self.pooler(x)
+        if pooling == "token":  # used for Vision Transformer(ViT)


"token" feels like a bit a weird name here, especially when compared to "avg" or "max". Maybe "first"?

sineeli added 24 commits November 13, 2024 18:16

vit base

741b889

Add vit backbone, classifier and preprocessor layers

13dae08

update args

b64b137

add default args

429d635

correct build method

6d69abc

fix build issues

2e87884

fix bugs

bd3cce0

Update backbone args and configs

4232a06

correct position ids dtype

32b08c5

build token layer

cc938c6

token layer build

78812de

assign correct dtype to TokenLayer

8a20465

fix build shape of token layer

de754cc

correct mlp dens var names

84ba896

use default norm mean and std as per hugging face config

7a70e16

correct position_ids

81e3021

remove separate token layer

d3061d6

correct position ids

618e163

Checkpoint conversion script and minor changes

2338637

correct flag type

95e5868

correct key name

9d2e5bd

use flat list later we can extract in between layers if needed

ac7d1d3

Add test cases and correct dtype polciy for model

8065c01

add proper docstrings

a8be824

sineeli requested review from mattdangerw and divyashreepathihalli November 21, 2024 23:52

correct test cases

3f027a0

sineeli added the kokoro:force-run Runs Tests on GPU label Nov 23, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Nov 23, 2024

use numpy for test data

05acb70

nit

521df6f

sineeli added the kokoro:force-run Runs Tests on GPU label Nov 25, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Nov 25, 2024

divyashreepathihalli reviewed Nov 26, 2024

View reviewed changes

nit

ae2b800

sineeli added the kokoro:force-run Runs Tests on GPU label Nov 27, 2024

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Nov 27, 2024

sineeli added 2 commits December 2, 2024 13:34

Merge branch 'master' into sineeli/ViT

26c2224

add presets

92149d5

mattdangerw reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

sineeli commented Nov 21, 2024

divyashreepathihalli left a comment

divyashreepathihalli Nov 26, 2024

divyashreepathihalli Nov 26, 2024

sineeli commented Nov 27, 2024

mattdangerw left a comment

mattdangerw Dec 3, 2024

mattdangerw Dec 3, 2024

mattdangerw Dec 3, 2024

mattdangerw Dec 3, 2024

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

Are you sure you want to change the base?

[ViT] Vision Transformer (ViT) backbone, layers, and image classifier #1989

Conversation

sineeli commented Nov 21, 2024

divyashreepathihalli left a comment

Choose a reason for hiding this comment

divyashreepathihalli Nov 26, 2024

Choose a reason for hiding this comment

divyashreepathihalli Nov 26, 2024

Choose a reason for hiding this comment

sineeli commented Nov 27, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Dec 3, 2024

Choose a reason for hiding this comment

mattdangerw Dec 3, 2024

Choose a reason for hiding this comment

mattdangerw Dec 3, 2024

Choose a reason for hiding this comment

mattdangerw Dec 3, 2024

Choose a reason for hiding this comment