You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you get to explore more on this and have any further insights?
What I would guess is that usually these video models use patch embedding that temporally downsamples (tubelet size of 2). So, the frame level features are kind of "lost".
What could be interesting is if you could repeat each frame one more time and then try to visualize.
Hi,
Thank you for this amazing project.
I have been exploring the feature maps produced by the pre-trained V-JEPA, using PCA component visualization.
However, the feature maps look very random, so I try doing the same thing without the pre-trained weight.
Were the feature maps from the V-JEPA pre-training supposed to be like this, or what did I missed in loading the pretrained weight?
Here is the code I used to do the feature visualization.
The lol.csv which I downloaded from https://www.kaggle.com/datasets/ipythonx/ssv2test?resource=download
The text was updated successfully, but these errors were encountered: