Skip to content
This repository has been archived by the owner on Jan 1, 2025. It is now read-only.

About using ToMe in ImageEncoderViT within segment anything #42

Open
yangzijia opened this issue Jul 5, 2024 · 2 comments
Open

About using ToMe in ImageEncoderViT within segment anything #42

yangzijia opened this issue Jul 5, 2024 · 2 comments

Comments

@yangzijia
Copy link

Hello Author,

Thank you for your contributions. I am currently looking to optimize the ImageEncoderViT method from “Segment Anything” using your token merging method, but I have encountered two issues:

  1. I noticed that the Block in the ImageEncoderViT uses windowed attention, and the shape of the tokens is (B, W, H, C), such as (1, 64, 64, 1280) for vit_h. This dimensionality cannot be processed by bipartite soft matching. I am considering whether merging W and H directly for computation would work. Do you have a better suggestion?
  2. The implementation of ToMe can reduce the number of tokens by about 98%, which changes the final feature shape. In the Image Encoder ViT, there are two Conv2d operations at the end, and after the token shape is changed, it cannot undergo convolution operations. I am wondering if adding a shape-expanding operation at this point would be feasible?

Thank you for your help.

@yangzijia yangzijia changed the title About using ToMe in ImageEncoderViT in segment anything About using ToMe in ImageEncoderViT within segment anything Jul 5, 2024
@dbolya
Copy link
Contributor

dbolya commented Jul 6, 2024

Hello, in general those are both open research questions, but I can get you started with some simple strategies.

  1. One option for applying ToMe to a model using window attention, is to merge the same number of tokens within each window. That way the number of windows stays the same while the number of tokens per window shrinks. You can implement that fairly easily by permuting the windows to become the batch size when running ToMe.
  2. I described a simple way of doing "unmerging" in the paper Token Merging for Fast Stable Diffusion by duplicating the merged tokens back to their original positions. You should be able to do the same thing for detection by unmerging at the end of the network before the detection heads. I'd also suggest adding an additional position embedding at the end if you're training the model.

Though like I said, both of those questions are open points of research. Feel free to use what I suggested as your baseline and let me know if you can find anything that works better!

@ranpin
Copy link

ranpin commented Aug 2, 2024

Hello Author,

Thank you for your contributions. I am currently looking to optimize the ImageEncoderViT method from “Segment Anything” using your token merging method, but I have encountered two issues:

  1. I noticed that the Block in the ImageEncoderViT uses windowed attention, and the shape of the tokens is (B, W, H, C), such as (1, 64, 64, 1280) for vit_h. This dimensionality cannot be processed by bipartite soft matching. I am considering whether merging W and H directly for computation would work. Do you have a better suggestion?
  2. The implementation of ToMe can reduce the number of tokens by about 98%, which changes the final feature shape. In the Image Encoder ViT, there are two Conv2d operations at the end, and after the token shape is changed, it cannot undergo convolution operations. I am wondering if adding a shape-expanding operation at this point would be feasible?

Thank you for your help.

Can I ask if you have made any progress?I'd like to try this but don't know how to go about it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants