Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the TensorRT deployment of the video inference model and video inference flows #434

Open
liyihao76 opened this issue Nov 4, 2024 · 2 comments

Comments

@liyihao76
Copy link

sam2 drawio

I'm currently trying to deploy a video inference model for SAM2 using TensorRT+cpp. Following his idea https://github.com/Aimol-l/OrtInference , split it into four models: image encoder, image decoder, memory encoder, and memory attention. First convert them into onnx files and further generate TensorRT engine files. Currently I have completed the deployment of inference for frame 0 (image encoder + image decoder) modeled after the deployment process of SAM1. However, the inference process for subsequent frames seems to be quite complex, especially the storage and update of obj_ptr and mask_mem. I'm a beginner, are there any detailed explanatory articles/videos for this part of the source code? Or a project on c++ deployment? Much appreciated.
There are a couple of specific questions:

  1. after I have completed the inference for frame 0, do the input hints need to be updated when predicting for subsequent frames? (e.g. using the box from the previous frame's inference as the prompt for the new frame?) What should be the input prompt part of the image encoder for frames without prompts?
  2. for obj_ptr storage, should it hold the contents of frame 0 (with prompt, conditioned in the paper) + the contents of the 15 most recent frames? If I add a new prompt at frame 20, should it save the contents of frame 20 or the contents of frame 0 (+ the contents of the 15 most recent frames)?
  3. https://github.com/Aimol-l/OrtInference adds time coding (the dark green part [7,1,64,64]), I don't know if it exists in the source code, what is its significance?
  4. i have some objects that may only exist in certain frames, if i want to reason from a certain frame in the middle of the video, shouldn't the cpp implementation do it by inference backwards + forwards from that frame?
@heyoeyo
Copy link

heyoeyo commented Nov 4, 2024

I've also been trying to make sense of the video processing sequence, so I can try to answer some of these (though I may have some parts wrong still):

What should be the input prompt part of the image encoder for frames without prompts?

Having no inputs seems to work fine, but the model actually uses a single point with label -1 (i.e. a padding point) when no inputs are given. It also uses a box input of None which ends up adding a second padding point due to the way the prompt encoder is setup, so the final result is a prompt made of 2 padding point embeddings.

If I add a new prompt at frame 20, should it save the contents of frame 20 or the contents of frame 0 (+ the contents of the 15 most recent frames)?

With the default settings, the model will use all prompted pointers that have happened before or during the current frame. So at frame 19, only the frame 0 pointer would be used (+ the 15 recent frame pointers) and then at frame 20, 21, 22, etc. both the frame 0 & 20 pointers would be used.

adds time coding (the dark green part [7,1,64,64]), I don't know if it exists in the source code, what is its significance?

I can't read the label in the diagram, but most likely it's the maskmem_tpos_enc which is like a time-based position encoding for the 6 most recent frame memory encodings + 1 encoding for the prompt memory encodings.

if i want to reason from a certain frame in the middle of the video, shouldn't the cpp implementation do it by inference backwards + forwards from that frame?

Yes that makes sense, especially if the object isn't easily visible when it first appears. The original propagation code has a reverse flag for helping with this sort of thing.

@okideal
Copy link

okideal commented Nov 18, 2024

Have anyone encountered any precision mismatch issues when converting TensorRT models? In my case, all outputs (except for vision_pos_embed) from the image_encoder, do not align with the ONNX outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants