-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large video and loading in batches #545
Comments
Most of the important memory functionality is handled inside the prepare_memory_conditioned_features function. More specifically, it's the cond_frame_outputs and non_cond_frame_outputs. The If you wanted to split things in batches but carry over the tracking memory, you'd need to manually store/restore this memory data, while potentially spoofing the frame indexing to make it appear as though that memory occurred in the past, even though each batch would see itself starting from frame 0. That function pulls all of the memory data out of an |
Thanks for the thorough reply @heyoeyo ! I wonder more generally what exactly the bottleneck is for out-of-memory errors for (very) large video files:
Mostly just thinking out loud here ... |
Ya that could be complicated if using batches. Maybe the simplest approach would be to manage the reversing entirely in the batching code, instead of using the built-in support in the SAM model. That way it's just a matter of providing the frames/batches in reverse order while the SAM model always sees it as going forward, so no changes are needed to the model code.
There are 3 main sources of memory use:
|
Is this only to make processing faster (have the frames ready...) or is there a benefit for the SAM model to encode the info of all frames (opposed to a subset only)?
This is great. I acutely did not find that info (6 / 15-16) in the .yaml configs for each model. Is that baked into the SAM code somewhere? This is super useful to know. Thanks, @heyoeyo! |
I'd guess it's originally there because it helps speed up training (i.e. iteratively re-running the same input with modified prompts). It should also help speed up interactions in their demo, for example when jumping back-and-forth around the video, by not having to repeatedly encode the same frame every time it's visited. Other than that, it shouldn't have any effect on the functionality of the model (e.g. caching vs. not caching will produce the same segmentation results).
The object pointer setting is part of the model init (and used in that memory processing function). It doesn't look like it's part of the yaml configs, but could be added to allow it to be modified (the setting has very little effect on the output, pointers can be fully disabled without any obvious consequence). The '6 previous frames' behavior is indirectly set by the model init, but it controls the size of a learned embedding, so it can't be changed without re-training the model. However there's a stride setting that can be adjusted, so that the model will use 'every other' or 'every third' etc. previous 6 frames, rather than the most recent 6 consecutive frames. It's also possible to manually modify the code that does the reading of the previous frames to have it do more or less than 6, though going above 6 requires some extra checks to avoid indexing incorrectly into that learned embedding. |
Excellent, thank you for these helpful explanations!
That is what I was wondering about. Great. For object pointer and 6 previous frame behavior: I can now envision a scenario where I do some lazy loading of frames to circumvent loading all frames from the start and I do use the pruning method that you shared earlier (and in #196 (comment)) to minimize overhang. I am thinking about writing the extracted polygons/masks to disk so they don't have to be recalculated if no further input has been given. I.e. it would be on-the-fly loading of video frames and already calculated masks. |
From what I've seen, changing the history size doesn't have much of an effect on the segmentation quality. Instead it has an effect on whether the tracking will stay on or jump off objects as they change appearance. There's also a noticeable performance hit when using longer histories. It seems the defaults are a good balance. Here are some examples running the tiny model with different history counts on a video I got off the MedSAM2 demo (seems to be permanently down though): 1 frame history. Breaks at the end: 32 frame history (slows down as history accumulates): The default (6) works well in this case (using 3 also works fine). Using too little (0, 1 or 2) causes problems, using a lot more (32) doesn't help, but I'm sure this is all video dependent.
I'd agree that's a better approach than batching (as long as you don't need to re-use the information within the batches). |
very informative... and makes intuitive sense. |
I want to segment very large video files - and have come across #264 for possible solutions.
Running with
vos_optimized=True
gives me some headache (#501).I wonder if there is an easier way to handle these things that is also more general. For example, is it possible to load the video in batches and somehow make use of the memory created for one tracked batch in the next, etc.?
What would have to be done to do that?
If I initiate only like the first 100 frames of one video (let's say it has 10,000 frames total), what are the issues here for loading and predicting in batches.
The text was updated successfully, but these errors were encountered: