Large video and loading in batches #545

horsto · 2025-01-19T17:55:19Z

I want to segment very large video files - and have come across #264 for possible solutions.
Running with vos_optimized=True gives me some headache (#501).

I wonder if there is an easier way to handle these things that is also more general. For example, is it possible to load the video in batches and somehow make use of the memory created for one tracked batch in the next, etc.?
What would have to be done to do that?
If I initiate only like the first 100 frames of one video (let's say it has 10,000 frames total), what are the issues here for loading and predicting in batches.

The text was updated successfully, but these errors were encountered:

heyoeyo · 2025-01-20T22:46:36Z

Most of the important memory functionality is handled inside the prepare_memory_conditioned_features function. More specifically, it's the cond_frame_outputs and non_cond_frame_outputs. The cond_frame_outputs is the memory associated with prompts that you've given the model, while the non_cond_frame_outputs is the sort of short-term running memory that isn't the result of prompting (it helps the model keep track of an object that has changed appearance compared to when it was prompted).

If you wanted to split things in batches but carry over the tracking memory, you'd need to manually store/restore this memory data, while potentially spoofing the frame indexing to make it appear as though that memory occurred in the past, even though each batch would see itself starting from frame 0.

That function pulls all of the memory data out of an output_dict variable, which is actually taken from the inference_state under the key output_dict_per_obj along with a per-object indexing key (i.e. each tracked object has it's own memory that would need to be recorded/restored separately). So most likely you'd want to pull the data out of the inference_state variable at the end of each batch and then re-inject it into the state for the next batch.

horsto · 2025-01-20T23:15:05Z

Thanks for the thorough reply @heyoeyo !
I see. I think batch processing will give me a lot of headache actually, if, for example it should be both reverse as well as forward compatible: As I understand if I am refining a mask (-> cond_frame_ouputs) on a later batch then reverse tracking should backwards-refine everything that happened before the refinement(?).

I wonder more generally what exactly the bottleneck is for out-of-memory errors for (very) large video files:

in SAM2VideoPredictor -> init_state all frames are loaded into the dictionary. That seems excessive. Maybe a random subset of all frames of the long video could be used?
For the non_cond_frame_outputs memory refinement: Is that limited, or can it be limited so to not include the history of "all" frames in the video?

Mostly just thinking out loud here ...

heyoeyo · 2025-01-21T14:29:00Z

it should be both reverse as well as forward compatible

Ya that could be complicated if using batches. Maybe the simplest approach would be to manage the reversing entirely in the batching code, instead of using the built-in support in the SAM model. That way it's just a matter of providing the frames/batches in reverse order while the SAM model always sees it as going forward, so no changes are needed to the model code.

what exactly the bottleneck is for out-of-memory errors

There are 3 main sources of memory use:

The model itself. This is on the order of 1GB and isn't really avoidable.
Like you mentioned, the encoded frames are all computed & cached in advance of any processing by default. It's a few MB per frame, so long videos can easily take dozens (or even hundreds) of GB of VRAM. Batching would limit the number of frames, so processing can run without using up all the memory. Alternatively, the async changes mentioned in that linked issue only load/process frames as they're used (without caching), so that also prevents excessive memory use (at the expense of having to re-compute frames if they're revisited)
Both the cond_frame_outputs and non_cond_frame_outputs storage is unlimited. Some people have posted approaches where they use a detector model to re-prompt the SAM model on every frame, which leads to a large amount of memory use due to the build up of cond_frame_outputs. Likewise, the non_cond_frame_outputs is cached for every frame, which will eventually use up all the VRAM for long running videos. Batching can again prevent this build up, though it needs to be done carefully if you carry the memory forward. With the default config, the model only uses the last 6 memory encodings and 15 or 16 'object pointers', so the older cached non_cond_frame_outputs can be cleared to prevent memory build-up (see issue Are there any method for reducing gpu memory overhead? #196, though the changes there might not account for reversing properly!).

horsto · 2025-01-21T14:50:12Z

the encoded frames are all computed & cached in advance of any processing by default

Is this only to make processing faster (have the frames ready...) or is there a benefit for the SAM model to encode the info of all frames (opposed to a subset only)?

With the default config, the model only uses the last 6 memory encodings and 15 or 16 'object pointers', so the older cached non_cond_frame_outputs can be cleared to prevent memory build-up

This is great. I acutely did not find that info (6 / 15-16) in the .yaml configs for each model. Is that baked into the SAM code somewhere? This is super useful to know.

Thanks, @heyoeyo!

heyoeyo · 2025-01-21T15:18:54Z

is there a benefit for the SAM model to encode the info of all frames

I'd guess it's originally there because it helps speed up training (i.e. iteratively re-running the same input with modified prompts). It should also help speed up interactions in their demo, for example when jumping back-and-forth around the video, by not having to repeatedly encode the same frame every time it's visited. Other than that, it shouldn't have any effect on the functionality of the model (e.g. caching vs. not caching will produce the same segmentation results).

Is that baked into the SAM code somewhere

The object pointer setting is part of the model init (and used in that memory processing function). It doesn't look like it's part of the yaml configs, but could be added to allow it to be modified (the setting has very little effect on the output, pointers can be fully disabled without any obvious consequence).

The '6 previous frames' behavior is indirectly set by the model init, but it controls the size of a learned embedding, so it can't be changed without re-training the model. However there's a stride setting that can be adjusted, so that the model will use 'every other' or 'every third' etc. previous 6 frames, rather than the most recent 6 consecutive frames. It's also possible to manually modify the code that does the reading of the previous frames to have it do more or less than 6, though going above 6 requires some extra checks to avoid indexing incorrectly into that learned embedding.

horsto · 2025-01-21T16:19:28Z

Excellent, thank you for these helpful explanations!

e.g. caching vs. not caching will produce the same segmentation results

That is what I was wondering about. Great.

For object pointer and 6 previous frame behavior:
I guess their baked in settings are a good way to proceed (I wonder what changes when you set all these to, let's say, three times their original size - will prediction get better? But I guess they reached some kind of optimum consensus here).

I can now envision a scenario where I do some lazy loading of frames to circumvent loading all frames from the start and I do use the pruning method that you shared earlier (and in #196 (comment)) to minimize overhang. I am thinking about writing the extracted polygons/masks to disk so they don't have to be recalculated if no further input has been given. I.e. it would be on-the-fly loading of video frames and already calculated masks.

heyoeyo · 2025-01-21T17:12:19Z

I wonder what changes when you set all these to, let's say, three times their original size - will prediction get better?

From what I've seen, changing the history size doesn't have much of an effect on the segmentation quality. Instead it has an effect on whether the tracking will stay on or jump off objects as they change appearance. There's also a noticeable performance hit when using longer histories. It seems the defaults are a good balance. Here are some examples running the tiny model with different history counts on a video I got off the MedSAM2 demo (seems to be permanently down though):

1 frame history. Breaks at the end:

6 frame history, the default:

32 frame history (slows down as history accumulates):

The default (6) works well in this case (using 3 also works fine). Using too little (0, 1 or 2) causes problems, using a lot more (32) doesn't help, but I'm sure this is all video dependent.

lazy loading of frames to circumvent loading all frames from the start and I do use the pruning method that you shared earlier to minimize overhang

I'd agree that's a better approach than batching (as long as you don't need to re-use the information within the batches).

horsto · 2025-01-21T17:15:57Z

The default (6) works well in this case (using 3 also works fine). Using too little (0, 1 or 2) causes problems, using a lot more (32) doesn't help

very informative... and makes intuitive sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large video and loading in batches #545

Large video and loading in batches #545

horsto commented Jan 19, 2025

heyoeyo commented Jan 20, 2025

horsto commented Jan 20, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025

Large video and loading in batches #545

Large video and loading in batches #545

Comments

horsto commented Jan 19, 2025

heyoeyo commented Jan 20, 2025

horsto commented Jan 20, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025

heyoeyo commented Jan 21, 2025

horsto commented Jan 21, 2025