Core APIs do not demux, and the `stream_index` parameter has (almost) no effect #476

NicolasHug · 2025-01-24T12:15:50Z

Alternative title: The C++ and core ops work fine as long as we add only one stream. They break if we add more than one stream.

Example 1:

from torchcodec.decoders import _core as core

# This video has stream 0 with dimensions torch.Size([3, 180, 320]) and stream 3 with dimensions torch.Size([3, 270, 480])
decoder = core.create_from_file("test/resources/nasa_13013.mp4")
core.add_video_stream(decoder, stream_index=0)
core.add_video_stream(decoder, stream_index=3)

for frame_index in range(100):
    frame, _, _ = core.get_frame_at_index(decoder, stream_index=0, frame_index=frame_index)
    print(frame.shape)  # torch.Size([3, 270, 480]). This is stream 3, not stream 0.

Example 2:

from torchcodec.decoders import _core as core

decoder = core.create_from_file("test/resources/nasa_13013.mp4")
core.add_video_stream(decoder, stream_index=0)

frame, _, _ = core.get_frame_at_index(decoder, stream_index=3, frame_index=5)  # This should error but doesn't
print(frame.shape)  # torch.Size([3, 180, 320]). This is Stream 0, not stream 3.

None of the core APIs or C++ APIs actually do demuxing. I.e. the stream_index parameter is never used to filter and select frames. The only way it is used is to seek.

This may be more clear by looking at the call-stack of our decoding entry-points.

All but one rely on getFrameAtIndexInternal, which will use the streamIndex to set the cursor:

torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp

Lines 1254 to 1255 in 288bb83

    
           setCursorPtsInSeconds(ptsToSeconds(pts, streamInfo.timeBase)); 
        
           return getNextFrameNoDemuxInternal(preAllocatedOutputTensor);

but then immediately return the frame that is returned by getNextFrameNoDemuxInternal(), which doesn't demux anything.

The text was updated successfully, but these errors were encountered:

scotts · 2025-01-25T03:50:26Z

While digging into this, I actually wrote PR #481 to get my head around the logic of the core decoding loop. The core decoding loop in getAVFrameUsingFilterFunction() basically does the following in a while (true):

Iterates through all active streams calling avcodec_receive_frame() on each. The first time the call succeeds, we record the stream index it succeeded on.
Pass the stream index and returned AVFrame to our filter function. If the function returns true, then we break out of the outer while loop and we have the frame we return to our caller.
We don't have the frame we're looking for, so we read more packets with av_read_frame() and send them to the decoder with avcodec_send_packet().

What took me some time to realize is that we expect steps 1 and 2 to fail the first time through the loop. Most examples start with reading packets, decoding them, and then getting out a frame. Certainly that's the order of what must happen, but we invert that logic.

Some thoughts:

We could do demuxing in the filter function, by just comparing stream indices.
But if we know a stream we want to read from, it seems wasteful to bother reading any other streams at all.
This logic does seem like it would be efficient for reading synced audio and video, but I don't know if that's something we want to do.
I want to explore if we can just make demuxing always happen, and everything needs to take a stream index. We might need to change some core APIs. And we'd lose some of the generality we currently have, but we're not taking advantage of it, and I don't know if we ever will.

NicolasHug · 2025-01-27T08:43:36Z

Most examples start with reading packets, decoding them, and then getting out a frame. Certainly that's the order of what must happen, but we invert that logic.

I noticed that too. When I asked @ahmadsharif1 why it was done in this order, he said that it's because the frame we want may be in a packet that we already sent to the decoder. In that case, we don't want to re-send a new packet, because this is wasteful. I was surprised too at first because this isn't how examples I've seen are written, but this seems to makes sense to me.

We could do demuxing in the filter function, by just comparing stream indices.

We could but that wouldn't be efficent. We call the filter function on an AVFrame, but the stream index is known as soon as we get the packet. If we were to demux within the filter function that mean we would have to decode all the frames, including those that aren't from the stream we want. I think we'll want to demux at the packet level instead, so that we can avoid decoding the frames that aren't form the targeted stream.

I want to explore if we can just make demuxing always happen, and everything needs to take a stream index. We might need to change some core APIs. And we'd lose some of the generality we currently have, but we're not taking advantage of it, and I don't know if we ever will.

I think we can, even in a BC-way: most of our APIs are [supposed to be] stream-specific, and those who aren't are still wrong anyway, in the sense that they won't be returning frames from any active stream. So it would be a bugfix in all cases. In terms of implementation I'm hoping this should be as simple as filtering the AVPacket by stream index in the main decoding loop.
That being said, before we start doing this, I think we should try to first think hard about what we want to support for audio, and if we want to support a mode where we decoding both audio and video. Because I suspect that any chance we make in any direction will influence the other.

NicolasHug mentioned this issue Jan 27, 2025

Remove multi-stream related code #483

Merged

This was referenced Feb 9, 2025

Remove streamIndex parameter and multi-stream related logic #506

Merged

Remove streamIndex parameter from core and ops APIs #509

Merged

NicolasHug closed this as completed in #509 Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core APIs do not demux, and the `stream_index` parameter has (almost) no effect #476

Core APIs do not demux, and the `stream_index` parameter has (almost) no effect #476

NicolasHug commented Jan 24, 2025

scotts commented Jan 25, 2025

NicolasHug commented Jan 27, 2025 •

edited

Loading

Core APIs do not demux, and the stream_index parameter has (almost) no effect #476

Core APIs do not demux, and the stream_index parameter has (almost) no effect #476

Comments

NicolasHug commented Jan 24, 2025

scotts commented Jan 25, 2025

NicolasHug commented Jan 27, 2025 • edited Loading

Core APIs do not demux, and the `stream_index` parameter has (almost) no effect #476

Core APIs do not demux, and the `stream_index` parameter has (almost) no effect #476

NicolasHug commented Jan 27, 2025 •

edited

Loading