Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core APIs do not demux, and the stream_index parameter has (almost) no effect #476

Closed
NicolasHug opened this issue Jan 24, 2025 · 2 comments · Fixed by #509
Closed

Core APIs do not demux, and the stream_index parameter has (almost) no effect #476

NicolasHug opened this issue Jan 24, 2025 · 2 comments · Fixed by #509

Comments

@NicolasHug
Copy link
Member

Alternative title: The C++ and core ops work fine as long as we add only one stream. They break if we add more than one stream.

Example 1:

from torchcodec.decoders import _core as core

# This video has stream 0 with dimensions torch.Size([3, 180, 320]) and stream 3 with dimensions torch.Size([3, 270, 480])
decoder = core.create_from_file("test/resources/nasa_13013.mp4")
core.add_video_stream(decoder, stream_index=0)
core.add_video_stream(decoder, stream_index=3)

for frame_index in range(100):
    frame, _, _ = core.get_frame_at_index(decoder, stream_index=0, frame_index=frame_index)
    print(frame.shape)  # torch.Size([3, 270, 480]). This is stream 3, not stream 0.

Example 2:

from torchcodec.decoders import _core as core

decoder = core.create_from_file("test/resources/nasa_13013.mp4")
core.add_video_stream(decoder, stream_index=0)

frame, _, _ = core.get_frame_at_index(decoder, stream_index=3, frame_index=5)  # This should error but doesn't
print(frame.shape)  # torch.Size([3, 180, 320]). This is Stream 0, not stream 3.

None of the core APIs or C++ APIs actually do demuxing. I.e. the stream_index parameter is never used to filter and select frames. The only way it is used is to seek.

This may be more clear by looking at the call-stack of our decoding entry-points.

Image

All but one rely on getFrameAtIndexInternal, which will use the streamIndex to set the cursor:

setCursorPtsInSeconds(ptsToSeconds(pts, streamInfo.timeBase));
return getNextFrameNoDemuxInternal(preAllocatedOutputTensor);

but then immediately return the frame that is returned by getNextFrameNoDemuxInternal(), which doesn't demux anything.

@scotts
Copy link
Contributor

scotts commented Jan 25, 2025

While digging into this, I actually wrote PR #481 to get my head around the logic of the core decoding loop. The core decoding loop in getAVFrameUsingFilterFunction() basically does the following in a while (true):

  1. Iterates through all active streams calling avcodec_receive_frame() on each. The first time the call succeeds, we record the stream index it succeeded on.
  2. Pass the stream index and returned AVFrame to our filter function. If the function returns true, then we break out of the outer while loop and we have the frame we return to our caller.
  3. We don't have the frame we're looking for, so we read more packets with av_read_frame() and send them to the decoder with avcodec_send_packet().

What took me some time to realize is that we expect steps 1 and 2 to fail the first time through the loop. Most examples start with reading packets, decoding them, and then getting out a frame. Certainly that's the order of what must happen, but we invert that logic.

Some thoughts:

  1. We could do demuxing in the filter function, by just comparing stream indices.
  2. But if we know a stream we want to read from, it seems wasteful to bother reading any other streams at all.
  3. This logic does seem like it would be efficient for reading synced audio and video, but I don't know if that's something we want to do.
  4. I want to explore if we can just make demuxing always happen, and everything needs to take a stream index. We might need to change some core APIs. And we'd lose some of the generality we currently have, but we're not taking advantage of it, and I don't know if we ever will.

@NicolasHug
Copy link
Member Author

NicolasHug commented Jan 27, 2025

Most examples start with reading packets, decoding them, and then getting out a frame. Certainly that's the order of what must happen, but we invert that logic.

I noticed that too. When I asked @ahmadsharif1 why it was done in this order, he said that it's because the frame we want may be in a packet that we already sent to the decoder. In that case, we don't want to re-send a new packet, because this is wasteful. I was surprised too at first because this isn't how examples I've seen are written, but this seems to makes sense to me.

We could do demuxing in the filter function, by just comparing stream indices.

We could but that wouldn't be efficent. We call the filter function on an AVFrame, but the stream index is known as soon as we get the packet. If we were to demux within the filter function that mean we would have to decode all the frames, including those that aren't from the stream we want. I think we'll want to demux at the packet level instead, so that we can avoid decoding the frames that aren't form the targeted stream.

I want to explore if we can just make demuxing always happen, and everything needs to take a stream index. We might need to change some core APIs. And we'd lose some of the generality we currently have, but we're not taking advantage of it, and I don't know if we ever will.

I think we can, even in a BC-way: most of our APIs are [supposed to be] stream-specific, and those who aren't are still wrong anyway, in the sense that they won't be returning frames from any active stream. So it would be a bugfix in all cases. In terms of implementation I'm hoping this should be as simple as filtering the AVPacket by stream index in the main decoding loop.
That being said, before we start doing this, I think we should try to first think hard about what we want to support for audio, and if we want to support a mode where we decoding both audio and video. Because I suspect that any chance we make in any direction will influence the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants