Understanding EncodedVideoCunk #648

reinhrst · 2023-03-03T09:27:30Z

reinhrst
Mar 3, 2023

have some questions about the EncodedVideoChunk that (afaict) are not addressed in the spec. Maybe I missed some additional information somewhere; if so, please point me in the right direction.

The specific case I'm referring to is decoding an (annexB) H264 stream

Should an EncodedVideoChunk always contain one Frame?

Since it has a field for type (to indicate whether it's a key frame), it feels that it points to specifically one frame.
Same goes for the timestamp field

In my experience (in Chrome 110), if you split your frame into two EncodedVideoChunks (and call videoDecoder.decode() twice, either the first call will get you a VideoFrame with half the data and the second call will fail, or the first call will fail (i.e. the decoder seems to expect that one call to decode is (at least) one full VideoFrame of data.
Likewise, if you feed two full frames of data into EncodedVideoChunk, the decoder seems to decode the first VideoFrame and ignore the second.

Issue #38 suggests that the spec says that it should always be one frame, I would be grateful if someone could point me to the right spot (so I can ask MDN to add it to their documentation)

Does the provided `timestamp` do anything?

In my experience (as also described in #565 ), the timestamp field seems to be unused while decoding. Is there actually a usecase where timestamp is useful in this context (else wouldn't it make more sense to make the field optional for decoding? -- or document that it can just contain anything)

Does the provided `type` do anything (useful)?

In my experience (in Chrome 110), the type field seems to be completely ignored. As far as I can tell, this is not entirely per-spec (VideoDecoder.decode() states "If chunk.type is not key, throw a DataError."), however this is what it does.

It feels to me that in this context type is not really doing anything useful, since the codec has to check anyways to see if the content is actually a key frame. It does make the load in the webapp higher, because I have to manually parse my bytestream to see if the next frame is a keyframe or not.

(finally, related but not entirely on the subject of EncodedVideoChunk) what is a key chunk anyways?

The spec says: "An encoded chunk that does not depend on any other frames for decoding". This sounds like an I-frame to me.

However my Chrome implementation seems to demand an IDR frame (which (afaik) is an Iframe with the additional properties that no subsequent frame will refer to a previous frame), which I also could see make sense.

Is this something the spec has (or people here have) an opinion on?

(I ask because I have some camcorder files that have their first IDR frame only after about 500 frames, including 20 I-frames, and VLC plays the first 500 frames fine. I don't know the H264 spec well enough to know if this is per-spec or a violation)

Finally, thanks for all the work on this, I do appreciate your efforts! I hope this place is the correct spot to put these questions; if you prefer to have them as issues or something else, let me know!

dalecurtis · 2023-03-03T18:14:51Z

dalecurtis
Mar 3, 2023
Collaborator

Should an EncodedVideoChunk always contain one Frame?

This is codec dependent. E.g., VP9 has a concept called super frames with multiple frames in it; see 5.26 in spec.

AFAIK there isn't a similar concept for H.264. In the codec registry we say:

NOTE: An access unit contains exactly one primary coded picture.
https://www.w3.org/TR/webcodecs-avc-codec-registration/#encodedvideochunk-data

Does the provided timestamp do anything?

Generally not for decoders, but encoders may use the one on VideoFrame for rate control.

Does the provided type do anything (useful)?

It's a sanity check -- many hardware decoders don't properly support other types. Chrome only checks the first frame.

(finally, related but not entirely on the subject of EncodedVideoChunk) what is a key chunk anyways?

You're right there are reconstruction mechanisms other than IDR, but per the registry type key must be an IDR frame:
https://www.w3.org/TR/webcodecs-avc-codec-registration/#encodedvideochunk-type

0 replies

reinhrst · 2023-03-04T14:05:48Z

reinhrst
Mar 4, 2023
Author

Thanks, I appreciate the answers and the links to the documents. It seems I had still missed quite some.

Considering the type parameter, are you saying that HW decoders need them? If so, that sounds like a good argument to keep them; although it still feels like a shame that I have to manually parse my streams to see if something is a keyframe or not.

Considering to keyframe, I read up on the spec a bit more ;). It seems that an I-frame with a "recovery point SEI message" (D.2.8 in the h264 spec) should be allowed as start of decoding as well (especially if the exact_match_flag is set, but maybe anyways). Would it make sense for me to make a github Issue with that request (or is there another procedure)?

The streams I work with (from a camcorder) have I-frames with recovery point SEI messages every 12 frames, but IDR frames only once every 600ish frames (in addition to the (out of spec) fact that they start with a non-IDR I-frame . For random access in a video player, having to decode 599 frames (worst case) for random access/seeking would be a considerable lag for a user.

0 replies

sandersdan · 2023-06-21T20:18:15Z

sandersdan
Jun 21, 2023
Collaborator

One clarification on type: this is more important for encoding, where it's providing useful information to the application. For decoding, it's mostly providing the metadata at the level of abstraction that WebCodecs operates, so that the specification can have rules about it.

0 replies

aboba · 2023-06-21T20:53:49Z

aboba
Jun 21, 2023
Maintainer

Dan said:

"VP9 has a concept called super frames with multiple frames in it... AFAIK there isn't a similar concept for H.264/AVC. In the codec registry we say...

[BA] VP9 and AV1 support spatial scalability, whereas H.264/AVC does not (only temporal). One concern I have is whether the WebCodecs and Encoded Transform specifications align on this issue. For example, RTCEncodedVideoFrameMetadata includes support for both spatialIndex and dependencies, which suggests that it outputs multiple frames with the same timestamp.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding EncodedVideoCunk #648

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

Should an EncodedVideoChunk always contain one Frame?

Does the provided `timestamp` do anything?

Does the provided `type` do anything (useful)?

(finally, related but not entirely on the subject of EncodedVideoChunk) what is a key chunk anyways?

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Understanding EncodedVideoCunk #648

reinhrst Mar 3, 2023

Should an EncodedVideoChunk always contain one Frame?

Does the provided timestamp do anything?

Does the provided type do anything (useful)?

(finally, related but not entirely on the subject of EncodedVideoChunk) what is a key chunk anyways?

Replies: 4 comments

dalecurtis Mar 3, 2023 Collaborator

Should an EncodedVideoChunk always contain one Frame?

Does the provided timestamp do anything?

Does the provided type do anything (useful)?

(finally, related but not entirely on the subject of EncodedVideoChunk) what is a key chunk anyways?

reinhrst Mar 4, 2023 Author

sandersdan Jun 21, 2023 Collaborator

aboba Jun 21, 2023 Maintainer

reinhrst
Mar 3, 2023

Does the provided `timestamp` do anything?

Does the provided `type` do anything (useful)?

dalecurtis
Mar 3, 2023
Collaborator

Does the provided `timestamp` do anything?

Does the provided `type` do anything (useful)?

reinhrst
Mar 4, 2023
Author

sandersdan
Jun 21, 2023
Collaborator

aboba
Jun 21, 2023
Maintainer