Re: Stateless Encoding uAPI Discussion and Proposal

From: Hsia-Jun Li
Date: Tue Aug 22 2023 - 04:30:56 EST




On 8/21/23 23:13, Nicolas Dufresne wrote:
CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.


Hello again,

I've been away last week.

Le vendredi 11 août 2023 à 22:08 +0200, Paul Kocialkowski a écrit :
Hi Nicolas,

On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
Hi folks,

On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
I am now working on a H.264 encoder driver for Allwinner platforms (currently
focusing on the V3/V3s), which already provides some usable bitstream and will
be published soon.

So I wanted to shared an update on my side since I've been making progress on
the H.264 encoding work for Allwinner platforms. At this point the code supports
IDR, I and P frames, with a single reference. It also supports GOP (both closed
and open with IDR or I frame interval and explicit keyframe request) but uses
QP controls and does not yet provide rate control. I hope to be able to
implement rate-control before we can make a first public release of the code.

Just a reminder that we will code review the API first, the supporting
implementation will just be companion. So in this context, the sooner the better
for an RFC here.

I definitely want to have some proposal that is (even vaguely) agreed upon
before proposing patches for mainline, even at the stage of RFC.

While I already have working results at this point, the API that is used is
very basic and just reuses controls from stateful encoders, with no extra
addition. Various assumptions are made in the kernel and there is no real
reference management, since the previous frame is always expected to be used
as the only reference.

One thing we are looking at these days, and aren't current controllable in
stateful interface is RTP RPSI (reference picture selection indication). This is
feedback that a remote decoder sends when a reference picture has been decoded.
In short, even if only 1 reference is used, we'd like the reference to change
only when we received the acknowledgement that the new one has been
reconstructed on the other side.

I'm not super keep in having to modify the Linux kernel specially for this
feature. Specially that similar API offer it at a lower level (VA, D3D12, and
probably future API).


We plan to make a public release at some point in the near future which shows
these working results, but it will not be a base for our discussion here yet.

One of the main topics of concern now is how reference frames should be managed
and how it should interact with kernel-side GOP management and rate control.

Maybe we need to have a discussion about kernel side GOP management first ?
While I think kernel side rate control is un-avoidable, I don't think stateless
encoder should have kernel side GOP management.

I don't have strong opinions about this. The rationale for my proposal is that
kernel-side rate control will be quite difficult to operate without knowledge
of the period at which intra/inter frames are produced. Maybe there are known
methods to handle this, but I have the impression that most rate control
implementations use the GOP size as a parameter.

More generally I think an expectation behind rate control is to be able to
decide at which time a specific frame type is produced. This is not possible if
the decision is entirely up to userspace.

In Television (and Youtube) streaming, the GOP size is just fixed, and you deal
with it. In fact, I never seen GOP or picture pattern being modified by the rate
control. In general, the high end rate controls will follow an HRD
specification. The rate controls will require information that represent
constraints, this is not limited to the rate. In H.264/HEVC, the level and
profile will play a role. But you could also add the VBV size and probably more.
I have never read the HRD specification completely.

In cable streaming notably, the RC job is to monitor the about of bits over a
period of time (the window). This window is defined by the streaming hardware
buffering capabilities. Best at this point is to start reading through HRD
specifications, and open source rate control implementation (notably x264).

I think overall, we can live with adding hints were needed, and if the gop
information is appropriate hint, then we can just reuse the existing control.

Why we still care about GOP here. Hardware have no idea about GOP at all. Although in codec likes HEVC, IDR and intra pictures's nalu header is different, there is not different in the hardware coding configration. NALU header is generated by the userspace usually.

While future encoding would regard the current encoded picture as an IDR is completed decided by the userspace.

Leaving GOP management to the kernel-side implies having it decide which frame
should be IDR, I or P (and B for encoders that can support it), while keeping
the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
to me that this is already a good balance between giving userspace a decent
level of control while not having to specify the frame type explicitly for each
frame or maintain a GOP in userspace.

My expectation for stateless encoder is to have to specify the frame type and
the associate references if the type requires it.

Ack. For us, this is also why we would require requests (unlike statful
encoder), as we have per frame information to carry, and requests explicitly
attach the information to the frame.



Requesting the frame type explicitly seems more fragile as many situations will
be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
it generally requires userspace to know a lot about what the codec assumptions
are. Also for B frames the decision would need to be consistent with the fact
that a following frame (in display order) would need to be submitted earlier
than the current frame and inform the kernel so that the picture order count
(display order indication) can be maintained. This is not impossible or out of
reach, but it brings a lot of complexity for little advantage.

We have had a lot more consistent results over the last decade with stateless
hardware codecs in contrast to stateful where we endup with wide variation in
behaviour. This applies to Chromium, GStreamer and any active users of VA
encoders really. I'm strongly in favour for stateless reference API out of the
Linux kernel.

Okay I understand the lower level of control make it possible to get much better
results than opaque firmware-driven encoders and it would be a shame to not
leverage this possibility with an API that is too restrictive.

However I do think it should be possible to operate the encoder without a lot
of codec-specific supporting code from userspace. This is also why I like having
kernel-side rate control (among other reasons).

Ack. We need a compromise here.


[...]


The next topic of interest is reference management. It seems pretty clear that
the decision of whether a frame should be a reference or not always needs to be
taken when encoding that frame. In H.264 the nal_ref_idc slice header element
indicates whether a frame is marked as reference or not. IDR frames can
additionally be marked as long-term reference (if I understood correctly, the
frame will stay in the reference picture list until the next IDR frame).

This is incorrect. Any frames can be marked as long term reference, it does not
matter what type they are. From what I recall, marking of the long term in the
bitstream is using a explicit IDX, so there is no specific rules on which one
get evicted. Long term of course are limited as they occupy space in the DPB.
Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
modes. The first is a simple fifo, in this case, any frame you encode and want
to keep as reference is pushed into the DPB (which has a fixed size minus the
long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
Though, an IDR will implicitly cause the decoder to evict everything (including
long term).

The second mode uses the memory management commands. This is a series if
instruction that the encoder can send to the decoder. The specification is quite
complex, it is a common source of bugs in decoders and a place were stateless
hardware codecs performs more consistently in general. Through the commands, the
encoder ensure that the decoder dpb representation stay on sync.

This is also what I understand from repeated reading of the spec and thanks for
the summary write-up!

My assumption was that it would be preferable to operate in the simple fifo
mode since the memory management commands need to be added to the bitstream
headers and require coordination from the kernel. Like you said it seems complex
and error-prone.

But maybe this mechanism could be used to allow any particular reference frame
configuration, opening the way for userspace to fully decide what the reference
buffer lists are? Also it would be good to know if such mechanisms are generally
present in codecs or if most of them have an implicit reference list that cannot
be modified.

Of course, the subject is much more relevant when there is encoders with more
then 1 reference. But you are correct, what the commands do, is allow to change,
add or remove any reference from the list (random modification), as long as they
fit in the codec contraints (like the DPB size notably). This is the only way
one can implement temporal SVC reference pattern, robust reference trees or RTP
RPSI. Note that long term reference also exists, and are less complex then these
commands.


If we the userspace could manage the lifetime of reconstruction buffers(assignment, reference), we don't need a command here.

It is just a problem of how to design another request API control structure to select which buffers would be used for list0, list1.
I this raises a big question, and I never checked how this worked with let's say
VA. Shall we let the driver resolve the changes into commands (VP8 have
something similar, while VP9 and AV1 are refresh flags, which are just trivial
to compute). I believe I'll have to investigate this further.


[...]

Addition information gathered:
- It seems likely that the Allwinner Video Engine only supports one reference
frame. There's a register for specifying the rec buffer of a second one but
I have never seen the proprietary blob use it. It might be as easy as
specifying a non-zero address there but it might also be ignored or require
some undocumented bit to use more than one reference. I haven't made any
attempt at using it yet.

There is something in that fact that makes me think of Hantro H1. Hantro H1 also
have a second reference, but non one ever use it. We have on our todo to
actually give this a look.

Having looked at both register layouts, I would tend to think both designs
are distinct. It's still unclear where Allwinner's video engine comes from:
perhaps they made it in-house, perhaps some obscure Chinese design house made it
for them or it could be known hardware with a modified register layout.

Ack,

I would also be interested to know if the H1 can do more than one reference!

From what we have in our pretty thin documentation, references are being
"searched" for fuzzy match and motion. So when you pass 2 references to the
encoder, then the encoder will search equally in both. I suspect it does a lot
more then that, and saves some information in the auxiliary buffers that exist
per reference, but this isn't documented and I'm not specialized enough really.

From usage perspective, all you have to do is give it access to the references
picture data (reconstructed image and auxiliary data). The result is compressed
macroblock data that may refer to these. We don't really know if it is used, but
we do assume it is and place it in the reference list. This is of course normal
thing to do, specially when using a reference fifo.

In theory, you could implement multiple reference with a HW that only supports
1. A technique could be to compress the image multiple time, and keep the "best"
one for the current configuration. Though, a proper multi-pass encoder would
avoid the bandwidth overhead of compressing and writing the temporary result.


- Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
do not support VP8 encode (despite Allwinner's proprietary blob having an
API for it). The only platform that advertises it is the A80 and this might
actually be a VP8-only Hantro H1. It seems that the API they developed in the
library stuck around even if no other platform can use it.

Thanks for letting us know. Our assumption is that a second hardware design is
unlikely as Google was giving it for free to any hardware makers that wanted it.


Sorry for the long email again, I'm trying to be a bit more explanatory than
just giving some bare conclusions that I drew on my own.

What do you think about these ideas?

In general, we diverge on the direction we want the interface to be. What you
seem to describe now is just a normal stateful encoder interface with everything
needed to drive the stateless hardware implemented in the Linux kernel. There is
no parsing or other unsafety in encoders, so I don't have a strict no-go
argument for that, but for me, it means much more complex drivers and lesser
flexibility. The VA model have been working great for us in the past, giving us
the ability to implement new feature, or even slightly of spec features. While,
the Linux kernel might not be the right place for these experimental methods.

VA seems too low-level for our case here, as it seems to expect full control
over more or less each bitstream parameter that will be produced.

I think we have to find some middle-ground that is not as limiting as stateful
encoders but not as low-level as VA.

Personally, I would rather discuss around your uAPI RFC though, I think a lot of
other devs here would like to see what you have drafted.

Hehe I wish I had some advanced proposal here but my implementation is quite
simplified compared to what we have to plan for mainline.

No worries, let's do that later then. On our side, we have similar limitation,
since we have to have something working before we can spend more time in turning
it into something upstream. So we have "something" for VP8, we'll do "something"
for H.264, from there we should be able to iterate. But having the opportunity
to iterate over a more capable hardware would clearly help understand the bigger
picture.

cheers,
Nicolas

--
Hsia-Jun(Randy) Li