Re: [PATCH v2 5/9] drm/vkms: Re-introduce line-per-line composition algorithm

From: Louis Chauvet
Date: Wed Mar 06 2024 - 12:30:27 EST


Le 05/03/24 - 12:10, Pekka Paalanen a écrit :
> On Mon, 4 Mar 2024 16:28:33 +0100
> Louis Chauvet <louis.chauvet@xxxxxxxxxxx> wrote:
>
> > Le 29/02/24 - 12:21, Pekka Paalanen a écrit :
> > > On Tue, 27 Feb 2024 16:02:09 +0100
> > > Louis Chauvet <louis.chauvet@xxxxxxxxxxx> wrote:
> > >
> > > > [...]
> > > >
> > > > > > -static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info,
> > > > > > - struct line_buffer *stage_buffer,
> > > > > > - struct line_buffer *output_buffer)
> > > > > > +static void pre_mul_alpha_blend(
> > > > > > + struct line_buffer *stage_buffer,
> > > > > > + struct line_buffer *output_buffer,
> > > > > > + int x_start,
> > > > > > + int pixel_count)
> > > > > > {
> > > > > > - int x_dst = frame_info->dst.x1;
> > > > > > - struct pixel_argb_u16 *out = output_buffer->pixels + x_dst;
> > > > > > - struct pixel_argb_u16 *in = stage_buffer->pixels;
> > > > > > - int x_limit = min_t(size_t, drm_rect_width(&frame_info->dst),
> > > > > > - stage_buffer->n_pixels);
> > > > > > -
> > > > > > - for (int x = 0; x < x_limit; x++) {
> > > > > > - out[x].a = (u16)0xffff;
> > > > > > - out[x].r = pre_mul_blend_channel(in[x].r, out[x].r, in[x].a);
> > > > > > - out[x].g = pre_mul_blend_channel(in[x].g, out[x].g, in[x].a);
> > > > > > - out[x].b = pre_mul_blend_channel(in[x].b, out[x].b, in[x].a);
> > > > > > + struct pixel_argb_u16 *out = &output_buffer->pixels[x_start];
> > > > > > + struct pixel_argb_u16 *in = &stage_buffer->pixels[x_start];
> > > > >
> > > > > Input buffers and pointers should be const.
> > > >
> > > > They will be const in v4.
> > > >
> > > > > > +
> > > > > > + for (int i = 0; i < pixel_count; i++) {
> > > > > > + out[i].a = (u16)0xffff;
> > > > > > + out[i].r = pre_mul_blend_channel(in[i].r, out[i].r, in[i].a);
> > > > > > + out[i].g = pre_mul_blend_channel(in[i].g, out[i].g, in[i].a);
> > > > > > + out[i].b = pre_mul_blend_channel(in[i].b, out[i].b, in[i].a);
> > > > > > }
> > > > > > }
> > > > >
> > > > > Somehow the hunk above does not feel like it is part of "re-introduce
> > > > > line-per-line composition algorithm". This function was already running
> > > > > line-by-line. Would it be easy enough to collect this and directly
> > > > > related changes into a separate patch?
> > > >
> > > > It is not directly related to the reintroduction of line-by-line
> > > > algorithm, but in the simplification and maintenability effort, I
> > > > changed a bit the function to avoid having multiple place computing the
> > > > x_start/pixel_count values. I don't see an interrest to extract it, it
> > > > will be just a translation of the few lines into the calling place.
> > >
> > > It does make review more difficult, because it makes the patch bigger
> > > and is not explained in the commit message. It is a surprise to a
> > > reviewer, who then needs to think what this means and does it belong
> > > here.
> > >
> > > If you explain it in the commit message and note it in the commit
> > > summary line, I think it would become fairly obvious that this patch is
> > > doing two things rather than one.
> > >
> > > Therefore, *if* it is easy to extract as a separate patch, then it
> > > would be nice to do so. However, if doing so would require you to write
> > > a bunch of temporary code that the next patch would just rewrite again,
> > > then doing so would be counter-productive.
> > >
> > > Patch split is about finding a good trade-off to make things easy for
> > > reviewers:
> > >
> > > - Smaller patches are better as long as they are self-standing and
> > > understandable in isolation, and of course do not regress anything.
> > >
> > > - Rewriting the same thing multiple times in the same series is extra
> > > work for a reviewer and therefore best avoided.
> > >
> > > - The simpler the semantic change, the bigger a patch can be and still
> > > be easy to review.
> > >
> > > And all the patch writing rules specific to the kernel project that I
> > > don't know about.
> >
> > I will extract it in "drm/vkms: Avoid computing blending limits inside the
> > blend function". It's not very relevant by itself, but it make the main
> > patch easier to read.
>
> Thank you.
>
>
> > > > [...]
> > > >
> > > > > > +/**
> > > > > > + * direction_for_rotation() - Helper to get the correct reading direction for a specific rotation
> > > > > > + *
> > > > > > + * @rotation: rotation to analyze
> > > > >
> > > > > This is KMS plane rotation property, right?
> > > > >
> > > > > So the KMS plane has been rotated by this, and what we want to find is
> > > > > the read direction on the attached FB so that reading returns pixels in
> > > > > the CRTC line/scanout order, right?
> > > > >
> > > > > Maybe extend the doc to explain that.
> > > >
> > > > Is it better?
> > > >
> > > > * direction_for_rotation() - Get the correct reading direction for a given rotation
> > > > *
> > > > * This function will use the @rotation parameter to compute the correct reading direction to read
> > > > * a line from the source buffer.
> > > > * For example, if the buffer is reflected on X axis, the pixel must be read from right to left.
> > > > * @rotation: Rotation to analyze. It correspond the the field @frame_info.rotation.
> > >
> > > I think it is important to define what determines the correct result.
> > > In this case, we want the reading to produce pixels in the CRTC scanout
> > > line order, I believe. If you don't say "CRTC", the reader does not
> > > know what "the correct reading direction" should match to.
> >
> > Is this a better explanation?
> >
> > * This function will use the @rotation setting of a source plane to compute the reading
> > * direction in this plane which correspond to a left to right writing in the CRTC.
> > * For example, if the buffer is reflected on X axis, the pixel must be read from right to left
> > * to be written from left to right on the CRTC.
>
> Perfect!
>
>
> >
> > > > > > + */
> > > > > > +enum pixel_read_direction direction_for_rotation(unsigned int rotation)
> > > > > > +{
> > > > > > + if (rotation & DRM_MODE_ROTATE_0) {
> > > > > > + if (rotation & DRM_MODE_REFLECT_X)
> > > > > > + return READ_LEFT;
> > > > > > + else
> > > > > > + return READ_RIGHT;
> > > > > > + } else if (rotation & DRM_MODE_ROTATE_90) {
> > > > > > + if (rotation & DRM_MODE_REFLECT_Y)
> > > > > > + return READ_UP;
> > > > > > + else
> > > > > > + return READ_DOWN;
> > > > > > + } else if (rotation & DRM_MODE_ROTATE_180) {
> > > > > > + if (rotation & DRM_MODE_REFLECT_X)
> > > > > > + return READ_RIGHT;
> > > > > > + else
> > > > > > + return READ_LEFT;
> > > > > > + } else if (rotation & DRM_MODE_ROTATE_270) {
> > > > > > + if (rotation & DRM_MODE_REFLECT_Y)
> > > > > > + return READ_DOWN;
> > > > > > + else
> > > > > > + return READ_UP;
> > > > > > + }
> > > > > > + return READ_RIGHT;
> > > > > > +}
> > > > > > +
> > > > > > /**
> > > > > > * blend - blend the pixels from all planes and compute crc
> > > > > > * @wb: The writeback frame buffer metadata
> > > > > > @@ -183,11 +187,11 @@ static void blend(struct vkms_writeback_job *wb,
> > > > > > {
> > > > > > struct vkms_plane_state **plane = crtc_state->active_planes;
> > > > > > u32 n_active_planes = crtc_state->num_active_planes;
> > > > > > - int y_pos;
> > > > > >
> > > > > > const struct pixel_argb_u16 background_color = { .a = 0xffff };
> > > > > >
> > > > > > size_t crtc_y_limit = crtc_state->base.crtc->mode.vdisplay;
> > > > > > + size_t crtc_x_limit = crtc_state->base.crtc->mode.hdisplay;
> > > > >
> > > > > Wonder why these were size_t, causing needs to cast below...
> > > >
> > > > For crtc_x_limit I just copied the crtc_y_limit. I will change both to u16
> > > > (the type of h/vdisplay).
> > >
> > > Don't go unsigned, that can cause unexpected results when mixed in
> > > computations with signed variables.
> >
> > I will replace them with int.
> >
> > > Oh, the cast was probably not about size but signedness. Indeed, size_t
> > > is unsigned.
> > >
> > > I don't see a reason to use a 16-bit size either, it just exposes the
> > > computations to under/overflows that would then be needed to check for.
> > > s32 should be as fast as any, and perhaps enough bits to never
> > > under/overflow in these computations, but please verify that.
> >
> > I just suggested u16 because it's the type of vdisplay/hdisplay. It was
> > not for performance reason.
>
> Right. It's not uncommon store a value in a storage efficient type that
> may also disallow illegal values, and then use a different type while
> actually computing with it in order to not provoke too obscure C
> language rules most people never heard of, to avoid over/underflows, or
> to just avoid undefined behaviour.
>
> ...
>
> > > > > > +static int get_step_1x1(struct drm_framebuffer *fb, enum pixel_read_direction direction,
> > > > > > + int plane_index)
> > > > > > {
> > > > > > - int x_src = frame_info->src.x1 >> 16;
> > > > > > - int y_src = y - frame_info->rotated.y1 + (frame_info->src.y1 >> 16);
> > > > > > -
> > > > > > - return packed_pixels_addr(frame_info, x_src, y_src);
> > > > > > + switch (direction) {
> > > > > > + default:
> > > > > > + DRM_ERROR("Invalid direction for pixel reading: %d\n", direction);
> > > > > > + return 0;
> > > > >
> > > > > What I'd do here is move the default: section outside of the switch
> > > > > completely. Then the compiler can warn if any enum value is not handled
> > > > > here. Since every case in the switch is a return statement, falling out
> > > > > of the switch block is the default case.
> > > >
> > > > Hoo, I did not know that gcc can warn when using enums, I will definitly
> > > > do it for the v4.
> > > >
> > > > > Maybe the enum variable containing an illegal value could be handled
> > > > > more harshly so that callers could rely on this function always
> > > > > returning a good value?
> > > > >
> > > > > Just like passing in fb=NULL is handled by the kernel as an OOPS.
> > > >
> > > > I don't think it's a good idea to OOPS inside a driver.
> > >
> > > Everyone already do that. Most functions that do not expect to be called
> > > with NULL never check the arguments for NULL. They just OOPS on
> > > dereference if someone passes in NULL. And for a good reason: adding
> > > all those checks is both code churn and it casts doubt: "maybe it is
> > > legal and expected to call this function with NULL sometimes, what good
> > > does that do?".
> >
> > I agree that adding something like
> >
> > if (direction_is_valid) pr_err("Invalid direction")
> >
> > is useless, but as I already have the switch, it cost nothing to warn if
> > something gone wrong. I will just replace this simple DRM_ERROR with a
> > WARN_ONCE to be more verbose about "it is a bug".
>
> Sounds good to me, and I hope kernel maintainers would agree.
>
>
> > > > An error here is
> > > > maybe dangerous, but is not fatal to the kernel. Maybe you know how to do
> > > > a "local" OOPS to break only this driver and not the whole kernel?
> > >
> > > I don't know what the best practices are in the kernel.
> > >
> > > > For the v4 I will keep a DRM_ERROR and return 0.
> > >
> > > Does that require the caller to check for 0? Could the 0 cause
> > > something else to end up in an endless loop? If it does return 0, how
> > > should a caller handle this case that "cannot" ever happen? Why have
> > > code for something that cannot happen?
> >
> > I have to return something, otherwise the compiler will complain about it.
> >
> > To avoid for future developers surprise, I added this information in the
> > comment. This way the user don't have to read the code to understand how
> > much he can rely on this value.
> >
> > If the caller can trust his direction, he don't have to worry about this.
> > If he can't trust his direction, he know that the returned value can be
> > zero, and thus can't be used for a loop variant.
>
> There should not be "untrusted" values to begin with at this point,
> anything that comes from outside of the kernel should have already been
> sanitised. This is about kernel bugs though. Bugs cannot be predicted,
> nor can anyone guarantee to write bug-free code. Hence, the direction
> value is always "somewhat untrusted". We're being paranoid about bugs
> that might happen and trying to ensure the kernel can limp along
> regardless, while also trying to minimise the amount of code that
> "cannot" ever be reached.
>
> > The zero is also nice because it does not interfere with the normal
> > behavior of this function. If the returned value is not zero, it's the
> > correct step to use from one pixel to an other.
>
> If you expect the caller needing to check for the "cannot happen" case,
> returning a unique error value is fine. If you expect the caller to
> never need to think of the "cannot happen" case, you should return a
> value that is "safe", if such value exists. "Safe" here means using it
> will not result in grave bugs like bad memory access, but it also won't
> produce expected results unless by accident.

That my issue, on my initial draft, I had a `return 1` (so I can use it as
a loop variant), but after thinking, if the start pixel is the last
of the plane, it will access outside the buffer.

> This getting perhaps a bit too philosophical, so don't mind about this
> too much if it feels strange.

Maybe yes, I was a bit paranoid, I can just return 0 and remove the comment.

>
> > > Of course it's a trade-off between correctness and limping along
> > > injured, and the kernel tends to strongly lean toward the latter for the
> > > obvious reasons.
> > >
> > > > > > + case READ_RIGHT:
> > > > > > + return fb->format->char_per_block[plane_index];
> > > > > > + case READ_LEFT:
> > > > > > + return -fb->format->char_per_block[plane_index];
> > > > > > + case READ_DOWN:
> > > > > > + return (int)fb->pitches[plane_index];
> > > > > > + case READ_UP:
> > > > > > + return -(int)fb->pitches[plane_index];
> > > > > > + }
> > > > > > }
> > > > > >
> > > > > > -static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x)
> > > > > > -{
> > > > > > - if (frame_info->rotation & (DRM_MODE_REFLECT_X | DRM_MODE_ROTATE_270))
> > > > > > - return limit - x - 1;
> > > > > > - return x;
> > > > > > -}
> > > > > >
> > > > > > /*
> > > > > > - * The following functions take pixel data from the buffer and convert them to the format
> > > > > > + * The following functions take pixel data (a, r, g, b, pixel, ...), convert them to the format
> > > > > > * ARGB16161616 in out_pixel.
> > > > > > *
> > > > > > - * They are used in the `vkms_compose_row` function to handle multiple formats.
> > > > > > + * They are used in the `read_line`s functions to avoid duplicate work for some pixel formats.
> > > > > > */
> > > > > >
> > > > > > -static void ARGB8888_to_argb_u16(u8 *src_pixels, struct pixel_argb_u16 *out_pixel)
> > > > > > +static void ARGB8888_to_argb_u16(struct pixel_argb_u16 *out_pixel, int a, int r, int g, int b)
> > > > >
> > > > > The function name ARGB8888_to_argb_u16() is confusing. It's not taking
> > > > > in ARGB8888 pixels but separate a,r,g,b ints. The only assumption it
> > > > > needs from the pixel format is the 8888 part.
> > > >
> > > > I don't realy know how to name it. What I like with ARGB8888 is that it's
> > > > clear that the values are 8 bits and in argb format.
> > >
> > > I could even propose
> > >
> > > static struct pixel_argb_u16
> > > argb_u16_from_u8888(int a, int r, int g, int b)
> > >
> > > perhaps. Yes, returning a struct by value. I think it would fit, and
> > > these are supposed to get fully inlined anyway, too.
> > >
> > > c.f argb_u16_from_u2101010().
> >
> > I don't find this method, but I got and like the idea, I will change the
> > callback to this in the v4.
>
> I mean, there is no support for 10-bpc formats in VKMS yet IIRC, but
> there should be one day, so thinking about how that would fit in the
> naming scheme is nice.
>
> > > Not a big deal though, I think I'm getting a little bit too involved to
> > > see what would be the most intuitively understandable naming scheme for
> > > someone not familiar with the code.
> > >
> > > > Do you think that `argb_u8_to_argb_u16`, with a new structure
> > > > pixel_argb_u8 will be better? (like PATCH 6/9 with pixel_yuv_u8).
> > > >
> > > > If so, I will introduce the argb_u8 structure in an other commit.
> > >
> > > How would you handle 10-bpc formats? Is there a need for
> > > proliferation of bit-depth-specific struct types?
> >
> > No, I don't think it's good to multiply things. I will patch Arthur's
> > patches to avoid the pixel_yuv_u8 structure.
> >
> > > > [...]
> > > >
> > > > > > + * The following functions are read_line function for each pixel format supported by VKMS.
> > > > > > *
> > > > > > - * This function composes a single row of a plane. It gets the source pixels
> > > > > > - * through the y coordinate (see get_packed_src_addr()) and goes linearly
> > > > > > - * through the source pixel, reading the pixels and converting it to
> > > > > > - * ARGB16161616 (see the pixel_read() callback). For rotate-90 and rotate-270,
> > > > > > - * the source pixels are not traversed linearly. The source pixels are queried
> > > > > > - * on each iteration in order to traverse the pixels vertically.
> > > > > > + * They read a line starting at the point @x_start,@y_start following the @direction. The result
> > > > > > + * is stored in @out_pixel and in the format ARGB16161616.
> > > > > > + *
> > > > > > + * Those function are very similar, but it is required for performance reason. In the past, some
> > > > > > + * experiment were done, and with a generic loop the performance are very reduced [1].
> > > > > > + *
> > > > > > + * [1]: https://lore.kernel.org/dri-devel/d258c8dc-78e9-4509-9037-a98f7f33b3a3@xxxxxxxxxx/
> > > > > > */
> > > > > > -void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y)
> > > > > > +
> > > > > > +static void ARGB8888_read_line(struct vkms_frame_info *frame_info, int x_start, int y_start,
> > > > > > + enum pixel_read_direction direction, int count,
> > > > > > + struct pixel_argb_u16 out_pixel[])
> > > > > > +{
> > > > > > + u8 *src_pixels = packed_pixels_addr(frame_info, x_start, y_start, 0);
> > > > > > +
> > > > > > + int step = get_step_1x1(frame_info->fb, direction, 0);
> > > > > > +
> > > > > > + while (count) {
> > > > > > + u8 *px = (u8 *)src_pixels;
> > > > > > +
> > > > > > + ARGB8888_to_argb_u16(out_pixel, px[3], px[2], px[1], px[0]);
> > > > > > + out_pixel += 1;
> > > > > > + src_pixels += step;
> > > > > > + count--;
> > > > >
> > > > > btw. you could eliminate decrementing 'count' if you computed end
> > > > > address and used while (out_pixel < end).
> > > >
> > > > Yes, you are right, but after thinking about it, neither out_pixel < end
> > > > and while (count) are conveying "this loop will copy `count` pixels. I
> > > > think a for-loop here is more understandable. There is no ambiguity in the
> > > > number of pixels written and less error-prone. I will replace
> > > > while (count)
> > > > by
> > > > for(int i = 0; i < count; i++)
> > >
> > > I agree that a for-loop is the most obvious way of saying it, but I
> > > also think while (out_pixel < end) is very close too, and so is while (count).
> > > None of those would make me think twice.
> > >
> > > However, I'm thinking of performance here. After all, this is the
> > > hottest code path there is in VKMS. Is the compiler smart enough to
> > > eliminate count-- or i to reduce the number of CPU cycles?
> >
> > You are proably right, I will change it to out_pixel < end.
>
> Don't trust my word without benchmarking it. ;-)

I did not notice a change with kms_fb_stress. There is maybe a small
improvment, but completly hidden in the DRM overhead.

Kind regards,
Louis Chauvet

>
> Thanks,
> pq



--
Louis Chauvet, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com