Re: regression with mainline kernel

From: Daniel Vetter
Date: Mon Nov 15 2021 - 09:25:38 EST


On Sat, Nov 13, 2021 at 09:06:57AM -0800, Linus Torvalds wrote:
> [ Hmm. This email was marked as spam for me. I see no obvious reason
> for it being marked as spam, but it happens.. ]
>
> On Thu, Nov 11, 2021 at 12:52 PM Sudip Mukherjee
> <sudipm.mukherjee@xxxxxxxxx> wrote:
> >
> > # first bad commit: [cd7f5ca33585918febe5e2f6dc090a21cfa775b0]
> > drm/virtio: implement context init: add virtio_gpu_fence_event
>
> Hmm. Judging from your automated screenshots, the login never appears.

Greg also reported a regression, plus I looked at the commit and had
questions too.

> > And, indeed reverting cd7f5ca33585 on top of debe436e77c7 has fixed
> > the problem I was seeing on my qemu test of x86_64. The qemu image is
> > based on Ubuntu.
>
> Presumably either that commit is somehow buggy in itself - or it does
> exactly what it means to do, and the new poll() semantics just
> confuses the heck out of the X server (or wayland or whatever).
>
> And honestly, if I read that thing correctly, the patch is entirely
> broken. The new poll function (virtio_gpu_poll()) will unconditionally
> remove the first event from the event list, and then report "Yeah, I
> had events".
>
> This is completely bogus for a few reasons:
>
> - poll() really should be idempotent, because the poll function gets
> called multiple times
>
> - it doesn't even seem to check that the event that it removes is the
> new VIRTGPU_EVENT_FENCE_SIGNALED_INTERNAL kind of event, so it will
> unconditionally just remove random events.
>
> - it does seem to check the "vfpriv->ring_idx_mask" and do the old
> thing if that is zero, but I see absolutely no reason for that (and
> that check itself has caused problems, see below)
>
> Honestly, my reaction to this all is that that commit is fundamentally
> broken and probably should be reverted regardless as "this commit does
> bad things".
>
> HOWEVER - it has had a fix for a NULL pointer dereference in the
> meantime - can you check whether the current top of tree happens to
> work for you? Maybe your problem isn't due to "that commit does
> unnatural things", but simply due to the bug fixed in d89c0c8322ec
> ("drm/virtio: Fix NULL dereference error in virtio_gpu_poll").
>
> And if it's still broken with that commit, I'll happily revert it and
> people need to go back to the drawing board.
>
> In fact, I would really suggest that people look at that
> virtio_gpu_poll() function regardless. That odd "let's unconditionally
> just drop events in the poll function is really REALLY broken
> behavior.

Tbh I haven't looked at the poll implementation, but it's fishy on
principles as in drm drivers shouldn't reinvent them. The commit message
cites vmwgfx as example for a private driver drm_event, but that works
perfectly fine with standard drm_poll (and is meant to work perfectly fine
with standard drm_poll).

So if it's buggy on top of questionable I think revert might be the right
choice irrespective of whether there's some fixes in-flight.

So if you end up pushing that revert:

References: https://lore.kernel.org/dri-devel/YZJrutLaiwozLfSw@phenom.ffwll.local/
Acked-by: Daniel Vetter <daniel.vetter@xxxxxxxx>

Plus credit Greg too and all that ofc.

But lets wait a bit for virtio folks to chime in, it's only Monday :-)

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch