Re: [REGRESSION] Too-low frequency limit for AMD GPU PCI-passed-through to Windows VM

From: Alex Williamson
Date: Fri Mar 18 2022 - 10:47:51 EST


On Fri, 18 Mar 2022 08:01:31 +0100
Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> wrote:

> On 18.03.22 06:43, Paul Menzel wrote:
> >
> > Am 17.03.22 um 13:54 schrieb Thorsten Leemhuis:
> >> On 13.03.22 19:33, James Turner wrote:
> >>>
> >>>> My understanding at this point is that the root problem is probably
> >>>> not in the Linux kernel but rather something else (e.g. the machine
> >>>> firmware or AMD Windows driver) and that the change in f9b7f3703ff9
> >>>> ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)") simply
> >>>> exposed the underlying problem.
> >>
> >> FWIW: that in the end is irrelevant when it comes to the Linux kernel's
> >> 'no regressions' rule. For details see:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/admin-guide/reporting-regressions.rst
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/process/handling-regressions.rst
> >>
> >>
> >> That being said: sometimes for the greater good it's better to not
> >> insist on that. And I guess that might be the case here.
> >
> > But who decides that?
>
> In the end afaics: Linus. But he can't watch each and every discussion,
> so it partly falls down to people discussing a regression, as they can
> always decide to get him involved in case they are unhappy with how a
> regression is handled. That obviously includes me in this case. I simply
> use my best judgement in such situations. I'm still undecided if that
> path is appropriate here, that's why I wrote above to see what James
> would say, as he afaics was the only one that reported this regression.
>
> > Running stuff in a virtual machine is not that uncommon.
>
> No, it's about passing through a GPU to a VM, which is a lot less common
> -- and afaics an area where blacklisting GPUs on the host to pass them
> through is not uncommon (a quick internet search confirmed that, but I
> might be wrong there).

Right, interference from host drivers and pre-boot environments is
always a concern with GPU assignment in particular. AMD GPUs have a
long history of poor behavior relative to things like PCI secondary bus
resets which we use to try to get devices to clean, reusable states for
assignment. Here a device is being bound to a host driver that
initiates some sort of power control, unbound from that driver and
exposed to new drivers far beyond the scope of the kernel's regression
policy. Perhaps it's possible to undo such power control when
unbinding the device, but it's not necessarily a given that such a
thing is possible for this device without a cold reset.

IMO, it's not fair to restrict the kernel from such advancements. If
the use case is within a VM, don't bind host drivers. It's difficult
to make promises when dynamically switching between host and userspace
drivers for devices that don't have functional reset mechanisms.
Thanks,

Alex