Re: [PATCH] PCI/LINK: Account for BW notification in vector calculation

From: Alex Williamson
Date: Tue Apr 23 2019 - 11:34:16 EST


On Tue, 23 Apr 2019 09:33:53 -0500
Alex G <mr.nuke.me@xxxxxxxxx> wrote:

> On 4/22/19 7:33 PM, Alex Williamson wrote:
> > On Mon, 22 Apr 2019 19:05:57 -0500
> > Alex G <mr.nuke.me@xxxxxxxxx> wrote:
> >> echo 0000:07:00.0:pcie010 |
> >> sudo tee /sys/bus/pci_express/drivers/pcie_bw_notification/unbind
> >
> > That's a bad solution for users, this is meaningless tracking of a
> > device whose driver is actively managing the link bandwidth for power
> > purposes.
>
> 0.5W savings on a 100+W GPU? I agree it's meaningless.

Evidence? Regardless, I don't have control of the driver that's making
these changes, but the claim seems unfounded and irrelevant.

> > There is nothing wrong happening here that needs to fill
> > logs. I thought maybe if I enabled notification of autonomous
> > bandwidth changes that it might categorize these as something we could
> > ignore, but it doesn't.
> > How can we identify only cases where this is
> > an erroneous/noteworthy situation? Thanks,
>
> You don't. Ethernet doesn't. USB doesn't. This logging behavior is
> consistent with every other subsystem that deals with multi-speed links.
> I realize some people are very resistant to change (and use very ancient
> kernels). I do not, however, agree that this is a sufficient argument to
> dis-unify behavior.

Sorry, I don't see how any of this is relevant either. Clearly I'm
using a recent kernel or I wouldn't be seeing this new bandwidth
notification driver. I'm assigning a device to a VM whose driver is
power managing the device via link speed changes. The result is that
we now see irrelevant spam in the host dmesg for every inconsequential
link downgrade directed by the device. I can see why we might want to
be notified of degraded links due to signal issues, but what I'm
reporting is that there are also entirely normal and benign reasons
that a link might be reduced, we can't seem to tell the difference
between a fault and this normal dynamic scaling, and the assumption of
a fault is spamming dmesg. So, I don't think what we have here is well
cooked. Do drivers have a mechanism to opt-out of this error
reporting? Can drivers register an anticipated link change to avoid
the spam? What instructions can we *reasonably* give to users as to
when these messages mean something, when they don't, any how they can
be turned off? Thanks,

Alex