Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver

From: Jason Gunthorpe
Date: Wed Nov 29 2023 - 13:07:50 EST


On Wed, Nov 29, 2023 at 03:41:54PM +0000, Greg Kroah-Hartman wrote:

> Ugh, you are right, I was wrong, there is no reference count here, using
> a miscdevice _requires_ you to have a separate reference count, like you
> all did. My fault.

I think we did not do a good job conveying that several experienced
people have looked at this internally already. I see you noticed a few
things we missed, but this is not unreviewed garbage code. We have
different opionions on how some of these things should work but none
of them are what I'd consider unlinuxy or wrong.

> > > Just make it simple and use a normal mutex please.
> >
> > A normal mutex would make the entire ioctl interface single threaded,
> > this is not desirable.
>
> Why not? It's an ioctl for a single device, surely this isn't
> performance criticial.

The RPCs to the FW can take a long time to execute and the "single
device" is really a multi-threaded CPU complex on the other
side. Being able to submit any commands for (say) 0.5s while one
thread chews on something is not desirable.

> And then only grab it when needed, on read/write/ioctl path it
> shouldn't be needed at all due to the proper reference counting of
> the structures. Only on open/close, right?

No - this lock is protecting the mcdev->mdev from being freed. If the
read side is held then mcdev->mdev cannot be freed. This is not a
fully refcounted object because the mdev is actually the device
underlying the aux device and it will be logically destroyed after the
aux device_driver remove() function returns.

The purpose of the rwlock is to ensure that the remove() function does
not return until all threads have completed using the mdev. Once
remove proceeds the read side of the lock will observe mdev==NULL and
fail all operations until the FD is closed. The refcount keeps the
memory underpinning the basic operation of the FD alive after remove()
until the FD finally closes.

At least for the purposes of this driver it is how it has to be. I
agree the mdev API surface that requires some of this may not be the
best it can be.

> And again, for a rw semaphore, benchmarks matter, often, if not almost
> always, a normal mutex is faster for stuff like this. If not, then a
> benchmark will show it.

We are not talking about micro differences in lock/unlock cost here.
This is obviously better because of how much time the lock will stay
locked while executing RPCs and sleeping waiting for interrupts back
from FW. We are talking >> 100's of ms of time spent locked while
executing FW RPCs.

> > We've been over this already, the devlink discussion is about some
> > configuration stuff.
>
> It was? I see device-specific diagonostic data for the mlx5 driver
> being exported through devlink today, that's not configuration. Why not
> just add more?

Today mlx5 has a bunch of debuggables:
- counters, even some device specific counters. Both in RDMA and
netdev objects exposed through netlink and sysfs
- A "core dump" mechanism through devlink
- Some debugfs stuff, but we are now aware of serious problems with it
- Some "devlink parameters" which configure a limited set of things

These are not "interactive". Yes we can add more counters, more
parameters and more stuff to the core dump, but that isn't what I mean
by debugging.

This is a complex device, it has processors on the other side running
a lot of software. A primary purpose of this design is to allow
bi-directional communication with debug agents in the FW.

Consider, (I'm trying to draw a picture here without disclosing trade
secrets about how the device operates), that one of the debug agents
is a GDB stub. There are many CPUs inside these devices, field
engineers may wish to enter a breakpoint and inspect something. With
this scheme we would put the GDB stub protocol packets into RPC
messages and execute them to the FW.

In the very worst nightmare debug scenario we might build a custom FW
with a custom debug agent targetting the specific problem and use
these generic FW RPCs to communicate however that very special one-off
agent requires.

Notice what such a thing requires: a future compatible channel to
exchange data bi-directionally with the SW components in the FW.

The design here of processing *generic RPCs* is not some dirty nod
toward enterprise distros - it is a sane and logical design to put
things in userspace that the kernel has no value-add to being part
of. This is standard straight-from-Linus design values. It is how
things like nvme built their very similar "vendor commands" system
(see for instance the intel & wd commands in nvmecli).

So, can we run generic opaque RPCs over some TBD devlink interface,
including with DMA? Technologically yes, of course. Practically?
No. Jakub has clearly stated his position on philisophical limits in
netdev and those include devlink. It excludes soemthing so open-ended
and ill-defined. FWIW, I agree with this, devlink should be much more
constrained. It is why we never suggested devlink for this kind of
debugging.

Genericness is also the source of alot of the cross discussion in this
thread.

A generic RPC interface can do *alot*. Actually, it can do *almost
anything* to the device. That is the entire point. What it can't do is
violate the integrity of the kernel - hence why it is a FW RPC
interface and not direct access to device registers.

So when Jakub says we can send configuration values that could also be
sent over devlink parameters, or read counters that could also be read
over netlink, he is not wrong. But we already do all those duplicative
things through the direct PCI access and have for at least 10 years.

I disagree that this duplication is a problem. Re-usable, cross vendor
generic interfaces have merit on their own. mlx5 has long been
implementing a wide selection of them as they are invented. We've been
doing this even though those interfaces duplicate what the raw PCI
tools already did.

This new driver isn't going to change that. We will still participate
in netdev and still implement these other things. We are a big part of
the community, I don't need someone holding a stick over the mlx5
team's head beating them to do the "right" thing.

Jason