Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

From: Jakub Kicinski
Date: Fri Mar 01 2019 - 15:04:10 EST


On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> Use case:
> ---------
> A user wants to create/delete hardware linked sub devices without
> using SR-IOV.
> These devices for a pci device can be netdev (optional rdma device)
> or other devices. Such sub devices share some of the PCI device
> resources and also have their own dedicated resources.
>
> Few examples are:
> 1. netdev having its own txq(s), rq(s) and/or hw offload parameters.
> 2. netdev with switchdev mode using netdev representor
> 3. rdma device with IB link layer and IPoIB netdev
> 4. rdma/RoCE device and a netdev
> 5. rdma device with multiple ports
>
> Requirements for above use cases:
> --------------------------------
> 1. We need a generic user interface & core APIs to create sub devices
> from a parent pci device but should be generic enough for other parent
> devices
> 2. Interface should be vendor agnostic
> 3. User should be able to set device params at creation time
> 4. In future if needed, tool should be able to create passthrough
> device to map to a virtual machine

Like a mediated device?

https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
https://www.dpdk.org/wp-content/uploads/sites/35/2018/06/Mediated-Devices-Better-Userland-IO.pdf

Other than pass-through it is entirely unclear to me why you'd need
a bus. (Or should I say VM pass through or DPDK?) Could you clarify
why the need for a bus?

My thinking is that we should allow spawning subports in devlink and
if user specifies "passthrough" the device spawned would be an mdev.

> 5. A device can have multiple ports

What does this mean, in practice? You want to spawn a subdev which can
access both ports? That'd be for RDMA use cases, more than Ethernet,
right? (Just clarifying :))

> 6. An orchestration software wants to know how many such sub devices
> can be created from a parent device so that it can manage them in global
> cluster resources.
>
> So how is it done?
> ------------------
> (a) user in control
> To address above requirements, a generic tool iproute2/devlink is
> extended for sub device's life cycle.
> However a devlink tool and its kernel counter part is not sufficient
> to create protocol agnostic devices on a existing PCI bus.

"Protocol agnostic"?... What does that mean?

> (b) subdev bus
> A given bus defines well defined addressing scheme. Creating sub devices
> on existing PCI bus with a different naming scheme is just weird.
> So, creating well named devices on appropriate bus is desired.

What's that address scheme you're referring to, you seem to assign IDs
in sequence?

> Hence a new 'subdev' bus is created.
> User adds/removes new sub devices subdev on this bus via a devlink tool.
> devlink tool instructs hardware driver to create/remove/configure
> such devices. Hardware vendor driver places devices on the bus.
> Another or same vendor driver matches based on vendor-id, device-id
> scheme and run through classic device driver model.
>
> Given that, these are user created devices for a given hardware and in
> absence of a central entity like PCISIG to assign vendor and device ids,
> A unique vendor and device id are maintained as enum in
> include/linux/subdev_ids.h.

Why do we need IDs? The sysfs hierarchy isn't sufficient? Do we need
a driver to match on those again? Is it going to be a different driver?

> subdev bus device names follow default device naming scheme of Linux
> kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.
>
> subdev device inherits its parent's DMA parameters.
> subdev will follow rich power management infrastructure of core kernel/
> So that every vendor driver doesn't have to iterate over its child
> devices, invent a locking and device anchoring scheme.
>
> Patchset summary:
> -----------------
> Patch-1, 2 introduces a subdev bus and interface for subdev life cycle.
> Patch-3 extends modpost tool for module device id table.
> Patch-4,5,6 implements a devlink vendor driver to add/remove devices.
> Patch-7 mlx5 driver implements subdev devices and places them on subdev
> bus.
> Patch-8 match against the subdev for mlx5 vendor, device id and creates
> fake netdevice.
>
> All patches are only a reference implementation to see RFC in works
> at devlink, sysfs and device model level. Once RFC looks good, more
> solid upstreamable version of the implementation will be done.
> All patches are functional except the last two patches, which just
> create fake subdev devices and fake netdevice.
>
> System example view:
> --------------------
>
> $ devlink dev show
> pci/0000:05:00.0
>
> $ devlink dev add pci/0000:05:00.0

That does not look great.

Also you have to return the id of the spawned device, otherwise this
is very racy.

> $ devlink dev show
> pci/0000:05:00.0
> subdev/subdev0

Please don't spawn devlink instances. Devlink instance is supposed to
represent an ASIC. If we start spawning them willy nilly for whatever
software construct we want to model the clarity of the ontology will
suffer a lot.

Please see the discussion on my recent patchset. I think Jiri CCed you.

> sysfs view with subdev:
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0
> [..]
> drwxr-xr-x 3 root root 0 Feb 13 15:57 infiniband
> -rw-r--r-- 1 root root 4096 Feb 13 15:57 msi_bus
> drwxr-xr-x 3 root root 0 Feb 13 15:57 net
> drwxr-xr-x 2 root root 0 Feb 13 15:57 power
> drwxr-xr-x 3 root root 0 Feb 13 15:57 ptp
> drwxr-xr-x 4 root root 0 Feb 13 15:57 subdev0
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0
> lrwxrwxrwx 1 root root 0 Feb 13 15:58 driver -> ../../../../../bus/subdev/drivers/mlx5_core
> drwxr-xr-x 3 root root 0 Feb 13 15:58 net
> drwxr-xr-x 2 root root 0 Feb 13 15:58 power
> lrwxrwxrwx 1 root root 0 Feb 13 15:58 subsystem -> ../../../../../bus/subdev
> -rw-r--r-- 1 root root 4096 Feb 13 15:58 uevent
>
> $ ls -l /sys/bus/pci/devices/0000:05:00.0/subdev0/net/
> drwxr-xr-x 5 root root 0 Feb 13 15:58 eth0
>
> Software view:
> -------------
> Some of you if you prefer to see in picture, below diagram tries to
> show software modules in bus/device hierarchy.
>
> devlink user (iproute2/devlink)
> ------------------------------
> |
> |
> +----------------+
> | devlink module |
> | doit() | +------------------+
> | | | | vendor driver |
> +------------|---+ | (mlx5) |
> ----------+-> subdev_ops() |
> +|-----------------+
> |
> +---------|--+ +-----------+ +------------------+
> | subdev bus | | core | | subdev device |
> | driver | | kernel | | drivers |
> | (add/del) | | dev model | | (netdev, rdma) |
> | ----------------------> probe/remove() |
> +------------+ +-----------+ +------------------+
>
> Alternatives considered:
> ------------------------
> Will discuss separately if needed to keep this RFC short.

Please do discuss.

The things key thing for me on the netdev side is what is the
forwarding model to this new entity. Is this basically VMDQ?
Should we just go ahead and mandate "switchdev mode" here?

Thanks for working on a common architecture and suffering through
people's reviews rather than adding a debugfs interface that does
this like a different vendor did :)