Re: [PATCH V4 0/5] mlx5 ConnectX control misc driver

From: Jason Gunthorpe
Date: Fri Feb 16 2024 - 10:06:05 EST


On Thu, Feb 15, 2024 at 05:00:46PM -0800, Jakub Kicinski wrote:

> But this is a bit of a vicious cycle, vendors have little incentive
> to interoperate, and primarily focus on adding secret sauce outside of
> the standard. In fact you're lucky if the vendor didn't bake some
> extension which requires custom switches into the NICs :(

This may all seem shocking if you come from the netdev world, but this
has been normal for HPC networking for the last 30 years at least.

My counter perspective would be that we are currently in a pretty good
moment for HPC industry because we actually have open source
implementations for most of it. In fact most actual deployments are
running something quite close to the mainline open source stack.

The main hold out right now is Cray/HPE's Slingshot networking family
(based on ethernet apparently), but less open source.

I would say the HPC community has a very different community goal post
that netdev land. Make your thing, whatever it is. Come with an open
kernel driver, a open rdma-core, a open libfabric/ucx and plug into
the open dpdk/nccl/ucx/libfabric layer and demonstrate your thing
works with openmpi/etc applications.

Supporting that open stack is broadly my north star for the kernel
perspective as Mesa is to DRM.

Several points of this chain are open industry standards driven by
technical working group communities.

This is what the standardization and interoperability looks like
here. It is probably totally foreign from a netdev view point, far
less focus on the wire protocol, devices and kernel. Here the focus is
on application and software interoperability. Still, it is open in
a pretty solid way.

Jason