Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

From: Greg Kroah-Hartman
Date: Wed Sep 16 2020 - 04:22:03 EST


On Wed, Sep 16, 2020 at 11:02:39AM +0300, Oded Gabbay wrote:
> On Wed, Sep 16, 2020 at 10:41 AM Greg Kroah-Hartman
> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Sep 16, 2020 at 09:36:23AM +0300, Oded Gabbay wrote:
> > > On Wed, Sep 16, 2020 at 9:25 AM Greg Kroah-Hartman
> > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Sep 15, 2020 at 11:49:12PM +0300, Oded Gabbay wrote:
> > > > > On Tue, Sep 15, 2020 at 11:42 PM David Miller <davem@xxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Oded Gabbay <oded.gabbay@xxxxxxxxx>
> > > > > > Date: Tue, 15 Sep 2020 20:10:08 +0300
> > > > > >
> > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code
> > > > > > > into the habanalabs driver.
> > > > > > >
> > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details
> > > > > > > are in that patch's commit message.
> > > > > > >
> > > > > > > Link to v2 cover letter:
> > > > > > > https://lkml.org/lkml/2020/9/12/201
> > > > > >
> > > > > > I agree with Jakub, this driver definitely can't go-in as it is currently
> > > > > > structured and designed.
> > > > > Why is that ?
> > > > > Can you please point to the things that bother you or not working correctly?
> > > > > I can't really fix the driver if I don't know what's wrong.
> > > > >
> > > > > In addition, please read my reply to Jakub with the explanation of why
> > > > > we designed this driver as is.
> > > > >
> > > > > And because of the RDMA'ness of it, the RDMA
> > > > > > folks have to be CC:'d and have a chance to review this.
> > > > > As I said to Jakub, the driver doesn't use the RDMA infrastructure in
> > > > > the kernel and we can't connect to it due to the lack of H/W support
> > > > > we have
> > > > > Therefore, I don't see why we need to CC linux-rdma.
> > > > > I understood why Greg asked me to CC you because we do connect to the
> > > > > netdev and standard eth infrastructure, but regarding the RDMA, it's
> > > > > not really the same.
> > > >
> > > > Ok, to do this "right" it needs to be split up into separate drivers,
> > > > hopefully using the "virtual bus" code that some day Intel will resubmit
> > > > again that will solve this issue.
> > > Hi Greg,
> > > Can I suggest an alternative for the short/medium term ?
> > >
> > > In an earlier email, Jakub said:
> > > "Is it not possible to move the files and still build them into a single
> > > module?"
> > >
> > > I thought maybe that's a good way to progress here ?
> >
> > Cross-directory builds of a single module are crazy. Yes, they work,
> > but really, that's a mess, and would never suggest doing that.
> >
> > > First, split the content to Ethernet and RDMA.
> > > Then move the Ethernet part to drivers/net but build it as part of
> > > habanalabs.ko.
> > > Regarding the RDMA code, upstream/review it in a different patch-set
> > > (maybe they will want me to put the files elsewhere).
> > >
> > > What do you think ?
> >
> > I think you are asking for more work there than just splitting out into
> > separate modules :)
> >
> > thanks,
> >
> > greg k-h
> Hi Greg,
>
> If cross-directory building is out of the question, what about
> splitting into separate modules ? And use cross-module notifiers/calls
> ? I did that with amdkfd and amdgpu/radeon a couple of years back. It
> worked (that's the best thing I can say about it).

That's fine with me.

> The main problem with this "virtual bus" thing is that I'm not
> familiar with it at all and from my experience I imagine it would take
> a considerable time and effort to upstream this infrastructure work.

It shouldn't be taking that long, but for some unknown reason, the
original author of that code is sitting on it and not resending it. Go
poke them through internal Intel channels to find out what the problem
is, as I have no clue why a 200-300 line bus module is taking so long to
get "right" :(

I'm _ALMOST_ at the point where I would just do that work myself, but
due to my current status with Intel, I'll let them do it as I have
enough other things on my plate...

> This could delay the NIC code for a couple of years, which by then
> this won't be relevant at all.

Why wouldn't this code be relevant in a year? It's going to be 2+ years
before any of this shows up in an "enterprise distro" based on their
release cycles anyway :)

thanks,

greg k-h