Re: [PATCH 00/15] Habana Labs kernel driver

From: Olof Johansson
Date: Wed Jan 23 2019 - 18:40:41 EST


On Wed, Jan 23, 2019 at 3:20 PM Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
>
> On Wed, Jan 23, 2019 at 03:04:33PM -0800, Olof Johansson wrote:
> > On Wed, Jan 23, 2019 at 2:45 PM Dave Airlie <airlied@xxxxxxxxx> wrote:
> > >
> > > On Thu, 24 Jan 2019 at 08:32, Oded Gabbay <oded.gabbay@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Jan 24, 2019 at 12:02 AM Dave Airlie <airlied@xxxxxxxxx> wrote:
> > > > >
> > > > > Adding Daniel as well.
> > > > >
> > > > > Dave.
> > > > >
> > > > > On Thu, 24 Jan 2019 at 07:57, Dave Airlie <airlied@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, 23 Jan 2019 at 10:01, Oded Gabbay <oded.gabbay@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > For those who don't know me, my name is Oded Gabbay (Kernel Maintainer
> > > > > > > for AMD's amdkfd driver, worked at RedHat's Desktop group) and I work at
> > > > > > > Habana Labs since its inception two and a half years ago.
> > > > > >
> > > > > > Hey Oded,
> > > > > >
> > > > > > So this creates a driver with a userspace facing API via ioctls.
> > > > > > Although this isn't a "GPU" driver we have a rule in the graphics
> > > > > > drivers are for accelerators that we don't merge userspace API with an
> > > > > > appropriate userspace user.
> > > > > >
> > > > > > https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements
> > > > > >
> > > > > > I see nothing in these accelerator drivers that make me think we
> > > > > > should be treating them different.
> > > > > >
> > > > > > Having large closed userspaces that we have no insight into means we
> > > > > > get suboptimal locked for ever uAPIs. If someone in the future creates
> > > > > > an open source userspace, we will end up in a place where they get
> > > > > > suboptimal behaviour because they are locked into a uAPI that we can't
> > > > > > change.
> > > > > >
> > > > > > Dave.
> > > >
> > > > Hi Dave,
> > > > While I always appreciate your opinion and happy to hear it, I totally
> > > > disagree with you on this point.
> > > >
> > > > First of all, as you said, this device is NOT a GPU. Hence, I wasn't
> > > > aware that this rule might apply to this driver or to any other driver
> > > > outside of drm. Has this rule been applied to all the current drivers
> > > > in the kernel tree with userspace facing API via IOCTLs, which are not
> > > > in the drm subsystem ? I see the logic for GPUs as they drive the
> > > > display of the entire machine, but this is an accelerator for a
> > > > specific purpose, not something generic as GPU. I just don't see how
> > > > one can treat them in the same way.
> > >
> > > The logic isn't there for GPUs for those reason that we have an
> > > established library or that GPUs are in laptops. They are just where
> > > we learned the lessons of merging things whose primary reason for
> > > being in the kernel is to execute stuff from misc userspace stacks,
> > > where the uAPI has to remain stable indefinitely.
> > >
> > > a) security - without knowledge of what the accelerator can do how can
> > > we know if the API you expose isn't just a giant root hole?
> > >
> > > b) uAPI stability. Without a userspace for this, there is no way for
> > > anyone even if in possession of the hardware to validate the uAPI you
> > > provide and are asking the kernel to commit to supporting indefinitely
> > > is optimal or secure. If an open source userspace appears is it to be
> > > limited to API the closed userspace has created. It limits the future
> > > unnecessarily.
> > >
> > > > There is no way that "someone" will create a userspace
> > > > for our H/W without the intimate knowledge of the H/W or without the
> > > > ISA of our programmable cores. Maybe for large companies this request
> > > > is valid, but for startups complying to this request is not realistic.
> > >
> > > So what benefit does the Linux kernel get from having support for this
> > > feature upstream?
> > >
> > > If users can't access the necessary code to use it, why does this
> > > require to be maintained in the kernel.
> > >
> > > > To conclude, I think this approach discourage other companies from
> > > > open sourcing their drivers and is counter-productive. I'm not sure
> > > > you are aware of how difficult it is to convince startup management to
> > > > opensource the code...
> > >
> > > Oh I am, but I'm also more aware how quickly startups go away and
> > > leave the kernel holding a lot of code we don't know how to validate
> > > or use.
> > >
> > > I'm opening to being convinced but I think defining new userspace
> > > facing APIs is a task that we should take a lot more seriously going
> > > forward to avoid mistakes of the past.
> >
> > I think the most important thing here is to know that things are
> > likely to change quite a bit over the next couple of years, and that
> > we don't know yet what we actually need. If we hold off picking up
> > support for hardware while all of this is ironed out, we'll miss out
> > on being exposed to it, and will have a very tall hill to climb once
> > we try to convince vendors to come into the fold. It's also not been a
> > requirement for the other two drivers we have merged, as far as I can
> > tell (CAPI and OpenCAPI) so the cat's already out of the bag.
> >
> > I'd rather not get stuck in a stand-off needing the longterm solution
> > to pick up the short term contribution. That way we can move over to a
> > _new_ API once there's been a better chance of finding common grounds
> > and once things settle down a bit, instead of trying to bring some
> > larger legacy codebase for devices that people might no longer care
> > much about over to the newer APIs.
> >
> > It's better to be exposed to the HW and drivers now, than having
> > people build large elaborate out-of-tree software stacks for this.
> > It's also better to get them to come and collaborate now, instead of
> > pushing them away until things are perfect.
> >
> > Having a way to validate and exercise the userspace API is important,
> > including ability to change it if needed. Would it be possible to open
> > up the lowest userspace pieces (driver interactions), even if some
> > other layers might not yet be, to exercise the device/kernel/userspace
> > interfaces without "live" workload, etc?
>
> Yes and to exercise the userspace API you need at very least to
> know the ISA so that you can write program for the accelerator.
> You also need to know the set of commands the hardware has. The
> ioctl and how to create a userspace that interact with the kernel
> is the easy part, the hard part is the compiler.
>
> So if we want any kind of freedom to play with the UAPI, enhance
> it or change it in anyway we must be free to build program for the
> device ourself.
>
> I believe that the GPU sub-system requirement are a good guideline
> to follow and the only exception with drivers/ that i am aware of
> is the fpga. Everything else in driver as either an open source
> userspace, expose a common API (like network) or is so simple that
> anyone can write a userspace for it.

Once we have a common framework I agree that we need enough tools to
exercise everything needed. I don't agree that this includes full
sources to everything. We don't expect this for most PCIe cards today
either.

If the GPU subsystem is to be followed, I fear that we will end up
with Nvidia-equivalent vendors from day 1, where they will just build
a bigger and bigger software stack on the side instead of joining in,
and someone will need to best-effort bridge the gap by reverse
engineering. I don't want that situation long-term, which is why I
think it's reasonable to be more relaxed during the early days with
upfront, clear, expectations for the longer term that hardware/kernel
interfaces need to be exercisable.

> For any complex device that execute program we should really enforce
> the open source userspace so that we can properly audit the driver
> as otherwise we only have half of the story with no idea what the
> other half might implies.

What you're demanding is open userspace _and_ firmware. Since without
firmware sources, you can't audit any on-chip behavior either (in
reality, most commands passed down are likely parsed by said
firmware).


-Olof