Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

From: Liam R. Howlett
Date: Tue Jun 27 2023 - 12:12:32 EST


* Danilo Krummrich <dakr@xxxxxxxxxx> [230627 10:58]:
> Hi Liam,
>
> On 6/27/23 03:58, Liam R. Howlett wrote:
> > * Danilo Krummrich <dakr@xxxxxxxxxx> [230626 14:37]:
> > > On 6/26/23 16:52, Matthew Wilcox wrote:
> > > > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
> > > > > On 6/26/23 15:19, Matthew Wilcox wrote:
> > > > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
> > > > > > > On the other hand, unless I miss something (and if so, please let me know),
> > > > > > > something is bogus with the API then.
> > > > > > >
> > > > > > > While the documentation of the Advanced API of the maple tree explicitly
> > > > > > > claims that the user of the API is responsible for locking, this should be
> > > > > > > limited to the bounds set by the maple tree implementation. Which means, the
> > > > > > > user must decide for either the internal (spin-) lock or an external lock
> > > > > > > (which possibly goes away in the future) and acquire and release it
> > > > > > > according to the rules maple tree enforces through lockdep checks.
> > > > > > >
> > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the
> > > > > > > tree isn't modified using the internal lock with mas_preallocate()?
> > > > > > >
> > > > > > > Besides that, I think the documentation should definitely mention this
> > > > > > > limitation and give some guidance for the locking.
> > > > > > >
> > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with
> > > > > > > the implementation details would be able to recognize this limitation.
> > > > > > >
> > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop
> > > > > > > the maple tree and go back to using a rb-tree, since it seems there is no
> > > > > > > sane way doing a worst-case pre-allocation that does not suffer from this
> > > > > > > limitation.
> > > > > >
> > > > > > I haven't been paying much attention here (too many other things going
> > > > > > on), but something's wrong.
> > > > > >
> > > > > > First, you shouldn't need to preallocate. Preallocation is only there
> > > > >
> > > > > Unfortunately, I think we really have a case where we have to. Typically GPU
> > > > > mappings are created in a dma-fence signalling critical path and that is
> > > > > where such mappings need to be added to the maple tree. Hence, we can't do
> > > > > any sleeping allocations there.
> > > >
> > > > OK, so there are various ways to hadle this, depending on what's
> > > > appropriate for your case.
> > > >
> > > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
> > > > layer "This is too hard, let me tap into the emergency reserves". It's
> > > > mildly frowned upon, so let's see if we can do better.
> > > >
> > > > If you know where the allocation needs to be stored, but want it to act as
> > > > NULL until the time is right, you can store a ZERO entry. That will read
> > > > as NULL until you store to it. A pure overwriting store will not cause
> > > > any memory allocation since all the implementation has to do is change
> > > > a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
> > > > As you're discovering, the Maple Tree API isn't fully baked yet.
> > > >
> > >
> > > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
> > > entries in advance would not work. Typically userspace submits a job to the
> > > kernel issuing one or multiple requests to map and unmap memory in an ioctl.
> > > Such a job is then put into a queue and processed asynchronously in a
> > > dma-fence signalling critical section. Hence, at the we'd store entries in
> > > advance we could have an arbitrary amount of pending jobs potentially still
> > > messing with the same address space region.
> >
> > What I think you are saying is that you have a number of requests
> > flooding in, which may overwrite the same areas, but are queued up to be
> > written after they are queued. These operations look to be kept in
> > order according to the code in nouveau_job_submit[1]. Is this correct?
>
> That's all correct.
>
> (Although Nouveau isn't a good example in this case. Some aspects of it do
> and some aspects of it do not apply to the problem we're discussing here.)
>
> >
> > So then, your issue isn't that you don't know where they will land, but
> > don't know if the area that you reserved is already split into other
> > areas? For instance, before the range 5-10 is backed by whatever
> > happens in the fence, it may have already become 5-6 & 8-10 by something
> > that came after (from userspace) but hasn't been processed by the
> > kernel that will live at 7? So you can't write 5-10 right away because
> > you can't be sure 5-10 is going to exist once you reach the kernel fence
> > code that stores the entry?
> >
> > Is my understanding of your issue correct?
>
> Yes, it is.
>
> However, the problem already starts while trying to reserve an area. In
> order to satisfy a user request, such a request is broken down into
> operations such as unmap mappings which are in the way entirely, remap
> mappings which intersect with the requested mapping and finally map the
> requested mapping. The execution of such a sequence must appear atomic and
> hence be locked accordingly. When trying to reserve an area we'd need to
> take that lock. But since this lock would be used in the dma-fence
> signalling critical path as well we'd not be allowed to do sleeping
> allocations while holding this lock.
>
> Potentially, this could be solved with a retry loop though. Drop the lock
> while allocating, take it again and check whether we still got enough nodes
> allocated. Analogous to what the maple tree does in mas_store_gfp(), I
> guess.
>
> >
> > Oh, and I guess the queued requests would have to remain ordered between
> > threads or whatever is on the other side? I mean, you can't have two
> > threads firing different things into the kernel at the same region
> > because I would think the results would be unpredictable?
>
> Once a job is queued up in the kernel they remain ordered.
>
> However, user threads could concurrently push jobs to the kernel altering
> the same region of the address space - it just would not make any sense for
> userspace to do that.
>
> In general userspace is responsible for the semantics of the address space.
> The kernel basically just takes any (valid) request and make it happen. It
> also assures waiting and signalling of fences which might be bound to
> certain jobs and obviously keeps track of the VA space to be able to clean
> things up once a client disappears.
>
> >
> > Can these overlapping entries partially overlap one region and another?
> > That is, can you have three in-flight writes that does something like:
> > store 1-10, store 10-20, store 5-15?
>
> Absolutely, yes.
>
> >
> > How stable of an output is needed? Does each kernel write need to be
> > 100% correct or is there a point where the userspace updates stop and
> > only then it is needed to be stable?
>
> It needs to be 100% correct all the time. The reason is that, as mentioned
> above, every job can carry in- and out-fences, such that userspace can order
> these jobs against the execution of shaders.

But each job is split into parts, so the fences surround these groups of
operations?

Since ordering is kept, you must reach a point before entering the
fences which could call the mas_preallocate() to ensure enough nodes
exist to install the new mapping, and then no other operations will be
happening. I guess what you are saying is each fence has more than one
tree operation?

As long as you are not mapping more than a range, then this should be possible
in a single write and thus a single preallocation. You can do this by
not actually writing unmaps/remaps to the tree within the fence. Once
the out-fence is reached, then the operation looks atomic.

Reading your patch, it is not clear this is accurate for VM_BIND of
asynchronous syncobjs. Is the fence spanning multiple syncobjs with
various ranges to map? Or are these things the split-up tasks of
unmap/remap, etc that will eventually boil down to what appears to be a
single write?

>
> This is also why there could be jobs queued up, where all of them apply
> changes to the same region within the VA space, since there might be shader
> executions (or just memory copies) ordered right between them.
>
> - Danilo
>
> >
> > >
> > > So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC
> > > directly in the fence signalling critical path. I guess mas_store_gfp() does
> > > not BUG_ON() if it can't get atomic pages?
> > >
> > > Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX).
> > > Do you think it could be a sane alternative to pre-allocate with
> > > MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise
> > > of pre-allocating just a couple of nodes and then rely on atomic pages for
> > > the rest?
> > >
> > > FYI, we're talking about a magnitude of hundreds of thousands of entries to
> > > be stored in the tree.
> > >
> >
> > Since you are not tracking gaps, you will get 16 entries per node. The
> > maximum height is 31, so that would be 16^31, assuming a gap between
> > each entry (the worst case), you can cut that in 1/2. To assure you can
> > successfully allocate storage for a new entries, you'd need to allocate
> > 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful
> > as almost all of these would be freed, and sometimes all of them.
> >
> > You estimate less than 1M entries, that would never go over 6 levels (8.3M
> > entries with the worst-case). 5 levels would get you 500K in the worst
> > case, but realistically you'll be in the 5 levels almost always. So,
> > 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages.
> >
> > [1] https://lore.kernel.org/linux-mm/20230620004217.4700-8-dakr@xxxxxxxxxx/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c
> >
>