Re: [RFC][PATCH 2/3] dma-buf: system_heap: Add pagepool support to system heap

From: John Stultz
Date: Wed Feb 03 2021 - 00:57:21 EST


On Tue, Feb 2, 2021 at 6:04 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
>
> On Fri, Jan 22, 2021 at 05:28:32PM -0800, John Stultz wrote:
> > On Mon, Dec 21, 2020 at 2:09 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
> > >
> > > On Fri, Dec 18, 2020 at 05:16:56PM -0800, John Stultz wrote:
> > > > On Fri, Dec 18, 2020 at 6:36 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
> > > > > On Thu, Dec 17, 2020 at 11:06:11PM +0000, John Stultz wrote:
> > > > > > Reuse/abuse the pagepool code from the network code to speed
> > > > > > up allocation performance.
> > > > > >
> > > > > > This is similar to the ION pagepool usage, but tries to
> > > > > > utilize generic code instead of a custom implementation.
> > > > >
> > > > > We also have one of these in ttm. I think we should have at most one of
> > > > > these for the gpu ecosystem overall, maybe as a helper that can be plugged
> > > > > into all the places.
> > > > >
> > > > > Or I'm kinda missing something, which could be since I only glanced at
> > > > > yours for a bit. But it's also called page pool for buffer allocations,
> > > > > and I don't think there's that many ways to implement that really :-)
> > > >
> > > > Yea, when I was looking around the ttm one didn't seem quite as
> > > > generic as the networking one, which more easily fit in here.
> > >
> > > Oops, I didn't look that closely and didn't realize you're reusing the one
> > > from net/core/.
> > >
> > > > The main benefit for the system heap is not so much the pool itself
> > > > (the normal page allocator is pretty good), as it being able to defer
> > > > the free and zero the pages in a background thread, so the pool is
> > > > effectively filled with pre-zeroed pages.
> > > >
> > > > But I'll take another look at the ttm implementation and see if it can
> > > > be re-used or the shared code refactored and pulled out somehow.
> > >
> > > I think moving the page_pool from net into lib and using it in ttm might
> > > also be an option. Lack of shrinker in the networking one might be a bit a
> > > problem.
> >
> > Yea. I've been looking at this, to see how to abstract out a generic
> > pool implementation, but each pool implementation has been tweaked for
> > the specific use cases, so a general abstraction is a bit tough right
> > off.
> >
> > For example the ttm pool's handling allocations both from alloc_pages
> > and dma_alloc in a pool, where the net page pool only uses alloc_pages
> > (but can pre map via dma_map_attr).
> >
> > And as you mentioned, the networking page pool is statically sized
> > where the ttm pool is dynamic and shrinker controlled.
> >
> > Further, as the ttm pool is utilized for keeping pools of pages set
> > for specific cache types, it makes it difficult to abstract that out
> > as we have to be able to reset the caching (set_pages_wb()) when
> > shrinking, so that would also have to be pushed down into the pool
> > attributes as well.
> >
> > So far, in my attempts to share an abstraction for both the net
> > page_pool and the ttm page pool, it seems to make the code complexity
> > worse on both sides - so while I'm interested in continuing to try to
> > find a way to share code here, I'm not sure it makes sense to hold up
> > this series (which is already re-using an existing implementation and
> > provide a performance bump in microbenchmarks) for the
> > grand-unified-page-pool. Efforts to refactor the ttm pool and net page
> > pool can continue on indepently, and I'd be happy to move the system
> > heap to whatever that ends up being.
>
> The thing is, I'm not sure sharing code with net/core is a really good
> idea, at least it seems like we have some impendence mismatch with the ttm
> pool. And going forward I expect sooner or later we need alignment between
> the pools/caches under drm with dma-buf heap pools a lot more than between
> dma-buf and net/core.

I mean... I don't think you're wrong here, but it was your suggestion.

> So this feels like a bit code sharing for code sharing's sake and not
> where it makes sense. Expecting net/core and gpu stacks to have the exact
> same needs for a page pool allocator has good chances to bite us in the
> long run.

Again, I agree with you at the high level here (dmabuf system heap and
ttm page pooling are conceptually more likely to align, and
duplication of buffer pools is non-optimal), but there's still the
practical aspect of the ttm pool being pretty tied to the ttm code
(utilizing ttm contexts, fixed MAX_ORDER*TTM_NUM_CACHING_TYPES
subpools per pool + 4 global sub-pools for only x86).

So... I guess I'll go for another pass at trying to pull something
generic out of the ttm_pool, but the cynic in me suspects folks will
just object to any inefficiencies added in order to do so (the
code-sharing for its own sake argument above) and I'll be back to
where I am now. But we'll see.

thanks
-john