Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4

From: John Stultz
Date: Mon Feb 06 2023 - 23:37:59 EST


On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <jaewon31.kim@xxxxxxxxxxx> wrote:
> Hello John Stultz, sorry for late reply.
> I had to manage other urgent things and this test also took some time to finish.
> Any I hope you to be happy with following my test results.
>
>
> 1. system heap modification
>
> To avoid effect of allocation from the pool, all the freed dma
> buffer were passed to buddy without keeping them in the pool.
> Some trace_printk and order counting logic were added.
>
> 2. the test tool
>
> To test the dma-buf system heap allocation speed, I prepared
> a userspace test program which requests a specified size to a heap.
> With the program, I tried to request 16 times of 10 MB size and
> added 1 sleep between each request. Each memory was not freed
> until the total 16 times total memory was allocated.

Oof. I really appreciate all your effort that I'm sure went in to
generate these numbers, but this wasn't quite what I was asking for.
I know you've been focused on allocation performance under memory
pressure, but I was hoping to see the impact of __using__ order 0
pages over order 4 pages in real world conditions (for camera or video
recording or other use cases that use large allocations). These
results seem to be still just focused on the difference in allocation
performance between order 0 and order 4 with and without contention.

That said, re-reading my email, I probably could have been more clear
on this aspect.


> 3. the test device
>
> The test device has arm64 CPU cores and v5.15 based kernel.
> To get stable results, the CPU clock was fixed not to be changed
> in run time, and the test tool was set to some specific CPU cores
> running in the same CPU clock.
>
> 4. test results
>
> As we expected if order 4 exist in the buddy, the order 8, 4, 0
> allocation was 1 to 4 times faster than the order 8, 0, 0. But
> the order 8, 0, 0 also looks fast enough.
>
> Here's time diff, and number of each order.
>
> order 8, 4, 0 in the enough order 4 case
>
> diff 8 4 0
> 665 usec 0 160 0
> 1,148 usec 0 160 0
> 1,089 usec 0 160 0
> 1,154 usec 0 160 0
> 1,264 usec 0 160 0
> 1,414 usec 0 160 0
> 873 usec 0 160 0
> 1,148 usec 0 160 0
> 1,158 usec 0 160 0
> 1,139 usec 0 160 0
> 1,169 usec 0 160 0
> 1,174 usec 0 160 0
> 1,210 usec 0 160 0
> 995 usec 0 160 0
> 1,151 usec 0 160 0
> 977 usec 0 160 0
>
> order 8, 0, 0 in the enough order 4 case
>
> diff 8 4 0
> 441 usec 10 0 0
> 747 usec 10 0 0
> 2,330 usec 2 0 2048
> 2,469 usec 0 0 2560
> 2,518 usec 0 0 2560
> 1,176 usec 0 0 2560
> 1,487 usec 0 0 2560
> 1,402 usec 0 0 2560
> 1,449 usec 0 0 2560
> 1,330 usec 0 0 2560
> 1,089 usec 0 0 2560
> 1,481 usec 0 0 2560
> 1,326 usec 0 0 2560
> 3,057 usec 0 0 2560
> 2,758 usec 0 0 2560
> 3,271 usec 0 0 2560
>
> From the perspective of responsiveness, the deterministic
> memory allocation speed, I think, is quite important. So I
> tested other case where the free memory are not enough.
>
> On this test, I ran the 16 times allocation sets twice
> consecutively. Then it showed the first set order 8, 4, 0
> became very slow and varied, but the second set became
> faster because of the already created the high order.
>
> order 8, 4, 0 in low memory
>
> diff 8 4 0
> 584 usec 0 160 0
> 28,428 usec 0 160 0
> 100,701 usec 0 160 0
> 76,645 usec 0 160 0
> 25,522 usec 0 160 0
> 38,798 usec 0 160 0
> 89,012 usec 0 160 0
> 23,015 usec 0 160 0
> 73,360 usec 0 160 0
> 76,953 usec 0 160 0
> 31,492 usec 0 160 0
> 75,889 usec 0 160 0
> 84,551 usec 0 160 0
> 84,352 usec 0 160 0
> 57,103 usec 0 160 0
> 93,452 usec 0 160 0
>
> diff 8 4 0
> 808 usec 10 0 0
> 778 usec 4 96 0
> 829 usec 0 160 0
> 700 usec 0 160 0
> 937 usec 0 160 0
> 651 usec 0 160 0
> 636 usec 0 160 0
> 811 usec 0 160 0
> 622 usec 0 160 0
> 674 usec 0 160 0
> 677 usec 0 160 0
> 738 usec 0 160 0
> 1,130 usec 0 160 0
> 677 usec 0 160 0
> 553 usec 0 160 0
> 1,048 usec 0 160 0
>
>
> order 8, 0, 0 in low memory
>
> diff 8 4 0
> 1,699 usec 2 0 2048
> 2,082 usec 0 0 2560
> 840 usec 0 0 2560
> 875 usec 0 0 2560
> 845 usec 0 0 2560
> 1,706 usec 0 0 2560
> 967 usec 0 0 2560
> 1,000 usec 0 0 2560
> 1,905 usec 0 0 2560
> 2,451 usec 0 0 2560
> 3,384 usec 0 0 2560
> 2,397 usec 0 0 2560
> 3,171 usec 0 0 2560
> 2,376 usec 0 0 2560
> 3,347 usec 0 0 2560
> 2,554 usec 0 0 2560
>
> diff 8 4 0
> 1,409 usec 2 0 2048
> 1,438 usec 0 0 2560
> 1,035 usec 0 0 2560
> 1,108 usec 0 0 2560
> 825 usec 0 0 2560
> 927 usec 0 0 2560
> 1,931 usec 0 0 2560
> 2,024 usec 0 0 2560
> 1,884 usec 0 0 2560
> 1,769 usec 0 0 2560
> 2,136 usec 0 0 2560
> 1,738 usec 0 0 2560
> 1,328 usec 0 0 2560
> 1,438 usec 0 0 2560
> 1,972 usec 0 0 2560
> 2,963 usec 0 0 2560

So, thank you for generating all of this. I think this all looks as
expected, showing the benefit of your change to allocation under
contention and showing the potential downside in the non-contention
case.

I still worry about the performance impact outside of allocation time
of using the smaller order pages (via map and unmap through iommu to
devices, etc), so it would still be nice to have some confidence this
won't introduce other regressions, but I do agree the worse case
impact is very bad.

> Finally if we change order 4 to use HIGH_ORDER_GFP,
> I expect that we could avoid the very slow cases.
>

Yeah. Again, this all aligns with the upside of your changes, which
I'm eager for.
I just want to have a strong sense of any regressions it might also cause.

I don't mean to discourage you, especially after all the effort here.
Do you think evaluating the before and after impact to buffer usage
(not just allocation) would be doable in the near term?

If you don't think so, given the benefit to allocation under pressure
is large (and I don't mean to give you hurdles to jump), I'm willing
to ack your change to get it merged, but if we later see performance
trouble, I'll be quick to advocate for reverting it. Is that ok?

thanks
-john