Re: [PATCH] mm: page_alloc: consume available CMA space first

From: Roman Gushchin
Date: Wed Jul 26 2023 - 19:38:24 EST

Next message: Damien Le Moal: "Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep"
Previous message: Jason Gunthorpe: "Re: [PATCH v8 2/4] iommufd: Add iommufd_access_replace() API"
In reply to: Andrew Morton: "Re: [PATCH v2] mm: page_alloc: consume available CMA space first"
Next in thread: Vlastimil Babka: "Re: [PATCH] mm: page_alloc: consume available CMA space first"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> On a memcache setup with heavy anon usage and no swap, we routinely
> see premature OOM kills with multiple gigabytes of free space left:
>
> Node 0 Normal free:4978632kB [...] free_cma:4893276kB
>
> This free space turns out to be CMA. We set CMA regions aside for
> potential hugetlb users on all of our machines, figuring that even if
> there aren't any, the memory is available to userspace allocations.
>
> When the OOMs trigger, it's from unmovable and reclaimable allocations
> that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> dominated by the anon pages.
>
>
> Because we have more options for CMA pages, change the policy to
> always fill up CMA first. This reduces the risk of premature OOMs.

I suspect it might cause regressions on small(er) devices where
a relatively small cma area (Mb's) is often reserved for a use by various
device drivers, which can't handle allocation failures well (even interim
allocation failures). A startup time can regress too: migrating pages out of
cma will take time.

And given the velocity of kernel upgrades on such devices, we won't learn about
it for next couple of years.

> Movable pages can be migrated out of CMA when necessary, but we don't
> have a mechanism to migrate them *into* CMA to make room for unmovable
> allocations. The only recourse we have for these pages is reclaim,
> which due to a lack of swap is unavailable in our case.

Idk, should we introduce such a mechanism? Or use some alternative heuristics,
which will be a better compromise between those who need cma allocations always
pass and those who use large cma areas for opportunistic huge page allocations.
Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
really this.

Thanks!

Next message: Damien Le Moal: "Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep"
Previous message: Jason Gunthorpe: "Re: [PATCH v8 2/4] iommufd: Add iommufd_access_replace() API"
In reply to: Andrew Morton: "Re: [PATCH v2] mm: page_alloc: consume available CMA space first"
Next in thread: Vlastimil Babka: "Re: [PATCH] mm: page_alloc: consume available CMA space first"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]