Re: [PATCH] mm: page_alloc: consume available CMA space first

From: Roman Gushchin
Date: Thu Jul 27 2023 - 13:08:22 EST


On Thu, Jul 27, 2023 at 11:34:13AM -0400, Johannes Weiner wrote:
> On Wed, Jul 26, 2023 at 04:38:11PM -0700, Roman Gushchin wrote:
> > On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> > > On a memcache setup with heavy anon usage and no swap, we routinely
> > > see premature OOM kills with multiple gigabytes of free space left:
> > >
> > > Node 0 Normal free:4978632kB [...] free_cma:4893276kB
> > >
> > > This free space turns out to be CMA. We set CMA regions aside for
> > > potential hugetlb users on all of our machines, figuring that even if
> > > there aren't any, the memory is available to userspace allocations.
> > >
> > > When the OOMs trigger, it's from unmovable and reclaimable allocations
> > > that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> > > dominated by the anon pages.
> > >
> > >
> > > Because we have more options for CMA pages, change the policy to
> > > always fill up CMA first. This reduces the risk of premature OOMs.
> >
> > I suspect it might cause regressions on small(er) devices where
> > a relatively small cma area (Mb's) is often reserved for a use by various
> > device drivers, which can't handle allocation failures well (even interim
> > allocation failures). A startup time can regress too: migrating pages out of
> > cma will take time.
>
> The page allocator is currently happy to give away all CMA memory to
> movables before entering reclaim. It will use CMA even before falling
> back to a different migratetype.
>
> Do these small setups take special precautions to never fill memory?
> Proactively trim file cache? Never swap? Because AFAICS, unless they
> do so, this would only change the timing of when CMA fills up, not if.

Imagine something like a web-camera or a router. It boots up, brings up some
custom drivers/hardware, starts some daemons and runs forever. It might never
reach the memory capacity or it might take hours or days. The point it that
during the initialization cma is fully available.

>
> > And given the velocity of kernel upgrades on such devices, we won't learn about
> > it for next couple of years.
>
> That's true. However, a potential regression with this would show up
> fairly early in kernel validation since CMA would fill up in a more
> predictable timeline. And the change is easy to revert, too.
>
> Given that we have a concrete problem with the current behavior, I
> think it's fair to require a higher bar for proof that this will
> indeed cause a regression elsewhere before raising the bar on the fix.

I'm not opposing the change, just raising up a concern. I expect that
we'll need a more complicated solution at some point anyway.

>
> > > Movable pages can be migrated out of CMA when necessary, but we don't
> > > have a mechanism to migrate them *into* CMA to make room for unmovable
> > > allocations. The only recourse we have for these pages is reclaim,
> > > which due to a lack of swap is unavailable in our case.
> >
> > Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> > which will be a better compromise between those who need cma allocations always
> > pass and those who use large cma areas for opportunistic huge page allocations.
> > Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> > really this.
>
> Right, having migration into CMA could be a viable option as well.
>
> But I would like to learn more from CMA users and their expectations,
> since there isn't currently a guarantee that CMA stays empty.

This change makes cma allocations less deterministic. If previously a cma allocation
was almost always succeeding, with this change we'll see more interim failures.
(it's all about some time after a boot when the majority of memory is still empty).

>
> This patch would definitely be the simpler solution. It would also
> shave some branches and cycles off the buddy hotpath for many users
> that don't actively use CMA but have CONFIG_CMA=y (I checked archlinux
> and Fedora, not sure about Suse).

Yes, this is good.