Re: [PATCH v2 3/3] percpu: improve allocation success rate for non-GFP_KERNEL callers

From: Tejun Heo
Date: Mon Feb 27 2017 - 17:36:06 EST


Hello,

On Mon, Feb 27, 2017 at 05:00:31AM -0800, Tahsin Erdogan wrote:
> On Mon, Feb 27, 2017 at 1:52 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > On Sat 25-02-17 20:38:29, Tahsin Erdogan wrote:
> >> When pcpu_alloc() is called with gfp != GFP_KERNEL, the likelihood of
> >> a failure is higher than GFP_KERNEL case. This is mainly because
> >> pcpu_alloc() relies on previously allocated reserves and does not make
> >> an effort to add memory to its pools for non-GFP_KERNEL case.
> >
> > Who is going to use a different mask?
>
> blkg_create() makes a call with a non-GFP_KERNEL mask:
> new_blkg = blkg_alloc(blkcg, q, GFP_NOWAIT | __GFP_NOWARN);
>
> which turns into a call stack like below:
>
> __vmalloc+0x45/0x50
> pcpu_mem_zalloc+0x50/0x80
> pcpu_populate_chunk+0x3b/0x380
> pcpu_alloc+0x588/0x6e0
> __alloc_percpu_gfp+0xd/0x10
> __percpu_counter_init+0x55/0xc0
> blkg_alloc+0x76/0x230
> blkg_create+0x489/0x670
> blkg_lookup_create+0x9a/0x230
> generic_make_request_checks+0x7dd/0x890
> generic_make_request+0x1f/0x180
> submit_bio+0x61/0x120

As indicated by GFP_NOWAIT | __GFP_NOWARN, it's okay to fail there.
It's not okay to fail consistently for a long time but it's not a big
issue to fail occassionally even if somewhat bunched up. The only bad
side effect of that is temporary misaccounting of some IOs, which
shouldn't be noticeable outside of pathological cases. If you're
actually seeing adverse effects of this, I'd love to learn about it.

> > We already have __vmalloc_gfp, why this cannot be used? Also note that
> > vmalloc dosn't really support arbitrary gfp flags. One have to be really
> > careful because there are some internal allocations which are hardcoded
> > GFP_KERNEL. Also this patch doesn't really add any new callers so it is
> > hard to tell whether what you do actually makes sense and is correct.
>
> Did you mean to say __vmalloc? If so, yes, I should use that.

So, the last time I looked at it the thorny ones in that path are the
page table (pgd, pud...) allocation functions. There are several
layers of indirection there but they end up in arch-specific
implemntations which hard code GFP_KERNEL. Without fixing them up, we
can't guarantee mapping the allocated pages making things kinda moot.

The only reason percpu allocator has the background allocator stuff is
vmalloc path can't do non-blocking allocations. If we can properly
fix that up, we can get rid of all those code from percpu allocator
and simply path the gfp flag to vmap functions. Please take a look at
__pcpu_map_pages() in mm/percpu-vm.c. map_kernel_range_noflush() is
the function which has implicit GFP_KERNEL allocation in it and what's
requiring the reserve.

If you can get rid of that, awesome, but given that your patch doesn't
touch that at all, I can't see how it's supposed to work.

Thanks.

--
tejun