Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr

From: Yang, Bin
Date: Wed Jul 04 2018 - 02:08:02 EST


thanks for reviewing my patch. I will update a new patch version based
on your feedback soon

On Tue, 2018-07-03 at 15:57 +0200, Thomas Gleixner wrote:
> Bin,
>
> On Thu, 28 Jun 2018, Bin Yang wrote:
>
> thanks for submitting this.
>
> > This issue can be easily triggered by free_initmem() functuion
> > on
> > x86_64 cpu.
>
> Please do not use the output of git show for submitting a patch. Use
> git
> format-patch(1).

I use "git send-email -1" to submit patch for review. Should I run "git
format-patch" first and send the patch as email?

>
> > If cpu supports "pdpe1gb", kernel will try to use 1G large
> > page.
> > In worst case, it needs to check every 4K page in 1G range in
> > try_preserve_large_page function. If __change_page_attr_set_clr
> > needs to change lots of 4K pages, cpu will be stuck for long
> > time.
>
> Unfortunately you missed to describe what the actual interface
> function is
> which is called from a driver or whatever, including the parameters.

This issue is discovered during kernel boot time optimization.
Sometimes, free_initmem() returns after about 600ms on my x86_64 board
with 4GB memory.

Since it is a common issue of x86_64, it can be reproduced by qemu too.
We can add some logs in try_preserve_large_page() function to measure
the loop count and elapsed time. Please make sure the host CPU has
"pdpe1gb" flag and run below qemu command to reproduce it:

qemu-system-x86_64 -enable-kvm -cpu host -serial mon:stdio -m 4G
-nographic -kernel bzImage -initrd ramdisk.img -append "console=ttyS0"

Since default x86_64 kernel enables CONFIG_RANDOMIZE_BASE, it needs to
try many times to let init memory be randomized in a 1G page range.

Below is the new commit comment I will use in new patch version soon

===========
If cpu supports "pdpe1gb", kernel will try its best to use 1G big page.
When changing a 4K page attr inside 1G page range, __change_page_attr()
will either consume this 4K page into the 1G page, or it splits 1G page
into 2M pages and tries again. The retry will either consume the 4K
page into a 2MB page, or it splits 2MB page into 4K pages.
try_preserve_large_page() is called by __change_page_attr() to decide
it by checking all 4K pages inside the big page one by one.

The cpu stuck issue is introduced by 1G page full range check. If the
result is not to split 1G page, the check loop count will be 1GB/4KB
= 262144. If it needs to change "N" 4K pages and non of them needs to
split 1GB page, the total loop count will be N * 262144. My patch
targets to reduce the loop count from N * 262144 to 1 * 262144

The full range check (262144 loops) might spend 500 ~ 1500us depends on
the CPU speed. During kernel boot, free_init_pages() might trigger this
issue when freeing "unused kernel" and "initrd" memory. In worst case,
i
t needs to run full check for every 4K pages of these init memories,
whi
ch might introduce hundreds of miliseconds kernel boot latency.
If this
issue happens during runtime, it might cause system stuck for
tens of
miliseconds. It is unacceptable for time critical task.

This patch try to cache the last address which had been checked just
now. If the next address is in same big page with same attr, the cache
will be used without full range check.

Adding logs in free_init_pages() and try_preserve_large_page() shows
this patch can reduce full range check time to less than 5us when cache
is hit. And kernel boot time can be reduced hundreds of miliseconds.

>
> > @@ -552,16 +552,20 @@ static int
> > try_preserve_large_page(pte_t *kpte, unsigned long address,
> > struct cpa_data *cpa)
> > {
> > + static unsigned long address_cache;
> > + static unsigned long do_split_cache = 1;
>
> What are the life time and protection rules of these static variables
> and
> why are they static in the first place.

they will be protected by pgd_lock. They only cache previous "do_split"
result and will be refreshed every time.

>
> > unsigned long nextpage_addr, numpages, pmask, psize, addr,
> > pfn, old_pfn;
> > pte_t new_pte, old_pte, *tmp;
> > pgprot_t old_prot, new_prot, req_prot;
> > int i, do_split = 1;
> > enum pg_level level;
> >
> > - if (cpa->force_split)
> > + spin_lock(&pgd_lock);
> > + if (cpa->force_split) {
> > + do_split_cache = 1;
>
> Returns with pgd_lock held which will immediately lockup the system
> on the
> next spin_lock(&pgd_lock) operation.
I am so sorry to make such stupid mistake. force_split was not hit on
my board :(
>
> Also what's the point of storing the already available information of
> cpa->force_split in yet another variable? This enforces taking the
> lock on
> every invocation for no reason.
As you know, do_split is initialized as 1. If do_split_cache == 1, the
cache value will not be used. If force_split == 1, cache value should
be expired. Since do_split_cache is protected by this lock, it needs to
task this lock here.
>
> > return 1;
> > + }
> >
> > - spin_lock(&pgd_lock);
> > /*
> > * Check for races, another CPU might have split this page
> > * up already:
> > @@ -627,13 +631,25 @@ try_preserve_large_page(pte_t *kpte, unsigned
> > long address,
> >
> > new_prot = static_protections(req_prot, address, pfn);
> >
> > + addr = address & pmask;
> > + pfn = old_pfn;
> > + /*
> > + * If an address in same range had been checked just now,
> > re-use the
> > + * cache value without full range check. In the worst
> > case, it needs to
> > + * check every 4K page in 1G range, which causes cpu stuck
> > for long
> > + * time.
> > + */
> > + if (!do_split_cache &&
> > + address_cache >= addr && address_cache < nextpage_addr
> > &&
>
> On the first call, address_cache contains 0. On any subsequent call
on the first call, do_split_cache is 1. if do_split_cache == 1,
address_cache will not be used.
> address_caches contains the address of the previous invocation of
> try_preserve_large_page().
>
> But addr = address & pmask! So you are comparing apples and oranges.
> Not
> sure how that is supposed to work correctly.

Yes, it needs to update check logic here. And it also needs to cache
the cached address page attr too. Current page attr might not be same
as previous one. I will send a new patch version to fix it

>
> > + pgprot_val(new_prot) == pgprot_val(old_prot)) {
> > + do_split = do_split_cache;
>
> This is an very obscure way to write:
I will update the cache logic
>
> do_split = 0;
>
> because do_split_cache must be 0 otherwise it cannot reach that code
> path.
>
> > + goto out_unlock;
> > + }
> > /*
> > * We need to check the full range, whether
> > * static_protection() requires a different pgprot for one
> > of
> > * the pages in the range we try to preserve:
> > */
> > - addr = address & pmask;
> > - pfn = old_pfn;
> > for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
> > PAGE_SIZE, pfn++) {
> > pgprot_t chk_prot = static_protections(req_prot,
> > addr, pfn);
> >
> > @@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte, unsigned
> > long address,
> > }
> >
> > out_unlock:
> > + address_cache = address;
> > + do_split_cache = do_split;
> > spin_unlock(&pgd_lock);
>
> So here you store the 'cache' values and while this code suggests
> that it
> protects the 'cache' via pgd_lock (due to lack of comments), the
> protection
> is actually cpa_lock.
>
> But, that cache information stays around when cpa_lock is dropped,
> i.e. when the current (partial) operation has been done and this
> information stays stale for the next user. That does not make sense.

__change_page_attr is the only function to change page attr and
try_preserve_large_page will be called every time for big page check.
If a big page had been splitted, it will not be merged again. So it is
safe to cache previous result in try_preserve_large_page() function.

>
> I'm still puzzled how that can ever happen at all and which problem
> you are
> trying to solve.
>
> The first invocation of try_to_preserve_large_page() will either
> preserve
> the 1GB page and consume the number of 4K pages which fit into it, or
> it
> splits it into 2M chunks and tries again. The retry will either
> preserve
> the 2M chunk and and consume the number of 4K pages which fit into
> it, or
> it splits it into 4K pages. Once split into 4K, subsequent
> invocations will
> return before reaching the magic cache checks.
>
> Can you please describe the call chain, the driver facing function
> used and
> the parameters which are handed in?

please refer to the new comment aboveã
>
> Thanks,
>
> tglx
>
>
>
>
>
>