RE: [PATCH v1] arm64/mm: avoid race condition of update page table when kernel init

From: Jianyong Wu
Date: Sun Nov 14 2021 - 21:09:25 EST


Ping ...
Any comments?

> -----Original Message-----
> From: Jianyong Wu
> Sent: Thursday, October 28, 2021 3:36 PM
> To: Anshuman Khandual <anshuman.khandual@xxxxxxx>; Catalin Marinas
> <Catalin.Marinas@xxxxxxx>; will@xxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> maz@xxxxxxxxxx; ardb@xxxxxxxxxx; david@xxxxxxxxxx; gshan@xxxxxxxxxx;
> Justin He <Justin.He@xxxxxxx>; nd <nd@xxxxxxx>
> Subject: RE: [PATCH v1] arm64/mm: avoid race condition of update page
> table when kernel init
>
> Hi Anshuman,
>
> > -----Original Message-----
> > From: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> > Sent: Thursday, October 28, 2021 1:58 PM
> > To: Jianyong Wu <Jianyong.Wu@xxxxxxx>; Catalin Marinas
> > <Catalin.Marinas@xxxxxxx>; will@xxxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx;
> > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> > maz@xxxxxxxxxx; ardb@xxxxxxxxxx; david@xxxxxxxxxx;
> gshan@xxxxxxxxxx;
> > Justin He <Justin.He@xxxxxxx>; nd <nd@xxxxxxx>
> > Subject: Re: [PATCH v1] arm64/mm: avoid race condition of update page
> > table when kernel init
> >
> >
> >
> > On 10/27/21 3:18 PM, Jianyong Wu wrote:
> > > Race condition of page table update can happen in kernel boot period
> > > as both of memory hotplug action when kernel init and the
> > > mark_rodata_ro can update page table. For virtio-mem, the function
> > excute flow chart is:
> > >
> > > -------------------------
> > > kernel_init
> > > kernel_init_freeable
> > > ...
> > > do_initcall
> > > ...
> > > module_init [A]
> > >
> > > ...
> > > mark_readonly
> > > mark_rodata_ro [B]
> > > -------------------------
> > > virtio-mem can be initialized at [A] and spwan a workqueue to add
> > > memory, therefore the race of update page table can happen inside [B].
> > >
> > > What's more, the race condition can happen even for ACPI based
> > > memory hotplug, as it can burst into kernel boot period while page
> > > table is updating inside mark_rodata_ro.
> > >
> > > That's why memory hotplug lock is needed to guard mark_rodata_ro to
> > > avoid the race condition.
> > >
> > > It may only happen in arm64. As fixmap, which is the global
> > > resource, is used in page table creating. So, the change is only for arm64.
> > >
> > > The error often occurs inside alloc_init_pud() in
> > > arch/arm64/mm/mmu.c
> > >
> > > the race condition flow is:
> > >
> > > *************** begin ************
> > >
> > > kerenl_init virtio-mem workqueue
> > > ========= ========
> > > alloc_init_pud(...)
> > > pudp = pud_set_fixmap_offset(..) alloc_init_pud(...)
> > > ... ...
> > > READ_ONCE(*pudp) //OK! pudp = pud_set_fixmap_offset(
> > > ... ...
> > > pud_clear_fixmap() //fixmap break
> > > READ_ONCE(*pudp) //CRASH!
> > >
> > > **************** end *************
> > >
> > > I catch the related error when test virtio-mem (a new memory hotplug
> > > driver) on arm64.
> > >
> > > How to reproduce:
> > > (1) prepare a kernel with virtio-mem enabled on arm64
> > > (2) start a VM using Cloud Hypervisor using the kernel above
> > > (3) hotplug memory, 20G in my case, with virtio-mem
> > > (4) reboot or start a new kernel using kexec
> > >
> > > Test for server times, you may find the error below:
> > >
> > > [ 1.131039] Unable to handle kernel paging request at virtual address
> > fffffbfffda3b140
> > > [ 1.134504] Mem abort info:
> > > [ 1.135722] ESR = 0x96000007
> > > [ 1.136991] EC = 0x25: DABT (current EL), IL = 32 bits
> > > [ 1.139189] SET = 0, FnV = 0
> > > [ 1.140467] EA = 0, S1PTW = 0
> > > [ 1.141755] FSC = 0x07: level 3 translation fault
> > > [ 1.143787] Data abort info:
> > > [ 1.144976] ISV = 0, ISS = 0x00000007
> > > [ 1.146554] CM = 0, WnR = 0
> > > [ 1.147817] swapper pgtable: 4k pages, 48-bit VAs,
> > pgdp=00000000426f2000
> > > [ 1.150551] [fffffbfffda3b140] pgd=0000000042ffd003,
> > p4d=0000000042ffd003, pud=0000000042ffe003, pmd=0000000042fff003,
> > pte=0000000000000000
> > > [ 1.155728] Internal error: Oops: 96000007 [#1] SMP
> > > [ 1.157724] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G C 5.15.0-
> > rc3+ #100
> > > [ 1.161002] Hardware name: linux,dummy-virt (DT)
> > > [ 1.162939] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > > [ 1.165825] pc : alloc_init_pud+0x38c/0x550
> > > [ 1.167610] lr : alloc_init_pud+0x394/0x550
> > > [ 1.169358] sp : ffff80001001bd10
> > > ......
> > > [ 1.200527] Call trace:
> > > [ 1.201583] alloc_init_pud+0x38c/0x550
> > > [ 1.203218] __create_pgd_mapping+0x94/0xe0
> > > [ 1.204983] update_mapping_prot+0x50/0xd8
> > > [ 1.206730] mark_rodata_ro+0x50/0x58
> > > [ 1.208281] kernel_init+0x3c/0x120
> > > [ 1.209760] ret_from_fork+0x10/0x20
> > > [ 1.211298] Code: eb15003f 54000061 d5033a9f d5033fdf (f94000a1)
> > > [ 1.213856] ---[ end trace 59473413ffe3f52d ]---
> > > [ 1.215850] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x0000000b
> > >
> > > We can see that the error derived from the l3 translation as the pte
> > > value is *0*. That is because the fixmap has been clear when access.
> > >
> > > Signed-off-by: Jianyong Wu <jianyong.wu@xxxxxxx>
> > > ---
> > > arch/arm64/mm/mmu.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index
> > > cfd9deb347c3..567dfba8f08a 100644
> > > --- a/arch/arm64/mm/mmu.c
> > > +++ b/arch/arm64/mm/mmu.c
> > > @@ -564,8 +564,10 @@ void mark_rodata_ro(void)
> > > * to cover NOTES and EXCEPTION_TABLE.
> > > */
> > > section_size = (unsigned long)__init_begin - (unsigned
> > > long)__start_rodata;
> > > + get_online_mems();
> > > update_mapping_prot(__pa_symbol(__start_rodata), (unsigned
> > long)__start_rodata,
> > > section_size, PAGE_KERNEL_RO);
> > > + put_online_mems();
> > >
> > > debug_checkwx();
> > > }
> > >
> >
> > While this should solve the current problem i.e race between
> > concurrent memory hotplug operation and mark_rodata_ro(), but I am
> > still wondering whether this is the fix at the right place and
> > granularity. Basically a hotplug operation queued in an work queue at
> > [A] can execute during [B] is the root cause of this problem.
> >
>
> Not exactly, this issue doesn't only happen at the the *pure* kernel boot.
> For example, hotplug memory through VM monitor when VM boot. We can't
> foresee when that happen. Thus, this issue can affect all kinds of memory
> hotplug mechanism, including ACPI based memory hotplug and virtio-mem.
> I'm not sure that fix it here is the best way. If the race only happens between
> kernel init and memory hotplug, I think it's fine to fix it here.
> IMO, this issue results from the race for "fixmap" resource. I wonder why
> this global resource is not protected by a lock. Maybe we can add one and fix
> it there.
>
> > start_kernel(..)
> > {
> >
> > arch_call_rest_init(..)
> > rest_init(..)
> > kernel_thread(kernel_init, ...)
> > kernel_init(..) {
> > kernel_init_freeable(..)
> > do_basic_setup(..)
> > do_initcalls(..)
> > ........
> > module_init(..) [A]
> > ................
> > ................
> > mark_readonly(..)
> > mark_rodata_ro(..)
> > update_mapping_prot(..)
> > __create_pgd_mapping(..) [B]
> > }
> > }
> >
> > Are there no other non-hotplug call path after module_init(), which
> > could land in __create_pgd_mapping() ?
> >
> > Please also note that this is not an existing problem on arm64
> > platform which needs to be fixed, as virtio-mem is yet to be enabled.
>
> This issue affects all kinds of memory hotplug mechanism, so, I think that's
> not the busyness with virtio-mem itself.
>
> Thanks
> Jianyong