Re: [mm/vunmap] e47110e905: WARNING:at_mm/vmalloc.c:#__vunmap

From: Oliver Sang
Date: Thu Apr 29 2021 - 03:37:37 EST


Hi, Linus,

On Sat, Apr 24, 2021 at 07:04:00PM -0700, Linus Torvalds wrote:
> On Sat, Apr 24, 2021 at 6:31 PM Oliver Sang <oliver.sang@xxxxxxxxx> wrote:
> > >
> > > Oliver - how reliable is that bisection?
> >
> > we will check further if any issue in our test env.
> >
> > by bot auto tests, we saw 12 issue instances out of 74 runs. but not happen
> > out of 100 runs of parent.
>
> Oh, that's interesting. So only 12 out of 74 runs saw that __vunmap
> warning, but if I understand your table correctly, there were some
> _other_ issues in there?

yes, there are.

>
> Are those also for that same commit? (ie those RIP:kfree /

yes. the 1st column (fail:runs) is for parent,
the 3rd colum (fail:runs) is for e47110e905

> RIP:kobject_add_internal / etc)?

the RIP:kfree happens just after __vunmap warning (as attached dmesg):
===================================================================
[ 198.730073] ------------[ cut here ]------------
[ 198.730597] Trying to vfree() bad address (0000000070935066)
[ 198.731223] WARNING: CPU: 0 PID: 1948 at mm/vmalloc.c:2247 __vunmap+0x663/0x990
..
[ 198.763940] ---[ end trace 572fd76a7879a124 ]---
[ 198.764449] stack segment: 0000 [#1] SMP KASAN
[ 198.764933] CPU: 0 PID: 1948 Comm: systemd-udevd Tainted: G W 5.9.0-rc1-00107-ge47110e90584a #1
[ 198.765985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 198.766872] RIP: 0010:kfree+0x64/0x360
==================================================================

and it did not happen on parent:
f3f99d63a8156c7a e47110e90584a22e9980510b00d
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:998 11% 108:432 dmesg.RIP:kfree

RIP:kobject_add_internal happens on both commits after more runs,
but much lower happen rate than __vunmap (or kfree) on e47110e905.

:998 13% 125:432 dmesg.RIP:__vunmap
:998 11% 108:432 dmesg.RIP:kfree
11:998 -1% 6:432 dmesg.RIP:kobject_add_internal


>
> I'm not sure how to read that table of yours - if I understand it
> correctly, it looks like the parent commit had some different ones

yes, the parent has some different ones that e47110e905 did not have
(below there is a full table after we run both parent and the e47110e905
more times which has more details)

> that the child did not (eg 2 cases of BUG_at_mm/usercopy.c?)

after more runs, the BUG_at_mm/usercopy.c is also reproduced on e47110e905
9:998 -1% 3:432 dmesg.kernel_BUG_at_mm/usercopy.c

>
> So it feels to me like there's some memory corruption somewhere, and
> that commit that it bisected to likely just changed the failure case
> (due to timing differences or allocation ordering changes).
>
> IOW, there seem to be other panics even in the parent.
>
> Yes/No?

Yes. we also concerned that the panics on parent could cover the __vunmap issue,
so we ran it up to almost 1000 times.


f3f99d63a8156c7a e47110e90584a22e9980510b00d
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
3:998 0% 6:432 dmesg.BUG:kernel_NULL_pointer_dereference,address
1:998 0% 2:432 dmesg.BUG:non-zero_pgtables_bytes_on_freeing_mm
24:998 0% 25:432 dmesg.BUG:unable_to_handle_page_fault_for_address
1:998 -0% :432 dmesg.Bad_pagetable:#[##]
2:998 -0% :432 dmesg.INFO:rcu_sched_detected_stalls_on_CPUs/tasks
40:998 14% 180:432 dmesg.Kernel_panic-not_syncing:Fatal_exception
:998 0% 1:432 dmesg.Kernel_panic-not_syncing:Fatal_exception_in_interrupt
27:998 0% 31:432 dmesg.Oops:#[##]
1:998 -0% :432 dmesg.RIP:__asan_report_load8_noabort
6:998 -0% 4:432 dmesg.RIP:__is_module_percpu_address
1:998 -0% :432 dmesg.RIP:__kasan_check_read
1:998 0% 5:432 dmesg.RIP:__lock_acquire
:998 1% 7:432 dmesg.RIP:__migration_entry_wait
1:998 0% 1:432 dmesg.RIP:__module_address
2:998 0% 3:432 dmesg.RIP:__rb_insert_augmented
:998 13% 125:432 dmesg.RIP:__vunmap
2:998 -0% :432 dmesg.RIP:anon_vma_interval_tree_insert
:998 0% 2:432 dmesg.RIP:apply_relocate_add
1:998 -0% :432 dmesg.RIP:check_memory_region
1:998 0% 1:432 dmesg.RIP:copy_pte_range
1:998 -0% :432 dmesg.RIP:copy_user_generic_string
2:998 -0% 1:432 dmesg.RIP:deactivate_slab
:998 0% 1:432 dmesg.RIP:del_usage_links
1:998 -0% :432 dmesg.RIP:do_raw_spin_trylock
:998 0% 1:432 dmesg.RIP:ep_send_events_proc
:998 0% 2:432 dmesg.RIP:find_vma
:998 1% 6:432 dmesg.RIP:free_percpu
1:998 0% 1:432 dmesg.RIP:handle_mm_fault
1:998 -0% :432 dmesg.RIP:kasan_report
1:998 0% 1:432 dmesg.RIP:kernfs_find_ns
1:998 -0% :432 dmesg.RIP:kernfs_link_sibling
:998 11% 108:432 dmesg.RIP:kfree
1:998 0% 1:432 dmesg.RIP:kmem_cache_alloc
:998 0% 2:432 dmesg.RIP:kmem_cache_alloc_trace
11:998 -1% 6:432 dmesg.RIP:kobject_add_internal
:998 0% 1:432 dmesg.RIP:llist_del_first
:998 0% 2:432 dmesg.RIP:load_module
1:998 -0% :432 dmesg.RIP:lock_release
7:998 -1% 1:432 dmesg.RIP:module_put
:998 0% 1:432 dmesg.RIP:module_remove_modinfo_attrs
1:998 -0% :432 dmesg.RIP:native_queued_spin_lock_slowpath
:998 0% 1:432 dmesg.RIP:native_safe_halt
:998 0% 1:432 dmesg.RIP:nmi_uaccess_okay
:998 0% 1:432 dmesg.RIP:paravirt_patch_default.cold
23:998 -1% 17:432 dmesg.RIP:print_modules
2:998 -0% 1:432 dmesg.RIP:qlist_free_all
1:998 -0% :432 dmesg.RIP:rb_erase
:998 0% 1:432 dmesg.RIP:rb_next
26:998 0% 27:432 dmesg.RIP:skip_spaces
:998 0% 1:432 dmesg.RIP:strncmp
:998 0% 1:432 dmesg.RIP:sysfs_file_ops
1:998 -0% :432 dmesg.RIP:sysfs_remove_group
:998 0% 1:432 dmesg.RIP:update_sd_lb_stats
7:998 -1% 1:432 dmesg.RIP:usercopy_abort
:998 0% 2:432 dmesg.RIP:vmalloc_to_page
:998 0% 1:432 dmesg.RIP:zap_p4d_range
1:998 -0% :432 dmesg.WARNING:at_fs/sysfs/group.c:#sysfs_remove_group
:998 0% 2:432 dmesg.WARNING:at_kernel/locking/lockdep.c:#__lock_acquire
7:998 -1% 1:432 dmesg.WARNING:at_kernel/module.c:#module_put
11:998 -1% 6:432 dmesg.WARNING:at_lib/kobject.c:#kobject_add_internal
:998 0% 1:432 dmesg.WARNING:at_mm/memory.c:#wp_page_copy
:998 13% 125:432 dmesg.WARNING:at_mm/vmalloc.c:#__vunmap
:998 0% 2:432 dmesg.WARNING:at_mm/vmalloc.c:#vmalloc_to_page
1:998 -0% :432 dmesg.WARNING:stack_going_in_the_wrong_direction?at_asm_exc_double_fault/0x
1:998 -0% :432 dmesg.WARNING:stack_recursion
42:998 3% 67:432 dmesg.canonical_address#:#[##]
9:998 -1% 4:432 dmesg.invalid_opcode:#[##]
1:998 -0% :432 dmesg.kernel_BUG_at_arch/x86/entry/common.c
:998 0% 1:432 dmesg.kernel_BUG_at_arch/x86/kernel/paravirt.c
9:998 -1% 3:432 dmesg.kernel_BUG_at_mm/usercopy.c
:998 10% 103:432 dmesg.stack_segment:#[##]


BTW, we noticed the e91d8d7823 ("mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING")
is a fix for e47110e90584, so we also tested it, as well as the v5.12-rc8,
found the issue still exists though lower happen rate.

f3f99d63a8156c7a e47110e90584a22e e91d8d78237de8d7 v5.12-rc8
---------------- ---------------- ---------------- ----------------
fail:runs fail:runs fail:runs fail:runs
| | | |
:998 125:432 46:299 35:334 dmesg.RIP:__vunmap
:998 125:432 46:299 35:334 dmesg.WARNING:at_mm/vmalloc.c:#__vunmap
>
> Linus

Attachment: dmesg.xz
Description: application/xz