Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

From: Linus Torvalds
Date: Mon Oct 16 2023 - 15:25:02 EST


On Mon, 16 Oct 2023 at 11:53, Uros Bizjak <ubizjak@xxxxxxxxx> wrote:
>
> Unfortunately, it does not work and dies early in the boot with:

Side note: build the kernel with debug info (the limited form is
sufficient), and then run oopses through

./scripts/decode_stacktrace.sh

to get much nicer oops information that has line numbers and inlining
information in the backtrace.

> [ 4.939358] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 4.940090] RIP: 0010:begin_new_exec+0x8f2/0xa30
> [ 4.940090] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
> ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
> fa ff ff <f0> 41 ff 0c 24 0f
> 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb

That decodes to

0: 31 f6 xor %esi,%esi
2: e8 c1 49 f9 ff call 0xfffffffffff949c8
7: e9 3c fa ff ff jmp 0xfffffffffffffa48
c: 31 f6 xor %esi,%esi
e: 4c 89 ef mov %r13,%rdi
11: e8 b2 4a f9 ff call 0xfffffffffff94ac8
16: e9 19 fa ff ff jmp 0xfffffffffffffa34
1b: 31 f6 xor %esi,%esi
1d: 4c 89 ef mov %r13,%rdi
20: e8 23 4a f9 ff call 0xfffffffffff94a48
25: e9 ea fa ff ff jmp 0xfffffffffffffb14
2a:* f0 41 ff 0c 24 lock decl (%r12) <-- trapping instruction
2f: 0f 85 55 fb ff ff jne 0xfffffffffffffb8a
35: 4c 89 e7 mov %r12,%rdi
38: e8 4b 02 df ff call 0xffffffffffdf0288

but without a nicer backtrace it's nasty to guess where this is.

The "lock decl ; jne" is a good hint, though - that sequence is most
definitely "atomic_dec_and_test()".

And that in turn means that it's almost certainly mmdrop(), which is

if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);

where that

35: 4c 89 e7 mov %r12,%rdi
38: e8 4b 02 df ff call 0xffffffffffdf0288

is exactly the unlikely "__mmdrop(mm)" part (and gcc decided to make
the likely branch a branch-out for some reason - presumably with the
inlining the code around it meant that was the better layout - maybe
this was all inside another "unlikely()" branch.

And if I read that right, this has all been inlined from
begin_new_exec() -> exec_mmap() -> mmdrop_lazy_tlb().

Now, how and why 'mm' would be NULL in that path, and why any
'current' reloading optimization would matter in this all I very much
can't see. The call site in begin_new_exec() is

/*
* Release all of the old mmap stuff
*/
acct_arg_size(bprm, 0);
retval = exec_mmap(bprm->mm);
if (retval)
goto out;

bprm->mm = NULL;

and "bprm->mm" is most definitely non-NULL there because we earlier did

So I suspect the problem happened much earlier, caused some nasty
internal corruption, and the odd 'mm is NULL' is just a symptom.

retval = set_mm_exe_file(bprm->mm, bprm->file);

using it, and that would have oopsed had bprm->mm been NULL then.

So there's some serious corruption there, but from the oops itself I
can't tell the source. I guess if we get 'current' wrong anywhere, all
bets are off.

Linus