Re: [PATCH] Test for riscv fixes

From: Alexandre Ghiti
Date: Fri Oct 06 2023 - 07:38:20 EST


Hi Mark,

On 02/10/2023 15:41, Mark Rutland wrote:
On Mon, Oct 02, 2023 at 09:13:52AM +0200, Alexandre Ghiti wrote:
Hi Edward,

On Sat, Sep 30, 2023 at 1:06 AM Edward AD<twuufnxlz@xxxxxxxxx> wrote:
Hi Alexandre,

On Fri, 29 Sep 2023 10:25:59 +0200 Alexandre Ghiti<alexghiti@xxxxxxxxxxxx> wrote:
I'm still not convinced this will fix the kasan out-of-bounds
accesses, the page can be valid but the read can happen at an offset
not initialized and trigger such errors right? I still think there is
something weird about the stack frame, as to me this should not happen
(but admittedly I don't know much about that).
The added check can confirm that the physical page is invalid (whether it is a
vmalloc allocated page or a slab allocated page), and exit the for loop when it is invalid.
Yes, but to me this is not what happens in the bug report you link:

| BUG: KASAN: out-of-bounds in walk_stackframe+0x130/0x2f2
arch/riscv/kernel/stacktrace.c:59
| Read of size 8 at addr ff20000006d37c38 by task swapper/1/0

So the read at address ff20000006d37c38 is not "normal" according to
KASAN (you can see there is no trap, meaning the physical mapping
exists).

| The buggy address belongs to the virtual mapping at
| [ff20000006d30000, ff20000006d39000) created by:
| kernel_clone+0x118/0x896 kernel/fork.c:2909

The virtual address is legitimate since the vma exists ^

| The buggy address belongs to the physical page:
| page:ff1c00000250dbc0 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x9436f

And the physical page also exists ^

So I insist, checking that a physical mapping exists to exit the loop
is not enough, to me, the error here is that the backtrace goes "too
far" at an address where nothing was written before and then KASAN
complains about that, again, we don't take any page fault here so it's
not a problem of existing physical mapping.
Yep!

I believe what's happening here is one task unwinding another (starting from
whatever gets saved in switch_to()), and there's nothing that prevents that
other task from running concurrently and modifying/poisoning its stack. In
general trying to unwind a remote stack is racy and broken, but we're stuck
with a few bits of the kernel tryingto do that occasionally and so the arch
code needs to handle that without blowing up.


Thanks for that, I had already fixed the "imprecise" unwinder (when we don't have a frame pointer) using READ_ONCE_NOCHECK() but I had not this use case in mind, so I'll fix that too.


For KASAN specifically you'll need to access the stack with unchecked accesses
(e.g. using READ_ONCE_NOCHECK() to read the struct stackframe), and you'll
probably want to add some explicit checks that pointers are within stack bounds
since concurrent modification (or corruption) could result in entirely bogus
pointers.

I *think* that we do the right thing on arm64, so you might want to take a look
at arm64's unwinder in arch/arm64/kernel/stacktrace.c,
arch/arm64/include/asm/stacktrace.h, and
arch/arm64/include/asm/stacktrace/common.h.


And I'll check that for the stack bounds check.

Thanks again,

Alex



Mark.