Re: [PATCH 0/8] stackleak: fixes and rework

From: Mark Rutland
Date: Tue Apr 26 2022 - 07:16:17 EST


On Tue, Apr 26, 2022 at 11:37:47AM +0100, Mark Rutland wrote:
> On Tue, Apr 26, 2022 at 11:10:52AM +0100, Mark Rutland wrote:
> > On Mon, Apr 25, 2022 at 03:54:00PM -0700, Kees Cook wrote:
> > > On Mon, Apr 25, 2022 at 12:55:55PM +0100, Mark Rutland wrote:
> > > > This series reworks the stackleak code. The first patch fixes some
> > > > latent issues on arm64, and the subsequent patches improve the code to
> > > > improve clarity and permit better code generation.
> > >
> > > This looks nice; thanks! I'll put this through build testing and get it
> > > applied shortly...
> >
> > Thanks!
> >
> > Patch 1 is liable to conflict with come other stacktrace bits that may go in
> > for v5.19, so it'd be good if either that could be queued as a fix for
> > v5.1-rc4, or we'll have to figure out how to deal with conflicts later.
> >
> > > > While the improvement is small, I think the improvement to clarity and
> > > > code generation is a win regardless.
> > >
> > > Agreed. I also want to manually inspect the resulting memory just to
> > > make sure things didn't accidentally regress. There's also an LKDTM test
> > > for basic functionality.
> >
> > I assume that's the STACKLEAK_ERASING test?
> >
> > I gave that a spin, but on arm64 that test is flaky even on baseline v5.18-rc1.
> > On x86_64 it seems consistent after 100s of runs. I'll go dig into that now.
>
> I hacked in some debug, and it looks like the sp used in the test is far above
> the current lowest_sp. The test is slightly wrong since it grabs the address of
> a local variable rather than using current_stack_pointer, but the offset I see
> is much larger:
>
> # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT
> [ 27.665221] lkdtm: Performing direct entry STACKLEAK_ERASING
> [ 27.665986] lkdtm: FAIL: lowest_stack 0xffff8000083a39e0 is lower than test sp 0xffff8000083a3c80
> [ 27.667530] lkdtm: FAIL: the thread stack is NOT properly erased!
>
> That's off by 0x2a0 (AKA 672) bytes, and it seems to be consistent from run to
> run.
>
> I note that an interrupt occuring could cause similar (since on arm64 those are
> taken/triaged on the task stack before moving to the irq stack, and the irq
> regs alone will take 300+ bytes), but that doesn't seem to be the problem here
> given this is consistent, and it appears some prior function consumed a lot of
> stack.
>
> I *think* the same irq problem would apply to x86, but maybe that initial
> triage happens on a trampoline stack.
>
> I'll dig a bit more into the arm64 side...

That offset above seems to be due to the earlier logic in direct_entry(), which
I guess is running out-of-line. With that hacked to:

----------------
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index e2228b6fc09bb..53f3027e8202d 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -378,8 +378,9 @@ static ssize_t direct_entry(struct file *f, const char __user *user_buf,
size_t count, loff_t *off)
{
const struct crashtype *crashtype;
- char *buf;
+ char *buf = "STACKLEAK_ERASING";

+#if 0
if (count >= PAGE_SIZE)
return -EINVAL;
if (count < 1)
@@ -395,13 +396,17 @@ static ssize_t direct_entry(struct file *f, const char __user *user_buf,
/* NULL-terminate and remove enter */
buf[count] = '\0';
strim(buf);
+#endif

crashtype = find_crashtype(buf);
+
+#if 0
free_page((unsigned long) buf);
if (!crashtype)
return -EINVAL;
+#endif

- pr_info("Performing direct entry %s\n", crashtype->name);
+ // pr_info("Performing direct entry %s\n", crashtype->name);
lkdtm_do_action(crashtype);
*off += count;

----------------

... the SP check doesn't fail, but I still see intermittent bad value failures.
Those might be due to interrupt frames.

Thanks,
Mark.