Re: [PATCH 1/3] sched: add sched_task_call()

From: Peter Zijlstra
Date: Tue Feb 17 2015 - 13:15:54 EST


On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > So far stack unwinding has basically been a best effort debug output
> > kind of thing, you're wanting to make the integrity of the kernel depend
> > on it.
> >
> > You require an absolute 100% correctness of the stack unwinder -- where
> > today it is; as stated above; a best effort debug output thing.
> >
> > That is a _big_ change.
>
> True, this does seem to be the first attempt to rely on the correctness
> of the stack walking code.
>
> > Has this been properly considered; has all the asm of the relevant
> > architectures been audited? Are you planning on maintaining that level
> > of audit for all new patches?
>
> I agree, the unwinder needs to be 100% correct. Currently we only
> support x86_64. Luckily, in general, stacks are very well defined and
> walking the stack of a sleeping task should be straightforward. I don't
> think it would be too hard to ensure the stack unwinder is right for
> other architectures as we port them.

I would not be too sure about that, the x86 framepointer thing is not
universal. IIRC some have to have some form of in-kernel dwarf
unwinding.

And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
because otherwise x86 stacks are a mess too.

So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
are implemented in asm, don't honour that. So if one of those faults and
the fault handler sleeps, you'll miss say your
'copy_user_enhanced_fast_string' entry.

Then again, those asm functions don't have function trace bits either,
so you can't patch them to begin with I suppose.

Here's to hoping you don't have to..

> > Because the way you propose to do things, we'll end up with silent but
> > deadly fail if the unwinder is less than 100% correct. No way to easily
> > debug that, no warns, just silent corruption.
>
> That's a good point. There's definitely room for additional error
> checking in the x86 stack walking code. A couple of ideas:
>
> - make sure it starts from a __schedule() call at the top
> - make sure we actually reach the bottom of the stack
> - make sure each stack frame's return address is immediately after a
> call instruction
>
> If we hit any of those errors, we can bail out, unregister with ftrace
> and restore the system to its original state.

And then hope you can get a better trace next time around? Or will you
fall-back to an alternative method of patching?

> > Are you really really sure you want to go do this?
>
> Basically, yes. We've had a lot of conversations about many different
> variations of how to do this over the past year, and this is by far the
> best approach we've come up with.

For some reason I'm thinking jikos is going to disagree with you on that
:-)

I'm further thinking we don't actually need 2 (or more) different means
of live patching in the kernel. So you all had better sit down (again)
and come up with something you all agree on.

> FWIW, we've been doing something similar with kpatch and stop_machine()
> over the last 1+ years, and have run into zero problems with that
> approach.

Could be you've just been lucky...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/