Re: Linux 3.19-rc5

From: Linus Torvalds
Date: Sat Jan 31 2015 - 16:54:43 EST


On Sat, Jan 31, 2015 at 12:16 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> All the stuff it flagged are genuinely wrong, albeit not disastrously
> so, things mostly just work.

I really disagree.

They weren't wrong. They *could* occasionally result in extra
reschedules, but that was never wrong to begin with.

But the debugging code made that much much worse, and made that "could
result in extra reschedules" happen all the time if it triggered.

And the whole "set task to sleep early" thing was rather intentional,
and was a fundamental part of the original design for
"select()/poll()" behavior, for example. Yes, we ended up having the
waitqueue functions and using that "pollwake()" and use
"wq->triggered" etc instead, but the whole optimistic early sleep
thing worked reasonably well for a long time even when the "early
TASK_SLEEP" was done before calling *thousands* of random poll
routines.

So you say "genuinely wrong", and I say "but that's how things were
designed - it's an optimistic approach, not an exact one". Your
debugging code changed that behavior, and actually introduced a real
bug, exactly because you felt that the "no nested sleeps" was a harder
requirement than it has ever actually been.

In other words, I think the debugging code itself is wrong, and then
that sched_annotate_sleep() thing is just a symptom of how it is
wrong. If you have to sprinkle these kinds of random workarounds in a
few core scheduler places (ok, mainly "wait()" paths, it looks like),
why would you expect random *drivers* to have to care about things
that even core kernel code says "I'm not going to care about this,
I'll just shut the warning up, because the warning is wrong".

Yes, the fact that select/poll was changed to try to avoid excessive
polling scheduling because it actually *was* a problem under some
loads does say that we generally want to try to avoid nested sleeping.
Because while it is rare and the optimistic approach works fine in
most cases, it certainly *can* become a problem if the optimistic "I'm
normally not going to sleep" thing ends up not being sufficiently
accurate.

So don't get me wrong - I think the whole "add debug code to find
places where we might have issues" was well worth it, and resulted in
improvements.

But once the low-hanging fruit and the core code that everybody hits
has been fixed, and people cannot apparently even be bothered with the
other cases it finds (like the pccardd case), at that point the value
of the debug code becomes rather less obvious.

And the downsides become bigger.

The pccardd example is an example of legacy use of our old and
original semantics of how the whole nested sleep was supposed to work.
And it *does* work. It's not a bug. It's how things have worked time
immemorial, and clearly nobody is really willing to bother with
changing working - but legacy - cardbus code. And at that point, I
think the debug code is actually *wrong*, and causes more problems
than it "fixes".

And debug code that causes more problems that it fixes should either
be removed, or improved to the point where the problems go away.

The "improved" part might be about saying "it's actually perfectly
_fine_ to have nested sleeps, as long as it is truly rare that the
nested sleep actually sleeps". And thus make the debug code really
test that it's *rare*, rather than test that it never happens. Warn if
it happens more than a couple of times a second for any particular
process, or something like that.

Hmm?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/