Re: live kernel upgrades (was: live kernel patching design)

From: Ingo Molnar
Date: Tue Feb 24 2015 - 05:23:38 EST



* Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:

> Your upgrade proposal is an *enormous* disruption to the
> system:
>
> - a latency of "well below 10" seconds is completely
> unacceptable to most users who want to patch the kernel
> of a production system _while_ it's in production.

I think this statement is false for the following reasons.

- I'd say the majority of system operators of production
systems can live with a couple of seconds of delay at a
well defined moment of the day or week - with gradual,
pretty much open ended improvements in that latency
down the line.

- I think your argument ignores the fact that live
upgrades would extend the scope of 'users willing to
patch the kernel of a production system' _enormously_.

For example, I have a production system with this much
uptime:

10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05

While currently I'm reluctant to reboot the system to
upgrade the kernel (due to a reboot's intrusiveness),
and that is why it has achieved a relatively high
uptime, but I'd definitely allow the kernel to upgrade
at 0:00am just fine. (I'd even give it up to a few
minutes, as long as TCP connections don't time out.)

And I don't think my usecase is special.

What gradual improvements in live upgrade latency am I
talking about?

- For example the majority of pure user-space process
pages in RAM could be saved from the old kernel over
into the new kernel - i.e. they'd stay in place in RAM,
but they'd be re-hashed for the new data structures.
This avoids a big chunk of checkpointing overhead.

- Likewise, most of the page cache could be saved from an
old kernel to a new kernel as well - further reducing
checkpointing overhead.

- The PROT_NONE mechanism of the current NUMA balancing
code could be used to transparently mark user-space
pages as 'checkpointed'. This would reduce system
interruption as only 'newly modified' pages would have
to be checkpointed when the upgrade happens.

- Hardware devices could be marked as 'already in well
defined state', skipping the more expensive steps of
driver initialization.

- Possibly full user-space page tables could be preserved
over an upgrade: this way user-space execution would be
unaffected even in the micro level: cache layout, TLB
patterns, etc.

There's lots of gradual speedups possible with such a model
IMO.

With live kernel patching we run into a brick wall of
complexity straight away: we have to analyze the nature of
the kernel modification, in the context of live patching,
and that only works for the simplest of kernel
modifications.

With live kernel upgrades no such brick wall exists, just
about any transition between kernel versions is possible.

Granted, with live kernel upgrades it's much more complex
to get the 'simple' case into an even rudimentarily working
fashion (full userspace state has to be enumerated, saved
and restored), but once we are there, it's a whole new
category of goodness and it probably covers 90%+ of the
live kernel patching usecases on day 1 already ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/