Re: Stop the Linux kernel madness

From: Andrea Arcangeli
Date: Mon Jun 21 2004 - 21:09:11 EST


On Sun, Jun 20, 2004 at 06:16:26PM -0700, 4Front Technologies wrote:
> You and others can keep suggesting that put the world+kitchen sink into the
> kernel and have the problems go away but it's not realistic. Many drivers
> are still maintained outside the kernel and you aren't providing a solution.

Breaking interfaces to drivers gratuitously would be insane, we're
breaking api to drivers only when we *have* to after thinking twice
about the problem and after excluding backwards compatible alternatives.
And normally the worst thing that can happen is that the driver doesn't
compile anymore.

Here I'm not talking about your buildsystem issue you run into, I'm
talking about true sourcelevel breakeage to kernel modules out of the
kernel that you may find too and that are more difficult to solve than
the buildsystem command already described in the readme.suse.

To make the last recent example we had to break the source API with the
drivers to fix the release_pages race that Andrew found and fixed in
mainline too. That changes page->count into page->_count and quite some
drivers broke even outside the kernel. I had the choice of not breaking
the API but that would had forced us to disable irq and take a per-zone
spinlock in every last put_page(), definitely not desiderable in a
enterprise OS where number matters. I appreciate the ability to fix
things right and to boost performance to the maximum whenever possible,
like it happens in the mainline kernel tree. I even had a lengthy
private discussion with Andrew and it was him suggesting me the
local_irq_disable + atomic_dec_and_lock as the only possible
alternative, but it wasn't attractive enough for performance reasons.
Furthermore in a few years people would be more annoyed by page->count
than by page->_count as people moves into more recent mainline releases.

At another time during 2.4 to support databases using >16G of ram and
running thousand of processes I had to break the pte_offset API to
create pte-highmem to avoid the pte to fill the whole lowmem zone and
run the box oom (luckily at around the same time vmalloc_to_page was
created too, so a more generic API that would work with mainline too
could be suggested to driver developers, and in turn even in this case
over time people should have been more confortable with vmalloc_to_page).

These things don't happen often, but they sometime have to happen and
it's good we can fix them right, unlike if we were shipping a
non-open-source OS that forced us to retain the same API to modules to
boot the machine and in turn to introduce ugly and slow hacks to
workaround bugs. These days the kernel is quite mature so hopefully they
won't happen anymore during stable cycles (I mean after 2.6.7 that
already had to break page->_count) but you never know.

NOTE: the source API with the kernel modules must not be confused with
the _binary_ ABI with userspace. the ABI with userspace is a completely
different matter. The ABI with userspace (like syscalls) must be the
same for all linux versions. That is very important. The kernel API to
modules not being fixed is a feature and not a bug.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/