Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

From: Linus Torvalds
Date: Sat Jan 09 2021 - 14:16:48 EST


On Sat, Jan 9, 2021 at 11:03 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> >
> > Sorry to ask but I'm curious, what also goes wrong if the user
> > modifies memory under GUP pin from vmsplice? That's not obvious to
> > see.
>
> It breaks the otherwise true rule that the data in pipe buffers is
> immutable.

Note that this continued harping on vmsplice() is entirely misguided.

Anything using GUP has the same issues.

This really has nothing to do with vmsplice() per se.

In many ways, vmsplice() might be the least of your issues, because
it's fairly easy to just limit that for untrusted use.

And no, that does not mean "we should make vmsplice root-only" kind of
limiting. There are no security issues in any normal situation. Again,
it's mainly about things that don't trust each other _despite_ running
in similar contexts as far as the kernel is concerned. IOW, exactly
that "zygote" kind of situation.

If you are a JIT (whether Zygote or a web browser), you basically need
to limit the things the untrusted JIT'ed code can do. And that
limiting may include vmsplice().

But note the "include" part of "include vmsplice()". Any other GUP
user really does have the same issues, it may just be less obvious and
have very different timings (or depend on access to devices etc).

Absolutely nothing cares about "data in pipe buffers changing" in any
other case. You can already write any data you want to a pipe, it
doesn't matter if it changes after the write or not.

(In many ways, "data in the page cache" is a *much* more difficult
issue for the kernel, and it's fundamental to any shared mmap. It's
much more difficult because that data is obviously very much also
accessible for DMA etc for writeout, and if you have something like
"checksums are calculated separately and non-atomically from the
actual DMA accesses", you will potentially get checksum errors where
the actual disk contents don't match your separately calculated
checksums until the _next_ write. This can actually be a feature -
seeing "further modifications were concurrent to the write" - but most
people end up considering it a bug).

Linus