Re: [PATCH v1] mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW

From: Linus Torvalds
Date: Tue Aug 09 2022 - 15:21:59 EST


On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> Since BUG_ON crashes the machine and Linus says that crashing the
> machine is bad, WARN_ON will also crash the machine if you set the
> panic_on_warn parameter, so it is also bad, thus we shouldn't use
> anything.

If you set 'panic_on_warn' you get to keep both pieces when something breaks.

The thing is, there are people who *do* want to stop immediately when
something goes wrong in the kernel.

Anybody doing large-scale virtualization presumably has all the
infrastructure to get debug info out of the virtual environment.

And people who run controlled loads in big server machine setups and
have a MIS department to manage said machines typically also prefer
for a machine to just crash over continuing.

So in those situations, a dead machine is still a dead machine, but
you get the information out, and panic_on_warn is fine, because panic
and reboot is fine.

And yes, that's actually a fairly common case. Things like syzkaller
etc *wants* to abort on the first warning, because that's kind of the
point.

But while that kind of virtualized automation machinery is very very
common, and is a big deal, it's by no means the only deal, and the
most important thing to the point where nothing else matters.

And if you are *not* in a farm, and if you are *not* using
virtualization, a dead machine is literally a useless brick. Nobody
has serial lines on individual machines any more. In most cases, the
hardware literally doesn't even exist any more.

So in that situation, you really cannot afford to take the approach of
"just kill the machine". If you are on a laptop and are doing power
management code, you generally cannot do that in a virtual
environment, and you already have enough problems with suspend and
resume being hard to debug, without people also going "oh, let's just
BUG_ON() and kill the machine".

Because the other side of that "we have a lot of machine farms doing
automated testing" is that those machine farms do not generally find a
lot of the exciting cases.

Almost every single merge window, I end up having to bisect and report
an oops or a WARN_ON(), because I actually run on real hardware. And
said problem was never seen in linux-next.

So we have two very different cases: the "virtual machine with good
logging where a dead machine is fine" - use 'panic_on_warn'. And the
actual real hardware with real drivers, running real loads by users.

Both are valid. But the second case means that BUG_ON() is basically
_never_ valid.

Linus