Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

From: John Hubbard
Date: Mon Nov 01 2021 - 15:31:53 EST


On 11/1/21 07:22, Pasha Tatashin wrote:
Yes, you are just repeating what the diffs say.

But it's still not good to have this function name doing something completely
different than its name indicates.

I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?


What? No, that's not where I was going at all. The function is already
named set_page_refcounted(), and one of the problems I see is that your
changes turn it into something that most certainly does not
set_page_refounted(). Instead, this patch *increments* the refcount.
That is not the same thing.

And then it uses a .config-sensitive assertion to "prevent" problems.
And by that I mean, the wording throughout this series seems to equate
VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
however, in CONFIG_DEBUG_VM configurations, and provide no protection at
all for normal (most distros) users. That's something that the wording,
comments, and even design should be tweaked to account for.

VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
sensitive, but in both cases *BUG_ON() means that there is an
unrecoverable problem that occured. The only difference between the
two is that VM_BUG_ON() is not enabled when distros decide to reduce
the size of their kernel and improve runtime performance by skipping
some extra checking.

There is no logical separation between VM_BUG_ON and BUG_ON, there is
been a lengthy discussion about this:

https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@xxxxxxxxxxxxxx/
"so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
just allows some people to build smaller kernels, but apparently
distro people would rather have debugging than save a few kB of RAM."

Losing control of ref_count is an unrecoverable problem because it
leads to security sensitive memory corruptions. It is better to crash
the kernel when that happens instead of ending up with some pages
mapped into the wrong address space.

The races are tricky to spot, but set_page_count() is inherently
dangerous, so I am removing it entirely and replacing it with safer
operations which do the same thing.

One example is this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29

I don't think we have any disagreement about the landscape here. But it's
much easier to describe the problem than it is to fix it--as always. And
repeating the problem doesn't make a proposed fix more (or less) appropriate. :)


I understand where this patchset is going, but this intermediate step is
not a good move.

Also, for the overall series, if you want to change from
"set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
do that is *not* to depend solely on VM_BUG*() to verify. Instead,
return something like -EBUSY if incrementing the value results in a
surprise, and let the caller decide how to handle it.

In set_page_refcounted() we already have:

VM_BUG_ON_PAGE(page_ref_count(page), page);
set_page_count(page, 1);

I am pointing out that above code is racy:

Between the check VM_BUG_ON_PAGE() check and unconditional set to 1
the value of page->_refcount can change.

I am replacing it with an identical version of code that is not racy.

And I'm pointing out that raciness is not the real bug, or at least, not
the only bug. "Fixing" the race does not fix the code, but the patch
series seems to imply that it does.

There is no need to complicate the code by introducing new -EBUSY
returns here, as it would reduce the fragility of this could even
farther.

Actually, -EBUSY would be OK if the problems were because we failed to

I am not sure -EBUSY would be OK here, it means we had a race which we
were not aware about, and which could have led to memory corruptions.

modify refcount for some reason, but if we modified refcount and got
an unexpected value (i.e underflow/overflow) we better report it right
away instead of waiting for memory corruption to happen.


Having the caller do the BUG() or VM_BUG*() is not a significant delay.

I agree, however, helper functions exist to remove code duplications.
If we must verify the assumption of set_page_refcounted() that non
counted page is turned into a counted page, it is better to do it in
one place than at every call site. We do it today in thus helper
function, I do not see why we would change that.


Let's ignore this -EBUSY idea for now, because I'm not sure where you are
going with your next version, and maybe it won't even come up.


thanks,
--
John Hubbard
NVIDIA