Re: kvm splat in mmu_spte_clear_track_bits

From: Linus Torvalds
Date: Tue Aug 29 2017 - 15:06:51 EST


On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
>
> Kirill did regress invalidate_page as it use to be call outside the
> spinlock and now it is call inside the spinlock thus reverting will
> introduce back a regression.

Honestly, this MMU notifier thing has been nothing but a badly
designed mistake from beginning to end, and bad rules for what can
sleep and what can not are one fundamental problem.

There are fundamentally two levels of VM locking, and those two levels
are not going to go away, and we're not budging on them:

- there's the "virtual address" level, which can block. We have a
nice mmap_semaphore, and we guarantee that it's held for writing for
all changes to the virtual memory layout

This is the "mmap/munmap" kind of granularity. The mmu callbacks at
*this* level are fine to block.

- then there is the "page level" VM handling, and honestly, that
*fundamentally* uses a spinlock. If we look at a particular page, that
page is meaningless without the lock. Really.

I honestly believe that any MMU callback at this level needs to be
atomic. Some of the absolutely *have* to be (that "change_pte", for
example).

In that second case, we might have a "begin/end" surrounding the
actual page table walk. And that might sleep, but then it
*fundamentally* cannot actually be able some particular single page
or stable range. Because without the page table spinlock, no such
stability exists. It's purely a "we are not going to start looking at
this range" kind of thing.

I really don't understand why the nVidia crap cannot follow those
simple rules. Because either

(a) you're working with virtual addresses, and you should be able to
work on that virtual layer

(b) you're actually working with physical pages, and you can just
hold on to those physical pages yourself.

I really detest our MMU callbacks. I shouldn't have allowed them to be
merged. And I definitely shoul.dn't allow them to screw up our VM
layer.

But we have them, and we should work at making sure people do sane things.

And yes, those sane things may include

(a) guaranteeing that the start/end range calls are always done
around the actual locked region.

(b) adding a ton of validation so that people *see* then they break
the rules. Even when they don't use some random notifier crud.

That (b) may involve adding a number of "might_sleep()" calls (not
deep in the notifiers themselves, but in the actual wrapper functions
even when notifiers are compiled out entirely!), but also adding calls
to validate those kinds of "you can't call
mmu_notifier_invalidate_page() without having first called
mmu_notifier_invalidate_range_start() in a sleepable context".

But (b) definitely should also be a very real onus on the mmu
notifiers themselves. No way can we sleep when we're traversing page
tables. We hold a page table lock. We can sleep before and after, but
not during actual page traversal.

Linus