Re: [PATCH] KVM: arm/arm64: fix unaligned hva start and end in handle_hva_to_gpa

From: Suzuki K Poulose
Date: Tue May 15 2018 - 13:41:20 EST


Hi Jia,

On 15/05/18 14:15, Jia He wrote:


On 5/15/2018 8:38 PM, Jia He Wrote:
Hi Suzuki

On 5/15/2018 4:36 PM, Suzuki K Poulose Wrote:

Hi Jia

On 05/15/2018 03:03 AM, Jia He wrote:
Hi Suzuki

I will merge the other thread into this, and add the necessary CC list

That WARN_ON call trace is very easy to reproduce in my armv8a server after I
start 20 guests

and run memhog in the host. Of course, ksm should be enabled

For you question about my inject fault debug patch:


Thanks for the patch, comments below.



...

index 7f6a944..ab8545e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -290,12 +290,17 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
ÂÂ * destroying the VM), otherwise another faulting VCPU may come in and mess
ÂÂ * with things behind our backs.
ÂÂ */
+extern int trigger_by_ksm;
ÂÂstatic void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
ÂÂ{
ÂÂÂÂÂÂÂÂ pgd_t *pgd;
ÂÂÂÂÂÂÂÂ phys_addr_t addr = start, end = start + size;
ÂÂÂÂÂÂÂÂ phys_addr_t next;

+ÂÂÂÂÂÂ if(trigger_by_ksm) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ end -= 0x200;
+ÂÂÂÂÂÂ }
+
ÂÂÂÂÂÂÂÂ assert_spin_locked(&kvm->mmu_lock);
ÂÂÂÂÂÂÂÂ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
ÂÂÂÂÂÂÂÂ do {

I need to point out that I never reproduced it without this debugging patch.

That could trigger the panic iff your "size" <= 0x200, leading to the
condition (end < start), which can make the loop go forever, as we
do while(addr < end) and end up accessing something which may not be PGD entry
and thus get a bad page with bad numbers all around. This case could be hit only
with your change and the bug in the KSM which gives us an address near the page
boundary.
No, I injected the fault on purpose to simulate the case when size is less than
PAGE_SIZE(eg. PAGE_SIZE-0x200=65024)
I ever got the panic info [1] *without* the debugging patch only once

[1] https://lkml.org/lkml/2018/5/9/992

So, I think we can safely ignore the PANIC().
More below.

I am a bit confused now. Do you mean, the panic was triggered *without* the debugging
patch ? If that is the case, it is really worrying.



Suzuki, thanks for the comments.

I proposed another ksm patch https://lkml.org/lkml/2018/5/3/1042
The root cause is ksm will add some extra flags to indicate that the page
is in/not_in the stable tree. This makes address not be aligned with PAGE_SIZE.
Thanks for the pointer. In the future, please Cc the people relevant to the
discussion in the patches.

 From arm kvm mmu point of view, do you think handle_hva_to_gpa still need
to handle
the unalignment case?
I don't think we should do that. Had we done this, we would never have caught
this bug
in KSM. Eventually if some other new implementation comes up with the a new
notifier
consumer which doesn't check alignment and doesn't WARN, it could simply do
the wrong
thing. So I believe what we have is a good measure to make sure that things are
in the right order.

IMO, the PAGE_SIZE alignment is still needed because we should not let the
bottom function
kvm_age_hva_handler to handle the exception. Please refer to the
implementation in X86 and
powerpc kvm_handle_hva_range(). They both aligned the hva with
hva_to_gfn_memslot.

 From an API perspective, you are passed on a "start" and "end" address. So,
you could potentially
do the wrong thing if you align the "start" and "end". May be those handlers
should also do the
same thing as we do.

But handle_hva_to_gpa has partially adjusted the alignment possibly:
ÂÂÂ 1750ÂÂÂÂÂÂÂÂ kvm_for_each_memslot(memslot, slots) {
ÂÂÂ 1751ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long hva_start, hva_end;
ÂÂÂ 1752ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ gfn_t gpa;
ÂÂÂ 1753
ÂÂÂ 1754ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ hva_start = max(start, memslot->userspace_addr);
ÂÂÂ 1755ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ hva_end = min(end, memslot->userspace_addr +
ÂÂÂ 1756ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (memslot->npages << PAGE_SHIFT));

at line 1755, let us assume that end=0x12340200 and
memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)=0x12340000
Then, hva_start is not page_size aligned and hva_end is aligned, and the size
will be PAGE_SIZE-0x200,
just as what I had done in the inject fault debugging patch.

Thats because we want to limit the handling of the hva/gpa range by memslot. So,
we make sure we pass on the range within the given memslot
to hva_to_gfn_memslot(). But we do iterate over the next memslot if the
original range falls in to the next slot. So, in practice, there is no
alignment/trimming of the range. Its just that we pass on the appropriate range
for each slot.

Yes, I understand what the codes did in hva_to_gfn_memslot(). What I mean is
hva_end may be changed and (hva_end - hva_start) will not be same as the
parameter _size_ ?

ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);

Anyway, I have to admit that all the exceptions are originally caused by the
STABLE_FLAG in ksm code. What I want to discuss here is how to make arm kvm
handle the exception more gracefully.

Thats my point. We need the fix in ksm. Once we have the fix in place, do
we hit the WARN_ON() any more ?


Hi Suzuki
How about this patch (balance of avoiding the WARN_ON storm and debugging
convenience):

The problem with WARN_ON_ONCE() is, it could potentially suppress the different
call paths that could lead to the triggers. e.g, unmap_stage2_range() could be
called from different paths and having a WARN_ON_ONCE() could potentially
hide the other call paths.

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 7f6a944..4033946 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -297,6 +297,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t
start, u64 size)
phys_addr_t next;

assert_spin_locked(&kvm->mmu_lock);
+
+ WARN_ON_ONCE(size & ~PAGE_MASK);
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
/*
@@ -1792,7 +1794,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t
gpa, u64 size, void *data
{
pte_t *pte = (pte_t *)data;

- WARN_ON(size != PAGE_SIZE);
+ WARN_ON_ONCE(size != PAGE_SIZE);
/*
* We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
* flag clear because MMU notifiers will have unmapped a huge PMD before
@@ -1823,7 +1825,7 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa,
u64 size, void *data)
pmd_t *pmd;
pte_t *pte;

- WARN_ON(size != PAGE_SIZE && size != PMD_SIZE);
+ WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE);
pmd = stage2_get_pmd(kvm, NULL, gpa);
if (!pmd || pmd_none(*pmd)) /* Nothing there */
return 0;
@@ -1843,7 +1845,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t
gpa, u64 size, void *
pmd_t *pmd;
pte_t *pte;

- WARN_ON(size != PAGE_SIZE && size != PMD_SIZE);
+ WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE);
pmd = stage2_get_pmd(kvm, NULL, gpa);
if (!pmd || pmd_none(*pmd)) /* Nothing there */
return 0;



Cheers
Suzuki