Re: kvm splat in mmu_spte_clear_track_bits

From: Mike Galbraith
Date: Tue Aug 29 2017 - 08:58:06 EST


On Mon, 2017-08-28 at 09:56 -0700, Nadav Amit wrote:
> Bernhard Held <berny156@xxxxxx> wrote:
>
> > On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
> >> 4.13-rc5 retested fails
> >> Crashed only after two hours or so of testing.
> >> 4.13-rc4 apparently works
> >> It survived several hours of varied tests (like 5 debian-installer runs, a
> >> win10 point release upgrade, some hurd package building, openbsd, etc),
> >> all while the host was likewise busy.
> >> Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
> >> but I wouldn't bet my life on it.
> >
> > I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.
> >
> > There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:
> >
> > $ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
> > e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
> > f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
> > aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
> > d041353dc98a mm: fix list corruptions on shmem shrinklist
> > af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
> > c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
> > b3a81d0841a9 mm: fix KSM data corruption
> > 99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
> > 0a2dd266dd6b mm: make tlb_flush_pending global
> > 56236a59556c mm: refactor TLB gathering API
> > a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
> > 0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
> > 16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
> > 9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
> > 4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
> > 9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
> > 434b06ae23ba test_kmod: fix bug which allows negative values on two config options
> > a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
> > 5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
> > 75dddef32514 mm: ratelimit PFNs busy info message
> > d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
>
> Donât blame me for the TLB stuff... My money is on aac2fea94f7a .

You may be onto something.

FWIW, with an RT host/guest, I reproduced the problem yesterday in
fairly short order, but today, with that commit reverted, and pushing
markedly harder, nada.

(hohum, intermittent bugs tend to do that, they're particularly fond of
showing up about 10 seconds after you report them dead...9..8..7;)

A colleague suggested going back toÂmmu_notifier_invalidate_page(),
which I'm going to try shortly (hopefully noticing absolutely nothing
the least bit 'interesting'), but first, I'm gonna CC the author of
that _maybe_ culprit patch.

-Mike