Re: kvm splat in mmu_spte_clear_track_bits

From: Bernhard Held
Date: Tue Aug 29 2017 - 05:19:36 EST


On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
Bernhard Held <berny156@xxxxxx> wrote:

On 08/27/2017 at 02:35 PM, Adam Borowski wrote:
4.13-rc5 retested fails
Crashed only after two hours or so of testing.
4.13-rc4 apparently works
It survived several hours of varied tests (like 5 debian-installer runs, a
win10 point release upgrade, some hurd package building, openbsd, etc),
all while the host was likewise busy.
Thus: to the best of my knowledge, the problem is between 4.13-rc4 and 4.13-rc5
but I wouldn't bet my life on it.

I get crashes with Win10 in kvm with 4.13-rc5. 4.13-rc4 works for me. THP seems to accelerate the crash, but that's not 100% sure.

There's still no crash after reverting merge 27df70 on 4.13-rc7. There are 21 commits in this merge, 10 are mm-related:

$ git log 4e082e9ba7cd..e86b298bebf7 --pretty=oneline --abbrev-commit
e86b298bebf7 userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
f357e345eef7 zram: rework copy of compressor name in comp_algorithm_store()
aac2fea94f7a rmap: do not call mmu_notifier_invalidate_page() under ptl
d041353dc98a mm: fix list corruptions on shmem shrinklist
af54aed94bf3 mm/balloon_compaction.c: don't zero ballooned pages
c0a6a5ae6b5d MAINTAINERS: copy virtio on balloon_compaction.c
b3a81d0841a9 mm: fix KSM data corruption
99baac21e458 mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
0a2dd266dd6b mm: make tlb_flush_pending global
56236a59556c mm: refactor TLB gathering API
a9b802500ebb Revert "mm: numa: defer TLB flush for THP migration as long as possible"
0a2c40487f3e mm: migrate: fix barriers around tlb_flush_pending
16af97dc5a89 mm: migrate: prevent racy access to tlb_flush_pending
9eeb52ae712e fault-inject: fix wrong should_fail() decision in task context
4e98ebe5f435 test_kmod: fix small memory leak on filesystem tests
9c56771316ef test_kmod: fix the lock in register_test_dev_kmod()
434b06ae23ba test_kmod: fix bug which allows negative values on two config options
a4afe8cdec16 test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
5af10dfd0afc userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
75dddef32514 mm: ratelimit PFNs busy info message
d507e2ebd2c7 mm: fix global NR_SLAB_.*CLAIMABLE counter reads

Donât blame me for the TLB stuff... My money is on aac2fea94f7a .

Amit, thanks for your courage to expose your patch!

I'm more and more confident that aac2fea94f7a is the culprit. Maybe it just accelerates the triggering of the splash. To be more sure the kernel needs to be tested for a couple of days. It would be great if others could assist in testing aac2fea94f7a.

Have fun,
Bernhard