KVM nonblocking MMU notifier with KVM_GUEST_USES_PFN looks racy [but is currently unused]

From: Jann Horn
Date: Mon Sep 18 2023 - 13:13:39 EST


Hi!

I haven't tested this and might be missing something, but I think that
the MMU notifier for KVM_GUEST_USES_PFN pfncache is currently a bit
broken. Except that nothing seems to actually use KVM_GUEST_USES_PFN,
so currently it's not actually a problem?

gfn_to_pfn_cache_invalidate_start() contains the following:

/*
* If the OOM reaper is active, then all vCPUs should have
* been stopped already, so perform the request without
* KVM_REQUEST_WAIT and be sad if any needed to be IPI'd.
*/
if (!may_block)
req &= ~KVM_REQUEST_WAIT;

called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);

WARN_ON_ONCE(called && !may_block);

The comment explains that we rely on OOM reaping only happening when a
process is sufficiently far into being stopped that it is no longer
executing vCPUs, but from what I can tell, that's not what the caller
actually guarantees. Especially on the path from the
process_mrelease() syscall (if we're dealing with a process whose mm
is not shared with other processes), we only check that the target
process has SIGNAL_GROUP_EXIT set. From what I can tell, that does
imply that delivery of a fatal signal has begun, but doesn't even
imply that the CPU running the target process has been IPI'd, let
alone that the target process has died or anything like that.

But I also don't see any reason why
gfn_to_pfn_cache_invalidate_start() actually has to do anything
special for non-blocking invalidation - from what I can tell, nothing
in there can block, basically everything runs with preemption
disabled. The first half of the function holds a spinlock; the second
half is basically a call to kvm_make_vcpus_request_mask(), which
disables preemption across the whole function with
get_cpu()/put_cpu(). A synchronous IPI spins until the IPI has been
acked but that doesn't count as sleeping. (And the rest of the OOM
reaping code will do stuff like synchronous IPIs for its TLB flushes,
too.)

So maybe you/I can just rip out the special-casing of nonblocking mode
from gfn_to_pfn_cache_invalidate_start() to fix this?

Relevant call paths for the theoretical race:

sys_kill
prepare_kill_siginfo
kill_something_info
kill_proc_info
rcu_read_lock
kill_pid_info
rcu_read_lock
group_send_sig_info [PIDTYPE_TGID]
do_send_sig_info
lock_task_sighand [task->sighand->siglock]
send_signal_locked
__send_signal_locked
prepare_signal
legacy_queue
signalfd_notify
sigaddset(&pending->signal, sig)
complete_signal
signal->flags = SIGNAL_GROUP_EXIT [mrelease will
work starting here]
for each thread:
sigaddset(&t->pending.signal, SIGKILL)
signal_wake_up [IPI happens here]
unlock_task_sighand [task->sighand->siglock]
rcu_read_unlock
rcu_read_unlock

sys_process_mrelease
find_lock_task_mm
spin_lock(&p->alloc_lock)
task_will_free_mem
SIGNAL_GROUP_EXIT suffices
PF_EXITING suffices if singlethreaded?
task_unlock
mmap_read_lock_killable
__oom_reap_task_mm
for each private non-PFNMAP/MIXED VMA:
tlb_gather_mmu
mmu_notifier_invalidate_range_start_nonblock
__mmu_notifier_invalidate_range_start
mn_hlist_invalidate_range_start
kvm_mmu_notifier_invalidate_range_start [as
ops->invalidate_range_start]
gfn_to_pfn_cache_invalidate_start
[loop over gfn_to_pfn_cache instances]
if overlap and KVM_GUEST_USES_PFN [UNUSED]: evict_vcpus=true
[if evict_vcpus]
kvm_make_vcpus_request_mask
__kvm_handle_hva_range
unmap_page_range
mmu_notifier_invalidate_range_end
tlb_finish_mmu
mmap_read_unlock