Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

From: Tianyi Liu
Date: Fri Oct 20 2023 - 05:24:44 EST


Hi Mark,

On Fri, 13 Oct 2023 15:01:22 +0100, Mark Rutland wrote:
> > > > The event processing flow is as follows (shown as backtrace):
> > > > #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > > > #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > > > <callback function pointers in `struct perf_guest_info_callbacks`>
> > > > #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > > > #3 perf_callchain_guest
> > > > #4 get_perf_callchain
> > > > #5 perf_callchain
> > > >
> > > > Between #0 and #1 is the interface between KVM and the arch-specific
> > > > impl, while between #1 and #2 is the interface between Perf and KVM.
> > > > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > > > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > > > and modifies #4 #5. The last patch is for userspace utils.
> > > >
> > > > Since arm64 hasn't provided some foundational infrastructure (interface
> > > > for reading from a virtual address of guest), the arm64 implementation
> > > > is stubbed for now because it's a bit complex, and will be implemented
> > > > later.
> > >
> > > I hope you realise that such an "interface" would be, by definition,
> > > fragile and very likely to break in a subtle way. The only existing
> > > case where we walk the guest's page tables is for NV, and even that is
> > > extremely fragile.
> >
> > For walking the guest's page tables, yes, there're only very few
> > use cases. Most of them are used in nested virtualization and XEN.
>
> The key point isn't the lack of use cases; the key point is that *this is
> fragile*.
>
> Consider that walking guest page tables is only safe because:
>
> (a) The walks happen in the guest-physical / intermiediate-physical address
> space of the guest, and so are not themselves subject to translation via
> the guest's page tables.
>
> (b) Special traps were added to the architecture (e.g. for TLB invalidation)
> which allow the host to avoid race conditions when the guest modifies page
> tables.
>
> For unwind we'd have to walk structures in the guest's virtual address space,
> which can change under our feet at any time the guest is running, and handling
> that requires much more care.
>
> I think this needs a stronger justification, and an explanation of how you
> handle such races.

Yes, guests can modify the page tables at any time, so the page table
we obtain may be corrupted. We may not be able to complete the traversal
of the page table or may receive incorrect data.

However, these are not critical issues because we often encounter
incorrect stack unwinding results. In fact, here we assume that the
guest OS/program has stack frames (compiled with `fno-omit-frame-pointer`),
but many programs do not adhere to such an assumption, which often leads
to invalid results. This is almost unavoidable, especially when the
guest OS is running third-party programs. The unwind results we record
may be incorrect; if the unwind cannot continue, we only record the
existing results. Addresses that cannot be resolved to symbols will be
later marked as `[unknown]` by `perf kvm`, and this is very common.

Our unwind strategy is conservative to ensure safety and do our best in
readonly situations. If the guest page table is broken, or the address
to be read is somehow not in the guest page table, we will not inject a
page fault but simply stop the unwind. The function that walks the
page table is done entirely in software and is readonly, having no
additional impact on the guest. Some results could also be incorrect.
It is sufficient as long as most of the records are correct for profiling.

Do you think these address your concerns?

Thanks,
Tianyi Liu