Re: [PATCH 1/2] perf/x86/intel/pt: Fail event scheduling on conflict with VMX

From: Peter Zijlstra
Date: Wed Feb 15 2017 - 03:53:36 EST


On Wed, Feb 15, 2017 at 10:11:41AM +0200, Alexander Shishkin wrote:
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>
> > On Tue, Feb 14, 2017 at 07:38:07PM +0100, Peter Zijlstra wrote:
> >
> >> Right, so I question the whole 'lets not schedule PT when VMX' premise,
> >> it leads to inconsistencies all over. How about we treat it like
> >> ->add() succeeded and VMX simply results in no output.
> >>
> >> Esp. when you then emit 'fake' data into/from a vmlaunch/vmresume
> >> instruction.
> >
> > That is, what about something like the below? (completely untested for
> > obvious raisins).
> >
> > That should schedule PT like normal, and where VMXON will auto-clear
> > TraceEn for us, we make VMXOFF set it again.
>
> This makes sense, but this will only make a difference if we're tracing
> (the kernel side of) the process that did VMXON in the first place and
> we want to see what happens immediately after VMXOFF.

So I got annoyed by the inconsistencies in state depending on timing;
and this mucking about with pretending the event is not scheduled and
shouldn't be counted running when VMX when it bloody well is. It just so
happens it doesn't generate data.

Then I got worried when acme said he needed to kill all his VMs before
using PT; that's just wrong. You should just get no data for the VM but
the rest should very much show up.

And then when I looked at this, I couldn't figure out who would ever
re-enable PT for CPU events. VMXON will clear TraceEn, nobody will ever
set it again, because we generally don't schedule CPU events if we don't
have to.

Hence this proposal that should clear all that up.

> > @@ -1174,10 +1180,12 @@ void intel_pt_interrupt(void)
> > /*
> > * If VMX is on and PT does not support it, don't touch anything.
> > */
> > - if (READ_ONCE(pt->vmx_on))
> > + if (READ_ONCE(pt->vmx_on)) {
> > + WRITE_ONCE(pt->vmx_pmi_pending, 1);
> > return;
> > + }
>
> This is even simpler: we actually need to carry out the first part of
> the interrupt function anyway, which deals with updating buffer pointers
> etc, thus "handling" the PMI, but we don't restart the event, which will
> be then done by the intel_pt_handle_vmx(0), so we don't need the
> pending_pmi thingy.

Indeed! Make it so ;-)

> Now the fake data is worrying me much more. Consider this: we start an
> event while ->vmx_on==1, which means that before we write a fake VMCS
> packet, we need to write a whole bunch of other packets to establish the
> context with synchronization point and kitchen sink (PSB..PSBEND), then
> fake a trace start TIP.PGE, then VMCS, then TIP.PGD. Then, we should
> remember that we did this once to not do it again on every sched-in.
>
> Another corner case is when there's not enough room in the buffer and we
> need to postpone the fake VMCS until there is room again. Let me see if
> there's more.

I suppose the alternative is issuing PERF_RECORD_AUX events indicating
paused output or something.