Re: [RFC PATCH] perf/core: don't sample kernel regs upon skid

From: Peter Zijlstra
Date: Mon Jul 02 2018 - 11:47:11 EST


On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:
> Users can request that general purpose registers, instruction pointer,
> etc, are sampled when a perf event counter overflows. To try to avoid
> this resulting in kernel state being leaked, unprivileged users are
> usually forbidden from opening events which count while the kernel is
> running.
>
> Unfortunately, this is not sufficient to avoid leading kernel state.

'leaking' surely.

>
> For various reasons, there can be a delay between the overflow occurring
> and the resulting overflow exception (e.g. an NMI) being taken. During
> this window, other instructions may be executed, resulting in skid.
>
> This skid means that a userspace-only event overflowing may result in an
> exception being taken *after* entry to the kernel, allowing kernel
> registers to be sampled. Depending on the amount of skid, this may only
> leak the PC (breaking KASLR), or it may leak secrets which are currently
> live in GPRs.
>
> Let's avoid this by only sampling from the user registers when an event
> is supposed to exclude the kernel, providing the illusion that the
> overflow exception is taken from userspace.
>
> We also have similar cases when sampling a guest, where we get the host
> regs in some cases. It's not entirely clear to me how we should handle
> these.

Would not a similar:

if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? */ &&
perf_guest_cbs && !perf_guest_cbs->is_in_guest())
return perf_guest_cbs->guest_pt_regs();

work there? Of course, perf_guest_info_callbacks is currently lacking
that guest_pt_regs() thingy..

>
> Signed-off-by: Mark Rutland <mark.rutland@xxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Jin Yao <yao.jin@xxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> ---
> kernel/events/core.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8f0434a9951a..2ab2548b2e66 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
> return callchain ?: &__empty_callchain;
> }
>
> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,
> + struct pt_regs *regs)
> +{
> + /*
> + * Due to interrupt latency (AKA "skid"), we may enter the kernel
> + * before taking an overflow, even if the PMU is only counting user
> + * events.
> + *
> + * If we're not counting kernel events, always use the user regs when
> + * sampling.
> + *
> + * TODO: what do we do about sampling a guest's registers? The IP is
> + * special-cased, but for the rest of the regs they'll get the
> + * user/kernel regs depending on whether exclude_kernel is set, which
> + * is nonsensical.
> + *
> + * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU
> + * can't provide the GPRs), so should we just zero the GPRs when in a
> + * guest? Or skip outputting the regs in perf_output_sample?

Seems daft Xen cannot provide registers; why is that? Boris?

> + */
> + if (event->attr.exclude_kernel && !user_mode(regs))
> + return task_pt_regs(current);
> +
> + return regs;
> +}
> +
> void perf_prepare_sample(struct perf_event_header *header,
> struct perf_sample_data *data,
> struct perf_event *event,
> @@ -6368,6 +6394,8 @@ void perf_prepare_sample(struct perf_event_header *header,
> {
> u64 sample_type = event->attr.sample_type;
>
> + regs = perf_get_sample_regs(event, regs);
> +
> header->type = PERF_RECORD_SAMPLE;
> header->size = sizeof(*header) + event->header_size;

In any case ACK for this thing.