Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

From: Mathieu Desnoyers
Date: Thu Apr 30 2020 - 12:16:24 EST


----- On Apr 30, 2020, at 12:11 PM, rostedt rostedt@xxxxxxxxxxx wrote:

> On Thu, 30 Apr 2020 16:11:21 +0200
> Joerg Roedel <jroedel@xxxxxxx> wrote:
>
>> Hi,
>>
>> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
>> > Talking with Mathieu about this on IRC, he pointed out that my code does
>> > have a vzalloc() that is called:
>> >
>> > in trace_pid_write()
>> >
>> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>> >
>> > This is done when -P1,2 is on the trace-cmd command line.
>>
>> Okay, tracked it down, some instrumentation in the page-fault and
>> double-fault handler gave me the stack-traces. Here is what happens:
>>
>> As already pointed out, it all happens because of page-faults on the
>> vzalloc'ed pid bitmap. It starts with this stack-trace:
>>
>> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
>
> Interesting. Because that function is this:
>
> bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
> {
> struct trace_array *tr = trace_file->tr;
> struct trace_array_cpu *data;
> struct trace_pid_list *no_pid_list;
> struct trace_pid_list *pid_list;
>
> pid_list = rcu_dereference_raw(tr->filtered_pids);
> no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);
>
> if (!pid_list && !no_pid_list)
> return false;
>
> data = this_cpu_ptr(tr->array_buffer.data);
>
> return data->ignore_pid;
> }
>
> Where it only sees if the pid masks exist. That is, it looks to see if
> there's pointers to them, it doesn't actually touch the vmalloc'd area.
> This check is to handle a race between allocating and deallocating the
> buffers and setting the ignore_pid bit. The reading of these arrays is done
> at sched_switch time, which sets or clears the ignore_pid field.
>
> That said, since this only happens on buffer instances (it does not trigger
> on the top level instance, which uses the same code for the pid masks)
>
> Could this possibly be for the tr->array_buffer.data, which is allocated
> with:
>
> allocate_trace_buffer() {
> [..]
> buf->data = alloc_percpu(struct trace_array_cpu);
>
> That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu
> allocation. This would explain why this crashes with the buffer instance
> and not with the top level instance. If it was related to the pid masks,
> then it would trigger for either (because they act the same in allocating
> at time of use). But when an instance is made, the tr->array_buffer.data is
> created. Which for the top level happens at boot up and the pages would
> have been synced long ago. But for a newly created instance, this happens
> just before its used. This could possibly explain why it's not a problem
> when doing it manually by hand, because the time between creating the
> instance, and the time to start and stop the tracing, is long enough for
> something to sync them page tables.
>
> tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
> allocations!

Did I mention that alloc_percpu uses:

static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
{
if (WARN_ON_ONCE(!slab_is_available()))
return NULL;

if (size <= PAGE_SIZE)
return kzalloc(size, gfp);
else
return __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL);
}

So yeah, it's vmalloc'd memory when size > PAGE_SIZE.

Thanks,

Mathieu




--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com