Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

From: Alexey Budankov
Date: Wed Sep 12 2018 - 04:27:42 EST



Hi,

On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means
>>> incorporating some POSIX API implementation into the tool, avoiding
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good
>> starting point.
>
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing
matrix multiplication on 96 core dual socket machines.

>
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

>

Thanks,
Alexey