Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()

From: Nadav Amit
Date: Wed Feb 27 2019 - 13:56:22 EST


> On Feb 27, 2019, at 9:57 AM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>
>> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>> Nadav Amit reported that commit:
>>>
>>> b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>>>
>>> added a bunch of constraints to all sorts of code; and while some of
>>> that was correct and desired, some of that seems superfluous.
>>
>> Trivial (but entirely untested) patch attached.
>>
>> That said, I didn't actually check how it affects code generation.
>> Nadav, would you check the code sequences you originally noticed?
>
> The original issue was raised while I was looking into a dropped patch of
> Matthew Wilcox that caused code size increase [1]. As a result I noticed
> that Peterâs patch caused big changes to the generated assembly across the
> kernel - I did not have a specific scenario that I cared about.
>
> The patch you sent (â+m/-volatileâ) does increase the code size by 1728
> bytes. Although code size is not the only metric for âcode optimizationâ,
> the original patch of Peter (âvolatileâ) only increased the code size by 201
> bytes. Peterâs original change also affected only 72 functions vs 228 that
> impacted by the new patch.
>
> Iâll have a look at some specific function assembly, but overall, the â+mâ
> approach might prevent even more code optimizations than the âvolatileâ one.
>
> Iâll send an example or two later.

Here is one example:

Dump of assembler code for function event_filter_pid_sched_wakeup_probe_pre:
0xffffffff8117c510 <+0>: push %rbp
0xffffffff8117c511 <+1>: mov %rsp,%rbp
0xffffffff8117c514 <+4>: push %rbx
0xffffffff8117c515 <+5>: mov 0x28(%rdi),%rax
0xffffffff8117c519 <+9>: mov %gs:0x78(%rax),%dl
0xffffffff8117c51d <+13>: test %dl,%dl
0xffffffff8117c51f <+15>: je 0xffffffff8117c535 <event_filter_pid_sched_wakeup_probe_pre+37>
0xffffffff8117c521 <+17>: mov %rdi,%rax
0xffffffff8117c524 <+20>: mov 0x78(%rdi),%rdi
0xffffffff8117c528 <+24>: mov 0x28(%rax),%rbx # REDUNDANT
0xffffffff8117c52c <+28>: callq 0xffffffff81167830 <trace_ignore_this_task>
0xffffffff8117c531 <+33>: mov %al,%gs:0x78(%rbx)
0xffffffff8117c535 <+37>: pop %rbx
0xffffffff8117c536 <+38>: pop %rbp
0xffffffff8117c537 <+39>: retq

The instruction at 0xffffffff8117c528 is redundant, and does not exist
without the recent patch. It seems to be a result of no-strict-aliasing,
which due to the new "memory writeâ (â+mâ) causes the compiler to re-read
the data.