Re: [PATCH v6] arm64: implement ftrace with regs

From: Julien Thierry
Date: Wed Jan 16 2019 - 13:01:08 EST




On 16/01/2019 15:56, Julien Thierry wrote:
> On 14/01/2019 12:26, Mark Rutland wrote:
>> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>>> On Fri, Jan 04, 2019 at 05:50:18PM +0000, Mark Rutland wrote:
>>>> Hi Torsten,
>>>>
>>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
>>>>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
>>>>> of each function. Replace the first NOP thus generated with a quick LR
>>>>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
>>>>> to ftrace, does not clobber the value. Ftrace will then generate the
>>>>> standard stack frames.
>>>
>>> Do we know what the overhead would be, if this was a link time change
>>> for the first instruction?
>>
>> No, but it should be possible to benchamrk that for a given workload,
>> which is what I'd like to see.
>>
>
> So, I hacked up something to have the -fpachable-function-entry=2 in the
> build and then have ftrace_init() patch in the "mov x9, lr" in the first
> nop of the function preludes.
>
> I tested it on a 8 x Cortex A-57 machine and compared with a version
> that just has the two nops in the function prelude.
>
> On workloads like hackbench, the average difference is within the noise
> (<1%). Time results below are in seconds.
>
> +------------+--------------------+
> | "nop; nop" | "mov x9, lr; nop" |
> +------------+--------------------+
> | 43.497 | 42.694 |
> | 43.464 | 43.148 |
> | 43.599 | 43.131 |
> | 43.785 | 43.63 |
> | 43.458 | 43.281 |
> | 44.3 | 43.328 |
> | 43.541 | 43.059 |
> | 43.529 | 43.298 |
> | 43.58 | 43.937 |
> | 43.385 | 43.122 |
> | 43.514 | 43.825 |
> | 45.508 | 43.268 |
> | 43.757 | 43.316 |
> | 43.392 | 43.146 |
> | 44.029 | 43.236 |
> | 43.515 | 43.139 |
> | 43.22 | 43.108 |
> | 43.496 | 43.836 |
> | 43.669 | 43.083 |
> | 43.388 | 43.38 |
> +------------+--------------------+
> average | 43.6813 | 43.29825 |
> +------------+--------------------+
>
Here are also some results running hackbench on 4 x Cortex-A53 (pay no
attention to the fact that the timescales are similar, I changed the
number of iteration done by hackbench so it wouldn't take too long)

+------------+-------------------+
| "nop; nop" | "mov x9, lr; nop" |
+------------+-------------------+
| 43.815 | 44.455 |
| 43.758 | 45.173 |
| 44.075 | 43.95 |
| 44.021 | 44.185 |
| 43.959 | 44.826 |
| 44.039 | 44.478 |
| 43.836 | 44.626 |
| 44.071 | 45.177 |
| 43.619 | 45.033 |
| 44.052 | 45.095 |
| 43.903 | 44.802 |
| 43.773 | 44.955 |
| 43.908 | 45.02 |
| 43.441 | 44.986 |
| 44.167 | 45.182 |
| 44.106 | 45.229 |
| 43.974 | 45.07 |
| 43.859 | 45.283 |
| 43.706 | 44.892 |
| 43.897 | 44.194 |
+------------+-------------------+
average | 43.899 | 44.835 |
+------------+-------------------+


So, in this case the performance take a ~2% hit from keeping the mov
always present in the function prelude instead of a nop.

Makes it a bit less obvious whether the always having that mov there
(whether patched at build time or run time) is good enough.

Cheers,

--
Julien Thierry