Re: [PATCH 0/2] arm64: Introduce boot parameter to disable TLB flush instruction within the same inner shareable domain

From: qi.fuli@xxxxxxxxxxx
Date: Tue Jul 02 2019 - 22:45:58 EST


Hi Will,

Thanks for your comments.

On 6/27/19 7:27 PM, Will Deacon wrote:
> On Mon, Jun 24, 2019 at 10:34:02AM +0000, qi.fuli@xxxxxxxxxxx wrote:
>> On 6/18/19 2:03 AM, Will Deacon wrote:
>>> On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote:
>>>> From: Takao Indoh <indou.takao@xxxxxxxxxxx>
>>>>
>>>> I found a performance issue related on the implementation of Linux's TLB
>>>> flush for arm64.
>>>>
>>>> When I run a single-threaded test program on moderate environment, it
>>>> usually takes 39ms to finish its work. However, when I put a small
>>>> apprication, which just calls mprotest() continuously, on one of sibling
>>>> cores and run it simultaneously, the test program slows down significantly.
>>>> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on
>>>> ThunderX1 and Fujitsu A64FX.
>>> This is a problem for any applications that share hardware resources with
>>> each other, so I don't think it's something we should be too concerned about
>>> addressing unless there is a practical DoS scenario, which there doesn't
>>> appear to be in this case. It may be that the real answer is "don't call
>>> mprotect() in a loop".
>> I think there has been a misunderstanding, please let me explain.
>> This application is just an example using for reproducing the
>> performance issue we found.
>> Our original purpose is reducing OS jitter by this series.
>> The OS jitter on massively parallel processing systems have been known
>> and studied for many years.
>> The 2.5% OS jitter can result in over a factor of 20 slowdown for the
>> same application [1].
> I think it's worth pointing out that the system in question was neither
> ARM-based nor running Linux, so I'd be cautious in applying the conclusions
> of that paper directly to our TLB invalidation code. Furthermore, the noise
> being generated in their experiments uses a timer interrupt, which has a
> /vastly/ different profile to a DVM message in terms of both system impact
> and frequency.
My original purpose was to explain that the OS jitter is a vital issue for
large-scale HPC environment by referencing this paper.
Please allow me to introduce the issue that had occurred to our HPC
environment.
We used FWQ [1] to do an experiment on 1 node of our HPC environment,
we expected it would be tens of microseconds of maximum OS jitter, but
it was
hundreds of microseconds, which didn't meet our requirement. We tried to
find
out the cause by using ftrace, but we cannot find any processes which would
cause noise and only knew the extension of processing time. Then we
confirmed
the CPU instruction count through CPU PMU, we also didn't find any changes.
However, we found that with the increase of that the TLB flash was called,
the noise was also increasing. Here we understood that the cause of this
issue
is the implementation of Linux's TLB flush for arm64, especially use of
TLBI-is
instruction which is a broadcast to all processor core on the system.
Therefore,
we made this patch set to fix this issue. After testing for several
times, the
noise was reduced and our original goal was achieved, so we do think
this patch
makes sense.

As I mentioned, the OS jitter is a vital issue for large-scale HPC
environment.
We tried a lot of things to reduce the OS jitter. One of them is task
separation
between the CPUs which are used for computing and the CPUs which are
used for
maintenance. All of the daemon processes and I/O interrupts are bounden
to the
maintenance CPUs. Further more, we used nohz_full to avoid the noise
caused by
computing CPU interruption, but all of the CPUs were affected by TLBI-is
instruction, the task separation of CPUs didn't work. Therefore, we
would like
to implement that TLB flush is done on minimal CPUs to reducing the OS
jitter
by using this patch set.

[1] https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf

Thanks,
QI Fuli

>> Though it may be an extreme example, reducing the OS jitter has been an
>> issue in HPC environment.
>>
>> [1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell.
>> "Characterizing application sensitivity to OS interference using
>> kernel-level noise injection." Proceedings of the 2008 ACM/IEEE
>> conference on Supercomputing. IEEE Press, 2008.
>>
>>>> I suppose the root cause of this issue is the implementation of Linux's TLB
>>>> flush for arm64, especially use of TLBI-is instruction which is a broadcast
>>>> to all processor core on the system. In case of the above situation,
>>>> TLBI-is is called by mprotect().
>>> On the flip side, Linux is providing the hardware with enough information
>>> not to broadcast to cores for which the remote TLBs don't have entries
>>> allocated for the ASID being invalidated. I would say that the root cause
>>> of the issue is that this filtering is not taking place.
>> Do you mean that the filter should be implemented in hardware?
> Yes. If you're building a large system and you care about "jitter", then
> you either need to partition it in such a way that sources of noise are
> contained, or you need to introduce filters to limit their scope. Rewriting
> the low-level memory-management parts of the operating system is a red
> herring and imposes a needless burden on everybody else without solving
> the real problem, which is that contended use of shared resources doesn't
> scale.
>
> Will