RE: [PATCH] mm: clear 1G pages with streaming stores on x86

From: Elliott, Robert (Servers)
Date: Mon Mar 30 2020 - 20:41:20 EST




> -----Original Message-----
> From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel-
> owner@xxxxxxxxxxxxxxx> On Behalf Of Arvind Sankar
> Sent: Wednesday, March 11, 2020 1:33 PM
> To: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx>
> Cc: Arvind Sankar <nivedita@xxxxxxxxxxxx>; Cannon Matthews
> <cannonmatthews@xxxxxxxxxx>; Matthew Wilcox <willy@xxxxxxxxxxxxx>;
> Andi Kleen <ak@xxxxxxxxxxxxxxx>; Michal Hocko <mhocko@xxxxxxxxxx>;
> Mike Kravetz <mike.kravetz@xxxxxxxxxx>; Andrew Morton <akpm@linux-
> foundation.org>; David Rientjes <rientjes@xxxxxxxxxx>; Greg Thelen
> <gthelen@xxxxxxxxxx>; Salman Qazi <sqazi@xxxxxxxxxx>; linux-
> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx
> Subject: Re: [PATCH] mm: clear 1G pages with streaming stores on x86
>
> On Wed, Mar 11, 2020 at 11:16:07AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Mar 10, 2020 at 11:35:54PM -0400, Arvind Sankar wrote:
> > >
> > > The rationale for MOVNTI instruction is supposed to be that it
> avoids
> > > cache pollution. Aside from the bench that shows MOVNTI to be
> faster for
> > > the move itself, shouldn't it have an additional benefit in not
> trashing
> > > the CPU caches?
> > >
> > > As string instructions improve, why wouldn't the same
> improvements be
> > > applied to MOVNTI?
> >
> > String instructions inherently more flexible. Implementation can
> choose
> > caching strategy depending on the operation size (cx) and other
> factors.
> > Like if operation is large enough and cache is full of dirty cache
> lines
> > that expensive to free up, it can choose to bypass cache. MOVNTI is
> more
> > strict on semantics and more opaque to CPU.
>
> But with today's processors, wouldn't writing 1G via the string
> operations empty out almost the whole cache? Or are there already
> optimizations to prevent one thread from hogging the L3?
>
> If we do want to just use the string operations, it seems like the
> clear_page routines should just call memset instead of duplicating
> it.
>

The last time I checked, glibc memcpy() chose non-temporal stores based
on transfer size, L3 cache size, and the number of cores.
For example, with glibc-2.216-16.fc27 (August 2017), on a Broadwell
system with E5-2699 36 cores 45 MiB L3 cache, non-temporal stores only
start to be used above 36 MiB.