Re: [PATCH v2] memcpy_flushcache: use cache flusing for larger lengths

From: Dan Williams
Date: Tue Mar 31 2020 - 17:19:47 EST


[ add x86 and LKML ]

On Tue, Mar 31, 2020 at 5:27 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
>
>
> On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote:
>
> >
> >
> > > -----Original Message-----
> > > From: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> > > Sent: Monday, March 30, 2020 6:32 AM
> > > To: Dan Williams <dan.j.williams@xxxxxxxxx>; Vishal Verma
> > > <vishal.l.verma@xxxxxxxxx>; Dave Jiang <dave.jiang@xxxxxxxxx>; Ira
> > > Weiny <ira.weiny@xxxxxxxxx>; Mike Snitzer <msnitzer@xxxxxxxxxx>
> > > Cc: linux-nvdimm@xxxxxxxxxxxx; dm-devel@xxxxxxxxxx
> > > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger
> > > lengths
> > >
> > > I tested dm-writecache performance on a machine with Optane nvdimm
> > > and it turned out that for larger writes, cached stores + cache
> > > flushing perform better than non-temporal stores. This is the
> > > throughput of dm- writecache measured with this command:
> > > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> > >
> > > block size 512 1024 2048 4096
> > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> > >
> > > We can see that for smaller block, movnti performs better, but for
> > > larger blocks, clflushopt has better performance.
> >
> > There are other interactions to consider... see threads from the last
> > few years on the linux-nvdimm list.
>
> dm-writecache is the only linux driver that uses memcpy_flushcache on
> persistent memory. There is also the btt driver, it uses the "do_io"
> method to write to persistent memory and I don't know where this method
> comes from.
>
> Anyway, if patching memcpy_flushcache conflicts with something else, we
> should introduce memcpy_flushcache_to_pmem.
>
> > For example, software generally expects that read()s take a long time and
> > avoids re-reading from disk; the normal pattern is to hold the data in
> > memory and read it from there. By using normal stores, CPU caches end up
> > holding a bunch of persistent memory data that is probably not going to
> > be read again any time soon, bumping out more useful data. In contrast,
> > movnti avoids filling the CPU caches.
>
> But if I write one cacheline and flush it immediatelly, it would consume
> just one associative entry in the cache.
>
> > Another option is the AVX vmovntdq instruction (if available), the
> > most recent of which does 64-byte (cache line) sized transfers to
> > zmm registers. There's a hefty context switching overhead (e.g.,
> > 304 clocks), and the CPU often runs AVX instructions at a slower
> > clock frequency, so it's hard to judge when it's worthwhile.
>
> The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good
> as 8, 16 or 32-bytes writes.
> ram nvdimm
> sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s
> sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s
> sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s
> sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s
> sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s
>
> With cached writes (where each cache line is immediatelly followed by clwb
> or clflushopt), 8, 16 or 32-byte write performs better than non-temporal
> stores and avx512 performs worse.
>
> sequential write 8 + clwb 5.1 GB/s 1.6 GB/s
> sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s
> sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s
> sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s

This is indeed compelling straight-line data. My concern, similar to
Robert's, is what it does to the rest of the system. In addition to
increasing cache pollution, which I agree is difficult to quantify, it
may also increase read-for-ownership traffic. Could you collect 'perf
stat' for this clwb vs nt comparison to check if any of this
incidental overhead effect shows up in the numbers? Here is a 'perf
stat' line that might capture that.

perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses
-r 5 $benchmark

In both cases nt and explicit clwb there's nothing that prevents the
dirty-cacheline, or the fill buffer from being written-back / flushed
before the full line is populated and maybe you are hitting that
scenario differently with the two approaches? I did not immediately
see a perf counter for events like this. Going forward I think this
gets better with the movdir64b instruction because that can guarantee
full-line-sized store-buffer writes.

Maybe the perf data can help make a decision about whether we go with
your patch in the near term?

>
>
> > In user space, glibc faces similar choices for its memcpy() functions;
> > glibc memcpy() uses non-temporal stores for transfers > 75% of the
> > L3 cache size divided by the number of cores. For example, with
> > glibc-2.216-16.fc27 (August 2017), on a Broadwell system with
> > E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used
> > for memcpy()s over 36 MiB.
>
> BTW. what does glibc do with reads? Does it flush them from the cache
> after they are consumed?
>
> AFAIK glibc doesn't support persistent memory - i.e. there is no function
> that flushes data and the user has to use inline assembly for that.

Yes, and I don't know of any copy routines that try to limit the cache
pollution of pulling the source data for a copy, only the destination.

> > It'd be nice if glibc, PMDK, and the kernel used the same algorithms.

Yes, it would. Although I think PMDK would make a different decision
than the kernel when optimizing for highest bandwidth for the local
application vs bandwidth efficiency across all applications.