Re: [RFC PATCH v1 5/6] mm: parallelize clear_gigantic_page

From: Dave Hansen
Date: Mon Jul 17 2017 - 12:02:46 EST


On 07/14/2017 03:16 PM, daniel.m.jordan@xxxxxxxxxx wrote:
> Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 cpus, 1T memory
> Test: Clear a range of gigantic pages
> nthread speedup size (GiB) min time (s) stdev
> 1 100 41.13 0.03
> 2 2.03x 100 20.26 0.14
> 4 4.28x 100 9.62 0.09
> 8 8.39x 100 4.90 0.05
> 16 10.44x 100 3.94 0.03
...
> 1 800 434.91 1.81
> 2 2.54x 800 170.97 1.46
> 4 4.98x 800 87.38 1.91
> 8 10.15x 800 42.86 2.59
> 16 12.99x 800 33.48 0.83

What was the actual test here? Did you just use sysfs to allocate 800GB
of 1GB huge pages?

This test should be entirely memory-bandwidth-limited, right? Are you
contending here that a single core can only use 1/10th of the memory
bandwidth when clearing a page?

Or, does all the gain here come because we are round-robin-allocating
the pages across all 8 NUMA nodes' memory controllers and the speedup
here is because we're not doing the clearing across the interconnect?