RE: [PATCH v3] percpu-internal/pcpu_chunk: Re-layout pcpu_chunk structure to reduce false sharing

From: Ma, Yu
Date: Tue Jun 13 2023 - 13:41:39 EST



> Hi Andrew,
>
> On Mon, Jun 12, 2023 at 02:43:31PM -0700, Andrew Morton wrote:
> > On Fri, 9 Jun 2023 23:07:30 -0400 Yu Ma <yu.ma@xxxxxxxxx> wrote:
> >
> > > When running UnixBench/Execl throughput case, false sharing is
> > > observed due to frequent read on base_addr and write on free_bytes,
> chunk_md.
> > >
> > > UnixBench/Execl represents a class of workload where bash scripts
> > > are spawned frequently to do some short jobs. It will do system call
> > > on execl frequently, and execl will call mm_init to initialize
> > > mm_struct of the process. mm_init will call __percpu_counter_init
> > > for percpu_counters initialization. Then pcpu_alloc is called to
> > > read the base_addr of pcpu_chunk for memory allocation. Inside
> > > pcpu_alloc, it will call pcpu_alloc_area to allocate memory from a
> specified chunk.
> > > This function will update "free_bytes" and "chunk_md" to record the
> > > rest free bytes and other meta data for this chunk. Correspondingly,
> > > pcpu_free_area will also update these 2 members when free memory.
> > > Call trace from perf is as below:
> > > + 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
> > > + 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
> > > - 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
> > > - 53.54% 0x654278696e552f34
> > > main
> > > __execve
> > > entry_SYSCALL_64_after_hwframe
> > > do_syscall_64
> > > __x64_sys_execve
> > > do_execveat_common.isra.47
> > > alloc_bprm
> > > mm_init
> > > __percpu_counter_init
> > > pcpu_alloc
> > > - __mutex_lock.isra.17
> > >
> > > In current pcpu_chunk layout, ‘base_addr’ is in the same cache line
> > > with ‘free_bytes’ and ‘chunk_md’, and ‘base_addr’ is at the last 8
> > > bytes. This patch moves ‘bound_map’ up to ‘base_addr’, to let
> > > ‘base_addr’ locate in a new cacheline.
> > >
> > > With this change, on Intel Sapphire Rapids 112C/224T platform, based
> > > on v6.4-rc4, the 160 parallel score improves by 24%.
> >
> > Well that's nice.
> >
> > >
> > > ...
> > >
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -41,10 +41,17 @@ struct pcpu_chunk {
> > > struct list_head list; /* linked to pcpu_slot lists */
> > > int free_bytes; /* free bytes in the chunk */
> > > struct pcpu_block_md chunk_md;
> > > - void *base_addr; /* base address of this chunk
> */
> > > + unsigned long *bound_map; /* boundary map */
> > > +
> > > + /*
> > > + * base_addr is the base address of this chunk.
> > > + * To reduce false sharing, current layout is optimized to make sure
> > > + * base_addr locate in the different cacheline with free_bytes and
> > > + * chunk_md.
> > > + */
> > > + void *base_addr ____cacheline_aligned_in_smp;
> > >
> > > unsigned long *alloc_map; /* allocation map */
> > > - unsigned long *bound_map; /* boundary map */
> > > struct pcpu_block_md *md_blocks; /* metadata blocks */
> > >
> > > void *data; /* chunk data */
> >
> > This will of course consume more memory. Do we have a feel for the
> > worst-case impact of this?
> >
>
> The pcpu_chunk struct is a backing data structure per chunk, so the
> additional memory should not be dramatic. A chunk covers ballpark between
> 64kb and 512kb memory depending on some config and boot time stuff, so I
> believe the additional memory used here is nominal at best.
>
> Working the #s on my desktop:
> Percpu: 58624 kB
> 28 cores -> ~2.1MB of percpu memory.
> At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
> Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB of
> overhead?
>
> I believe we can do a little better to avoid eating that full padding, so likely
> less than that.
>
> Acked-by: Dennis Zhou <dennis@xxxxxxxxxx>
>

Thanks Andrew and Dennis for agreement on the patch.
The layout of this structure (printed by pahole) before and after this patch is as below for reference.
The default size is 136 Bytes with 3 cachelines. With patch v3, it is 192 Bytes with 56 extra padding.
For "____cacheline_aligned_in_smp ", initially it was not added with the same concern on memory,
as it can obtain the same performance gain with reshuffle on base_addr only. Thanks to Dennis with
expertise on the overall usage, it is added to be more clear and bring convenience for future changes.

--default v6.4-rc4--
struct pcpu_chunk {
struct list_head list; /* 0 16 */
int free_bytes; /* 16 4 */
struct pcpu_block_md chunk_md; /* 20 32 */

/* XXX 4 bytes hole, try to pack */

void * base_addr; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
long unsigned int * alloc_map; /* 64 8 */
long unsigned int * bound_map; /* 72 8 */
struct pcpu_block_md * md_blocks; /* 80 8 */
void * data; /* 88 8 */
bool immutable; /* 96 1 */
bool isolated; /* 97 1 */

/* XXX 2 bytes hole, try to pack */

int start_offset; /* 100 4 */
int end_offset; /* 104 4 */

/* XXX 4 bytes hole, try to pack */

struct obj_cgroup * * obj_cgroups; /* 112 8 */
int nr_pages; /* 120 4 */
int nr_populated; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
int nr_empty_pop_pages; /* 128 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int populated[]; /* 136 0 */

/* size: 136, cachelines: 3, members: 17 */
/* sum members: 122, holes: 4, sum holes: 14 */
/* last cacheline: 8 bytes */
};

--with patch v3--
struct pcpu_chunk {
struct list_head list; /* 0 16 */
int free_bytes; /* 16 4 */
struct pcpu_block_md chunk_md; /* 20 32 */

/* XXX 4 bytes hole, try to pack */

long unsigned int * bound_map; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
void * base_addr; /* 64 8 */
long unsigned int * alloc_map; /* 72 8 */
struct pcpu_block_md * md_blocks; /* 80 8 */
void * data; /* 88 8 */
bool immutable; /* 96 1 */
bool isolated; /* 97 1 */

/* XXX 2 bytes hole, try to pack */

int start_offset; /* 100 4 */
int end_offset; /* 104 4 */

/* XXX 4 bytes hole, try to pack */

struct obj_cgroup * * obj_cgroups; /* 112 8 */
int nr_pages; /* 120 4 */
int nr_populated; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
int nr_empty_pop_pages; /* 128 4 */

/* XXX 4 bytes hole, try to pack */

long unsigned int populated[]; /* 136 0 */

/* size: 192, cachelines: 3, members: 17 */
/* sum members: 122, holes: 4, sum holes: 14 */
/* padding: 56 */
};


Regards
Yu

> Thanks,
> Dennis