[RFC PATCH v2 1/3] blk-mq: Move csd inside struct request so it's 32-byte aligned

From: Leonardo Bras
Date: Sat May 20 2023 - 01:31:16 EST


Currently, request->csd has type struct __call_single_data.

call_single_data_t is defined in include/linux/smp.h :

/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
typedef struct __call_single_data call_single_data_t
__aligned(sizeof(struct __call_single_data));

As the comment above the typedef suggests, having struct __call_single_data
split between 2 cachelines causes the need to fetch / invalidate / bounce 2
cachelines instead of 1 when the cpu receiving the request gets to run the
requested function. This is usually bad for performance, due to one extra
memory access and 1 extra cacheline usage.

As an example with a 64-bit machine with
CONFIG_BLK_RQ_ALLOC_TIME=y
CONFIG_BLK_WBT=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_INLINE_ENCRYPTION=y

Will output pahole with:
struct request {
[...]
union {
struct __call_single_data csd; /* 240 32 */
u64 fifo_time; /* 240 8 */
}; /* 240 32 */
[...]
}

At this config, and any cacheline size between 32 and 256, will cause csd
to be split between 2 cachelines: csd->node (16 bytes) in the first
cacheline, and csd->func (8 bytes) & csd->info (8 bytes) in the second.

During blk_mq_complete_send_ipi(), csd->func and csd->info are getting
changed, and when it calls __smp_call_single_queue() csd->node will get
changed.

On the cpu which got the request, csd->func and csd->info get read by
__flush_smp_call_function_queue() and csd->node gets changed by
csd_unlock(), meaning the two cachelines containing csd will get accessed.

To avoid this, it would be necessary to make sure request->csd is placed
somewhere else in the struct, so it is always in a single cacheline,
while avoiding the introduction of any hole in the struct. In order to
achieve this, move request->csd to after 'struct block_device *part'.

The rationale of this change is that:
- There will be no CONFIG_*-dependent field before csd, so there is no
chance of having unexpected holes on given configs.
- On 64-bit machines, csd will be at byte 96, and
- On 32-bit machines, csd will be at byte 64.

This means after this change, request->csd will always be cacheline aligned
for cachelines >= 32-bytes (64-bit) and cachelines >= 16-bytes (32-bits),
as long as struct request is cacheline aligned.

In above change, the struct request size is not supposed to change in any
configuration.

Signed-off-by: Leonardo Bras <leobras@xxxxxxxxxx>
---
include/linux/blk-mq.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 06caacd77ed6..44201e18681f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -105,6 +105,11 @@ struct request {
};

struct block_device *part;
+
+ union {
+ struct __call_single_data csd;
+ u64 fifo_time;
+ };
#ifdef CONFIG_BLK_RQ_ALLOC_TIME
/* Time that the first bio started allocating this request. */
u64 alloc_time_ns;
@@ -189,11 +194,6 @@ struct request {
} flush;
};

- union {
- struct __call_single_data csd;
- u64 fifo_time;
- };
-
/*
* completion callback.
*/
--
2.40.1