Re: [PATCH v3 2/2] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()

From: David Howells
Date: Fri Aug 18 2023 - 11:21:15 EST


Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> > Although I'm not sure the bit-fields really help.
> > There are 8 bytes at the start of the structure, might as well
> > use them :-)
>
> Actuallyç I wrote the patch that way because it seems to improve code
> generation.
>
> The bitfields are generally all set together as just plain one-time
> constants at initialization time, and gcc sees that it's a full byte
> write. And the reason 'data_source' is not a bitfield is that it's not
> a constant at iov_iter init time (it's an argument to all the init
> functions), so having that one as a separate byte at init time is good
> for code generation when you don't need to mask bits or anything like
> that.
>
> And once initialized, having things be dense and doing all the
> compares with a bitwise 'and' instead of doing them as some value
> compare again tends to generate good code.

Actually... I said that switch(enum) seemed to generate suboptimal code...
However, if the enum is renumbered such that the constants are in the same
order as in the switch() it generates better code.

So we want this order:

enum iter_type {
ITER_UBUF,
ITER_IOVEC,
ITER_BVEC,
ITER_KVEC,
ITER_XARRAY,
ITER_DISCARD,
};

to match:

switch (iov_iter_type(iter)) {
case ITER_UBUF:
progress = iterate_ubuf(iter, len, priv, priv2, ustep);
break;
case ITER_IOVEC:
progress = iterate_iovec(iter, len, priv, priv2, ustep);
break;
case ITER_BVEC:
progress = iterate_bvec(iter, len, priv, priv2, step);
break;
case ITER_KVEC:
progress = iterate_kvec(iter, len, priv, priv2, step);
break;
case ITER_XARRAY:
progress = iterate_xarray(iter, len, priv, priv2, step);
break;
case ITER_DISCARD:
default:
progress = len;
break;
}

then gcc should be able to generate a ternary tree, which it does here:

<+77>: mov (%rdx),%al
<+79>: cmp $0x2,%al
<+81>: je 0xffffffff81779bcc <_copy_from_iter+394>
<+87>: ja 0xffffffff81779aa9 <_copy_from_iter+103>

though it really split the number space in the wrong place. It can then use
one CMP (or TEST) to split again:

<+89>: test %al,%al
<+91>: mov 0x8(%rdx),%rdx
<+95>: jne 0xffffffff81779b48 <_copy_from_iter+262>
<+101>: jmp 0xffffffff81779b17 <_copy_from_iter+213>

It then should only require one CMP at this point, since AL can only be 4, 5
or 6+:

<+103>: cmp $0x3,%al
<+105>: je 0xffffffff81779c6e <_copy_from_iter+556>
<+111>: cmp $0x4,%al
<+113>: jne 0xffffffff81779dd2 <_copy_from_iter+912>

The end result being that it should have at most two CMP instructions for any
number in the range 0 to 5 - though in this case, it will have three for AL>3.

However, it doesn't do this with if-if-if with a number sequence or a bitmask,
but rather generates an chain of cmp-cmp-cmp or test-test-test, presumably
because it fails to spot the if-conditions are related.

I should log a gcc bug on this on the poor switch() behaviour.

Also, if we renumber the enum to put UBUF and IOVEC first, we can get rid of
iter->user_backed in favour of:

static inline bool user_backed_iter(const struct iov_iter *i)
{
return iter_is_ubuf(i) || iter_is_iovec(i);
}

which gcc just changes into something like a "CMP $1" and a "JA".


Comparing Linus's bit patch (+ is better) to renumbering the switch (- is
better):

__iov_iter_get_pages_alloc inc 0x32a -> 0x331 +0x7
_copy_from_iter dcr 0x3c7 -> 0x3bf -0x8
_copy_from_iter_flushcache inc 0x364 -> 0x370 +0xc
_copy_from_iter_nocache inc 0x33e -> 0x347 +0x9
_copy_mc_to_iter dcr 0x3bc -> 0x3b6 -0x6
_copy_to_iter inc 0x34a -> 0x34b +0x1
copy_page_from_iter_atomic.part.0 dcr 0x424 -> 0x41c -0x8
copy_page_to_iter_nofault.part.0 dcr 0x3a9 -> 0x3a5 -0x4
csum_and_copy_from_iter inc 0x3e5 -> 0x3e8 +0x3
csum_and_copy_to_iter dcr 0x449 -> 0x448 -0x1
dup_iter inc 0x34 -> 0x37 +0x3
fault_in_iov_iter_readable dcr 0x9c -> 0x9a -0x2
fault_in_iov_iter_writeable dcr 0x9c -> 0x9a -0x2
iov_iter_advance dcr 0xde -> 0xd8 -0x6
iov_iter_alignment inc 0xe0 -> 0xe3 +0x3
iov_iter_extract_pages inc 0x416 -> 0x418 +0x2
iov_iter_gap_alignment dcr 0x69 -> 0x67 -0x2
iov_iter_init inc 0x27 -> 0x31 +0xa
iov_iter_is_aligned dcr 0x104 -> 0xf5 -0xf
iov_iter_npages inc 0x119 -> 0x11d +0x4
iov_iter_revert dcr 0x93 -> 0x86 -0xd
iov_iter_single_seg_count dcr 0x41 -> 0x3c -0x5
iov_iter_ubuf inc 0x39 -> 0x3a +0x1
iov_iter_zero dcr 0x338 -> 0x32e -0xa
memcpy_from_iter_mc inc 0x2a -> 0x2b +0x1

I think there may be more savings to be made if I go and convert more of the
functions to using switch().

David