Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

From: Ard Biesheuvel
Date: Tue Mar 26 2024 - 04:52:06 EST


On Tue, 26 Mar 2024 at 10:06, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later. However, an implementation using plain AESNI + AVX is also added
> and provides a small boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro. This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake. These CPU models default to an
> implementation using ymm registers instead.
>
> This patchset increases the throughput of AES-256-XTS decryption by the
> following amounts on the following CPUs:
>
> | 4096-byte messages | 512-byte messages |
> ----------------------+--------------------+-------------------+
> Intel Skylake | 1% | 11% |
> Intel Ice Lake | 92% | 59% |
> Intel Sapphire Rapids | 115% | 78% |
> AMD Zen 1 | 25% | 20% |
> AMD Zen 2 | 26% | 20% |
> AMD Zen 3 | 82% | 40% |
> AMD Zen 4 | 118% | 48% |
>
> (The results for encryption are very similar to decryption. I just tend
> to measure decryption because decryption performance is more important.)
>
> There's no separate kconfig option for the new AES-XTS implementations,
> as they are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API. They are prioritized
> appropriately so that the best one for the CPU is used by default.
>

This is very nice work!

I didn't check the performance delta on my system (it's Intel but I
have no idea which uarch), but it supports all flavours that you
implemented here, and all pass the selftests with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, so for the series

Tested-by: Ard Biesheuvel <ardb@xxxxxxxxxx>

I will try to make time to review the code as well.

> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
> registers the right one? arch/x86/crypto/poly1305_glue.c thinks that
> only Skylake has the bad downclocking. My current proposal is a bit
> more conservative; it also excludes Ice Lake and Tiger Lake. Those
> CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
> place? It probably doesn't make sense to have random different
> policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode? It
> seems to work, and technically it's not new because Poly1305 and ARIA
> already use AVX512, including the mask registers and zmm registers up
> to 31. So if there was a major issue, like the new registers not
> being properly saved and restored, it probably would have already been
> found. But AES-XTS support would introduce a wider use of it.
>

I don't have much input here, except that I think we should just
disable AVX512 kernel-wide on systems where there is no benefit in
terms of throughput. I suspect this might change with algorithms that
rely more heavily on the masking, but so far, we have been making
quite effective use of simple permute vectors and overlapping loads
and stores to do the same. And as Eric points out, the only relevant
use case in the kernel is blocks of size 2^n where n is at least 9.