Re: [PATCH 3/3] crypto: x86/sm4 - add AES-NI/AVX/x86_64 assembler implementation

From: Tianjia Zhang
Date: Sun Jun 13 2021 - 06:15:02 EST


Hi Eric,

On 6/11/21 7:27 AM, Eric Biggers wrote:
On Thu, Jun 10, 2021 at 09:44:59PM +0800, Tianjia Zhang wrote:
This patch adds AES-NI/AVX/x86_64 assembler implementation of SM4
block cipher. Through two affine transforms, we can use the AES
S-Box to simulate the SM4 S-Box to achieve the effect of instruction
acceleration.


Benchmark results, please.

Also, is this passing the self-tests, including the fuzz tests?


I will provide this information in the next version.

+/*
+ * void sm4_aesni_avx_expand_key(const u8 *key, u32 *rk_enc,
+ * u32 *rk_dec, const u32 *fk, const u32 *ck);
+ */
+SYM_FUNC_START(sm4_aesni_avx_expand_key)
+ /* input:
+ * %rdi: 128-bit key
+ * %rsi: rkey_enc
+ * %rdx: rkey_dec
+ * %rcx: fk array
+ * %r8: ck array
+ */
+ FRAME_BEGIN

Key expansion isn't performance-critical. Can the C library version be used, or
does the key need to be expanded in a way specific to this x86 implementation?


It can be replaced by a common implementation of C library. For expand key that are not called frequently, the optimization of a specific instruction set does not bring much benefit. Of course, it is possible to delete this implementation.

+/*
+ * void sm4_aesni_avx_crypt4(const u32 *rk, u8 *dst,
+ * const u8 *src, int nblocks)
+ */
+SYM_FUNC_START(sm4_aesni_avx_crypt4)
+ /* input:
+ * %rdi: round key array, CTX
+ * %rsi: dst (1..4 blocks)
+ * %rdx: src (1..4 blocks)
+ * %rcx: num blocks (1..4)
+ */
+ FRAME_BEGIN
[...]

+static void sm4_encrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+ const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ if (crypto_simd_usable()) {
+ kernel_fpu_begin();
+ sm4_aesni_avx_crypt4(ctx->rkey_enc, out, in, 1);
+ kernel_fpu_end();
+ } else
+ crypto_sm4_do_crypt(ctx->rkey_enc, out, in);
+}
+
+static void sm4_decrypt(struct crypto_tfm *tfm, u8 *out, const u8 *in)
+{
+ const struct crypto_sm4_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ if (crypto_simd_usable()) {
+ kernel_fpu_begin();
+ sm4_aesni_avx_crypt4(ctx->rkey_dec, out, in, 1);
+ kernel_fpu_end();
+ } else
+ crypto_sm4_do_crypt(ctx->rkey_dec, out, in);
+}

Your assembly code appears to handle encrypting up to 4 blocks at a time.
However you have only wired this up to the "cipher" API which does 1 block at a
time. Is this intentional?

What are your performance results with real-world chaining modes like XTS, and
do you plan to implement any of these modes directly?


This implementation is intentional. First, a general block encryption is provided. There is no obvious performance improvement in this implementation. The key to optimization is to make full use of parallel four blocks encryption at a time. This is still under development, and I will continue to implement things like XTS in the future. Optimization of such specific modes.

+
+static struct crypto_alg sm4_asm_alg = {
+ .cra_name = "sm4",
+ .cra_driver_name = "sm4-asm",

In arch/x86/crypto/, "-asm" usually means a vanilla x86 assembly implementation
without any AES-NI, SSE, AVX, etc. instructions. Calling this something like
"sm4-aesni-avx" would make more sense. (Or is it actually avx2, not avx?)


will do in next version patch.

+config CRYPTO_SM4_AESNI_AVX_X86_64
+ tristate "SM4 cipher algorithm (x86_64/AES-NI/AVX)"
+ depends on X86 && 64BIT
+ select CRYPTO_SKCIPHER
+ select CRYPTO_SIMD
+ select CRYPTO_ALGAPI
+ select CRYPTO_LIB_SM4

As-is, neither CRYPTO_SKCIPHER nor CRYPTO_SIMD needs to be selected here.


ditto.

+ help
+ SM4 cipher algorithms (OSCCA GB/T 32907-2016) (x86_64/AES-NI/AVX).
+
+ SM4 (GBT.32907-2016) is a cryptographic standard issued by the
+ Organization of State Commercial Administration of China (OSCCA)
+ as an authorized cryptographic algorithms for the use within China.
+
+ SMS4 was originally created for use in protecting wireless
+ networks, and is mandated in the Chinese National Standard for
+ Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure)
+ (GB.15629.11-2003).
+
+ The latest SM4 standard (GBT.32907-2016) was proposed by OSCCA and
+ standardized through TC 260 of the Standardization Administration
+ of the People's Republic of China (SAC).
+
+ The input, output, and key of SMS4 are each 128 bits.
+
+ See also: <https://eprint.iacr.org/2008/329.pdf>
+
+ If unsure, say N.

This is the help text for the x86 implementation specifically. Please don't
have boilerplate text about the algorithm here; that already exists for the
generic implementation. The text should explain about the x86 implementation.


ditto.

- Eric


Thanks for your suggestion.

Cheers,
Tianjia