Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig_pool for TCP-AO

From: Dmitry Safonov
Date: Fri Aug 11 2023 - 17:03:02 EST

Next message: tip-bot2 for Cristian Ciocaltea: "[tip: x86/urgent] x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405"
Previous message: Andrew Lunn: "Re: [PATCH v2 2/4] soc: qcom: aoss: Add debugfs interface for sending messages"
In reply to: Eric Dumazet: "Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig_pool for TCP-AO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Eric,

On 8/8/23 10:58, Eric Dumazet wrote:
> On Wed, Aug 2, 2023 at 7:27 PM Dmitry Safonov <dima@xxxxxxxxxx> wrote:
[..]
>> - hash = crypto_alloc_ahash("md5", 0, CRYPTO_ALG_ASYNC);
>> - if (IS_ERR(hash))
>> - return;
>> -
>> - for_each_possible_cpu(cpu) {
>> - void *scratch = per_cpu(tcp_md5sig_pool, cpu).scratch;
>> - struct ahash_request *req;
>> -
>> - if (!scratch) {
>> - scratch = kmalloc_node(sizeof(union tcp_md5sum_block) +
>> - sizeof(struct tcphdr),
>> - GFP_KERNEL,
>> - cpu_to_node(cpu));
>> - if (!scratch)
>> - return;
>> - per_cpu(tcp_md5sig_pool, cpu).scratch = scratch;
>> - }
>> - if (per_cpu(tcp_md5sig_pool, cpu).md5_req)
>> - continue;
>> -
>> - req = ahash_request_alloc(hash, GFP_KERNEL);
>> - if (!req)
>> - return;
>> -
>> - ahash_request_set_callback(req, 0, NULL, NULL);
>> -
>> - per_cpu(tcp_md5sig_pool, cpu).md5_req = req;
>> + scratch_size = sizeof(union tcp_md5sum_block) + sizeof(struct tcphdr);
>> + ret = tcp_sigpool_alloc_ahash("md5", scratch_size);
>> + if (ret >= 0) {
>> + tcp_md5_sigpool_id = ret;
>
> tcp_md5_alloc_sigpool() can be called from multiple cpus,
> yet you are writing over tcp_md5_sigpool_id here without any
> spinlock/mutex/annotations ?

Yeah, it's writing-only: as long as there was a TCP-MD5 key in
the system, sigpool_id returned by tcp_sigpool_alloc_ahash() would stay
the same. The only time when it writes a different value - iff all
previous MD5 keys were released.

> KCSAN would eventually file a report, and a compiler might very well
> transform this to
>
> tcp_md5_sigpool_id = random_value;
> <window where readers might catch garbage>
> tcp_md5_sigpool_id = ret;

Agree, haven't thought about the optimizing compiler nightmares.
Will add WRITE_ONCE().

[..]
>> +
>> +/**
>> + * sigpool_reserve_scratch - re-allocates scratch buffer, slow-path
>> + * @size: request size for the scratch/temp buffer
>> + */
>> +static int sigpool_reserve_scratch(size_t size)
>> +{
>> + struct scratches_to_free *stf;
>> + size_t stf_sz = struct_size(stf, scratches, num_possible_cpus());
>
> This is wrong. num_possible_cpus() could be 2, with two cpus numbered 0 and 8.
>
> Look for nr_cpu_ids instead.

Hmm, but it seems num_possible_cpus() was exactly what I wanted?

As free_old_scratches() does:

: while (stf->cnt--)
: kfree(stf->scratches[stf->cnt]);

So, that's just the number of previously allocated scratches, not
per-CPU index.

>> + int cpu, err = 0;
>> +
>> + lockdep_assert_held(&cpool_mutex);
>> + if (__scratch_size >= size)
>> + return 0;
>> +
>> + stf = kmalloc(stf_sz, GFP_KERNEL);
>> + if (!stf)
>> + return -ENOMEM;
>> + stf->cnt = 0;
>> +
>> + size = max(size, __scratch_size);
>> + cpus_read_lock();
>> + for_each_possible_cpu(cpu) {
>> + void *scratch, *old_scratch;
>> +
>> + scratch = kmalloc_node(size, GFP_KERNEL, cpu_to_node(cpu));
>> + if (!scratch) {
>> + err = -ENOMEM;
>> + break;
>> + }
>> +
>> + old_scratch = rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
>> + scratch, lockdep_is_held(&cpool_mutex));
>> + if (!cpu_online(cpu) || !old_scratch) {
>> + kfree(old_scratch);
>> + continue;
>> + }
>> + stf->scratches[stf->cnt++] = old_scratch;
>> + }
>
>
> I wonder why you are not simply using something around __alloc_percpu()

I presume that was a heritage of __tcp_alloc_md5sig_pool() that
allocated scratches this way. In pre-crypto-cloning versions per-CPU
ahash_requests were allocated with alloc_percpu().

But now looking closer, if I read pcpu allocator code correctly, it does
pcpu_mem_zalloc() for populating per-CPU chunks which depending on
pcpu_chunk_struct_size may be allocated with vmalloc() or kzalloc().

IOW, as long as CONFIG_NEED_PER_CPU_KM != n, which is:

: config NEED_PER_CPU_KM
: depends on !SMP || !MMU

The memory, returned by pcpu allocator has no guarantee of being from
direct-map. Which I presume defeats the purpose of having any
scratch/temporary area for sg/crypto.

Am I missing something?

>> + cpus_read_unlock();
>> + if (!err)
>> + __scratch_size = size;
>> +
>> + call_rcu(&stf->rcu, free_old_scratches);
>> + return err;
>> +}
>> +
>> +static void sigpool_scratch_free(void)
>> +{
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu)
>> + kfree(rcu_replace_pointer(per_cpu(sigpool_scratch, cpu),
>> + NULL, lockdep_is_held(&cpool_mutex)));
>
> I wonder why bothering about freeing some space ? One scratch buffer
> per cpu is really small.

When all TCP-MD5/TCP-AO keys removed, why not free some memory?

[..]
>> + /* slow-path */
>> + mutex_lock(&cpool_mutex);
>> + ret = sigpool_reserve_scratch(scratch_size);
>> + if (ret)
>> + goto out;
>> + for (i = 0; i < cpool_populated; i++) {
>> + if (!cpool[i].alg)
>> + continue;
>> + if (strcmp(cpool[i].alg, alg))
>> + continue;
>> +
>> + if (kref_read(&cpool[i].kref) > 0)
>> + kref_get(&cpool[i].kref);
>> + else
>> + kref_init(&cpool[i].kref);
>
> This looks wrong. kref_init() should be called once, after allocation,
> not based on kref_read().

Well, as long as it's the slow-path and cpool_mutex was grabbed, this
essentially re-allocating an entry that had it's last reference
released, but wasn't yet destroyed by __cpool_free_entry().

I think, if I interpreted it correctly, it was suggested-by Jakub:
https://lore.kernel.org/all/20230106175326.2d6a4dcd@xxxxxxxxxx/T/#u

[..]
>> +static void __cpool_free_entry(struct sigpool_entry *e)
>> +{
>> + crypto_free_ahash(e->hash);
>> + kfree(e->alg);
>> + memset(e, 0, sizeof(*e));
>> +}
>> +
>> +static void cpool_cleanup_work_cb(struct work_struct *work)
>
> Really this looks over complicated to me.
>
> All this kref maintenance and cpool_mutex costs :/

Hmm, unsure: I can remove this sigpool thing and just allocate a crypto
tfm for every new setsockopt(). Currently, this allocates one tfm per
hash algorithm, and without it it would be one tfm per setsockopt()/key.

The supported scale currently is thousands of keys per-socket, which
means that with sizeof(struct crypto_ahash) == 104, that will increase
memory consumption by hundreds of kilobytes to megabytes. I thought
that's a reasonable thing to do.

Should I proceed TCP-AO patches without this manager thing?

>> +{
>> + bool free_scratch = true;
>> + unsigned int i;
>> +
>> + mutex_lock(&cpool_mutex);
>> + for (i = 0; i < cpool_populated; i++) {
>> + if (kref_read(&cpool[i].kref) > 0) {
>> + free_scratch = false;
>> + continue;
>> + }
>> + if (!cpool[i].alg)
>> + continue;
>> + __cpool_free_entry(&cpool[i]);
>> + }
>> + if (free_scratch)
>> + sigpool_scratch_free();
>> + mutex_unlock(&cpool_mutex);
>> +}
[..]

Thanks,
Dmitry

Next message: tip-bot2 for Cristian Ciocaltea: "[tip: x86/urgent] x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405"
Previous message: Andrew Lunn: "Re: [PATCH v2 2/4] soc: qcom: aoss: Add debugfs interface for sending messages"
In reply to: Eric Dumazet: "Re: [PATCH v9 net-next 01/23] net/tcp: Prepare tcp_md5sig_pool for TCP-AO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]