Re: [PATCH v2] nvme/tcp: Add support to set the tcp worker cpu affinity

From: Li Feng
Date: Mon Apr 17 2023 - 23:28:04 EST




> 2023年4月17日 下午9:33,Sagi Grimberg <sagi@xxxxxxxxxxx> 写道:
>
>
>>>> On Thu, Apr 13, 2023 at 09:29:41PM +0800, Li Feng wrote:
>>>>> The default worker affinity policy is using all online cpus, e.g. from 0
>>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
>>>>> have a bad performance.
>>>>
>>>> Can you explain in detail how nvme-tcp performs worse in this situation?
>>>>
>>>> If some of CPUs are knows as busy, you can submit the nvme-tcp io jobs
>>>> on other non-busy CPUs via taskset, or scheduler is supposed to choose
>>>> proper CPUs for you. And usually nvme-tcp device should be saturated
>>>> with limited io depth or jobs/cpus.
>>>>
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> Taskset can’t work on nvme-tcp io-queues, because the worker cpu has decided at the nvme-tcp ‘connect’ stage,
>>> not the sending io stage. Assume there is only one io-queue, the binding cpu is CPU0, no matter io jobs
>>> run other cpus.
>> OK, looks the problem is on queue->io_cpu, see nvme_tcp_queue_request().
>> But I am wondering why nvme-tcp doesn't queue the io work on the current
>> cpu? And why is queue->io_cpu introduced? Given blk-mq defines cpu
>> affinities for each hw queue, driver is supposed to submit IO request
>> to hardware on the local CPU.
>> Sagi and Guys, any ideas about introducing queue->io_cpu?
>
> Hey Ming,
>
> I have some vague memories wrt to this, but from what I recall:
>
> - The number of queues is dependent on both the controller and
> the user (Not a reason/motivation on its own, just clarifying).
>
> - It simply matches what pci does (to some extent, outside of rx side
> entropy that may exist), it just happens to take more cpu cycles due to
> the network stack overhead.
>
> - I didn't want io threads to change CPUs because of RFS/aRFS
> optimizations that people use, which allows the NIC to steer interrupts
> (and napi context) to where the io thread is running, and thus minimize
> latency due to improved locality. that on its own was shown to be worth
> over 30% reduction.
>
RFS works not good here. On my aarch64, the NIC irq is handled on NUMA node 2 CPU.
And nvme-tcp io-queue is busy on CPU0.

> - At some point nvme-tcp rx context used to run in softirq, and having
> to synchronize different cores (on different numa nodes maybe, depends
> on what RSS decided) when processing the socket resulted in high
> latency as well. This is not the case today (due to some nics back then
> that surfaced various issues with this) but it may be come back in
> the future again (if shown to provide value).
>
> - Also today when there is a sync network send from .queue_rq path,
> it is only executed when the running cpu == queue->io_cpu, to avoid high
> contention. My concern is that if the io context is not bound to
> a specific cpu, it may create heavier contention on queue serialization.
> Today there are at most 2 contexts that compete, io context (triggered from network rx or scheduled in the submission path) and .queue_rq sync
> network send path. I'd prefer to not have to introduce more contention with increasing number of threads accessing an nvme controller.
>
> Having said that, I don't think there is a fundamental issue with
> using queue_work, or queue_work_node(cur_cpu) or
> queue_work_node(netdev_home_cpu), if that does not introduce
> additional latency in the common cases. Although having io threads
> bounce around is going to regress users that use RFS/aRFS...