RE: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

From: Long Li
Date: Fri Aug 23 2019 - 20:13:57 EST


>>>Subject: Re: [PATCH 3/3] nvme: complete request in work queue on CPU
>>>with flooded interrupts
>>>
>>>
>>>> Sagi,
>>>>
>>>> Here are the test results.
>>>>
>>>> Benchmark command:
>>>> fio --bs=4k --ioengine=libaio --iodepth=64
>>>> --
>>>filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/d
>>>ev/nv
>>>>
>>>me4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev
>>>/nvme9n1
>>>> --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test
>>>> --group_reporting --gtod_reduce=1
>>>>
>>>> With your patch: 1720k IOPS
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>>
>>>> Interrupts are the fastest but we need to find a way to throttle it.
>>>
>>>This is the workload that generates the flood?
>>>If so I did not expect that this would be the perf difference..
>>>
>>>If completions keep pounding on the cpu, I would expect irqpoll to simply
>>>keep running forever and poll the cqs. There is no fundamental reason why
>>>polling would be faster in an interrupt, what matters could be:
>>>1. we reschedule more than we need to
>>>2. we don't reap enough completions in every poll round, which will trigger
>>>rearming the interrupt and then when it fires reschedule another softirq...
>>>

Yes I think it's the rescheduling that takes some. With the patch there are lots of ksoftirqd activities. (compared to nearly none without the patch)
A 90 seconds FIO run shows a big difference of context switches on all CPUs:
With patch: 5755849
Without patch: 1462931

>>>Maybe we need to take care of some irq_poll optimizations?
>>>
>>>Does this (untested) patch make any difference?
>>>--
>>>diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15
>>>100644
>>>--- a/lib/irq_poll.c
>>>+++ b/lib/irq_poll.c
>>>@@ -12,7 +12,8 @@
>>> #include <linux/irq_poll.h>
>>> #include <linux/delay.h>
>>>
>>>-static unsigned int irq_poll_budget __read_mostly = 256;
>>>+static unsigned int irq_poll_budget __read_mostly = 3000; unsigned int
>>>+__read_mostly irqpoll_budget_usecs = 2000;
>>>
>>> static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);
>>>
>>>@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);
>>>
>>> static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>>> {
>>>- struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
>>>- int rearm = 0, budget = irq_poll_budget;
>>>- unsigned long start_time = jiffies;
>>>+ struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
>>>+ unsigned int budget = irq_poll_budget;
>>>+ unsigned long time_limit =
>>>+ jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
>>>+ LIST_HEAD(list);
>>>
>>> local_irq_disable();
>>>+ list_splice_init(irqpoll_list, &list);
>>>+ local_irq_enable();
>>>
>>>- while (!list_empty(list)) {
>>>+ while (!list_empty(&list)) {
>>> struct irq_poll *iop;
>>> int work, weight;
>>>
>>>- /*
>>>- * If softirq window is exhausted then punt.
>>>- */
>>>- if (budget <= 0 || time_after(jiffies, start_time)) {
>>>- rearm = 1;
>>>- break;
>>>- }
>>>-
>>>- local_irq_enable();
>>>-
>>> /* Even though interrupts have been re-enabled, this
>>> * access is safe because interrupts can only add new
>>> * entries to the tail of this list, and only ->poll()
>>> * calls can remove this head entry from the list.
>>> */
>>>- iop = list_entry(list->next, struct irq_poll, list);
>>>+ iop = list_first_entry(&list, struct irq_poll, list);
>>>
>>> weight = iop->weight;
>>> work = 0;
>>>@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>>
>>> budget -= work;
>>>
>>>- local_irq_disable();
>>>-
>>> /*
>>> * Drivers must not modify the iopoll state, if they
>>> * consume their assigned weight (or more, some drivers can't @@
>>>-125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct
>>>softirq_action *h)
>>> if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
>>> __irq_poll_complete(iop);
>>> else
>>>- list_move_tail(&iop->list, list);
>>>+ list_move_tail(&iop->list, &list);
>>> }
>>>+
>>>+ /*
>>>+ * If softirq window is exhausted then punt.
>>>+ */
>>>+ if (budget <= 0 || time_after_eq(jiffies, time_limit))
>>>+ break;
>>> }
>>>
>>>- if (rearm)
>>>+ local_irq_disable();
>>>+
>>>+ list_splice_tail_init(irqpoll_list, &list);
>>>+ list_splice(&list, irqpoll_list);
>>>+ if (!list_empty(irqpoll_list))
>>> __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
>>>
>>> local_irq_enable();
>>>--

It's got slightly better IOPS. With this 2nd patch, the number of context switches is at 5445863 (~5% improvement over the 1st patch).