Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Michael S. Tsirkin
Date: Sat Dec 09 2023 - 05:42:39 EST

Next message: Krzysztof Kozlowski: "[PATCH 1/4] ARM: dts: aspeed: greatlakes: correct Mellanox multi-host property"
Previous message: kernel test robot: "io_uring/poll.c:480:43: sparse: sparse: incorrect type in initializer (different base types)"
In reply to: Michael S. Tsirkin: "Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)"
Next in thread: Jason Wang: "Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > >
> > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > the terminology.
> > > > >
> > > > > - vhost is looping through available queues.
> > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > - kworker does something with that queue and terminates quickly.
> > > > >
> > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > it will just keep on looping until it is migrated off the CPU.
> > > >
> > > >
> > > > Normally it takes the buffers off the queue and is done with it.
> > > > I am guessing that at the same time guest is running on some other
> > > > CPU and keeps adding available buffers?
> > > >
> > >
> > > It seems to do just that, there are multiple other vhost instances
> > > involved which might keep filling up thoses queues.
> > >
> >
> > No vhost is ever only draining queues. Guest is filling them.
> >
> > > Unfortunately, this makes the problematic vhost instance to stay on
> > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > explicitly woken up by vhost, so it wants it to do something.
> > >
> > > At this point it seems that there is an assumption about the scheduler
> > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > far, it seems like EEVDF does what is intended to do.
> > >
> > > Shouldn't there be a more explicit mechanism in use that allows the
> > > kworker to be scheduled in favor of the vhost?
> > >
> > > It is also concerning that the vhost seems cannot be preempted by the
> > > scheduler while executing that loop.
> >
> >
> > Which loop is that, exactly?
>
> The loop continously passes translate_desc in drivers/vhost/vhost.c
> That's where I put the trace statements.
>
> The overall sequence seems to be (top to bottom):
>
> handle_rx
> get_rx_bufs
> vhost_get_vq_desc
> vhost_get_avail_head
> vhost_get_avail
> __vhost_get_user_slow
> translate_desc << trace statement in here
> vhost_iotlb_itree_first

I wonder why do you keep missing cache and re-translating.
Is pr_debug enabled for you? If not could you check if it
outputs anything?
Or you can tweak:

#define vq_err(vq, fmt, ...) do { \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
if ((vq)->error_ctx) \
eventfd_signal((vq)->error_ctx, 1);\
} while (0)

to do pr_err if you prefer.

> These functions show up as having increased overhead in perf.
>
> There are multiple loops going on in there.
> Again the disclaimer though, I'm not familiar with that code at all.

So there's a limit there: vhost_exceeds_weight should requeue work:

} while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));

then we invoke scheduler each time before re-executing it:

{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

node = llist_del_all(&worker->work_list);
if (node) {
__set_current_state(TASK_RUNNING);

node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
cond_resched();
}
}

return !!node;
}

These are the byte and packet limits:

/* Max number of bytes transferred before requeueing the job.
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000

/* Max number of packets transferred before requeueing the job.
* Using this limit prevents one virtqueue from starving others with small
* pkts.
*/
#define VHOST_NET_PKT_WEIGHT 256

Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

--
MST

Next message: Krzysztof Kozlowski: "[PATCH 1/4] ARM: dts: aspeed: greatlakes: correct Mellanox multi-host property"
Previous message: kernel test robot: "io_uring/poll.c:480:43: sparse: sparse: incorrect type in initializer (different base types)"
In reply to: Michael S. Tsirkin: "Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)"
Next in thread: Jason Wang: "Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]