Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

From: Jason Wang
Date: Mon Dec 11 2023 - 02:27:10 EST


On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>
> On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > >
> > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > > the terminology.
> > > > > >
> > > > > > - vhost is looping through available queues.
> > > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > > - kworker does something with that queue and terminates quickly.
> > > > > >
> > > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > > it will just keep on looping until it is migrated off the CPU.
> > > > >
> > > > >
> > > > > Normally it takes the buffers off the queue and is done with it.
> > > > > I am guessing that at the same time guest is running on some other
> > > > > CPU and keeps adding available buffers?
> > > > >
> > > >
> > > > It seems to do just that, there are multiple other vhost instances
> > > > involved which might keep filling up thoses queues.
> > > >
> > >
> > > No vhost is ever only draining queues. Guest is filling them.
> > >
> > > > Unfortunately, this makes the problematic vhost instance to stay on
> > > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > > explicitly woken up by vhost, so it wants it to do something.

It looks to me vhost doesn't use workqueue but the worker by itself.

> > > >
> > > > At this point it seems that there is an assumption about the scheduler
> > > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > > far, it seems like EEVDF does what is intended to do.
> > > >
> > > > Shouldn't there be a more explicit mechanism in use that allows the
> > > > kworker to be scheduled in favor of the vhost?

Vhost did a brunch of copy_from_user() which should trigger
__might_fault() so a __might_sleep() most of the case.

> > > >
> > > > It is also concerning that the vhost seems cannot be preempted by the
> > > > scheduler while executing that loop.
> > >
> > >
> > > Which loop is that, exactly?
> >
> > The loop continously passes translate_desc in drivers/vhost/vhost.c
> > That's where I put the trace statements.
> >
> > The overall sequence seems to be (top to bottom):
> >
> > handle_rx
> > get_rx_bufs
> > vhost_get_vq_desc
> > vhost_get_avail_head
> > vhost_get_avail
> > __vhost_get_user_slow
> > translate_desc << trace statement in here
> > vhost_iotlb_itree_first
>
> I wonder why do you keep missing cache and re-translating.
> Is pr_debug enabled for you? If not could you check if it
> outputs anything?
> Or you can tweak:
>
> #define vq_err(vq, fmt, ...) do { \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
> if ((vq)->error_ctx) \
> eventfd_signal((vq)->error_ctx, 1);\
> } while (0)
>
> to do pr_err if you prefer.
>
> > These functions show up as having increased overhead in perf.
> >
> > There are multiple loops going on in there.
> > Again the disclaimer though, I'm not familiar with that code at all.
>
>
> So there's a limit there: vhost_exceeds_weight should requeue work:
>
> } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
>
> then we invoke scheduler each time before re-executing it:
>
>
> {
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
>
> node = llist_del_all(&worker->work_list);
> if (node) {
> __set_current_state(TASK_RUNNING);
>
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, &work->flags);
> kcov_remote_start_common(worker->kcov_handle);
> work->fn(work);
> kcov_remote_stop();
> cond_resched();
> }
> }
>
> return !!node;
> }
>
> These are the byte and packet limits:
>
> /* Max number of bytes transferred before requeueing the job.
> * Using this limit prevents one virtqueue from starving others. */
> #define VHOST_NET_WEIGHT 0x80000
>
> /* Max number of packets transferred before requeueing the job.
> * Using this limit prevents one virtqueue from starving others with small
> * pkts.
> */
> #define VHOST_NET_PKT_WEIGHT 256
>
>
> Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

Or a dirty hack like cond_resched() in translate_desc().

Thanks


>
> --
> MST
>