Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

From: Michael S. Tsirkin
Date: Thu Aug 10 2023 - 15:42:15 EST


On Tue, Aug 08, 2023 at 10:30:56AM +0800, Jason Wang wrote:
> On Mon, Jul 31, 2023 at 2:30 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >
> > On Thu, Jul 27, 2023 at 5:46 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 04:59:33PM +0800, Jason Wang wrote:
> > > > > They really shouldn't - any NIC that takes forever to
> > > > > program will create issues in the networking stack.
> > > >
> > > > Unfortunately, it's not rare as the device/cvq could be implemented
> > > > via firmware or software.
> > >
> > > Currently that mean one either has sane firmware with a scheduler that
> > > can meet deadlines, or loses ability to report errors back.
> > >
> > > > > But if they do they can always set this flag too.
> > > >
> > > > This may have false negatives and may confuse the management.
> > > >
> > > > Maybe we can extend the networking core to allow some device specific
> > > > configurations to be done with device specific lock without rtnl. For
> > > > example, split the set_channels to
> > > >
> > > > pre_set_channels
> > > > set_channels
> > > > post_set_channels
> > > >
> > > > The device specific part could be done in pre and post without a rtnl lock?
> > > >
> > > > Thanks
> > >
> > >
> > > Would the benefit be that errors can be reported to userspace then?
> > > Then maybe. I think you will have to show how this works for at least
> > > one card besides virtio.
> >
> > Even for virtio, this seems not easy, as e.g the
> > virtnet_send_command() and netif_set_real_num_tx_queues() need to
> > appear to be atomic to the networking core.
> >
> > I wonder if we can re-consider the way of a timeout here and choose a
> > sane value as a start.
>
> Michael, any more input on this?
>
> Thanks

I think this is just mission creep. We are trying to fix
vduse - let's do that for starters.

Recovering from firmware timeouts is far from trivial and
just assuming that just because it timed out it will not
access memory is just as likely to cause memory corruption
with worse results than an infinite spin.

I propose we fix this for vduse and assume hardware/firmware
is well behaved. Or maybe not well behaved firmware will
set the flag losing error reporting ability.



> >
> > Thanks
> >
> > >
> > >
> > > --
> > > MST
> > >