Re: [PATCHSET] block, mempool, percpu: implement percpu mempool andfix blkcg percpu alloc deadlock

From: Andrew Morton
Date: Thu Dec 22 2011 - 22:08:51 EST


On Thu, 22 Dec 2011 21:54:11 -0500 Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:

> On Thu, Dec 22, 2011 at 05:38:20PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 20:21:12 -0500 Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> >
> > > On Thu, Dec 22, 2011 at 03:41:38PM -0800, Andrew Morton wrote:
> > > >
> > > > If the code *does* correctly handle ->stats_cpu == NULL then we have
> > > > options.
> > > >
> > > > a) Give userspace a new procfs/debugfs file to start stats gathering
> > > > on a particular cgroup/request_queue pair. Allocate the stats
> > > > memory in that.
> > > >
> > > > b) Or allocate stats_cpu on the first call to blkio_read_stat_cpu()
> > > > and return zeroes for this first call.
> > >
> > > But the purpose of stats is that they are gathered even if somebody
> > > has not read them even once?
> >
> > That's not a useful way of using stats. The normal usage would be to
> > record the stats then start the workload then monitor how the stats
> > have changed as work proceeds.
>
> I have atleast one example "iostat" which does not follow this. Its
> first report shows the total stats since the system boot and each
> subsequent report covers time since previous report. With stats being
> available since the cgroup creation time, one can think of extending
> iostat tool to display per IO cgroup stats too.

If that's useful (dubious) then it can be addressed by creating the
stats when a device is bound to the cgroup (below).

> Also we have a knob "reset_stats" to reset all the stats to zero. So
> one can first reset stats, starts workload and then monitor it (if one
> does not like stats since the cgroup creation time).
>
> >
> > > So if I create a cgroup and put some
> > > task into it which does some IO, I think stat collection should start
> > > immediately without user taking any action.
> >
> > If you really want to know the stats since cgroup creation then trigger
> > the stats initialisation from userspace when creating the blkio_cgroup.
>
> These per cpu stats are per cgroup per device. So if a workload in a
> cgroup is doing IO to 4 devices, we allocate 4 percpu stat areas for
> stats. So at cgroup creation time we just don't know how many of these
> to create and also it does not cover the case of device hotplug after
> cgroup creation.

Mark the cgroup as "needing stats" then allocate the stats (if needed)
when a device is bound to the cgroup. Rather than on first I/O.

> >
> > > Forcing the user to first
> > > read a stat before the collection starts is kind of odd to me.
> >
> > Well one could add a separate stats_enable knob. Doing it
> > automatically from read() would be for approximate-back-compatibility
> > with existing behaviour.
> >
> > Plus (again) this way we also avoid burdening non-stats-users with the
> > overhead of stats.
>
> Even if we do that we have the problem with hoplugged device. Assume a
> cgroup created, stats enabled now a new devices shows up and some task
> in the group does IO on that device. Now we need to create percpu data
> area for that cgroup-device pair dynamically in IO path and we are back
> to the same problem.

Why do the allocation during I/O? Can't it be done in the hotplug handler?


Please, don't expend brain cells thinking of reasons why this all can't
be done. Instead expend them on finding a way in which it *can* be
done!

Maybe doing all this cgroups stuff within ->elevator_set_req_fn() was
the wrong design decision.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/