Re: [RFC 0/3] block: proportional based blk-throttling

From: Tejun Heo
Date: Thu Jan 21 2016 - 16:10:12 EST


Hello, Shaohua.

On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is

Just a nit. blk-throttle is both bw and iops based.

> weight based. It would be great there is a unified iocontroller for the two.
> And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> for blk-mq. It's time to have a scalable iocontroller supporting both
> bandwidth/weight based control and working with blk-mq.
>
> blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> It has a global lock which is scaring for scalability, but it's not terrible in
> practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> this isn't a big problem for today's workload. This patchset then try to make a
> unified iocontroller. I'm leveraging blk-throttling.

Have you tried with some level, say 5, of nesting? IIRC, how it
implements hierarchical control is rather braindead (and yeah I'm
responsible for the damage).

> The idea is pretty simple. If we know disk total bandwidth, we can calculate
> cgroup bandwidth according to its weight. blk-throttling can use the calculated
> bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> pattern. Long history is meaningless. The simple algorithm in patch 1 works
> pretty well when IO pattern changes.

So, that part is fine but I don't think it makes sense to make weight
based control either bandwidth or iops based. The fundamental problem
is that it's a false choice. It's like asking someone who wants a car
to choose between accelerator and brake. It's a choice without a good
answer. Both are wrong. Also note that there's an inherent
difference from the currently implemented absolute limits. Absolute
limits can be combined. Weights based on different metrics can't be.

Even with modern SSDs, both iops and bandwidth play major roles in
deciding how costly each IO is and I'm fairly confident that this is
fundamental enough to be the case for quite a while. I *think* the
cost model can be approximated from measurements. Devices are
becoming more and more predictable in their behaviors after all. For
weight based distribution, the unit of distribution should be IO time,
not bandwidth or iops.

Thanks.

--
tejun