Re: [RFC 0/3] block: proportional based blk-throttling

From: Tejun Heo
Date: Thu Jan 21 2016 - 17:42:13 EST


Hello, Shaohua.

On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote:
> > Have you tried with some level, say 5, of nesting? IIRC, how it
> > implements hierarchical control is rather braindead (and yeah I'm
> > responsible for the damage).
>
> Not yet. Agree nesting increases the locking time. But my test is
> already an extreme case. I had 32 threads in 2 nodes running IO and the
> IOPS is 1M/s. Don't think real workload will act like this. The locking
> issue definitely should be revisited in the future though.

The thing is that most of the possible contentions can be removed by
implementing per-cpu cache which shouldn't be too difficult. 10%
extra cost on current gen hardware is already pretty high.

> Disagree io time is a better choice. Actually I think IO time will be

If IO time isn't the right term, let's call it IO cost. Whatever the
term, the actual fraction of cost that each IO is incurring.

> the least we shoule consider for SSD. Idealy if we know each IO cost and
> total disk capability, things will be easy. Unfortunately there is no
> way to know IO cost. Bandwidth isn't perfect, but might be the best.
>
> I don't know why you think devices are predictable. SSD is never
> predictable. I'm not sure how you will measure IO time. Morden SSD has
> large queue depth (blk-mq support 10k queue depth). That means we can
> send 10k IO in several ns. Measuring IO start/finish time doesn't help
> too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
> might use more than 100us. The IO time will increase with higher io
> depth. The fundamental problem is disk with large queue depth can buffer
> infinite IO request. I think IO time only works for queue depth 1 disk.

They're way more predictable than rotational devices when measured
over a period. I don't think we'll be able to measure anything
meaningful at individual command level but aggregate numbers should be
fairly stable. A simple approximation of IO cost such as fixed cost
per IO + cost proportional to IO size would do a far better job than
just depending on bandwidth or iops and that requires approximating
two variables over time. I'm not sure how easy / feasible that
actually would be tho.

> On the other hand, how do you utilize IO time? If we use similar
> algorithm like the patch set (eg, cgroup's IO time slice = cgroup_share
> / all_cgroup_share * disk_IO_time_capability), how do you get
> disk_IO_time_capability? Or use CFQ alrithm (eg, switch cgroup if the
> cgroup uses its IO time slice). But CFQ is known not working well with
> NCQ unless idle disk, because disk with large queue depth can dispatch
> all cgorup's IO immediately. Idling should be avoided of course for high
> speed storage.

I wasn't talking about time slicing as in CFQ but rather approximating
the cost of each IO. I don't think it makes sense to implement
bandwidth based weight control when the cost of IOs can significantly
vary depending on IO direction and size. The approxmiation doesn't
have to be perfect but we should be able to land somehwere near the
ballpark.

Thanks.

--
tejun