Re: [RFC 0/3] block: proportional based blk-throttling

From: Vivek Goyal
Date: Fri Jan 22 2016 - 14:09:17 EST


On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > Hello, Shaohua.
> > >
> > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > The thing is that most of the possible contentions can be removed by
> > > > > implementing per-cpu cache which shouldn't be too difficult. 10%
> > > > > extra cost on current gen hardware is already pretty high.
> > > >
> > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > breaks fairness very much. I have no idea how this can be fixed.
> > >
> > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > worth of IO cost (or bandwidth or iops) available to the cgroup. In
> > > practice, this may have to be adjusted down depending on the number of
> > > cgroups performing active IOs. For a given cgroup, B can be
> > > distributed among the CPUs that are actively issuing IOs in that
> > > cgroup. It will degenerate to round robin of small budget if there
> > > are too many active for the budget available but for most cases this
> > > will cut down most of cross-CPU traffic.
> > >
> > > > > They're way more predictable than rotational devices when measured
> > > > > over a period. I don't think we'll be able to measure anything
> > > > > meaningful at individual command level but aggregate numbers should be
> > > > > fairly stable. A simple approximation of IO cost such as fixed cost
> > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > just depending on bandwidth or iops and that requires approximating
> > > > > two variables over time. I'm not sure how easy / feasible that
> > > > > actually would be tho.
> > > >
> > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > of bandwidth. eg cost = bandwidth/ios.
> > >
> > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > cost proportional to the size, so
> > >
> > > cost = F + R * size
> > >
> >
> > Hi Tejun,
> >
> > May be we can throw in a cost differentiation for IO direction also here.
> > This still will not take care of cost based on IO pattern, but that's
> > another level of complexity which can be added to keep track of IO pattern
> > of cgroup and bump up cost accordingly.
> >
> > Here are some random thoughts basically adding some more details to your idea.
> > I am not sure whether it makes sense or not or how difficult it is to
> > implement it.
> >
> > Assume we ensure fairness in a time interval of T and have total of N
> > tokens for IO in that time interval T. When a new inteval starts, we
> > distribute these N tokens to the pending cgroups based on their weight and
> > proportional share. And keep on distributing N tokens after each time
> > interval.
> >
> > We will have to come up with some sort of cost matrix to determine how many
> > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > dynamically.
> >
> > Both N and T will be variable and will have to be adjusted continuously.
> > For N we could start with some initial number. If we distributed too many
> > tokens then device can handle in time T, then in next cycle we will have
> > to reduce the value of N and distribute less tokens. If we distributed
> > too less tokens and device is fast and finished in less time than T,
> > then we can start next cycle sooner and distribute more tokens for next
> > cycle. So based on device throughput in a certain time interval, number
> > of tokens issued for next cycle will vary.
>
> Note, we don't know if we dispatch too many/too less tokens. A device
> with large queue depth can accept all requests. If queue depth is 1,
> things would be easy.

If device accepts too many requests then we will keep on increasing tokens
and cgroups will keep on submitting IOs in the proportion of their weight.
Once queue is full, then we will hit a wall and we will start decreasing
number of tokens. So I guess this should still work.

One problem with deep queue depths will be though that a heavy writer
will be able to fill up the queue in a very short interval and block
small readers behind it. I guess until and unless devices start doing
some prioritization of IO, this problem will be hard to solve. Driving
smaller queue depth is not an option as it makes the bandwidth drop.

>
> > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > IO, then N tokens will not be consumed in time T and next time we will
> > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > will finish fast and next time we will give more. So exact cost of IO
> > might not be a huge factor.
>
> we still need know the R. any idea for this?

Hmm..., thinking loud. Will following work.

Can we keep track of average bw and average iops of the queue. And then
use that to come up with per IO cost and BW cost.

Say average queue bandwidth is ABW and average IOPS is AIOPS.

So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.

A cgroup's fractional cost of IO = IO_size/(T * ABW)

As we are supposed to dispatch N tokens in time T, cgroups cost of IO
in terms of tokens will be

Cgroup_cost_BW = (N * IO_Size)/(T * ABW)

Similarly, a cgroup's per IO cost based on IOPS will be.

Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)

So per IO per cgroup we could charge following tokens.

Charged_tokens = Cgroup_cost_BW + Cgroup_cost_IO

As we are charging cgroup twice (once based on bandwidth and once based
on IOPS), may be we can half the effective cost.

Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2

Does it make sense?

Thanks
Vivek