Re: [RFC 0/3] block: proportional based blk-throttling

From: Shaohua Li
Date: Fri Jan 22 2016 - 14:46:12 EST


On Fri, Jan 22, 2016 at 02:09:10PM -0500, Vivek Goyal wrote:
> On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> > On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > > Hello, Shaohua.
> > > >
> > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > > The thing is that most of the possible contentions can be removed by
> > > > > > implementing per-cpu cache which shouldn't be too difficult. 10%
> > > > > > extra cost on current gen hardware is already pretty high.
> > > > >
> > > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > > breaks fairness very much. I have no idea how this can be fixed.
> > > >
> > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > > worth of IO cost (or bandwidth or iops) available to the cgroup. In
> > > > practice, this may have to be adjusted down depending on the number of
> > > > cgroups performing active IOs. For a given cgroup, B can be
> > > > distributed among the CPUs that are actively issuing IOs in that
> > > > cgroup. It will degenerate to round robin of small budget if there
> > > > are too many active for the budget available but for most cases this
> > > > will cut down most of cross-CPU traffic.
> > > >
> > > > > > They're way more predictable than rotational devices when measured
> > > > > > over a period. I don't think we'll be able to measure anything
> > > > > > meaningful at individual command level but aggregate numbers should be
> > > > > > fairly stable. A simple approximation of IO cost such as fixed cost
> > > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > > just depending on bandwidth or iops and that requires approximating
> > > > > > two variables over time. I'm not sure how easy / feasible that
> > > > > > actually would be tho.
> > > > >
> > > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > > of bandwidth. eg cost = bandwidth/ios.
> > > >
> > > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > > cost proportional to the size, so
> > > >
> > > > cost = F + R * size
> > > >
> > >
> > > Hi Tejun,
> > >
> > > May be we can throw in a cost differentiation for IO direction also here.
> > > This still will not take care of cost based on IO pattern, but that's
> > > another level of complexity which can be added to keep track of IO pattern
> > > of cgroup and bump up cost accordingly.
> > >
> > > Here are some random thoughts basically adding some more details to your idea.
> > > I am not sure whether it makes sense or not or how difficult it is to
> > > implement it.
> > >
> > > Assume we ensure fairness in a time interval of T and have total of N
> > > tokens for IO in that time interval T. When a new inteval starts, we
> > > distribute these N tokens to the pending cgroups based on their weight and
> > > proportional share. And keep on distributing N tokens after each time
> > > interval.
> > >
> > > We will have to come up with some sort of cost matrix to determine how many
> > > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > > dynamically.
> > >
> > > Both N and T will be variable and will have to be adjusted continuously.
> > > For N we could start with some initial number. If we distributed too many
> > > tokens then device can handle in time T, then in next cycle we will have
> > > to reduce the value of N and distribute less tokens. If we distributed
> > > too less tokens and device is fast and finished in less time than T,
> > > then we can start next cycle sooner and distribute more tokens for next
> > > cycle. So based on device throughput in a certain time interval, number
> > > of tokens issued for next cycle will vary.
> >
> > Note, we don't know if we dispatch too many/too less tokens. A device
> > with large queue depth can accept all requests. If queue depth is 1,
> > things would be easy.
>
> If device accepts too many requests then we will keep on increasing tokens
> and cgroups will keep on submitting IOs in the proportion of their weight.
> Once queue is full, then we will hit a wall and we will start decreasing
> number of tokens. So I guess this should still work.

queue will never be full. Typical application drives < 32 IO depth. NVMe
SSD queue can have 64k queue depth for each hardware queue according to
the spec.
> One problem with deep queue depths will be though that a heavy writer
> will be able to fill up the queue in a very short interval and block
> small readers behind it. I guess until and unless devices start doing
> some prioritization of IO, this problem will be hard to solve. Driving
> smaller queue depth is not an option as it makes the bandwidth drop.

yep, this is the problem. Disk accepts new requests even all pending
requests already exhaust its resources.
> > > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > > IO, then N tokens will not be consumed in time T and next time we will
> > > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > > will finish fast and next time we will give more. So exact cost of IO
> > > might not be a huge factor.
> >
> > we still need know the R. any idea for this?
>
> Hmm..., thinking loud. Will following work.
>
> Can we keep track of average bw and average iops of the queue. And then
> use that to come up with per IO cost and BW cost.
>
> Say average queue bandwidth is ABW and average IOPS is AIOPS.
>
> So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.
>
> A cgroup's fractional cost of IO = IO_size/(T * ABW)
>
> As we are supposed to dispatch N tokens in time T, cgroups cost of IO
> in terms of tokens will be
>
> Cgroup_cost_BW = (N * IO_Size)/(T * ABW)
>
> Similarly, a cgroup's per IO cost based on IOPS will be.
>
> Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)
>
> So per IO per cgroup we could charge following tokens.
>
> Charged_tokens = Cgroup_cost_BW + Cgroup_cost_IO
>
> As we are charging cgroup twice (once based on bandwidth and once based
> on IOPS), may be we can half the effective cost.
>
> Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2

So the cost = Cgroup_cost_BW * A + Cgroup_cost_IO * B

bandwidth based: A = 1, B = 0
IOPS based: A = 0, B = 1
The proposal: A = 1/2, B = 1/2

I'm sure people will invent other A/B combinations. It's hard to say
which one is better. Maybe we really should have simple ones first, eg,
either bandwidth based or IOPS based, and have a knob to choose. That
pretty much shows the powerless from kernel side, but that's something
we can offer.

Thanks,
Shaohua