Re: [PATCH] Avoid preferential treatment of groups that aren'tbacklogged

From: Vivek Goyal
Date: Wed Feb 09 2011 - 22:57:56 EST


On Wed, Feb 09, 2011 at 06:45:25PM -0800, Chad Talbott wrote:
> On Wed, Feb 9, 2011 at 6:09 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> > In upstream code once a group gets backlogged we put it at the end
> > and not at the beginning of the tree. (I am wondering are you looking
> > at the google internal code :-))
> >
> > So I don't think that issue of a low weight group getting more disk
> > time than its fair share is present in upstream kernels.
>
> You've caught me re-using a commit description. :)
>
> Here's an example of the kind of tests that fail without this patch
> (run via the test that Justin and Akshay have posted):
>
> 15:35:35 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
> 15:35:55 INFO Experiment completed in 20.4 seconds
> 15:35:55 INFO experiment 14 achieved DTFs: 886, 113
> 15:35:55 INFO experiment 14 FAILED: max observed error is 64, allowed is 50
>
> 15:35:55 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
> 15:36:16 INFO Experiment completed in 20.5 seconds
> 15:36:16 INFO experiment 15 achieved DTFs: 891, 108
> 15:36:16 INFO experiment 15 FAILED: max observed error is 59, allowed is 50
>
> Since this is Jens' unmodified tree, I've had to change
> BLKIO_WEIGHT_MIN to 10 to allow this test to proceed. We typically
> run many jobs with small weights, and achieve the requested isolation:
> see below results with this patch:
>
> 14:59:17 INFO ----- Running experiment 14: 950 rdrand, 50 rdrand.delay10
> 14:59:36 INFO Experiment completed in 19.0 seconds
> 14:59:36 INFO experiment 14 achieved DTFs: 947, 52
> 14:59:36 INFO experiment 14 PASSED: max observed error is 3, allowed is 50
>
> 14:59:36 INFO ----- Running experiment 15: 950 rdrand, 50 rdrand.delay50
> 14:59:55 INFO Experiment completed in 18.5 seconds
> 14:59:55 INFO experiment 15 achieved DTFs: 944, 55
> 14:59:55 INFO experiment 15 PASSED: max observed error is 6, allowed is 50
>
> As you can see, it's with seeky workloads that come and go from the
> service tree where this patch is required.

I have not look into or run the tests posted by Justin and Akshay. Can you
give more details about these tests.

Are you running with group_isolation=0 or 1. These tests seem to be random
read and if group_isolation=0 (default), then all the random read queues
should go in root group and there will be no service differentiation.

If you ran different random readers in different groups of differnet
weight with group_isolation=1, then there is a case of having service
differentiation. In that case we will idle for 8ms on each group before
we expire the group. So in these test cases are low weight groups not
submitting IO with-in 8ms? Putting a random reader in separate group
with think time > 8, I think is going to hurt a lot because for every
single IO dispatched group is going to weight for 8ms before it is
expired.

So the only case which comes to my mind where this patch can help is
when there are lots of groups doing IO with different weights. These
groups have think time greater than 8ms and hence get deleted from
service tree. When next time a low weight group has IO, instead of being
put at the end of service tree, it might be put even farther allowing
a higher weight group to get backlogged ahead of it.

Can you run blktrace and verify what's happenig?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/