Re: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit

From: Jens Axboe
Date: Mon Aug 23 2004 - 10:54:18 EST


On Mon, Aug 23 2004, Marcelo Tosatti wrote:
> On Sun, Aug 22, 2004 at 09:18:51PM +0200, Karl Vogel wrote:
> > When using elevator=as I'm unable to trigger the swap of death, so it seems
> > that the CFQ scheduler is at blame here.
> >
> > With AS scheduler, the system recovers in +-10 seconds, vmstat output during
> > that time:
> >
> > procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id wa
> > 1 0 0 295632 40372 49400 87 278 324 303 1424 784 7 2 78 13
> > 0 0 0 295632 40372 49400 0 0 0 0 1210 648 3 1 96 0
> > 0 0 0 295632 40372 49400 0 0 0 0 1209 652 4 0 96 0
> > 2 0 0 112784 40372 49400 0 0 0 0 1204 630 23 34 43 0
> > 1 9 156236 788 264 8128 28 156220 3012 156228 3748 3655 11 31 0 59
> > 0 15 176656 2196 280 8664 0 20420 556 20436 1108 374 2 5 0 93
> > 0 17 205320 724 232 7960 28 28664 396 28664 1118 503 7 12 0 81
> > 2 12 217892 1812 252 8556 248 12584 864 12584 1495 318 2 7 0 91
> > 4 14 253268 2500 268 8728 188 35392 432 35392 1844 399 3 7 0 90
> > 0 13 255692 1188 288 9152 960 2424 1408 2424 1173 2215 10 5 0 85
> > 0 7 266140 2288 312 9276 604 10468 752 10468 1248 644 5 5 0 90
> > 0 7 190516 340636 348 9860 1400 0 2016 0 1294 817 4 8 0 88
> > 1 8 190516 339460 384 10844 552 0 1556 4 1241 642 3 1 0 96
> > 1 3 190516 337084 404 11968 1432 0 2576 4 1292 788 3 1 0 96
> > 0 6 190516 333892 420 13612 1844 0 3500 0 1343 850 5 2 0 93
> > 0 1 190516 333700 424 13848 480 0 720 0 1250 654 3 2 0 95
> > 0 1 190516 334468 424 13848 188 0 188 0 1224 589 3 2 0 95
> >
> > With CFQ processes got stuck in 'D' and never left that state. See URL's in my
> > initial post for diagnostics.
>
> I can confirm this on a 512MB box with 512MB swap (2.6.8-rc4). Using CFQ the machine swaps out
> 400 megs, with AS it swaps out 30M.
>
> That leads to allocation failures/etc.
>
> CFQ allocates a huge number of bio/biovecs:
>
> cat /proc/slabinfo | grep bio
> biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0
> biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0
> biovec-64 265 265 768 5 1 : tunables 54 27 0 : slabdata 53 53 0
> biovec-16 260 260 192 20 1 : tunables 120 60 0 : slabdata 13 13 0
> biovec-4 272 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0
> biovec-1 121088 122040 16 226 1 : tunables 120 60 0 : slabdata 540 540 0
> bio 121131 121573 64 61 1 : tunables 120 60 0 : slabdata 1992 1993 0
>
>
> biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0
> biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0
> biovec-64 265 265 768 5 1 : tunables 54 27 0 : slabdata 53 53 0
> biovec-16 258 260 192 20 1 : tunables 120 60 0 : slabdata 13 13 0
> biovec-4 257 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0
> biovec-1 66390 68026 16 226 1 : tunables 120 60 0 : slabdata 301 301 0
> bio 66389 67222 64 61 1 : tunables 120 60 0 : slabdata 1102 1102 0
>
> (which are freed later on, but the cause for the trashing during the swap IO).
>
> While AS does:
>
> [marcelo@yage marcelo]$ cat /proc/slabinfo | grep bio
> biovec-(256) 256 256 3072 2 2 : tunables 24 12 0 : slabdata 128 128 0
> biovec-128 256 260 1536 5 2 : tunables 24 12 0 : slabdata 52 52 0
> biovec-64 260 260 768 5 1 : tunables 54 27 0 : slabdata 52 52 0
> biovec-16 280 280 192 20 1 : tunables 120 60 0 : slabdata 14 14 0
> biovec-4 264 305 64 61 1 : tunables 120 60 0 : slabdata 5 5 0
> biovec-1 4478 5424 16 226 1 : tunables 120 60 0 : slabdata 24 24 0
> bio 4525 5002 64 61 1 : tunables 120 60 0 : slabdata 81 82 0
>
>
> Odd thing is the 400M swapped out are not reclaimed after exp (the 512MB callocator) exits. With AS
> almost all swapped out memory is reclaimed on exit.
>
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 0 492828 13308 320 3716 0 0 0 0 1002 5 0 0 100 0
>
>
> Jens, is this huge amount of bio/biovec's allocations expected with CFQ? Its really really bad.

Nope, it's not by design :-)

A test case would be nice, then I'll fix it as soon as possible. But
please retest with 2.6.8.1 marcelo, 2.6.8-rc4 is missing an important
fix to ll_rw_blk that can easily cause this. The first report is for
2.6.8.1, so I'm more puzzled on that.

--
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/