Re: swapping and the value of /proc/sys/vm/swappiness
From: Con Kolivas
Date: Thu Sep 16 2004 - 19:27:40 EST
Marcelo Tosatti wrote:
Con!
Spent some time reading your patch...
Great!
Well if "distress" is getting higher (with similar workload/pressure)
thats because VM is having a harder time freeing pages (priority increases,
distress increases).
You say "distress is getting higher in later kernels". Can you expand
more on that? How did you find this out, and can you be more especific
wrt "later kernels".
When I say earlier kernels I mean prior to 2.6.8.
I'm still referring to the "hard_swappiness patch" that Florin was using
to fix his problem which diagnosed that distress increased. I'm sorry if
my newer patch confuses the issue. hard_swappiness effectively changed this:
-distress = 100 >> zone->prev_priority
-mapped_ratio = (sc->nr_mapped * 100) / total_memory;
-swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
-if (swap_tendency >= 100)
- reclaim_mapped = 1;
into this:
+mapped_ratio = (sc->nr_mapped * 100) / total_memory;
+swap_tendency = mapped_ratio / 2 + vm_swappiness
+if (swap_tendency >= 100)
+ reclaim_mapped = 1;
This made swap_tendency dependant _only_ on the mapped_ratio. Now if you
load up the same desktop and applications your mapped_ratio will be
virtually identical regardless of the kernel. If you then copy a large
file or convert a large video file etc, then the mapped ratio will be
unchanged. Therefore if the swapping increased with this workload in
2.6.8 and later kernels but did _not_ increase with hard_swappiness it
must be the "distress" value which is entirely dependant on
zone->prev_priority. Does that make my conclusion clearer?
Below here you're referring to my mapped_watermark patch so I'll address
that separately to avoid confusion.
I see you add a "z->nr_unmapped" watermark a bit above "z->pages_high",
and use that to set "pgdat->mapped_nrpages" to what needs to be freed
so z->free_pages reaches "z->nr_unmapped".
And then you use that per-pgdat "mapped_nrpages" count to avoid:
- moving mapped pages to inactive list (wasting the swappiness algorithm)
- swapping out pages at shrink_list
Those two only happen when pgdat->mapped_nrpages is zero, which
becomes true when we go below pages_low.
To resume, deactivation/swapout of mapped pages only happens when we
go any zone pages_low.
Correct?
Yes apart from one big caveat. scanning is expensive, so it only scans
at lowest priority (DEF_PRIORITY). If it fails to release enough memory
it simply returns quietly. This means that if vm pressure is hard enough
and occurs frequently/fast enough it will still drop down below
pages_high even if the watermarks have not been re-achieved. Then the
normal algorithm will take over.
Now with v2.6 stock kernel, kswapd will deactivate (using vm_swappiness algorithm)
and swapout pages between the low and high zone watermarks.
That avoids swapping out as hard as possible until we go below pages_low.
IMHO this might be OK for common desktop workloads where people complain
about swap, but might be harmful for other workloads where swapping out on
advance unused anonymous process memory is a _gain_.
As I said, it only does it lightly, and it's tunable.
I dont understand this check on balance_pgdat (kswapd worker function):
+ if (maplimit && sc.nr_mapped * 100 / total_memory > vm_mapped)
+ return 0;
+
So "if not any zone is under pages_low, and more than vm_mapped % of ram
is mapped, bail out."
This will only be hit if "maplimit" is true. This means we have entered
balance_pgdat only due to the unmapped watermark (zone->pages_min * 4).
Here is where the real "tunable" comes into play. If greater than
vm_mapped % of ram is mapped (ie application) pages, it will not do
anything at this watermark. By default it is set to 66%. Setting it to 0
inactivates this patch entirely and makes the vm behave much like
setting swappiness to 100 in mainline.
I still think swapout behaviour can be correctly tuned with vm_swappiness,
and agree with Andrew on that we should not change anything in the algorithm
if this can be tuned.
I agree it can be, but something in the logic has definitely changed,
and a different value is not giving users like Florin the desired result
any more.
Cheers,
Con
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/