Re: [PATCH/RFC] mm/swapfile: reduce kswapd overhead by not filling up disks

From: Vlastimil Babka
Date: Mon Dec 21 2015 - 10:58:36 EST


On 12/11/2015 04:09 PM, Christian Borntraeger wrote:
if a user has more than one swap disk with different priorities, the
swap code will fill up the hight prio disk until the last block is
used.
The swap code will continue to scan the first disk also when its
already filling the 2nd or 3rd disk.
We have seen kswapd running at 100% CPU, with the majority of hits
in the scanning code of scan_swap_map, even for non-rotational disks
when this happens.
For example with 3 disks
disk1 99.9%
disk2 10%
disk3 0%
it will scan the bitmap of disk1 (and as the disk is full the
cluster optimization does not trigger) for every page that will
likely go to disk2 anyway.

By doing a first scan that only uses up to 98%, we force the swap
code to use the 2nd disk slightly earlier, but it reduces kswapd
cpu usage significantly. The 2nd scan will then allow to fill
the remaining 2%, again starting with the highest prio disk.

The code does not affect cases with all the same swap priorities,
unless all disks are about 98% full.
There is one issue with mythis approach: If there is a mix between
same and different priorities, the code will loop too often due
to the requeue, so and idea for a better fix is welcome.

Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx>

IMHO you should resend with CCing the relevant people directly (e.g. via ./scripts/get_maintainers.pl) or this might simply get lost in high-volume mailing lists.

Note that I'm not familiar with this code. But my first thought would be to put a cache with batch-refill/free before the bitmap. During the "first" round only consider si's with enough free to satisfy the whole batch-refill.

---
mm/swapfile.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5887731..d3817cf 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -640,6 +640,7 @@ swp_entry_t get_swap_page(void)
{
struct swap_info_struct *si, *next;
pgoff_t offset;
+ bool first = true;

if (atomic_long_read(&nr_swap_pages) <= 0)
goto noswap;
@@ -653,6 +654,12 @@ start_over:
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
spin_lock(&si->lock);
+ /* at 98% usage lets try the other swaps */
+ if (first && si->inuse_pages / 98 * 100 > si->pages) {
+ spin_lock(&swap_avail_lock);
+ spin_unlock(&si->lock);
+ goto nextsi;
+ }
if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
spin_lock(&swap_avail_lock);
if (plist_node_empty(&si->avail_list)) {
@@ -692,6 +699,10 @@ nextsi:
if (plist_node_empty(&next->avail_list))
goto start_over;
}
+ if (first) {
+ first = false;
+ goto start_over;
+ }

spin_unlock(&swap_avail_lock);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/