Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

From: Buddy Lumpkin
Date: Wed Apr 11 2018 - 02:38:46 EST



> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this. If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way. If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>>
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you devote
>> more threads to kswapd, doesnât mean they are busy. If multiple kswapd threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
>
> [...]
>
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
>
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.

If the memory hog is generating enough demand for multiple kswapd
tasks to be busy, then it is generating enough demand to trigger direct
reclaims. Since direct reclaims are 100% CPU bound, the preemptions
you are concerned about are happening anyway.

> In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.

This makes direct reclaims sound like a positive thing overall and that
is simply not the case. If cleaning is the metaphor to describe direct
reclaims, then itâs happening in the kitchen using a garden hose.
When conditions for direct reclaims are present they can occur in any
task that is allocating on the system. They inject latency in random places
and they decrease filesystem throughput.

When software engineers try to build their own cache, I usually try to talk
them out of it. This rarely works, as they usually have reasons they believe
make the project compelling, so I just ask that they compare their results
using direct IO and a private cache to simply allowing the page cache to
do itâs thing. I canât make this pitch any more because direct reclaims have
too much of an impact on filesystem throughput.

The only positive thing that direct reclaims provide is a means to prevent
the system from crashing or deadlocking when it falls too low on memory.

> If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
>
> Maybe we should renice kswapd anyway ... thoughts? We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
>