Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

From: Glauber Costa
Date: Mon Nov 19 2012 - 09:00:54 EST


On 11/17/2012 05:21 AM, Anton Vorontsov wrote:
> On Fri, Nov 16, 2012 at 01:57:09PM -0800, David Rientjes wrote:
>>>> I'm wondering if we should have more than three different levels.
>>>>
>>>
>>> In the case I outlined below, for backwards compatibility. What I
>>> actually mean is that memcg *currently* allows arbitrary notifications.
>>> One way to merge those, while moving to a saner 3-point notification, is
>>> to still allow the old writes and fit them in the closest bucket.
>>
>> Yeah, but I'm wondering why three is the right answer.
>
> You were not Cc'ed, so let me repeat why I ended up w/ the levels (not
> necessary three levels), instead of relying on the 0..100 scale:
>
> The main change is that I decided to go with discrete levels of the
> pressure.
>
> When I started writing the man page, I had to describe the 'reclaimer
> inefficiency index', and while doing this I realized that I'm describing
> how the kernel is doing the memory management, which we try to avoid in
> the vmevent. And applications don't really care about these details:
> reclaimers, its inefficiency indexes, scanning window sizes, priority
> levels, etc. -- it's all "not interesting", and purely kernel's stuff. So
> I guess Mel Gorman was right, we need some sort of levels.
>
> What applications (well, activity managers) are really interested in is
> this:
>
> 1. Do we we sacrifice resources for new memory allocations (e.g. files
> cache)?
> 2. Does the new memory allocations' cost becomes too high, and the system
> hurts because of this?
> 3. Are we about to OOM soon?
>
> And here are the answers:
>
> 1. VMEVENT_PRESSURE_LOW
> 2. VMEVENT_PRESSURE_MED
> 3. VMEVENT_PRESSURE_OOM
>
> There is no "high" pressure, since I really don't see any definition of
> it, but it's possible to introduce new levels without breaking ABI.
>
> Later I came up with the fourth level:
>
> Maybe it makes sense to implement something like PRESSURE_MILD/BALANCE
> with an additional nr_pages threshold, which basically hits the kernel
> about how many easily reclaimable pages userland has (that would be a
> part of our definition for the mild/balance pressure level).
>
> I.e. the fourth level can serve as a two-way communication w/ the kernel.
> But again, this would be just an extension, I don't want to introduce this
> now.
>
>>>> Umm, why do users of cpusets not want to be able to trigger memory
>>>> pressure notifications?
>>>>
>>> Because cpusets only deal with memory placement, not memory usage.
>>
>> The set of nodes that a thread is allowed to allocate from may face memory
>> pressure up to and including oom while the rest of the system may have a
>> ton of free memory. Your solution is to compile and mount memcg if you
>> want notifications of memory pressure on those nodes. Others in this
>> thread have already said they don't want to rely on memcg for any of this
>> and, as Anton showed, this can be tied directly into the VM without any
>> help from memcg as it sits today. So why implement a simple and clean
>
> You meant 'why not'?
>
>> mempressure cgroup that can be used alone or co-existing with either memcg
>> or cpusets?
>>
>>> And it is not that moving a task to cpuset disallows you to do any of
>>> this: you could, as long as the same set of tasks are mounted in a
>>> corresponding memcg.
>>>
>>
>> Same thing with a separate mempressure cgroup. The point is that there
>> will be users of this cgroup that do not want the overhead imposed by
>> memcg (which is why it's disabled in defconfig) and there's no direct
>> dependency that causes it to be a part of memcg.
>
> There's also an API "inconvenince issue" with memcg's usage_in_bytes
> stuff: applications have a hard time resetting the threshold to 'emulate'
> the pressure notifications, and they also have to count bytes (like 'total
> - used = free') to set the threshold. While a separate 'pressure'
> notifications shows exactly what apps actually want to know: the pressure.
>

Anton,

The API you propose is way superior than memcg's current interface IMHO.
That is why my proposal is to move memcg to yours, and deprecate the old
interface.

We can do this easily by allowing writes to happen, and then moving them
to the closest pressure bucket. More or less what was done for timers to
reduce wakeups.

What I noted in a previous e-mail, is that memcg triggers notifications
based on "usage" *before* the stock is drained. This means it can be
wrong by as much as 32 * NR_CPUS * PAGE_SIZE, and so far, nobody seemed
to care.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/