Re: [patch 00/11] userspace out of memory handling

From: Jianguo Wu
Date: Tue Mar 11 2014 - 08:06:14 EST


On 2014/3/6 10:52, David Rientjes wrote:

> On Wed, 5 Mar 2014, Andrew Morton wrote:
>
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg. If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
>
> It does so by relooping in the page allocator waiting for memory to be
> freed just like it would if the kernel oom killer were called and process
> A was waiting for the oom kill victim process B to exit, we don't have the
> ability to put it on a waitqueue because we don't touch the freeing
> hotpath. The userspace oom handler may not even necessarily kill
> anything, it may be able to free its own memory and start throttling other
> processes, for example.
>
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
>
> I agree completely and proposed this before as memory.oom_delay_millisecs
> at http://lwn.net/Articles/432226 which we use internally when memory
> can't be freed or a memcg's limit cannot be expanded. I guess it makes
> more sense alongside the rest of this patchset now, I can add it as an
> additional patch next time around.
>
>> Would be interested to see a description of how all this works.
>>
>
> There's an article for LWN also being developed on this topic. As
> mentioned in that article, I think it would be best to generalize a lot of
> the common functions and the eventfd handling entirely into a library.
> I've attached an example implementation that just invokes a function to
> handle the situation.
>
> For Google's usecase specifically, at the root memcg level (system oom) we
> want to do priority based memcg killing. We want to kill from within a
> memcg hierarchy that has the lowest priority relative to other memcgs.
> This cannot be implemented with /proc/pid/oom_score_adj today. Those
> priorities may also change depending on whether a memcg hierarchy is
> "overlimit", i.e. its limit has been increased temporarily because it has
> hit a memcg oom and additional memory is readily available on the system.
>
> So why not just introduce a memcg tunable that specifies a priority?
> Well, it's not that simple. Other users will want to implement different
> policies on system oom (think about things like existing panic_on_oom or
> oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task
> originally for SGI because they wanted a fast oom kill rather than
> expensive tasklist scan: the allocating task itself is rather irrelevant,
> it was just the unlucky task that was allocating at the moment that oom
> was triggered. What's guaranteed is that current in that case will always
> free memory from under oom (it's not a member of some other mempolicy or
> cpuset that would be needlessly killed). Both sysctls could trivially be
> reimplemented in userspace with this feature.
>
> I have other customers who don't run in a memcg environment at all, they
> simply reattach all processes to root and delete all other memcgs. These
> customers are only concerned about system oom conditions and want to do
> something "interesting" before a process is killed. Some want to log the
> VM statistics as an artifact to examine later, some want to examine heap
> profiles, others can start throttling and freeing memory rather than kill
> anything. All of this is impossible today because the kernel oom killer
> will simply kill something immediately and any stats we collect afterwards
> don't represent the oom condition. The heap profiles are lost, throttling
> is useless, etc.
>
> Jianguo (cc'd) may also have usecases not described here.
>

I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.

>> It is unfortunate that this feature is memcg-only. Surely it could
>> also be used by non-memcg setups. Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
>
> It's memcg only because it can handle both system and memcg oom conditions
> with the same clean interface, it would be possible to implement only
> system oom condition handling through procfs (a little sloppy since it
> needs to register the eventfd) but then a userspace oom handler would need
> to determine which interface to use based on whether it was running in a
> memcg or non-memcg environment. I implemented this feature with userspace
> in mind: I didn't want it to need two different implementations to do the
> same thing depending on memcg. The way it is written, a userspace oom
> handler does not know (nor need not care) whether it is constrained by the
> amount of system RAM or a memcg limit. It can simply write the reserve to
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and
> be done.
>
> This does mean that memcg needs to be enabled for the support, though.
> This is already done on most distributions, the cgroup just needs to be
> mounted. Would it be better to duplicate the interface in two different
> spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea
> of a userspace library that takes care of this registration (and mounting,
> perhaps) proposed on LWN would be the best of both worlds.
>
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
>
> Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I
> removed it purposefully :)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/