Re: [dm-devel] System freezes after OOM

From: Ondrej Kozina
Date: Thu Jul 14 2016 - 06:52:03 EST


On 07/13/2016 05:02 PM, Mikulas Patocka wrote:


On Wed, 13 Jul 2016, Michal Hocko wrote:

On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
The problem of swapping to dm-crypt is this.

The free memory goes low, kswapd decides that some page should be swapped
out. However, when you swap to an ecrypted device, writeback of each page
requires another page to hold the encrypted data. dm-crypt uses mempools
for all its structures and pages, so that it can make forward progress
even if there is no memory free. However, the mempool code first allocates
from general memory allocator and resorts to the mempool only if the
memory is below limit.

OK, thanks for the clarification. I guess the core part happens in
crypt_alloc_buffer, right?

So every attempt to swap out some page allocates another page.

As long as swapping is in progress, the free memory is below the limit
(because the swapping activity itself consumes any memory over the limit).
And that triggered the OOM killer prematurely.

I am not sure I understand the last part. Are you saing that we trigger
OOM because the initiated swapout will not be able to finish the IO thus
release the page in time?

On kernel 4.6 - premature OOM is triggered just because the free memory
stays below the limit for some time.

On kernel 4.7-rc6 (that contains your OOM patch
0a0337e0d1d134465778a16f5cbea95086e8e9e0), OOM is not triggered, but the
machine slows down to a crawl, because the allocator applies throttling to
the process that is doing the encryption.

The oom detection checks waits for an ongoing writeout if there is no
reclaim progress and at least half of the reclaimable memory is either
dirty or under writeback. Pages under swaout are marked as under
writeback AFAIR. The writeout path (dm-crypt worker in this case) should
be able to allocate a memory from the mempool, hand over to the crypt
layer and finish the IO. Is it possible this might take a lot of time?

See the backtrace below - the dm-crypt worker is making progress, but the
memory allocator deliberatelly stalls it in throttle_vm_writeout.

On Tue, 12 Jul 2016, Michal Hocko wrote:

Look at the amount of free memory. It is completely depleted. So it
smells like a process which has access to memory reserves has consumed
all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
context user which went off the leash.

It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior
to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never
exhausted reserved memory. With this commit, mempool allocations drop
__GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC,
they can bypass all limits).

Hmm, but the patch allows access to the memory reserves only when the
pool is empty. And even then the caller would have to request access to
reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That

PF_MEMALLOC is set when you enter the block driver when swapping. So, some
of the mempool allocations are done with PF_MEMALLOC.

doesn't seem to be the case for the dm-crypt, though. Or do you suspect
that some other mempool user might be doing so?

Bisection showed that that patch triggered the dm-crypt swapping problems.

Without the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc
1. allocates memory up to __GFP_NOMEMALLOC limit
2. allocates memory from the mempool reserve
3. waits, until some objects are returned to the mempool

With the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc
1. allocates memory up to __GFP_NOMEMALLOC limit
2. allocates memory from the mempool reserve
3. allocates all remaining memory until total exhaustion
4. waits, until some objects are returned to the mempool

No this doesn't sound like a proper solution. The current decision
logic, as explained above relies on the feedback from the reclaim. A
free swap space doesn't really mean we can make a forward progress.

I'm interested - why would you need to trigger the OOM killer if there is
free swap space?

Let me clarify. If there is a swapable memory then we shouldn't trigger
the OOM killer normally of course. And that should be the case with the
current implementation. We just rely on the swapout making some progress

And what does exactly "making some progress" mean? How do you measure it?

and back off only if that is not the case after several attempts with a
throttling based on the writeback counters. Checking the available swap
space doesn't guarantee a forward progress, though. If the swap out is
stuck for some reason then it should be safer to trigger to OOM rather
than wait or trash for ever (or an excessive amount of time).

Now, I can see that the retry logic might need some tuning for complex
setups like dm-crypt swap partitions because the progress might be much
slower there. But I would like the understand what is the worst estimate
for the swapout path with all the roadblocks on the way for this setup
before we can think of a proper retry logic tuning.

For example, you could increment a percpu counter each time writeback of a
page is finished. If the counters stays the same for some pre-determined
period of time, writeback is stuck and you could trigger OOM prematurely.

But the memory management code doesn't do that. So what it really does and
what is the intention behind it?

Another question is - do we really want to try to recover in case of stuck
writeback? If the swap device dies so that it stops processing I/Os, the
system is dead anyway - there is no point in trying to recover it by
killing processes.

The question is if these safeguards against stuck writeback are really
doing more harm than good. Do you have some real use case where you get
stuck writeback and where you need to recover by OOM killing?

This is not the first time I've seen premature OOM. Long time ago, I saw a
case when the admin set /proc/sys/vm/swappiness to a low value (because he
was running some scientific calculations on the machine and he preferred
memory being allocated to those calculations rather than to the cache) -
and the result was premature OOM killing while the machine had plenty of
free swap swace.

The kernel 4.7-rc almost deadlocks in another way. The machine got stuck
and the following stacktrace was obtained when swapping to dm-crypt.

We can see that dm-crypt does a mempool allocation. But the mempool
allocation somehow falls into throttle_vm_writeout. There, it waits for
0.1 seconds. So, as a result, the dm-crypt worker thread ends up
processing requests at an unusually slow rate of 10 requests per second
and it results in the machine being stuck (it would proabably recover if
we waited for extreme amount of time).

Hmm, that throttling is there since ever basically. I do not see what
would have changed that recently, but I haven't looked too close to be
honest.

I agree that throttling a flusher (which this worker definitely is)
doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
this kind of things. So maybe the right thing to do is to use this flag
for the dm_crypt worker:

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 4f3cb3554944..0b806810efab 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
static void kcryptd_crypt(struct work_struct *work)
{
struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+ unsigned int pflags = current->flags;

+ current->flags |= PF_LESS_THROTTLE;
if (bio_data_dir(io->base_bio) == READ)
kcryptd_crypt_read_convert(io);
else
kcryptd_crypt_write_convert(io);
+ tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
}

static void kcryptd_queue_crypt(struct dm_crypt_io *io)

^^^ That fixes just one specific case - but there may be other threads
doing mempool allocations in the device mapper subsystem - and you would
need to mark all of them.

I would try the patch below - generally, allocations from the mempool
subsystem should not wait in the memory allocator at all. I don't know if
there are other cases when these allocations can sleep. I'm interested if
it fixes Ondrej's case - or if it uncovers some other sleeping.

An alternate possibility would be to drop the flag __GFP_DIRECT_RECLAIM in
mempool_alloc - so that mempool allocations never sleep in the allocator.

Good news (I hope). With Mikulas's patch below I'm able to run the test and not get the utility oom_killed. Neither the system livelocks for dozens minutes as before with pure 4.7.0-rc6. Here's the syslog:

https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/0/4.7.0-rc7+.log

Just for the record the test utility allocates memory. More than physical ram installed, but much less than total ram including swap.

As you can see the swap fills slowly as expected. I'd not say it's ideal fix since during the test system not much responsive, but still I'd call it a progress.

Regards O.


---
mm/page-writeback.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: linux-4.7-rc7/mm/page-writeback.c
===================================================================
--- linux-4.7-rc7.orig/mm/page-writeback.c 2016-07-12 20:57:53.000000000 +0200
+++ linux-4.7-rc7/mm/page-writeback.c 2016-07-12 20:59:41.000000000 +0200
@@ -1945,6 +1945,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
unsigned long background_thresh;
unsigned long dirty_thresh;

+ /*
+ * If we came here from mempool_alloc, we don't want to wait 0.1s.
+ * We want to fail as soon as possible, so that the allocation is tried
+ * from mempool reserve.
+ */
+ if (unlikely(gfp_mask & __GFP_NORETRY))
+ return;
+
for ( ; ; ) {
global_dirty_limits(&background_thresh, &dirty_thresh);
dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);