Re: Tasks stuck jbd2 for a long time

From: Theodore Ts'o
Date: Thu Aug 17 2023 - 21:20:42 EST


On Thu, Aug 17, 2023 at 11:59:03AM -0700, Bhatnagar, Rishabh wrote:
>
> I think I found the thread that is holding the transaction handle. It seems
> to be in runnable state though.

This looks like it's case of livelock...

> It has the journal_info set to the journal handle that has the matching
> transaction as the journal's running transaction.
> Here is the associated stack trace. It is converting unwritten extents to
> extents.
>
> PID: 287    TASK: ffff976801890000  CPU: 20  COMMAND: "kworker/u96:35"
>  #0 [ffffbd6c40b3f498] __schedule+617 at ffffffffbb912df9
>  #1 [ffffbd6c40b3f530] _cond_resched+38 at ffffffffbb9133e6
>  #2 [ffffbd6c40b3f538] shrink_lruvec+670 at ffffffffbb244d1e
>  #3 [ffffbd6c40b3f640] _cond_resched+21 at ffffffffbb9133d5
>  #4 [ffffbd6c40b3f648] shrink_node+552 at ffffffffbb2452a8
>  #5 [ffffbd6c40b3f6c8] do_try_to_free_pages+201 at ffffffffbb245829
>  #6 [ffffbd6c40b3f718] try_to_free_pages+239 at ffffffffbb246c0f
>  #7 [ffffbd6c40b3f7b0] __alloc_pages_slowpath.constprop.114+913 at
> ffffffffbb28d741
>  #8 [ffffbd6c40b3f890] __alloc_pages_nodemask+679 at ffffffffbb28e2e7
>  #9 [ffffbd6c40b3f900] allocate_slab+726 at ffffffffbb2b0886
> #10 [ffffbd6c40b3f958] ___slab_alloc+1173 at ffffffffbb2b3ff5
> #11 [ffffbd6c40b3f988] insert_revoke_hash+37 at ffffffffc016f435 [jbd2] 
> (/home/ec2-user/linux/fs/jbd2/revoke.c:146)

insert_revoke_hash is trying to do a memory allocation of a 48 byte
structure using kmem_cache_alloc() with the __GFP_NOFAIL bit set. (We
use GFP_NOFAIL because if the memory allocation fails, the only
recourse we can have is to shut down the journal and force the file
system to be read-only --- or crash the system, of course.)

Since we have set __GFP_NOFAIL, the memory allocator is apparently not
able to find even a single free page for the slab allocator, and so
it's apparently trying and trying to free memory --- and failing
miserably.

Hmm... something that might be worth trying is to see if running the
job in a memcg, since how the kernel handles OOM, and how it will
handle OOM kills will differ depending whether is getting constrained
by container memory or by completely running out of memory.

I wonder why this version of the 5.10 kernel isn't solving the problem
performing an OOM kill to free memory. We're running a 5.10 based
kernel in our data centers at $WORK, and normally the OOM killer is
quite free to make memory available by killing as necessasry deal with
these situations. (As Spock once said, "The needs of the many
outweighs the needs of the few --- or the one." And sometimes the
best way to keep the system running is to sacrifice one of the
userspace processes.) Do you by any chance have all or most of the
user processes exempted from the OOM killer?

> #16 [ffffbd6c40b3faa0] __ext4_forget+338 at ffffffffc02efb32 [ext4] 
> (/home/ec2-user/linux/fs/ext4/ext4_jbd2.c:298)
> #17 [ffffbd6c40b3fae0] ext4_free_blocks+2437 at ffffffffc031fd55 [ext4]
> (/home/ec2-user/linux/fs/ext4/mballoc.c:5709 (discriminator 2))
> #18 [ffffbd6c40b3fbb0] ext4_ext_handle_unwritten_extents+596 at
> ffffffffc02f56a4 [ext4] ((inlined by) ext4_ext_handle_unwritten_extents at
> /home/ec2-user/linux/fs/ext4/extents.c:3892)

This part of the stack trace is weird. I don't see what callees of
ext4_ext_handle_unwritten_extents() would result in ext4_free_blocks()
getting called. Unfortunately, the code is very heavily inlined, and
we only see the first level of inlining. My best guess is that it was
in some error handling code, such as this:

if (err) {
/* free all allocated blocks in error case */
for (i = 0; i < depth; i++) {
if (!ablocks[i])
continue;
ext4_free_blocks(handle, inode, NULL, ablocks[i], 1,
EXT4_FREE_BLOCKS_METADATA);
}
}

... and this call to ext4_free_blocks() resulted in the call to
__ext4_forget, which in turn tried to create a journal revoke record.

And the cause of the error may very well have been caused by some
other memory allocation, if the system was so desperately low on
memory.


Anyway, the big question is why the system allowed the system to get
so low in memory in the first place. In addition to OOM killing
processes, one of the things that Linux is supposed to do is "write
throttling", where if a process is dirtying too many pages, to put the
guilty processes to sleep so that page cleaning can have a chance to
catch up.

Quoting from section 14.1.5 (Writeback) from [1]:

As applications write to files, the pagecache becomes dirty and
the buffercache may become dirty. When the amount of dirty memory
reaches a specified number of pages in bytes
(vm.dirty_background_bytes), or when the amount of dirty memory
reaches a specific ratio to total memory
(vm.dirty_background_ratio), or when the pages have been dirty for
longer than a specified amount of time
(vm.dirty_expire_centisecs), the kernel begins writeback of pages
starting with files that had the pages dirtied first. The
background bytes and ratios are mutually exclusive and setting one
will overwrite the other. Flusher threads perform writeback in the
background and allow applications to continue running. If the I/O
cannot keep up with applications dirtying pagecache, and dirty
data reaches a critical setting (vm.dirty_bytes or
vm.dirty_ratio), then applications begin to be throttled to
prevent dirty data exceeding this threshold.

[1] https://documentation.suse.com/sles/15-SP3/html/SLES-all/cha-tuning-memory.html

So it might be worth looking at if your system has non-default values
for /proc/sys/vm/{dirty_bytes,dirty_ratio,dirty_background_bytes,etc.}.

Cheers,

- Ted