Re: Tasks stuck jbd2 for a long time

From: Bhatnagar, Rishabh
Date: Wed Aug 16 2023 - 14:34:07 EST



On 8/16/23 7:53 AM, Jan Kara wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



On Tue 15-08-23 20:57:14, Bhatnagar, Rishabh wrote:
On 8/15/23 7:28 PM, Theodore Ts'o wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



It would be helpful if you can translate address in the stack trace to
line numbers. See [1] and the script in
./scripts/decode_stacktrace.sh in the kernel sources. (It is
referenced in the web page at [1].)

[1] https://docs.kernel.org/admin-guide/bug-hunting.html

Of course, in order to interpret the line numbers, we'll need a
pointer to the git repo of your kernel sources and the git commit ID
you were using that presumably corresponds to 5.10.184-175.731.amzn2.x86_64.

The stack trace for which I am particularly interested is the one for
the jbd2/md0-8 task, e.g.:
Thanks for checking Ted.

We don't have fast_commit feature enabled. So it should correspond to this
line:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/jbd2/commit.c?h=linux-5.10.y#n496

Not tainted 5.10.184-175.731.amzn2.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:jbd2/md0-8 state:D stack: 0 pid: 8068 ppid: 2
flags:0x00004080
Call Trace:
__schedule+0x1f9/0x660
schedule+0x46/0xb0
jbd2_journal_commit_transaction+0x35d/0x1880 [jbd2] <--------- line #?
? update_load_avg+0x7a/0x5d0
? add_wait_queue_exclusive+0x70/0x70
? lock_timer_base+0x61/0x80
? kjournald2+0xcf/0x360 [jbd2]
kjournald2+0xcf/0x360 [jbd2]
Most of the other stack traces you refenced are tasks that are waiting
for the transaction commit to complete so they can proceed with some
file system operation. The stack traces which have
start_this_handle() in them are examples of this going on. Stack
traces of tasks that do *not* have start_this_handle() would be
specially interesting.
I see all other stacks apart from kjournald have "start_this_handle".
That would be strange. Can you post full output of "echo w
/proc/sysrq-trigger" to dmesg, ideally passed through scripts/faddr2line as
Ted suggests. Thanks!

Sure i'll try to collect that. The system freezes when such a situation happens and i'm not able
to collect much information. I'll try to crash the kernel and collect kdump and see if i can get that info.

Can low available memory be a reason for a thread to not be able to close the transaction handle for a long time?
Maybe some writeback thread starts the handle but is not able to complete writeback?


Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR