Re: 6.5.0rc5 fs hang - ext4? raid?

From: Dr. David Alan Gilbert
Date: Tue Aug 15 2023 - 21:32:33 EST


(Copying in Christoph and Jens)

* Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote:
> * Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote:
> > * Theodore Ts'o (tytso@xxxxxxx) wrote:
> > > On Mon, Aug 14, 2023 at 09:02:53PM +0000, Dr. David Alan Gilbert wrote:
> > > > dg 29594 29592 0 18:40 pts/0 00:00:00 /usr/bin/ar --plugin /usr/libexec/gcc/x86_64-redhat-linux/13/liblto_plugin.so -csrDT src/intel/perf/libintel_perf.a src/intel/perf/libintel_perf.a.p/meson-generated_.._intel_perf_metrics.c.o src/intel/perf/libintel_perf.a.p/intel_perf.c.o src/intel/perf/libintel_perf.a.p/intel_perf_query.c.o src/intel/perf/libintel_perf.a.p/intel_perf_mdapi.c.o
> > > >
> > > > [root@dalek dg]# cat /proc/29594/stack
> > > > [<0>] md_super_wait+0xa2/0xe0
> > > > [<0>] md_bitmap_unplug+0xd2/0x120
> > > > [<0>] flush_bio_list+0xf3/0x100 [raid1]
> > > > [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> > > > [<0>] __blk_flush_plug+0xd7/0x150
> > > > [<0>] blk_finish_plug+0x29/0x40
> > > > [<0>] ext4_do_writepages+0x401/0xc90
> > > > [<0>] ext4_writepages+0xad/0x180
> > >
> > > If you want a few seconds and try grabbing cat /proc/29594/stack
> > > again, what does the stack trace stay consistent as above?
> >
> > I'll get back to that and retry it.
>
> Yeh, the stack is consistent; this time around it's an 'ar' in a kernel
> build:
>
> [root@dalek dg]# cat /proc/17970/stack
> [<0>] md_super_wait+0xa2/0xe0
> [<0>] md_bitmap_unplug+0xad/0x120
> [<0>] flush_bio_list+0xf3/0x100 [raid1]
> [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> [<0>] __blk_flush_plug+0xd7/0x150
> [<0>] blk_finish_plug+0x29/0x40
> [<0>] ext4_do_writepages+0x401/0xc90
> [<0>] ext4_writepages+0xad/0x180
> [<0>] do_writepages+0xd2/0x1e0
> [<0>] filemap_fdatawrite_wbc+0x63/0x90
> [<0>] __filemap_fdatawrite_range+0x5c/0x80
> [<0>] ext4_release_file+0x74/0xb0
> [<0>] __fput+0xf5/0x2a0
> [<0>] task_work_run+0x5d/0x90
> [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> [<0>] do_syscall_64+0x6c/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [root@dalek dg]# cat /proc/17970/stack
> [<0>] md_super_wait+0xa2/0xe0
> [<0>] md_bitmap_unplug+0xad/0x120
> [<0>] flush_bio_list+0xf3/0x100 [raid1]
> [<0>] raid1_unplug+0x3b/0xb0 [raid1]
> [<0>] __blk_flush_plug+0xd7/0x150
> [<0>] blk_finish_plug+0x29/0x40
> [<0>] ext4_do_writepages+0x401/0xc90
> [<0>] ext4_writepages+0xad/0x180
> [<0>] do_writepages+0xd2/0x1e0
> [<0>] filemap_fdatawrite_wbc+0x63/0x90
> [<0>] __filemap_fdatawrite_range+0x5c/0x80
> [<0>] ext4_release_file+0x74/0xb0
> [<0>] __fput+0xf5/0x2a0
> [<0>] task_work_run+0x5d/0x90
> [<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> [<0>] do_syscall_64+0x6c/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> > > Also, if you have iostat installed (usually part of the sysstat
> > > package), does "iostat 1" show any I/O activity on the md device?
>
> iostat is showing something odd, most devices are at 0,
> except for 3 of the dm's that are stuck at 100% utilisation with
> apparently nothing going on:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.06 0.00 0.03 53.06 0.00 46.84
>
> Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
> ...
> dm-16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> dm-17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> dm-18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> dm-20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> ....
>
> dm-20 is the /dev/mapper/main-more which is the RAID on which the
> fs runs, 16 and 17 are main-more_rmeta_0 and main-more_rimage_0
> so something screwy is going on there.

I've just finished a bisect of this hang, and got to:

615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
Author: Christoph Hellwig <hch@xxxxxx>
Date: Fri May 19 06:40:48 2023 +0200

blk-mq: defer to the normal submission path for post-flush requests

Requests with the FUA bit on hardware without FUA support need a post
flush before returning to the caller, but they can still be sent using
the normal I/O path after initializing the flush-related fields and
end I/O handler.

Signed-off-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Bart Van Assche <bvanassche@xxxxxxx>
Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@xxxxxx
Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>

the full set of runs is:
2 good: 6995e2de6891c724bfeb2db33d7b87775f913ad1 v6.4
7 good 0aa69d53ac7c30f6184f88f2e310d808b32b35a5 6.4.0dg-00428-g0aa69d53ac7c
12 good c1075e548ce6e6b5c7b71f2b05d344164ebc52bb 6.4.0-rc2dg-00020-gc1075e548ce6
15 good be4c427809b0a746aff54dbb8ef663f0184291d0 6.4.0-rc2dg-00022-gbe4c427809b0
17 bad 615939a2ae734e3e68c816d6749d1f5f79c62ab7 6.4.0-rc2dg-00023-g615939a2ae73
16 bad 1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 6.4.0-rc2dg-00024-g1e82fadfc6b9
14 bad 9a67aa52a42b31ad44220cc218df3b75a5cd5d05 6.4.0-rc2dg-00025-g9a67aa52a42b
13 bad 38f2dd34410f9070b60969a07ff7d8743b4fd56c 6.4.0-rc2dg-00030-g38f2dd34410f
11 bad a450f49708ea2ccabd1c5d2fe8a702ca5ef77941 6.4.0-rc2dg-00071-ga450f49708ea
10 bad a4f75764d16bed317276b05a9fe2c179ef61680d 6.4.0-rc2dg-00112-ga4f75764d16b
9 bad 658afed19ceed54a52b9e9e69c0791c8868ff55d 6.4.0-rc2dg-00193-g658afed19cee
8 bad 7cffdbe3607a6cc2dc02d135e13732ec36bc4e28 6.4.0dg-00752-g7cffdbe3607a
6 bad 2605e80d3438c77190f55b821c6575048c68268e 6.4.0dg-01094-g2605e80d3438
5 bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0 6.4.0dg-02143-g6e17c6de3ddf
4 bad 3a8a670eeeaa40d87bd38a587438952741980c18 6.4.0dg-04247-g3a8a670eeeaa
3 bad 1b722407a13b7f8658d2e26917791f32805980a2 6.4.0dg-07066-g1b722407a13b
1 bad: 52a93d39b17dc v6.5-rc5

a sysrq-w from build 16 ( 1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 )
s below.

This is very repeatable for me;
a) boot into desktop
b) Start rhythmbox playing
c) Open 2 terminals
d) In one start make clean on kernel tree
e) in other start iostat -x 1
f) Once the make clean is done, make -j 32
g) If that succeeds try ag somejunk in kernel tree (i.e. parallel
grep)

It's not clear to me if starting rhythmbox is actually necessary;
but with that recipe it often fails in a few seconds.

Let me know if there's anything you'd like me to try.

Dave

[ 90.614548] sysrq: Show Blocked State
[ 90.615300] task:kworker/u64:3 state:D stack:0 pid:232 ppid:2 flags:0x00004000
[ 90.615307] Workqueue: writeback wb_workfn (flush-253:20)
[ 90.615315] Call Trace:
[ 90.615317] <TASK>
[ 90.615321] __schedule+0x3eb/0x14c0
[ 90.615327] ? update_sd_lb_stats.constprop.0+0x135/0x8d0
[ 90.615334] schedule+0x5e/0xd0
[ 90.615336] io_schedule+0x46/0x70
[ 90.615338] bit_wait_io+0x11/0x70
[ 90.615341] __wait_on_bit+0x46/0x140
[ 90.615343] ? __pfx_bit_wait_io+0x10/0x10
[ 90.615346] out_of_line_wait_on_bit+0x95/0xc0
[ 90.615349] ? __pfx_wake_bit_function+0x10/0x10
[ 90.615353] do_get_write_access+0x266/0x410
[ 90.615358] jbd2_journal_get_write_access+0x5f/0x80
[ 90.615362] __ext4_journal_get_write_access+0x85/0x180
[ 90.615366] ? ext4_get_group_desc+0x5c/0xe0
[ 90.615369] ext4_mb_mark_diskspace_used+0x10d/0x460
[ 90.615374] ext4_mb_new_blocks+0x18b/0x1000
[ 90.615377] ? __kmalloc+0x4d/0x150
[ 90.615381] ? ext4_find_extent+0x3c6/0x420
[ 90.615385] ext4_ext_map_blocks+0x60d/0x1980
[ 90.615389] ? allocate_slab+0x251/0x4c0
[ 90.615393] ? ext4_alloc_io_end_vec+0x23/0x80
[ 90.615397] ? mpage_prepare_extent_to_map+0x454/0x490
[ 90.615401] ext4_map_blocks+0x251/0x620
[ 90.615404] ext4_do_writepages+0x76a/0xc90
[ 90.615408] ext4_writepages+0xad/0x180
[ 90.615414] do_writepages+0xd2/0x1e0
[ 90.615418] ? __wb_calc_thresh+0x4b/0x130
[ 90.615421] __writeback_single_inode+0x3d/0x360
[ 90.615424] ? wbc_detach_inode+0x101/0x220
[ 90.615427] writeback_sb_inodes+0x1ed/0x4b0
[ 90.615432] __writeback_inodes_wb+0x4c/0xf0
[ 90.615435] wb_writeback+0x172/0x2f0
[ 90.615439] wb_workfn+0x2b5/0x510
[ 90.615442] ? __schedule+0x3f3/0x14c0
[ 90.615444] ? __mod_timer+0x11f/0x370
[ 90.615449] process_one_work+0x1c7/0x3d0
[ 90.615453] worker_thread+0x51/0x390
[ 90.615456] ? __pfx_worker_thread+0x10/0x10
[ 90.615458] kthread+0xe8/0x120
[ 90.615462] ? __pfx_kthread+0x10/0x10
[ 90.615466] ret_from_fork+0x2c/0x50
[ 90.615472] </TASK>
[ 90.615613] task:kworker/u64:4 state:D stack:0 pid:290 ppid:2 flags:0x00004000
[ 90.615617] Workqueue: writeback wb_workfn (flush-253:20)
[ 90.615621] Call Trace:
[ 90.615622] <TASK>
[ 90.615623] __schedule+0x3eb/0x14c0
[ 90.615626] ? __submit_bio+0x88/0x170
[ 90.615629] ? submit_bio_noacct_nocheck+0x159/0x370
[ 90.615632] schedule+0x5e/0xd0
[ 90.615635] md_super_wait+0xa2/0xe0
[ 90.615640] ? __pfx_autoremove_wake_function+0x10/0x10
[ 90.615644] md_bitmap_unplug+0xf3/0x120
[ 90.615650] flush_bio_list+0x1e/0xe0 [raid1]
[ 90.615659] raid1_unplug+0xaf/0xd0 [raid1]
[ 90.615665] __blk_flush_plug+0xdb/0x160
[ 90.615669] blk_finish_plug+0x29/0x40
[ 90.615671] wb_writeback+0x2bc/0x2f0
[ 90.615674] wb_workfn+0x2b5/0x510
[ 90.615677] ? ttwu_queue_wakelist+0xef/0x110
[ 90.615681] process_one_work+0x1c7/0x3d0
[ 90.615684] worker_thread+0x51/0x390
[ 90.615687] ? __pfx_worker_thread+0x10/0x10
[ 90.615689] kthread+0xe8/0x120
[ 90.615692] ? __pfx_kthread+0x10/0x10
[ 90.615696] ret_from_fork+0x2c/0x50
[ 90.615701] </TASK>
[ 90.616150] task:mdX_raid1 state:D stack:0 pid:1425 ppid:2 flags:0x00004000
[ 90.616153] Call Trace:
[ 90.616154] <TASK>
[ 90.616155] __schedule+0x3eb/0x14c0
[ 90.616157] ? __schedule+0x3f3/0x14c0
[ 90.616161] schedule+0x5e/0xd0
[ 90.616163] md_super_wait+0xa2/0xe0
[ 90.616167] ? __pfx_autoremove_wake_function+0x10/0x10
[ 90.616171] md_bitmap_daemon_work+0x211/0x3a0
[ 90.616175] ? __pfx_md_thread+0x10/0x10
[ 90.616179] md_check_recovery+0x42/0x5a0
[ 90.616182] raid1d+0x87/0x16d0 [raid1]
[ 90.616188] ? __pfx_md_thread+0x10/0x10
[ 90.616191] ? __schedule+0x3f3/0x14c0
[ 90.616193] ? psi_task_switch+0x84/0x230
[ 90.616198] ? lock_timer_base+0x61/0x80
[ 90.616202] ? prepare_to_wait_event+0x60/0x180
[ 90.616206] ? __pfx_md_thread+0x10/0x10
[ 90.616209] md_thread+0xae/0x190
[ 90.616212] ? __pfx_autoremove_wake_function+0x10/0x10
[ 90.616217] kthread+0xe8/0x120
[ 90.616220] ? __pfx_kthread+0x10/0x10
[ 90.616223] ret_from_fork+0x2c/0x50
[ 90.616228] </TASK>
[ 90.616310] task:jbd2/dm-20-8 state:D stack:0 pid:1501 ppid:2 flags:0x00004000
[ 90.616313] Call Trace:
[ 90.616314] <TASK>
[ 90.616315] __schedule+0x3eb/0x14c0
[ 90.616317] ? mempool_alloc+0x89/0x1b0
[ 90.616322] schedule+0x5e/0xd0
[ 90.616324] md_super_wait+0xa2/0xe0
[ 90.616328] ? __pfx_autoremove_wake_function+0x10/0x10
[ 90.616332] md_bitmap_unplug+0xba/0x120
[ 90.616336] flush_bio_list+0x1e/0xe0 [raid1]
[ 90.616342] raid1_unplug+0xaf/0xd0 [raid1]
[ 90.616349] __blk_flush_plug+0xdb/0x160
[ 90.616352] blk_finish_plug+0x29/0x40
[ 90.616354] jbd2_journal_commit_transaction+0xfdb/0x1a00
[ 90.616362] kjournald2+0xad/0x280
[ 90.616367] ? __pfx_autoremove_wake_function+0x10/0x10
[ 90.616372] ? __pfx_kjournald2+0x10/0x10
[ 90.616375] kthread+0xe8/0x120
[ 90.616378] ? __pfx_kthread+0x10/0x10
[ 90.616382] ret_from_fork+0x2c/0x50
[ 90.616387] </TASK>
[ 90.618244] task:oggdemux0:sink state:D stack:0 pid:3216 ppid:2103 flags:0x00000002
[ 90.618248] Call Trace:
[ 90.618249] <TASK>
[ 90.618251] __schedule+0x3eb/0x14c0
[ 90.618254] schedule+0x5e/0xd0
[ 90.618256] io_schedule+0x46/0x70
[ 90.618259] folio_wait_bit_common+0x13d/0x350
[ 90.618262] ? __pfx_wake_page_function+0x10/0x10
[ 90.618265] filemap_get_pages+0x5fe/0x630
[ 90.618268] ? psi_task_switch+0xd6/0x230
[ 90.618272] ? finish_task_switch.isra.0+0x94/0x2f0
[ 90.618276] filemap_read+0xdf/0x350
[ 90.618279] ? dput+0x3a/0x310
[ 90.618284] vfs_read+0x1f6/0x320
[ 90.618289] ksys_read+0x6f/0xf0
[ 90.618291] do_syscall_64+0x60/0x90
[ 90.618295] ? exit_to_user_mode_prepare+0x142/0x1f0
[ 90.618298] ? syscall_exit_to_user_mode+0x1b/0x40
[ 90.618302] ? do_syscall_64+0x6c/0x90
[ 90.618305] ? exit_to_user_mode_prepare+0x142/0x1f0
[ 90.618307] ? syscall_exit_to_user_mode+0x1b/0x40
[ 90.618310] ? do_syscall_64+0x6c/0x90
[ 90.618312] ? do_syscall_64+0x6c/0x90
[ 90.618314] ? do_syscall_64+0x6c/0x90
[ 90.618316] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 90.618321] RIP: 0033:0x7f2b3f9083ba
[ 90.618346] RSP: 002b:00007f2b179fe180 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 90.618349] RAX: ffffffffffffffda RBX: 0000559e098dfd20 RCX: 00007f2b3f9083ba
[ 90.618351] RDX: 00000000000025ae RSI: 00007f2b04038190 RDI: 0000000000000015
[ 90.618352] RBP: 00007f2b179fe1a0 R08: 0000000000000000 R09: 00007f2b3f1e16a0
[ 90.618354] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000025ae
[ 90.618355] R13: 0000000000000000 R14: 00007f2b179ff648 R15: 00000000000025ae
[ 90.618358] </TASK>
[ 90.618426] task:gcc state:D stack:0 pid:6486 ppid:6485 flags:0x00000006
[ 90.618429] Call Trace:
[ 90.618430] <TASK>
[ 90.618431] __schedule+0x3eb/0x14c0
[ 90.618435] schedule+0x5e/0xd0
[ 90.618437] io_schedule+0x46/0x70
[ 90.618440] folio_wait_bit_common+0x13d/0x350
[ 90.618443] ? __pfx_wake_page_function+0x10/0x10
[ 90.618445] filemap_get_pages+0x5fe/0x630
[ 90.618449] ? path_openat+0xa2c/0x1140
[ 90.618454] filemap_read+0xdf/0x350
[ 90.618457] ? do_filp_open+0xb3/0x160
[ 90.618461] vfs_read+0x1f6/0x320
[ 90.618465] ksys_read+0x6f/0xf0
[ 90.618467] do_syscall_64+0x60/0x90
[ 90.618469] ? exc_page_fault+0x7f/0x180
[ 90.618472] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 90.618476] RIP: 0033:0x7efee8b08381
[ 90.618480] RSP: 002b:00007ffc7e238648 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 90.618482] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efee8b08381
[ 90.618483] RDX: 0000000000000400 RSI: 000055b8799377e0 RDI: 0000000000000005
[ 90.618484] RBP: 00007ffc7e238770 R08: 0000000000000400 R09: 00007efee88442c0
[ 90.618486] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[ 90.618487] R13: 00007ffc7e238810 R14: 00007ffc7e238690 R15: 00007ffc7e238680
[ 90.618490] </TASK>

> Dave
>
>
> > > What about the underying block dvices used by the md device? If the
> > > md device is attached to HDD's where you can see the activity light,
> > > can you see (or hear) any disk activity?
> >
> > It's spinning rust, and I hear them go quiet when the hang happens.
> >
> > Dave
> >
> > > This sure seems like either the I/O driver isn't processing requests,
> > > or some kind of hang in the md layer....
> > >
> > > - Ted
> > --
> > -----Open up your eyes, open up your mind, open up your code -------
> > / Dr. David Alan Gilbert | Running GNU/Linux | Happy \
> > \ dave @ treblig.org | | In Hex /
> > \ _________________________|_____ http://www.treblig.org |_______/
> --
> -----Open up your eyes, open up your mind, open up your code -------
> / Dr. David Alan Gilbert | Running GNU/Linux | Happy \
> \ dave @ treblig.org | | In Hex /
> \ _________________________|_____ http://www.treblig.org |_______/
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux | Happy \
\ dave @ treblig.org | | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/