Re: Tasks stuck jbd2 for a long time

From: Theodore Ts'o
Date: Thu Aug 17 2023 - 22:43:04 EST


On Fri, Aug 18, 2023 at 01:31:35AM +0000, Lu, Davina wrote:
>
> Looks like this is a similar issue I saw before with fio test (buffered IO with 100 threads), it is also shows "ext4-rsv-conversion" work queue takes lots CPU and make journal update every stuck.

Given the stack traces, it is very much a different problem.

> There is a patch and see if this is the same issue? this is not the
> finial patch since there may have some issue from Ted. I will
> forward that email to you in a different loop. I didn't continue on
> this patch that time since we thought is might not be the real case
> in RDS.

The patch which you've included is dangerous and can cause file system
corruption. See my reply at [1], and your corrected patch which
addressed my concern at [2]. If folks want to try a patch, please use
the one at [2], and not the one you quoted in this thread, since it's
missing critically needed locking.

[1] https://lore.kernel.org/r/YzTMZ26AfioIbl27@xxxxxxx
[2] https://lore.kernel.org/r/53153bdf0cce4675b09bc2ee6483409f@xxxxxxxxxx

The reason why we never pursued it is because (a) at one of our weekly
ext4 video chats, I was informed by Oleg Kiselev that the performance
issue was addressed in a different way, and (b) I'd want to reproduce
the issue on a machine under my control so I could understand what was
was going on and so we could examine the dynamics of what was
happening with and without the patch. So I'd would have needed to
know how many CPU's what kind of storage device (HDD?, SSD? md-raid?
etc.) was in use, in addition to the fio recipe.

Finally, I'm a bit nervous about setting the internal __WQ_ORDERED
flag with max_active > 1. What was that all about, anyway?

- Ted