Re: [PATCH 5.10 109/145] mm: make wait_on_page_writeback() wait for multiple pending writebacks

From: Hugh Dickins
Date: Mon Jan 11 2021 - 12:55:48 EST


On Mon, 11 Jan 2021, Greg Kroah-Hartman wrote:

> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>
> commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.
>
> Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
> logic") we've had some very occasional reports of BUG_ON(PageWriteback)
> in write_cache_pages(), which we thought we already fixed in commit
> 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").
>
> But syzbot just reported another one, even with that commit in place.
>
> And it turns out that there's a simpler way to trigger the BUG_ON() than
> the one Hugh found with page re-use. It all boils down to the fact that
> the page writeback is ostensibly serialized by the page lock, but that
> isn't actually really true.
>
> Yes, the people _setting_ writeback all do so under the page lock, but
> the actual clearing of the bit - and waking up any waiters - happens
> without any page lock.
>
> This gives us this fairly simple race condition:
>
> CPU1 = end previous writeback
> CPU2 = start new writeback under page lock
> CPU3 = write_cache_pages()
>
> CPU1 CPU2 CPU3
> ---- ---- ----
>
> end_page_writeback()
> test_clear_page_writeback(page)
> ... delayed...
>
> lock_page();
> set_page_writeback()
> unlock_page()
>
> lock_page()
> wait_on_page_writeback();
>
> wake_up_page(page, PG_writeback);
> .. wakes up CPU3 ..
>
> BUG_ON(PageWriteback(page));
>
> where the BUG_ON() happens because we woke up the PG_writeback bit
> becasue of the _previous_ writeback, but a new one had already been
> started because the clearing of the bit wasn't actually atomic wrt the
> actual wakeup or serialized by the page lock.
>
> The reason this didn't use to happen was that the old logic in waiting
> on a page bit would just loop if it ever saw the bit set again.
>
> The nice proper fix would probably be to get rid of the whole "wait for
> writeback to clear, and then set it" logic in the writeback path, and
> replace it with an atomic "wait-to-set" (ie the same as we have for page
> locking: we set the page lock bit with a single "lock_page()", not with
> "wait for lock bit to clear and then set it").
>
> However, out current model for writeback is that the waiting for the
> writeback bit is done by the generic VFS code (ie write_cache_pages()),
> but the actual setting of the writeback bit is done much later by the
> filesystem ".writepages()" function.
>
> IOW, to make the writeback bit have that same kind of "wait-to-set"
> behavior as we have for page locking, we'd have to change our roughly
> ~50 different writeback functions. Painful.
>
> Instead, just make "wait_on_page_writeback()" loop on the very unlikely
> situation that the PG_writeback bit is still set, basically re-instating
> the old behavior. This is very non-optimal in case of contention, but
> since we only ever set the bit under the page lock, that situation is
> controlled.
>
> Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@xxxxxxxxxxxxxxxxxxxxxxxxx
> Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
> Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxx
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

I think it's too early to push this one through to stable:
Linus mentioned on Friday that Michael Larabel of Phoronix
has observed a performance regression from this commit.

Correctness outweighs performance of course, but I think
stable users might see the performance issue much sooner
than they would ever see the BUG fixed. Wait a bit,
while we think some more about what to try next?

Hugh

>
> ---
> mm/page-writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2826,7 +2826,7 @@ EXPORT_SYMBOL(__test_set_page_writeback)
> */
> void wait_on_page_writeback(struct page *page)
> {
> - if (PageWriteback(page)) {
> + while (PageWriteback(page)) {
> trace_wait_on_page_writeback(page, page_mapping(page));
> wait_on_page_bit(page, PG_writeback);
> }