Re: [PATCH 1/1] Revert "iomap: fall back to buffered writes for invalidation failures"

From: Darrick J. Wong
Date: Wed Feb 09 2022 - 23:59:20 EST


On Wed, Feb 09, 2022 at 08:52:43AM +0000, Lee Jones wrote:
> This reverts commit 60263d5889e6dc5987dc51b801be4955ff2e4aa7.
>
> Reverting since this commit opens a potential avenue for abuse.

What kind of abuse? Did you conclude there's an avenue solely because
some combination of userspace rigging produced a BUG warning? Or is
this a real problem that someone found?

> The C-reproducer and more information can be found at the link below.

No. Post the information and your analysis here. I'm not going to dig
into some Google site to find out what happened, and I'm not going to
assume that future developers will be able to access that URL to learn
why this patch was created.

I am not convinced that this revert is a proper fix for ... whatever is
the problem, because you never explained what happened.

> With this patch applied, I can no longer get the repro to trigger.

With ext4 completely reverted, I can no longer get the repro to trigger
either.

> kernel BUG at fs/ext4/inode.c:2647!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 459 Comm: syz-executor359 Tainted: G W 5.10.93-syzkaller-01028-g0347b1658399 #0

What BUG on fs/ext4/inode.c:2647?

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/ext4/inode.c?h=v5.10.93#n2647

All I see is a call to pagevec_release()? There's a BUG_ON further up
if we wait for page writeback but then it still has Writeback set. But
I don't see anything in pagevec_release that would actually result in a
BUG warning.

Oh, right, this is one of those inscrutable syzkaller things, where a
person can spend hours figuring out what the hell it set up.

Yeah...no, you don't get to submit patches to core kernel code, claim
it's not your responsibility to know anything about a subsystem that you
want to patch, and then expect us to do the work for you. If you pick
up a syzkaller report, you get to figure out what broke, why, and how to
fix it in a reasonable manner.

You're a maintainer, would you accept a patch like this?

NAK.

--D

OH WAIT, you're running this on the Android 5.10 kernel, aren't you?
The BUG report came from page_buffers failing to find any buffer heads
attached to the page.
https://android.googlesource.com/kernel/common/+/refs/heads/android12-5.10-2022-02/fs/ext4/inode.c#2647

Yeah, don't care.

> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:mpage_prepare_extent_to_map+0xbe9/0xc00 fs/ext4/inode.c:2647
> Code: e8 fc bd c3 ff e9 80 f6 ff ff 44 89 e9 80 e1 07 38 c1 0f 8c a6 fe ff ff 4c 89 ef e8 e1 bd c3 ff e9 99 fe ff ff e8 87 c9 89 ff <0f> 0b e8 80 c9 89 ff 0f 0b e8 79 1e b8 02 66 0f 1f 84 00 00 00 00
> RSP: 0018:ffffc90000e2e9c0 EFLAGS: 00010293
> RAX: ffffffff81e321f9 RBX: 0000000000000000 RCX: ffff88810c12cf00
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffffc90000e2eb90 R08: ffffffff81e31e71 R09: fffff940008d68b1
> R10: fffff940008d68b1 R11: 0000000000000000 R12: ffffea00046b4580
> R13: ffffc90000e2ea80 R14: 000000000000011e R15: dffffc0000000000
> FS: 00007fcfd0717700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcfd07d5520 CR3: 000000010a142000 CR4: 00000000003506b0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> ext4_writepages+0xcbb/0x3950 fs/ext4/inode.c:2776
> do_writepages+0x13a/0x280 mm/page-writeback.c:2358
> __filemap_fdatawrite_range+0x356/0x420 mm/filemap.c:427
> filemap_write_and_wait_range+0x64/0xe0 mm/filemap.c:660
> __iomap_dio_rw+0x621/0x10c0 fs/iomap/direct-io.c:495
> iomap_dio_rw+0x35/0x80 fs/iomap/direct-io.c:611
> ext4_dio_write_iter fs/ext4/file.c:571 [inline]
> ext4_file_write_iter+0xfc5/0x1b70 fs/ext4/file.c:681
> do_iter_readv_writev+0x548/0x740 include/linux/fs.h:1941
> do_iter_write+0x182/0x660 fs/read_write.c:866
> vfs_iter_write+0x7c/0xa0 fs/read_write.c:907
> iter_file_splice_write+0x7fb/0xf70 fs/splice.c:686
> do_splice_from fs/splice.c:764 [inline]
> direct_splice_actor+0xfe/0x130 fs/splice.c:933
> splice_direct_to_actor+0x4f4/0xbc0 fs/splice.c:888
> do_splice_direct+0x28b/0x3e0 fs/splice.c:976
> do_sendfile+0x914/0x1390 fs/read_write.c:1257
>
> Link: https://syzkaller.appspot.com/bug?extid=41c966bf0729a530bd8d
>
> From the patch:
> Cc: Stable <stable@xxxxxxxxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Dave Chinner <dchinner@xxxxxxxxxx>
> Cc: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> Cc: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> Cc: Bob Peterson <rpeterso@xxxxxxxxxx>
> Cc: Damien Le Moal <damien.lemoal@xxxxxxx>
> Cc: Theodore Ts'o <tytso@xxxxxxx> # for ext4
> Cc: Andreas Gruenbacher <agruenba@xxxxxxxxxx>
> Cc: Ritesh Harjani <riteshh@xxxxxxxxxxxxx>
>
> Others:
> Cc: Johannes Thumshirn <jth@xxxxxxxxxx>
> Cc: linux-xfs@xxxxxxxxxxxxxxx
> Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> Cc: linux-ext4@xxxxxxxxxxxxxxx
> Cc: cluster-devel@xxxxxxxxxx
>
> Fixes: 60263d5889e6d ("iomap: fall back to buffered writes for invalidation failures")
> Reported-by: syzbot+0ed9f769264276638893@xxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Lee Jones <lee.jones@xxxxxxxxxx>
> ---
> fs/ext4/file.c | 2 --
> fs/gfs2/file.c | 3 +--
> fs/iomap/direct-io.c | 16 +++++-----------
> fs/iomap/trace.h | 1 -
> fs/xfs/xfs_file.c | 4 ++--
> fs/zonefs/super.c | 7 ++-----
> 6 files changed, 10 insertions(+), 23 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 3ed8c048fb12c..cb347c3b35535 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -551,8 +551,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iomap_ops = &ext4_iomap_overwrite_ops;
> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> is_sync_kiocb(iocb) || unaligned_io || extend);
> - if (ret == -ENOTBLK)
> - ret = 0;
>
> if (extend)
> ret = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index b39b339feddc9..847adb97380d3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -835,8 +835,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
>
> ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
> is_sync_kiocb(iocb));
> - if (ret == -ENOTBLK)
> - ret = 0;
> +
> out:
> gfs2_glock_dq(gh);
> out_uninit:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5becd..ddcd06c0c452d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -10,7 +10,6 @@
> #include <linux/backing-dev.h>
> #include <linux/uio.h>
> #include <linux/task_io_accounting_ops.h>
> -#include "trace.h"
>
> #include "../internal.h"
>
> @@ -413,9 +412,6 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> * can be mapped into multiple disjoint IOs and only a subset of the IOs issued
> * may be pure data writes. In that case, we still need to do a full data sync
> * completion.
> - *
> - * Returns -ENOTBLK In case of a page invalidation invalidation failure for
> - * writes. The callers needs to fall back to buffered I/O in this case.
> */
> struct iomap_dio *
> __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> @@ -493,15 +489,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> if (iov_iter_rw(iter) == WRITE) {
> /*
> * Try to invalidate cache pages for the range we are writing.
> - * If this invalidation fails, let the caller fall back to
> - * buffered I/O.
> + * If this invalidation fails, tough, the write will still work,
> + * but racing two incompatible write paths is a pretty crazy
> + * thing to do, so we don't support it 100%.
> */
> if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
> - end >> PAGE_SHIFT)) {
> - trace_iomap_dio_invalidate_fail(inode, pos, count);
> - ret = -ENOTBLK;
> - goto out_free_dio;
> - }
> + end >> PAGE_SHIFT))
> + dio_warn_stale_pagecache(iocb->ki_filp);
>
> if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
> ret = sb_init_dio_done_wq(inode->i_sb);
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index fdc7ae388476f..5693a39d52fb6 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -74,7 +74,6 @@ DEFINE_EVENT(iomap_range_class, name, \
> DEFINE_RANGE_EVENT(iomap_writepage);
> DEFINE_RANGE_EVENT(iomap_releasepage);
> DEFINE_RANGE_EVENT(iomap_invalidatepage);
> -DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
>
> #define IOMAP_TYPE_STRINGS \
> { IOMAP_HOLE, "HOLE" }, \
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b0f93f738372..43e2c057214d9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -589,8 +589,8 @@ xfs_file_dio_aio_write(
> xfs_iunlock(ip, iolock);
>
> /*
> - * No fallback to buffered IO after short writes for XFS, direct I/O
> - * will either complete fully or return an error.
> + * No fallback to buffered IO on errors for XFS, direct IO will either
> + * complete fully or fail.
> */
> ASSERT(ret < 0 || ret == count);
> return ret;
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bec47f2d074be..d54fbef3db4df 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -851,11 +851,8 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
> return -EFBIG;
>
> - if (iocb->ki_flags & IOCB_DIRECT) {
> - ssize_t ret = zonefs_file_dio_write(iocb, from);
> - if (ret != -ENOTBLK)
> - return ret;
> - }
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return zonefs_file_dio_write(iocb, from);
>
> return zonefs_file_buffered_write(iocb, from);
> }
> --
> 2.35.0.263.gb82422642f-goog
>