Re: current->journal_info got nested! (was Re: [syzbot] [xfs?] [ext4?] general protection fault in jbd2__journal_start)

From: Darrick J. Wong
Date: Tue Jan 30 2024 - 22:46:48 EST


On Wed, Jan 31, 2024 at 10:37:18AM +1100, Dave Chinner wrote:
> On Tue, Jan 30, 2024 at 06:52:21AM -0800, syzbot wrote:
> > syzbot has found a reproducer for the following issue on:
> >
> > HEAD commit: 861c0981648f Merge tag 'jfs-6.8-rc3' of github.com:kleikam..
> > git tree: upstream
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=13ca8d97e80000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=b0b9993d7d6d1990
> > dashboard link: https://syzkaller.appspot.com/bug?extid=cdee56dbcdf0096ef605
> > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=104393efe80000
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1393b90fe80000
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/7c6cc521298d/disk-861c0981.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/6203c94955db/vmlinux-861c0981.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/17e76e12b58c/bzImage-861c0981.xz
> > mounted in repro: https://storage.googleapis.com/syzbot-assets/d31d4eed2912/mount_3.gz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+cdee56dbcdf0096ef605@xxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > general protection fault, probably for non-canonical address 0xdffffc000a8a4829: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: probably user-memory-access in range [0x0000000054524148-0x000000005452414f]
> > CPU: 1 PID: 5065 Comm: syz-executor260 Not tainted 6.8.0-rc2-syzkaller-00031-g861c0981648f #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
> > RIP: 0010:jbd2__journal_start+0x87/0x5d0 fs/jbd2/transaction.c:496
> > Code: 74 63 48 8b 1b 48 85 db 74 79 48 89 d8 48 c1 e8 03 42 80 3c 28 00 74 08 48 89 df e8 63 4d 8f ff 48 8b 2b 48 89 e8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 ef e8 4a 4d 8f ff 4c 39 65 00 0f 85 1a
> > RSP: 0018:ffffc900043265c8 EFLAGS: 00010203
> > RAX: 000000000a8a4829 RBX: ffff8880205fa3a8 RCX: ffff8880235dbb80
> > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88801c1a6000
> > RBP: 000000005452414e R08: 0000000000000c40 R09: 0000000000000001
> ^^^^^^^^
> Hmmmm - TRAN. That's looks suspicious, I'll come back to that.
>
> > R10: dffffc0000000000 R11: ffffed1003834871 R12: ffff88801c1a6000
> > R13: dffffc0000000000 R14: 0000000000000c40 R15: 0000000000000002
> > FS: 0000555556f90380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020020000 CR3: 0000000021fed000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > __ext4_journal_start_sb+0x215/0x5b0 fs/ext4/ext4_jbd2.c:112
> > __ext4_journal_start fs/ext4/ext4_jbd2.h:326 [inline]
> > ext4_dirty_inode+0x92/0x110 fs/ext4/inode.c:5969
> > __mark_inode_dirty+0x305/0xda0 fs/fs-writeback.c:2452
> > generic_update_time fs/inode.c:1905 [inline]
> > inode_update_time fs/inode.c:1918 [inline]
> > __file_update_time fs/inode.c:2106 [inline]
> > file_update_time+0x39b/0x3e0 fs/inode.c:2136
> > ext4_page_mkwrite+0x207/0xdf0 fs/ext4/inode.c:6090
> > do_page_mkwrite+0x197/0x470 mm/memory.c:2966
> > wp_page_shared mm/memory.c:3353 [inline]
> > do_wp_page+0x20e3/0x4c80 mm/memory.c:3493
> > handle_pte_fault mm/memory.c:5160 [inline]
> > __handle_mm_fault+0x26a3/0x72b0 mm/memory.c:5285
> > handle_mm_fault+0x27e/0x770 mm/memory.c:5450
> > do_user_addr_fault arch/x86/mm/fault.c:1415 [inline]
> > handle_page_fault arch/x86/mm/fault.c:1507 [inline]
> > exc_page_fault+0x2ad/0x870 arch/x86/mm/fault.c:1563
> > asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:570
>
> EXt4 is triggering a BUG_ON:
>
> handle_t *handle = journal_current_handle();
> int err;
>
> if (!journal)
> return ERR_PTR(-EROFS);
>
> if (handle) {
> >>>>>>>>> J_ASSERT(handle->h_transaction->t_journal == journal);
> handle->h_ref++;
> return handle;
> }
>
> via a journal assert failure. It appears that current->journal_info
> isn't what it is supposed to be. It's finding something with TRAN in
> it, I think. I'll come back to this.
>
> What syzbot is doing is creating a file on it's root filesystem and
> write()ing 0x208e24b bytes (zeroes from an anonymous mmap() region,
> I think) to it to initialise it's contents.
>
> Then it mmap()s the ext4 file for 0xb36000 bytes and copies the raw
> test filesystem image in the source code into it. It then creates a
> memfd that it decompresses the data in the mapped ext4 file into and
> creates a loop device that points to that memfd. It then mounts the
> loop device and we get an XFS filesystem which doesn't appear to
> contain any corruptions in it at all. It then runs a bulkstat pass
> on the the XFS filesystem.
>
> This is where it gets interesting. The user buffer that XFS
> copies the inode data into points to a memory address inside the
> range of the ext4 file that the filesystem image was copied to.
> It does not overlap with the filesystem image.
>
> Hence when XFS goes to copy the inodes into the user buffer, it
> triggers write page faults on the ext4 backing file.
>
> That's this part of the trace:
>
>
> > RIP: 0010:rep_movs_alternative+0x4a/0x70 arch/x86/lib/copy_user_64.S:71
> > Code: 75 f1 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 df 83 f9 08 73 e8 eb c9 <f3> a4 c3 48 89 c8 48 c1 e9 03 83 e0 07 f3 48 a5 89 c1 85 c9 75 b3
> > RSP: 0018:ffffc900043270f8 EFLAGS: 00050202
> > RAX: ffffffff848cda01 RBX: 0000000020020040 RCX: 0000000000000040
> > RDX: 0000000000000000 RSI: ffff8880131873b0 RDI: 0000000020020000
> > RBP: 1ffff92000864f26 R08: ffff8880131873ef R09: 1ffff11002630e7d
> > R10: dffffc0000000000 R11: ffffed1002630e7e R12: 00000000000000c0
> > R13: dffffc0000000000 R14: 000000002001ff80 R15: ffff888013187330
> > copy_user_generic arch/x86/include/asm/uaccess_64.h:112 [inline]
> > raw_copy_to_user arch/x86/include/asm/uaccess_64.h:133 [inline]
> > _copy_to_user+0x86/0xa0 lib/usercopy.c:41
> > copy_to_user include/linux/uaccess.h:191 [inline]
> > xfs_bulkstat_fmt+0x4f/0x120 fs/xfs/xfs_ioctl.c:744
> > xfs_bulkstat_one_int+0xd8b/0x12e0 fs/xfs/xfs_itable.c:161
> > xfs_bulkstat_iwalk+0x72/0xb0 fs/xfs/xfs_itable.c:239
> > xfs_iwalk_ag_recs+0x4c3/0x820 fs/xfs/xfs_iwalk.c:220
> > xfs_iwalk_run_callbacks+0x25b/0x490 fs/xfs/xfs_iwalk.c:376
> > xfs_iwalk_ag+0xad6/0xbd0 fs/xfs/xfs_iwalk.c:482
> > xfs_iwalk+0x360/0x6f0 fs/xfs/xfs_iwalk.c:584
> > xfs_bulkstat+0x4f8/0x6c0 fs/xfs/xfs_itable.c:308
> > xfs_ioc_bulkstat+0x3d0/0x450 fs/xfs/xfs_ioctl.c:867
> > xfs_file_ioctl+0x6a5/0x1980 fs/xfs/xfs_ioctl.c:1994
> > vfs_ioctl fs/ioctl.c:51 [inline]
> > __do_sys_ioctl fs/ioctl.c:871 [inline]
> > __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
> > do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > do_syscall_64+0xf5/0x230 arch/x86/entry/common.c:83
> > entry_SYSCALL_64_after_hwframe+0x63/0x6b
>
> What is interesting here is this is running in an empty XFS
> transaction context so that the bulkstat operation garbage collects
> all the buffers it accesses without us having to explicit cleanup -
> they all get released when we cancel the transaction context at the
> end of the process.
>
> But that means copy-out is running inside a transaction context, and
> that means current->journal_info contains a pointer to the current
> struct xfs_trans the bulkstat operation is using.
>
> Guess what we have as the first item in a struct xfs_trans? That's
> right, it's a magic number, and that magic number is:
>
> #define XFS_TRANS_HEADER_MAGIC 0x5452414e /* TRAN */
>
> It should be obvious what has happened now -
> current->journal_info is not null, so ext4 thinks it owns the
> structure attached there and panics when it finds that it isn't an
> ext4 journal handle being held there.
>
> I don't think there are any clear rules as to how filesystems can
> and can't use current->journal_info. In general, a task can't jump
> from one filesystem to another inside a transaction context like
> this, so there's never been a serious concern about nested
> current->journal_info assignments like this in the past.
>
> XFS is doing nothing wrong - we're allowed to define transaction
> contexts however we want and use current->journal_info in this way.
> However, we have to acknowledge that ext4 has also done nothing
> wrong by assuming current->journal_info should below to it if it is
> not null. Indeed, XFS does the same thing.

Getting late here, so this will be pretty terse--

Thinking narrowly about just xfs, I think this means that the
bulkstat/inumbers implementations need to allocate a bounce buffer to
format records into so that it can copy_to_user without any locks held.
We have no idea if the destination page is a file page or anonymous
memory or whatever. Or: Do we really need to set current->journal_info
for empty transactions?

> The question here is what to do about this? The obvious solution is
> to have save/restore semantics in the filesystem code that
> sets/clears current->journal_info, and then filesystems can also do
> the necessary "recursion into same filesystem" checks they need to
> ensure that they aren't nesting transactions in a way that can
> deadlock.

We don't know what locks might be held by whatever code set
journal_info. I don't see how we could push it aside sanely?

> Maybe there are other options - should filesystems even be allowed to
> trigger page faults when they have set current->journal_info?

I wonder if we ought to be checking current->journal_info in our own
page_mkwrite handler and throwing back an errno? I don't think we want
to go down the rabbithele of "someone else was running a transaction,
maybe we can proceed with an update anyway???".

> What other potential avenues are there that could cause this sort of
> transaction context nesting that we haven't realised exist? Does
> ext4 data=jounral have problems like this in the data copy-in path?
> What other filesystems allow page faults in transaction contexts?

Ugh, whackamole. :/

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>