Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD

From: Luis Chamberlain
Date: Fri Apr 28 2023 - 20:12:13 EST


On Mon, Apr 24, 2023 at 07:49:26AM +0200, Christoph Hellwig wrote:
> +const struct address_space_operations def_blk_aops = {
> + .dirty_folio = filemap_dirty_folio,
> + .release_folio = iomap_release_folio,
> + .invalidate_folio = iomap_invalidate_folio,
> + .read_folio = blkdev_read_folio,
> + .readahead = blkdev_readahead,
> + .writepages = blkdev_writepages,
> + .is_partially_uptodate = iomap_is_partially_uptodate,
> + .error_remove_page = generic_error_remove_page,
> + .migrate_folio = filemap_migrate_folio,
> +};
> +#endif /* CONFIG_BUFFER_HEAD */

We've tested this with bs > ps (LBS) devices and it would seem it crashes,
as Pankaj notes perhaps due to lack of higher order folio support yet
on this path, for the block cache. The same crash happens with NVMe
(using out-of-tree nvme_core.debug_large_lbas boot parameter to enable NVMe
LBS) or brd with LBS. To enable NVMe LBS or brd with LBS you need
out of tree patches though of course, so I've stashed these into
a branch, large-block-20230426 [0] so to help folks who may want
to experiment further.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=large-block-20230426

[ 11.245248] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 11.254581] #PF: supervisor read access in kernel mode
[ 11.257387] #PF: error_code(0x0000) - not-present page
[ 11.260921] PGD 0 P4D 0
[ 11.262600] Oops: 0000 [#1] PREEMPT SMP PTI
[ 11.264993] CPU: 7 PID: 198 Comm: (udev-worker) Not tainted 6.3.0-large-block-20230426 #2
[ 11.269385] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 11.275054] RIP: 0010:iomap_page_create.isra.0+0xc/0xd0
[ 11.277924] Code: 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 f5 53 <48> 8b 06 48 c1 e8 0d 89 c6 83 e6 01 0f 84 a1 00 00 00 4c 8b 65 28
[ 11.287293] RSP: 0018:ffffb0f0805ef9d8 EFLAGS: 00010293
[ 11.289964] RAX: ffff9de3c1fa8388 RBX: ffffb0f0805efa78 RCX: 000000037ffe0000
[ 11.293212] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000d
[ 11.296485] RBP: 0000000000000000 R08: 0000000000021000 R09: ffffffff9c733b20
[ 11.299724] R10: 0000000000000001 R11: 000000000000c000 R12: 0000000000000000
[ 11.302974] R13: ffffffff9be96260 R14: ffffb0f0805efa58 R15: 0000000000000000
[ 11.306206] FS: 00007f03ea8368c0(0000) GS:ffff9de43bdc0000(0000) knlGS:0000000000000000
[ 11.309949] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 11.312464] CR2: 0000000000000000 CR3: 0000000117ec6006 CR4: 0000000000770ee0
[ 11.315442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 11.318310] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 11.321010] PKRU: 55555554
[ 11.322212] Call Trace:
[ 11.323224] <TASK>
[ 11.324146] iomap_readpage_iter+0x96/0x300
[ 11.325694] iomap_readahead+0x174/0x2d0
[ 11.327129] read_pages+0x69/0x1f0
[ 11.328389] ? folio_add_lru+0x7e/0xe0
[ 11.329751] page_cache_ra_unbounded+0x187/0x1d0
[ 11.331301] force_page_cache_ra+0x94/0xb0
[ 11.332681] filemap_get_pages+0x10e/0x650
[ 11.334073] ? _raw_spin_lock+0x13/0x40
[ 11.335287] filemap_read+0xbf/0x340
[ 11.336430] ? aa_file_perm+0x117/0x4b0
[ 11.337646] ? generic_fillattr+0x45/0xf0
[ 11.338887] ? _copy_to_user+0x22/0x30
[ 11.340026] ? cp_new_stat+0x150/0x180
[ 11.341166] blkdev_read_iter+0x5e/0x140
[ 11.342357] vfs_read+0x1f0/0x2c0
[ 11.343354] ksys_read+0x63/0xe0
[ 11.344331] do_syscall_64+0x37/0x90
[ 11.345411] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 11.346760] RIP: 0033:0x7f03eaf3903d

(gdb) l *(iomap_readpage_iter+0x96)
0xffffffff814021b6 is in iomap_readpage_iter (fs/iomap/buffered-io.c:280).
275 if (iomap->type == IOMAP_INLINE)
276 return iomap_read_inline_data(iter, folio);
277
278 /* zero post-eof blocks as the page may be mapped */
279 iop = iomap_page_create(iter->inode, folio, iter->flags);
280 iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
281 if (plen == 0)
282 goto done;
283
284 if (iomap_block_needs_zeroing(iter, pos)) {
(gdb) l *(iomap_page_create+0xc)
0xffffffff81400cdc is in iomap_page_create (./arch/x86/include/asm/bitops.h:207).
202 }
203
204 static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
205 {
206 return ((1UL << (nr & (BITS_PER_LONG-1))) &
207 (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
208 }
209
210 static __always_inline bool constant_test_bit_acquire(long nr, const volatile unsigned long *addr)
211 {

To reproduce one would want a system with only say XFS as the root
image. I've enabled this on kdevops through "pure-iomap" option:

https://github.com/linux-kdevops/kdevops/blob/master/docs/lbs.md

Luis