Re: [RFC 00/23] Enable block size > page size in XFS

From: Dave Chinner
Date: Thu Sep 21 2023 - 17:16:15 EST


On Wed, Sep 20, 2023 at 09:57:56PM -0700, Luis Chamberlain wrote:
> On Wed, Sep 20, 2023 at 08:00:12PM -0700, Luis Chamberlain wrote:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=large-block-linus
> >
> > I haven't tested yet the second branch I pushed though but it applied without any changes
> > so it should be good (usual famous last words).
>
> I have run some preliminary tests on that branch as well above using fsx
> with larger LBA formats running them all on the *same* system at the
> same time. Kernel is happy.
>
> root@linus ~ # uname -r
> 6.6.0-rc2-large-block-linus+
>
> root@linus ~ # mount | grep mnt
> /dev/nvme17n1 on /mnt-16k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme13n1 on /mnt-32k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme11n1 on /mnt-64k-16ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme18n1 on /mnt-32k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme14n1 on /mnt-64k-32ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme7n1 on /mnt-64k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme4n1 on /mnt-32k-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme3n1 on /mnt-16k-512b type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme9n1 on /mnt-64k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=64k,noquota)
> /dev/nvme8n1 on /mnt-32k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme6n1 on /mnt-16k-4ks type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme5n1 on /mnt-4k type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/nvme1n1 on /mnt-512 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>
> root@linus ~ # ps -ef| grep fsx
> root 45601 45172 44 04:02 pts/3 00:20:26 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k/foo
> root 46207 45658 39 04:04 pts/5 00:17:18 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-16ks/foo
> root 46792 46289 35 04:06 pts/7 00:14:36 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-16ks/foo
> root 47293 46899 39 04:08 pts/9 00:15:30 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k/foo
> root 47921 47338 34 04:10 pts/11 00:12:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-32ks/foo
> root 48898 48484 32 04:14 pts/13 00:10:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-512b/foo
> root 49313 48939 35 04:15 pts/15 00:11:38 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-512/foo
> root 49729 49429 40 04:17 pts/17 00:12:27 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-512b/foo
> root 50085 49794 33 04:18 pts/19 00:09:56 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-64k-4ks/foo
> root 50449 50130 36 04:19 pts/21 00:10:28 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-32k-4ks/foo
> root 50844 50517 41 04:20 pts/23 00:11:22 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-16k-4ks/foo
> root 51135 50893 52 04:21 pts/25 00:13:57 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-4k/foo
> root 52061 51193 49 04:25 pts/27 00:11:21 /var/lib/xfstests/ltp/fsx -q -S 0 -p 1000000 /mnt-512/foo
> root 57668 52131 0 04:48 pts/29 00:00:00 grep fsx

So I just pulled this, built it and run generic/091 as the very
first test on this:

# ./run_check.sh --mkfs-opts "-m rmapbt=1 -b size=64k" --run-opts "-s xfs_64k generic/091"
.....
meta-data=/dev/pmem0 isize=512 agcount=4, agsize=32768 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=0
data = bsize=65536 blocks=131072, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0, ftype=1
log =internal log bsize=65536 blocks=2613, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=65536 blocks=0, rtextents=0
....
Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_64k generic/091
SECTION -- xfs_64k
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 test3 6.6.0-rc2-large-block-linus-dgc+ #1906 SMP PREEMPT_DYNAMIC Thu Sep 21 15:19:47 AEST 2023
MKFS_OPTIONS -- -f -m rmapbt=1 -b size=64k /dev/pmem1
MOUNT_OPTIONS -- -o dax=never -o context=system_u:object_r:root_t:s0 /dev/pmem1 /mnt/scratch

generic/091 10s ... [failed, exit status 1]- output mismatch (see /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad)
--- tests/generic/091.out 2022-12-21 15:53:25.467044754 +1100
+++ /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad 2023-09-21 15:47:48.222559248 +1000
@@ -1,7 +1,113 @@
QA output created by 091
fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
...
(Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/091.out /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad' to see the entire diff)
Failures: generic/091
Failed 1 of 1 tests
Xunit report: /home/dave/src/xfstests-dev/results//xfs_64k/result.xml

SECTION -- xfs_64k
=========================
Failures: generic/091
Failed 1 of 1 tests


real 0m4.214s
user 0m0.972s
sys 0m3.603s
#

For all these assertions about how none of your testing is finding
bugs in this code, It's taken me *4 seconds* of test runtime to find
the first failure.

And, well, it's the same failure as I reported for the previous
version of this code:

# cat /home/dave/src/xfstests-dev/results//xfs_64k/generic/091.out.bad
/home/dave/src/xfstests-dev/ltp/fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/test/junk
mapped writes DISABLED
Seed set to 1
main: filesystem does not support exchange range, disabling!
fallocating to largest ever: 0x79f06
READ BAD DATA: offset = 0x18000, size = 0xf000, fname = /mnt/test/junk
OFFSET GOOD BAD RANGE
0x21000 0x0000 0x9008 0x0
operation# (mod 256) for the bad data may be 144
0x21001 0x0000 0x0810 0x1
operation# (mod 256) for the bad data may be 16
0x21002 0x0000 0x1000 0x2
operation# (mod 256) for the bad data may be 16
0x21005 0x0000 0x8e00 0x3
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21007 0x0000 0x82ff 0x4
operation# (mod 256) for the bad data may be 255
0x21008 0x0000 0xffff 0x5
operation# (mod 256) for the bad data may be 255
0x21009 0x0000 0xffff 0x6
operation# (mod 256) for the bad data may be 255
0x2100a 0x0000 0xffff 0x7
operation# (mod 256) for the bad data may be 255
0x2100b 0x0000 0xff00 0x8
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21010 0x0000 0x700b 0x9
operation# (mod 256) for the bad data may be 112
0x21011 0x0000 0x0b10 0xa
operation# (mod 256) for the bad data may be 16
0x21012 0x0000 0x1000 0xb
operation# (mod 256) for the bad data may be 16
0x21014 0x0000 0x038e 0xc
operation# (mod 256) for the bad data may be 3
0x21015 0x0000 0x8e00 0xd
operation# (mod 256) for the bad data unknown, check HOLE and EXTEND ops
0x21017 0x0000 0x82ff 0xe
operation# (mod 256) for the bad data may be 255
0x21018 0x0000 0xffff 0xf
operation# (mod 256) for the bad data may be 255
LOG DUMP (69 total operations):
1( 1 mod 256): FALLOC 0x6ba10 thru 0x79f06 (0xe4f6 bytes) EXTENDING
2( 2 mod 256): SKIPPED (no operation)
3( 3 mod 256): SKIPPED (no operation)
4( 4 mod 256): TRUNCATE DOWN from 0x79f06 to 0x51800
5( 5 mod 256): SKIPPED (no operation)
6( 6 mod 256): READ 0x1b000 thru 0x21fff (0x7000 bytes)
7( 7 mod 256): PUNCH 0x2ce7a thru 0x39b9e (0xcd25 bytes)
8( 8 mod 256): PUNCH 0x29238 thru 0x29f57 (0xd20 bytes)
9( 9 mod 256): COPY 0x3000 thru 0x9fff (0x7000 bytes) to 0x40400 thru 0x473ff
10( 10 mod 256): READ 0x16000 thru 0x21fff (0xc000 bytes)
11( 11 mod 256): FALLOC 0x4a42b thru 0x4b8f7 (0x14cc bytes) INTERIOR
12( 12 mod 256): TRUNCATE DOWN from 0x51800 to 0x15c00 ******WWWW
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): READ 0xb000 thru 0x14fff (0xa000 bytes)
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): SKIPPED (no operation)
17( 17 mod 256): SKIPPED (no operation)
18( 18 mod 256): READ 0x3000 thru 0x11fff (0xf000 bytes)
19( 19 mod 256): FALLOC 0x69b94 thru 0x6c922 (0x2d8e bytes) EXTENDING
20( 20 mod 256): SKIPPED (no operation)
21( 21 mod 256): SKIPPED (no operation)
22( 22 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
23( 23 mod 256): SKIPPED (no operation)
24( 24 mod 256): SKIPPED (no operation)
25( 25 mod 256): SKIPPED (no operation)
26( 26 mod 256): ZERO 0x1fba0 thru 0x2c568 (0xc9c9 bytes) ******ZZZZ
27( 27 mod 256): READ 0x4f000 thru 0x50fff (0x2000 bytes)
28( 28 mod 256): READ 0x39000 thru 0x3afff (0x2000 bytes)
29( 29 mod 256): WRITE 0x40200 thru 0x4cdff (0xcc00 bytes)
30( 30 mod 256): SKIPPED (no operation)
31( 31 mod 256): WRITE 0x47e00 thru 0x547ff (0xca00 bytes)
32( 32 mod 256): SKIPPED (no operation)
33( 33 mod 256): READ 0x28000 thru 0x29fff (0x2000 bytes)
34( 34 mod 256): SKIPPED (no operation)
35( 35 mod 256): READ 0x69000 thru 0x6bfff (0x3000 bytes)
36( 36 mod 256): READ 0x16000 thru 0x20fff (0xb000 bytes)
37( 37 mod 256): ZERO 0x45150 thru 0x47e9c (0x2d4d bytes)
38( 38 mod 256): SKIPPED (no operation)
39( 39 mod 256): SKIPPED (no operation)
40( 40 mod 256): COPY 0x10000 thru 0x11fff (0x2000 bytes) to 0x22a00 thru 0x249ff
41( 41 mod 256): WRITE 0x29000 thru 0x2efff (0x6000 bytes)
42( 42 mod 256): ZERO 0x59c7 thru 0x13eee (0xe528 bytes)
43( 43 mod 256): FALLOC 0x1fdbf thru 0x2e694 (0xe8d5 bytes) INTERIOR ******FFFF
44( 44 mod 256): SKIPPED (no operation)
45( 45 mod 256): ZERO 0x740f5 thru 0x7a11f (0x602b bytes)
46( 46 mod 256): SKIPPED (no operation)
47( 47 mod 256): WRITE 0x14200 thru 0x1e3ff (0xa200 bytes)
48( 48 mod 256): READ 0x69000 thru 0x6bfff (0x3000 bytes)
49( 49 mod 256): TRUNCATE DOWN from 0x6c922 to 0x16a00 ******WWWW
50( 50 mod 256): WRITE 0x15000 thru 0x163ff (0x1400 bytes)
51( 51 mod 256): PUNCH 0x3b5e thru 0xa2c1 (0x6764 bytes)
52( 52 mod 256): SKIPPED (no operation)
53( 53 mod 256): SKIPPED (no operation)
54( 54 mod 256): WRITE 0x34a00 thru 0x3fdff (0xb400 bytes) HOLE ***WWWW
55( 55 mod 256): WRITE 0x38000 thru 0x397ff (0x1800 bytes)
56( 56 mod 256): PUNCH 0x7922 thru 0x115f0 (0x9ccf bytes)
57( 57 mod 256): SKIPPED (no operation)
58( 58 mod 256): SKIPPED (no operation)
59( 59 mod 256): SKIPPED (no operation)
60( 60 mod 256): FALLOC 0x300a8 thru 0x331d0 (0x3128 bytes) INTERIOR
61( 61 mod 256): ZERO 0x3799c thru 0x39245 (0x18aa bytes)
62( 62 mod 256): ZERO 0x62fc3 thru 0x6b630 (0x866e bytes)
63( 63 mod 256): SKIPPED (no operation)
64( 64 mod 256): ZERO 0x6110a thru 0x61dad (0xca4 bytes)
65( 65 mod 256): FALLOC 0x1d8ca thru 0x20876 (0x2fac bytes) INTERIOR
66( 66 mod 256): COPY 0x65000 thru 0x68fff (0x4000 bytes) to 0x22400 thru 0x263ff
67( 67 mod 256): SKIPPED (no operation)
68( 68 mod 256): WRITE 0x36a00 thru 0x415ff (0xac00 bytes)
69( 69 mod 256): READ 0x18000 thru 0x26fff (0xf000 bytes) ***RRRR***
Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood")

Guess what? The fsx parameters being used means it is testing things you
aren't. Yes, the '-Z -R -W' mean it is using direct IO for reads and writes,
mmap() is disabled. Other parameters indicate that using 4k aligned reads and
512 byte aligned writes and truncates.

There is a reason there are multiple different fsx tests in fstests;
they all exercise different sets of IO behaviours and alignments,
and they exercise the IO paths differently.

So there's clearly something wrong here - it's likely that the
filesystem IO alignment parameters pulled from the underlying block
device (4k physical, 512 byte logical sector sizes) are improperly
interpreted. i.e. for a filesystem with a sector size of 4kB,
direct IO with an alignment of 512 bytes should be rejected......

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx