Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression

From: Yujie Liu
Date: Fri Mar 08 2024 - 03:46:13 EST


On Thu, Mar 07, 2024 at 06:19:46PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> > Thanks for testing! This is an interesting result and certainly unexpected
> > for me. The readahead code allocates naturally aligned pages so based on
> > the distribution of allocations it seems that before commit ab4443fe3ca6
> > readahead window was at least 32 pages (128KB) aligned and so we allocated
> > order 5 pages. After the commit, the readahead window somehow ended up only
> > aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> > readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> > then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> > would not fit in 128KB readahead window now), and order 2 page to finish
> > filling the readahead window.
> >
> > Now I'm not 100% sure why the readahead window alignment changed with
> > different rounding when placing readahead mark - probably that's some
> > artifact when readahead window is tiny in the beginning before we scale it
> > up (I'll verify by tracing whether everything ends up looking correctly
> > with the current code). So I don't expect this is a problem in ab4443fe3ca6
> > as such but it exposes the issue that readahead page insertion code should
> > perhaps strive to achieve better readahead window alignment with logical
> > file offset even at the cost of occasionally performing somewhat shorter
> > readahead. I'll look into this once I dig out of the huge heap of email
> > after vacation...
>
> I was surprised by what you said here, so I went and re-read the code
> and it doesn't work the way I thought it did. So I had a good long think
> about how it _should_ work, and I looked for some more corner conditions,
> and this is what I came up with.
>
> The first thing I've done is separate out the two limits. The EOF is
> a hard limit; we will not allocate pages beyond EOF. The ra->size is
> a soft limit; we will allocate pages beyond ra->size, but not too far.
>
> The second thing I noticed is that index + ra_size could wrap. So add
> a check for that, and set it to ULONG_MAX. index + ra_size - async_size
> could also wrap, but this is harmless. We certainly don't want to kick
> off any more readahead in this circumstance, so leaving 'mark' outside
> the range [index..ULONG_MAX] is just fine.
>
> The third thing is that we could allocate a folio which contains a page
> at ULONG_MAX. We don't really want that in the page cache; it makes
> filesystems more complicated if they have to check for that, and we
> don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
> This _should_ already be prohibited by the "Don't allocate pages past EOF"
> check, but let's explicitly prohibit it.
>
> Compile tested only.

We applied the diff on top of commit ab4443fe3ca6 but got a kernel panic
when running the dd test:

[ 109.259674][ C46] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [ dd:8616]
[ 109.268946][ C46] Modules linked in: xfs loop intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm irqbypass crct10dif_pclmul crc32_pclmul sd_mod raid6_pq ghash_clmulni_intel libcrc32c crc32c_intel sg sha512_ssse3 i915 nvme rapl drm_buddy nvme_core intel_gtt ahci t10_pi drm_display_helper ast intel_cstate libahci ipmi_ssif ttm drm_shmem_helper mei_me i2c_i801 crc64_rocksoft_generic video crc64_rocksoft acpi_ipmi intel_uncore megaraid_sas mei drm_kms_helper joydev libata i2c_ismt i2c_smbus dax_hmem crc64 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter drm fuse ip_tables
[ 109.336216][ C46] CPU: 46 PID: 8616 Comm: dd Tainted: G I 6.8.0-rc1-00005-g6c6de6e42e46 #1
[ 109.347892][ C46] Hardware name: NULL NULL/NULL, BIOS 05.02.01 05/12/2023
[ 109.356324][ C46] RIP: 0010:page_cache_ra_order (mm/readahead.c:521)
[ 109.363394][ C46] Code: cf 48 89 e8 4c 89 fa 48 d3 e0 48 01 c2 75 09 83 e9 01 48 89 e8 48 d3 e0 49 8d 77 ff 48 01 f0 49 39 c6 73 11 83 e9 01 48 89 e8 <48> d3 e0 48 01 f0 49 39 c6 72 ef 31 c0 83 f9 01 8b 3c 24 0f 44 c8
All code
========
0: cf iret
1: 48 89 e8 mov %rbp,%rax
4: 4c 89 fa mov %r15,%rdx
7: 48 d3 e0 shl %cl,%rax
a: 48 01 c2 add %rax,%rdx
d: 75 09 jne 0x18
f: 83 e9 01 sub $0x1,%ecx
12: 48 89 e8 mov %rbp,%rax
15: 48 d3 e0 shl %cl,%rax
18: 49 8d 77 ff lea -0x1(%r15),%rsi
1c: 48 01 f0 add %rsi,%rax
1f: 49 39 c6 cmp %rax,%r14
22: 73 11 jae 0x35
24: 83 e9 01 sub $0x1,%ecx
27: 48 89 e8 mov %rbp,%rax
2a:* 48 d3 e0 shl %cl,%rax <-- trapping instruction
2d: 48 01 f0 add %rsi,%rax
30: 49 39 c6 cmp %rax,%r14
33: 72 ef jb 0x24
35: 31 c0 xor %eax,%eax
37: 83 f9 01 cmp $0x1,%ecx
3a: 8b 3c 24 mov (%rsp),%edi
3d: 0f 44 c8 cmove %eax,%ecx

Code starting with the faulting instruction
===========================================
0: 48 d3 e0 shl %cl,%rax
3: 48 01 f0 add %rsi,%rax
6: 49 39 c6 cmp %rax,%r14
9: 72 ef jb 0xfffffffffffffffa
b: 31 c0 xor %eax,%eax
d: 83 f9 01 cmp $0x1,%ecx
10: 8b 3c 24 mov (%rsp),%edi
13: 0f 44 c8 cmove %eax,%ecx
[ 109.385897][ C46] RSP: 0018:ffa0000012837c00 EFLAGS: 00000206
[ 109.393176][ C46] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000020159674
[ 109.402607][ C46] RDX: 000000000004924c RSI: 0000000000049249 RDI: ff11003f6f3ae7c0
[ 109.412038][ C46] RBP: 0000000000000001 R08: 0000000000038700 R09: 0000000000000013
[ 109.421447][ C46] R10: 0000000000022c04 R11: 0000000000000001 R12: ffa0000012837cb0
[ 109.430868][ C46] R13: ffd400004fee4b40 R14: 0000000000049249 R15: 000000000004924a
[ 109.440270][ C46] FS: 00007f777e884640(0000) GS:ff11003f6f380000(0000) knlGS:0000000000000000
[ 109.450756][ C46] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.458603][ C46] CR2: 00007f2d4d425020 CR3: 00000001b4f84005 CR4: 0000000000f71ef0
[ 109.468003][ C46] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.477392][ C46] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 109.486794][ C46] PKRU: 55555554
[ 109.491197][ C46] Call Trace:
[ 109.495300][ C46] <IRQ>
[ 109.498922][ C46] ? watchdog_timer_fn (kernel/watchdog.c:548)
[ 109.505074][ C46] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:466)
[ 109.511620][ C46] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752)
[ 109.518059][ C46] ? hrtimer_interrupt (kernel/time/hrtimer.c:1817)
[ 109.524088][ C46] ? __sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1065 arch/x86/kernel/apic/apic.c:1082)
[ 109.531286][ C46] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
[ 109.538190][ C46] </IRQ>
[ 109.541867][ C46] <TASK>
[ 109.545545][ C46] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649)
[ 109.552832][ C46] ? page_cache_ra_order (mm/readahead.c:521)
[ 109.559122][ C46] filemap_get_pages (mm/filemap.c:2500)
[ 109.564935][ C46] filemap_read (mm/filemap.c:2594)
[ 109.570241][ C46] xfs_file_buffered_read (fs/xfs/xfs_file.c:315) xfs
[ 109.577202][ C46] xfs_file_read_iter (fs/xfs/xfs_file.c:341) xfs
[ 109.583749][ C46] vfs_read (include/linux/fs.h:2079 fs/read_write.c:395 fs/read_write.c:476)
[ 109.588762][ C46] ksys_read (fs/read_write.c:619)
[ 109.593660][ C46] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 109.599038][ C46] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
[ 109.605982][ C46] RIP: 0033:0x7f777e78d3ce
[ 109.611255][ C46] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
All code
========
0: c0 e9 b6 shr $0xb6,%cl
3: fe (bad)
4: ff (bad)
5: ff 50 48 call *0x48(%rax)
8: 8d 3d 6e 08 0b 00 lea 0xb086e(%rip),%edi # 0xb087c
e: e8 69 01 02 00 call 0x2017c
13: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
1a: 00 00
1c: 64 8b 04 25 18 00 00 mov %fs:0x18,%eax
23: 00
24: 85 c0 test %eax,%eax
26: 75 14 jne 0x3c
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 5a ja 0x8c
32: c3 ret
33: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
3a: 00 00
3c: 48 83 ec 28 sub $0x28,%rsp

Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 5a ja 0x62
8: c3 ret
9: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
10: 00 00
12: 48 83 ec 28 sub $0x28,%rsp
[ 109.633619][ C46] RSP: 002b:00007ffc78ab2778 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 109.643392][ C46] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f777e78d3ce
[ 109.652686][ C46] RDX: 0000000000001000 RSI: 00005629c0f7c000 RDI: 0000000000000000
[ 109.661976][ C46] RBP: 00005629c0f7c000 R08: 00005629c0f7bd30 R09: 00007f777e870be0
[ 109.671251][ C46] R10: 00005629c0f7c000 R11: 0000000000000246 R12: 0000000000000000
[ 109.680528][ C46] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffff
[ 109.689808][ C46] </TASK>
[ 109.693512][ C46] Kernel panic - not syncing: softlockup: hung tasks


# mm/readahead.c

486 void page_cache_ra_order(struct readahead_control *ractl,
487 struct file_ra_state *ra, unsigned int new_order)
488 {
489 struct address_space *mapping = ractl->mapping;
490 pgoff_t index = readahead_index(ractl);
491 pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
492 pgoff_t limit = index + ra->size;
493 pgoff_t mark = index + ra->size - ra->async_size;
494 int err = 0;
495 gfp_t gfp = readahead_gfp_mask(mapping);
496
497 if (!mapping_large_folio_support(mapping) || ra->size < 4)
498 goto fallback;
499
500 if (new_order < MAX_PAGECACHE_ORDER) {
501 new_order += 2;
502 if (new_order > MAX_PAGECACHE_ORDER)
503 new_order = MAX_PAGECACHE_ORDER;
504 while ((1 << new_order) > ra->size)
505 new_order--;
506 }
507
508 if (limit < index)
509 limit = ULONG_MAX;
510
511 filemap_invalidate_lock_shared(mapping);
512 while (index < limit) {
513 unsigned int order = new_order;
514
515 /* Align with smaller pages if needed */
516 if (index & ((1UL << order) - 1))
517 order = __ffs(index);
518 if (index + (1UL << order) == 0)
519 order--;
520 /* Don't allocate pages past EOF */
521 while (index + (1UL << order) - 1 > last)
522 order--;
523 /* THP machinery does not support order-1 */
524 if (order == 1)
525 order = 0;
526 err = ra_alloc_folio(ractl, index, mark, order, gfp);
527 if (err)
528 break;
529 index += 1UL << order;
530 }
531
532 if (index > limit) {
533 ra->size += index - limit - 1;
534 ra->async_size += index - limit - 1;
535 }
536
537 read_pages(ractl);
538 filemap_invalidate_unlock_shared(mapping);
539
540 /*
541 * If there were already pages in the page cache, then we may have
542 * left some gaps. Let the regular readahead code take care of this
543 * situation.
544 */
545 if (!err)
546 return;
547 fallback:
548 do_page_cache_ra(ractl, ra->size, ra->async_size);
549 }


Regards,
Yujie

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 130c0e7df99f..742e1f39035b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
> {
> struct address_space *mapping = ractl->mapping;
> pgoff_t index = readahead_index(ractl);
> - pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> + pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> + pgoff_t limit = index + ra->size;
> pgoff_t mark = index + ra->size - ra->async_size;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> @@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
> if (!mapping_large_folio_support(mapping) || ra->size < 4)
> goto fallback;
>
> - limit = min(limit, index + ra->size - 1);
> -
> if (new_order < MAX_PAGECACHE_ORDER) {
> new_order += 2;
> new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
> new_order = min_t(unsigned int, new_order, ilog2(ra->size));
> }
>
> + if (limit < index)
> + limit = ULONG_MAX;
> filemap_invalidate_lock_shared(mapping);
> - while (index <= limit) {
> + while (index < limit) {
> unsigned int order = new_order;
>
> /* Align with smaller pages if needed */
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
> + /* Avoid wrap */
> + if (index + (1UL << order) == 0)
> + order--;
> /* Don't allocate pages past EOF */
> - while (index + (1UL << order) - 1 > limit)
> + while (index + (1UL << order) - 1 > last)
> order--;
> err = ra_alloc_folio(ractl, index, mark, order, gfp);
> if (err)