Re: BUG: KCSAN: data-race in folio_batch_move_lru / mpage_read_end_io

From: Matthew Wilcox
Date: Thu Aug 31 2023 - 10:53:02 EST


On Mon, Aug 28, 2023 at 11:14:23PM +0200, Mirsad Todorovac wrote:
> BUG: KCSAN: data-race in folio_batch_move_lru / mpage_read_end_io

This one's still niggling at me. I've trimmed the timestamps and some
of the other irrelevant stuff out of this to make it easier to read.

> value changed: 0x0017ffffc0020001 -> 0x0017ffffc0020004

Notionally I understand this. This is page->flags and the PG_locked bit
was set initially, but after a short delay PG_locked was cleared and
PG_uptodate was set. That's _normal_. For many, many pages, we set the
locked bit, initiate a read; the device does a DMA, sends an interrupt;
the interrupt handler sets the PG_uptodate bit and clears the PG_locked
bit to indicate the page is no longer under I/O.

But what I don't understand is how we see this for _this_ page.

> write (marked) to 0xffffef9a44978bc0 of 8 bytes by interrupt on cpu 28:
> mpage_read_end_io (arch/x86/include/asm/bitops.h:55 include/asm-generic/bitops/instrumented-atomic.h:29 include/linux/page-flags.h:739 fs/mpage.c:55)
> bio_endio (block/bio.c:1617)
> blk_mq_end_request_batch (block/blk-mq.c:850 block/blk-mq.c:1088)
> nvme_pci_complete_batch (drivers/nvme/host/pci.c:986) nvme
> nvme_irq (drivers/nvme/host/pci.c:1086) nvme

This is the interrupt handler. It's doing what it's supposed to;
marking the page uptodate and unlocking it.

> read to 0xffffef9a44978bc0 of 8 bytes by task 348 on cpu 12:
> folio_batch_move_lru (./include/linux/mm.h:1814 ./include/linux/mm.h:1824 ./include/linux/memcontrol.h:1636 ./include/linux/memcontrol.h:1659 mm/swap.c:216)
> folio_batch_add_and_move (mm/swap.c:235)
> folio_add_lru (./arch/x86/include/asm/preempt.h:95 mm/swap.c:518)
> folio_add_lru_vma (mm/swap.c:538)
> do_anonymous_page (mm/memory.c:4146)

This is the part I don't understand. The path to calling
folio_add_lru_vma() comes directly from vma_alloc_zeroed_movable_folio():

folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
if (!folio)
goto oom;
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
folio_throttle_swaprate(folio, GFP_KERNEL);
__folio_mark_uptodate(folio);
entry = mk_pte(&folio->page, vma->vm_page_prot);
entry = pte_sw_mkyoung(entry);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
if (!vmf->pte)
goto release;
if (vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto release;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_put(folio);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);

(sorry that's a lot of lines). But there's _nowhere_ there that sets
PG_locked. It's a freshly allocated page; all page flags (that are
actually flags; ignore the stuff up at the top) should be clear. We
even check that with PAGE_FLAGS_CHECK_AT_PREP. Plus, it doesn't
make sense that we'd start I/O; the page is freshly allocated, full of
zeroes; there's no backing store to read the page from.

It really feels like this page was freed while it was still under I/O
and it's been reallocated to this victim process.

I'm going to try a few things and see if I can figure this out.

> __handle_mm_fault (mm/memory.c:3662 mm/memory.c:4939 mm/memory.c:5079)
> handle_mm_fault (mm/memory.c:5233)
> do_user_addr_fault (arch/x86/mm/fault.c:1392)
> exc_page_fault (./arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1494 arch/x86/mm/fault.c:1542)
> asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570)
> copyout (./arch/x86/include/asm/uaccess_64.h:112 ./arch/x86/include/asm/uaccess_64.h:133 lib/iov_iter.c:168)
> _copy_to_iter (lib/iov_iter.c:316 (discriminator 5))
> copy_page_to_iter (lib/iov_iter.c:483 lib/iov_iter.c:468)
> filemap_read (mm/filemap.c:2712)
> blkdev_read_iter (block/fops.c:620)
> vfs_read (./include/linux/fs.h:1871 fs/read_write.c:389 fs/read_write.c:470)
> ksys_read (fs/read_write.c:613)
> __x64_sys_read (fs/read_write.c:621)