Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

From: Hugh Dickins
Date: Mon Oct 04 2021 - 22:26:47 EST


On Mon, 4 Oct 2021, Matthew Wilcox wrote:
> On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@xxxxxxxxxxxxxx/
> > > > > or
> > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@xxxxxxxxxxxxxx/
> > > > >
> > > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > > could well be nastier); but for the bug in this thread, I expect
> > > >
> > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > > me AFAIU.
> > >
> > > I don't see why we should have an S_ISREG() check. I agree it's not the
> > > intended usecase, but it ought to work fine. Unless there's something
> > > I'm missing?
> >
> > Check out this bug report:
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@xxxxxxxxxxxxxx/
> > and the patch from me:
> > https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@xxxxxxxxx/
> >
> > I don't think we handle buffers correctly for file THP, right? My
> > patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> > me. Why do we have THP collapsed for unintended usecase in the first
> > place?
>
> OK, I've done some more digging. I think what's going on with this
> report is userspace opens the block device RO, causes the page cache to
> be loaded with data, then khugepaged comes in and creates THPs.

Yes.

> What confuses me is that these THPs have private data attached to them.
> I don't know how that happens. If it's block device specific, then
> yes, something like your S_ISREG() idea should work fine. Otherwise,
> we might need to track down another problem.

Agreed, the file THP is created without PagePrivate, so the puzzle was
why the read-only cached page would later become page_has_private().

The C repro showed that it uses (a BTRFS_IOC_ADD_DEV ioctl which might
not be relevant and) a BLKRRPART ioctl 0x125f: I didn't follow BLKRRPART
all the way down, but imagine it has to attach buffer-heads to re-read
the partition table. Which would explain it.

Aside from that particular ioctl, it seems a good idea to insist on
S_ISREG just to shrink the attack surface: as Yang Shi says, executable
THP on block device was never an intended usecase, and not a usecase
anyone is likely to miss! And that fuzzer appears to delight in
tormenting /dev/nullb0, so let's just seal off that avenue.

You're right to have some doubt, as to whether there might be other
ways for buffer-heads to get attached, even on a read-only regular
file; but no way has sprung to my mind, and READ_ONLY_THP_FOR_FS has
survived well in its intended usage: so I think we should proceed on
the assumption that no further bugs remain - then fix them when found.

I wasn't able to reproduce the problem with the repro, would need to
waste many hours to do so. But here's the untested S_ISREG patch I
came up with. Sorry, I've mixed something else in: in moving the
alignment part to clarify the conditions, I was alarmed to see that
shmem with !shmem_huge_enabled was falling through to THP_FOR_FS to
give unexpected huge pages: fixed that, though later found there's
a separate shmem_huge_enabled() check which should exclude it.

--- 5.15-rc4/mm/khugepaged.c 2021-09-12 17:39:21.943438422 -0700
+++ linux/khugepaged.c 2021-10-03 20:41:13.194822795 -0700
@@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm
if (!transhuge_vma_enabled(vma, vm_flags))
return false;

+ if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+ vma->vm_pgoff, HPAGE_PMD_NR))
+ return false;
+
/* Enabled via shmem mount options or sysfs settings. */
- if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) {
- return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
- HPAGE_PMD_NR);
- }
+ if (shmem_file(vma->vm_file))
+ return shmem_huge_enabled(vma);

/* THP settings require madvise. */
if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
return false;

/* Read-only file mappings need to be aligned for THP to work. */
- if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
- !inode_is_open_for_write(vma->vm_file->f_inode) &&
- (vm_flags & VM_EXEC)) {
- return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
- HPAGE_PMD_NR);
+ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ (vm_flags & VM_EXEC) && vma->vm_file) {
+ struct inode *inode = vma->vm_file->f_inode;
+
+ return !inode_is_open_for_write(inode) &&
+ S_ISREG(inode->i_mode);
}

if (!vma->anon_vma || vma->vm_ops)