Re: [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

From: John Hubbard
Date: Wed Feb 23 2022 - 19:44:26 EST


On 2/23/22 3:31 PM, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
>>
>> Just a small thing I'll say once, to get it out of my system. No action
>> required here, I just want it understood:
>>
>> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
>> pin_user_pages_remote()"), you would have written that like this:
>>
>> "process_vm_writev() is dirtying pages without properly warning the file
>> system in advance..."
>>
>> Because, there were many callers that were doing this:
>>
>> get_user_pages*()
>> ...use the pages...
>>
>> for_each_page() {
>> set_page_dirty*()
>> put_page()
>> }
>
> Sure, but that's not sufficient when modifying file-backed pages.

We are in complete agreement. I was just trying to hint (too subtly)
that the problem is ages old, and after doing all this work on the
prerequisites to solving it (pin_user_pages() is a prerequisite), it kind
of bothers me to see a commit with "work around bugs in mm/gup.c" in the
title. That's all. :)


> Previously, there was only two ways of modifying file-backed pages in
> the page cache --- either using the write(2) system call, or when a
> mmaped page is modified by the userspace.
>
> In the case of write(2), the file system gets notified before the page
> cache is modified by a call to the address operation's write_begin()
> call, and after the page cache is modified, the address operation's
> write_end() call lets the file system know that the modification is
> done. After the write is done, the 30 second writeback timer is
> triggered, and in the case of ext4's data=journalling mode, we close
> the ext4 micro-transation (and therefore the time between write_begin
> and write_end calls needs to be minimal); otherwise this can block
> ext4 transactions.
>
> In the case of a user page fault, the address operation's
> page_mkwrite() is called, and at that point we will allocate any
> blocks needed to back memory if necessary (in the case of delayed
> allocation, file system space has to get reserved). The problem here
> for remote access is that page_mkwrite() can block, and it can
> potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we
> can't just add the page buffers and do the file system allocation in
> set_page_dirty(), since set_page_dirty() isn't allowed to block.

Oh yes. Believe me, I am well-versed in that story! But it's always
nice to hear it again, especially from a file system maintainer.
Each time there is something new, such as the micro-transaction detail.

>
> One approach might be to make all of the pages writeable when
> pin_user_pages_remote() is called. That that would mean that in the
> case of remote access via process_vm_writev or RDMA, all of the blocks
> will be allocated early. But that's probably better since at that
> point the userspace code is in a position to receive the error when
> setting up the RDMA memory, and we don't want to be asking the file
> system to do block allocation when incoming data is coming in from
> Infiniband or iWARP.

So it sounds like the file lease idea, yes? I'm hoping that that
is still viable. I still think it's a good LSF/MM topic, to work through.

>
>> I see that ext4_warning_inode() has rate limiting, but it doesn't look
>> like it's really intended for a per-page rate. It looks like it is
>> per-superblock (yes?), so I suspect one instance of this problem, with
>> many pages involved, could hit the limit.
>>
>> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
>> either, but it might be better here. OTOH, I have minimal experience
>> with ext4_warning_inode() and maybe it really is just fine with per-page
>> failure rates.
>
> With the syzbot reproducer, we're not actually triggering the rate
> limiter, since the ext4 warning is only getting hit a bit more than
> once every 4 seconds. And I'm guessing that in the real world, people
> aren't actually trying to do remote direct access to file-backed
> memory, at least not using ext4, since that's an invitation to a
> kernel crash, and we would have gotten user complaints. If some user

Actually...I can confirm that real customers really are doing *exactly*
that! Despite the kernel crashes--because the crashes don't always
happen unless you have a large (supercomputer-sized) installation. And
even then it is not always root-caused properly.

I guess that goes in the "weird but true" category. :)


thanks,
--
John Hubbard
NVIDIA

> actually tries to use process_vm_writev for realsies, as opposed to a
> random fuzzer or from a malicious program , we do want to warn them
> about the potential data loss, so I'd prefer to warn once for each
> inode --- but I'm not convinced that it's worth the effort.
>
> Cheers,
>
> - Ted