Re: [RFC 0/4] convert create_page_buffers to create_folio_buffers

From: Matthew Wilcox
Date: Sat Apr 15 2023 - 13:10:08 EST


On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> On 4/15/23 05:44, Matthew Wilcox wrote:
> > I do wonder how much it's worth doing this vs switching to non-BH methods.
> > I appreciate that's a lot of work still.
>
> That's what I've been wondering, too.
>
> I would _vastly_ prefer to switch over to iomap; however, the blasted
> sb_bread() is getting in the way. Currently iomap only runs on entire
> pages / folios, but a lot of (older) filesystems insist on doing 512

Hang on, no, iomap can issue sub-page reads. eg iomap_read_folio_sync()
will read the parts of the folio which have not yet been read when
called from __iomap_write_begin().

> byte I/O. While this seem logical (seeing that 512 bytes is the
> default, and, in most cases, the only supported sector size) question
> is whether _we_ from the linux side need to do that.
> We _could_ upgrade to always do full page I/O; there's a good
> chance we'll be using the entire page anyway eventually.
> And with storage bandwidth getting larger and larger we might even
> get a performance boost there.

I think we need to look at this from the filesystem side. What do
filesystems actually want to do? The first thing is they want to read
the superblock. That's either going to be immediately freed ("Oh,
this isn't a JFS filesystem after all") or it's going to hang around
indefinitely. There's no particular need to keep it in any kind of
cache (buffer or page). Except ... we want to probe a dozen different
filesystems, and half of them keep their superblock at the same offset
from the start of the block device. So we do want to keep it cached.
That's arguing for using the page cache, at least to read it.

Now, do we want userspace to be able to dd a new superblock into place
and have the mounted filesystem see it? I suspect that confuses just
about every filesystem out there. So I think the right answer is to read
the page into the bdev's page cache and then copy it into a kmalloc'ed
buffer which the filesystem is then responsible for freeing. It's also
responsible for writing it back (so that's another API we need), and for
a journalled filesystem, it needs to fit into the journalling scheme.
Also, we may need to write back multiple copies of the superblock,
possibly with slight modifications.

There are a lot of considerations here, and I don't feel like I have
enough of an appreciation of filesystem needs to come up with a decent
API. I'd hope we can get a good discussion going at LSFMM.