Re: [PATCH 04/10] iov_iter: Add mapping and discard iterator types

From: David Howells
Date: Mon Sep 17 2018 - 16:59:01 EST


Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> > Add two new iterator types to iov_iter:
> >
> > (1) ITER_MAPPING
> >
> > This walks through a set of pages attached to an address_space that
> > are pinned or locked, starting at a given page and offset and walking
> > for the specified amount of space. A facility to get a callback each
> > time a page is entirely processed is provided.
> >
> > This is useful for copying data from socket buffers to inodes in
> > network filesystems.
>
> Interesting... Questions:
> * what will hold those pages? IOW, where will you unlock/drop/whatnot
> those sucker?

The caller needs to have those pages pinned - say with PG_locked or
PG_writeback. Sorry - I mentioned this in the cover, but not here.

You can either undo there changes in the callback or upon completion of the
iteration.

> * "callback" sounds dangerous - it appears to imply that you won't
> copy to/from the same page twice. Not true for a lot of iov_iter users; what
> happens if you pass such a beast to them?

Similar to ITER_PIPE. There's no rewind. Once you've passed a page, it's
gone. Under what circumstances would you want to copy to/from the same page
twice?

> * why not simply "build and populate ITER_BVEC aliasing a piece of
> mapping", possibly in "grab" and "grab+lock" variants?

ITER_BVEC is inefficient. This is what the upstream now. See
afs_load_bvec().

That what the code currently uses. There's a
practical limit to the number of pages I can shovel into one in one go.

Further, every time I reach the end of a ITER_BVEC, I have to return to
process context, which then has to round up the next bundle of pages by
calling the radix tree. It seems to work out better to put the radix
iteration into the iterator if we can. The caller guarantees that the
contents of the region of interest are (a) fully populated and (b) pinned.

Yet further, with ITER_BVEC, I can't release any of the pinned pages until the
entire iteration is finished. That means if I have a 4GB BVEC, those pages
are going to be pinned a long time. With ITER_MAPPING, they're released
incrementally via the callback.

> Those ITER_MAPPING do seem to be related to ITER_BVEC, at the very least.

Only in the sense that the current position can be described by the same three
numbers: page, len, offset. I'm reusing struct bio_vec so that I can share
some of the code with ITER_BVEC.

> Note, BTW, that iov_iter_get_pages...() might mutate into something similar
> - "build and populate ITER_BVEC aliasing a piece of given iov_iter". Or,
> perhaps, a nicer-on-memory analogue of ITER_BVEC - with <offset, bytes,
> pointer to pages array> instead of <offset, bytes, page> as elements, with
> the same "populate from mapping" to get something similar to your
> functionality and "populate from iov_iter" for
> iov_iter_get_pages... replacement

The whole point is to avoid having to use ITER_BVEC. ITER_BVEC has a number
of issues that ITER_MAPPING overcomes - though ITER_MAPPING can only be used
with a mapping (or, at least, a radix tree).

There is no point to the loop shifting page runs into a page array for use
with a BVEC when the mapping carries the same information. You save memory
and processing time.

David