Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: NeilBrown
Date: Sun May 17 2015 - 19:16:49 EST


On Sun, 17 May 2015 09:43:34 -0700 Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sun, May 17, 2015 at 3:55 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > And that is complete crap. Multi-component lookups do make sense; once
> > we are at the edge of the area present in dcache, we _know_ there won't
> > be any existing mountpoints involved; parsing the components and feeding
> > them to fs at once, along with an array of dentries to fill makes perfect
> > sense. Why bother with a bunch of roundtrips when we can have one?
>
> Yes, the edges are easier. And yes, it's fine to do components one by one.
>
> Maybe I misunderstood, but I thought that was exactly what Neil
> *didn't* want to do, though. It sounded like he wanted to do
> path-based lookup, not component-based one.

Just to be crystal clear about what I want:
I want the filesystem to be in control

Any examples, whether about multi-component lookup or path-based lookup or
O_EXCL opens are just throw-away examples. I have no desire to implement or
re-implement anything like that. I just want the filesystem to have control.
The reason I want, is that it will (ultimately) make the code easier to
understand and so easier to verify. And it will make implementing unusual
filesystems easier.

The dcache is just a cache. It is a great cache, but it isn't the filesystem.

So filesystems should be able to put things in the cache. And the VFS should
be able to look up things in the cache. And if the VFS finds everything it
needs to follow a full path all the way to the inode at the end, that is
great. But as soon as it hits something that the cache doesn't have an
answer for, it asks the filesystem.
As a useful simple case it can ask via d_revalidate in RCU mode, in which
case the filesystem either says (based on its own caching rules) "Yeah, this
one's OK really" and the VFS just keeps going, or the filesystem says "Nope,
I need more time with this one" and we drop out or RCU and to the more
general case.

In that general case it just hands everything to the filesystem.
The filesystem then uses generic helpers (or not) to find the answers and adds
more current information to the cache.
It could potentially just return and let the VFS continue down the cache (now
with current data), but it probably makes more sense for the filesystem to
explicitly return what it has.

So for Al's example of revalidating multiple components at once, once the VFS
gets to a point in the path where d_revalidate says "I need more time",
the VFS just passes the rest of the path to the filesystem.
The filesystem can then see what is in the cache and revalidate multiple
dentries in parallel. Or it could just send the rest of the path to the
server requesting attributes for each directory in the path, and then can pop
all of that into the dcache/icache and let the lookup complete.
Or it can just do one component at a time.


>
> But yes, if it's purely about preloading the cache, then *that* should
> be reasonably easy. In fact, it should work as-is today, if we just
> added a "const char *hint" to the lookup callback which told the
> filesystem what will come after this lookup. But it would be a hint
> for pre-loading the dcache, nothing more.

"hint" being a synonym for "layering violation" ??


NeilBrown

>
> So if we have a pathname like "a/b/c" that we don't have in the
> dcache, and we're doing to look up component "a", we could give "b/c"
> as the hint, and a filesystem that currently populates the dcache with
> "a" by doing
>
> d_instantiate(dentry, inode);
>
> could decide that *before* it does that "d_instantiate()", it could
> pre-populate the child list of 'dentry' with the lookup information
> for 'b' (and possibly recursively for 'c' too under 'd').
>
> But you'd still have to do the components one by one, you couldn't
> just do the "final" tip.
>
> And no, I absolutely refuse to even entertain the thought of the
> filesystem actually doing any of the do_last crap. It would bt purely
> about pre-populating the dcache deeper than the one single component,
> and then the VFS layer would just find the pre-populated dentries and
> do the normal thing.
>
> Doing things that way means that not only does do_last() at the vfs
> level already do the right thing, but we get all the per-component
> semantics (with security checks etc) right, because we'd still be
> traversing the pathname one component at a time. It's just the
> filesystem that could prime the cache.
>
> If *that* was what Neil wanted to do (rather than do "a/b/c" as one
> single lookup to the server), then I withdraw all my complaints and am
> sorry for having misunderstood.
>
> Linus

Attachment: pgprtvDj2dLFm.pgp
Description: OpenPGP digital signature