Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

From: Jan Kara
Date: Thu Nov 04 2021 - 06:03:23 EST


On Wed 03-11-21 18:36:06, Vivek Goyal wrote:
> On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > Hi Jan,
> > > > > >
> > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > both for shared and non-shared directory case.
> > > > > >
> > > > > > With local filesystems we have a control that we can first queue up
> > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > very well happen that events got delayed in the communication path
> > > > > > somewhere and local watches went away and now there is no way to
> > > > > > deliver those events to the application.
> > > > >
> > > > > So after thinking for some time about this I have the following question
> > > > > about the architecture of this solution: Why do you actually have local
> > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > there? When e.g. file is created from the client, client tells the server
> > > > > about creation, the server performs the creation which generates the
> > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > client which just queues it into notification group's queue for userspace
> > > > > to read it.
> > > > >
> > > > > Now with this architecture there's no problem with duplicate events for
> > > > > local & server notification marks, similarly there's no problem with lost
> > > > > events after inode deletion because events received by the client are
> > > > > directly queued into notification queue without any checking whether inode
> > > > > is still alive etc. Would this work or am I missing something?
> > > > >
> > > >
> > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > events?
> > > >
> > > > Do you propose to maintain separate event queues over the protocol?
> > > > Attach a "recipient list" to each event?
> > >
> > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > for path P, we notify server, it will create notification group A on the
> > > server (if not already existing - there we need some notification group
> > > identifier unique among all clients), and place mark for it on path P. Then
> > > the full stream of notification events generated for group A on the server
> > > will just be forwarded to the client and inserted into the A's notification
> > > queue. IMO this is very simple solution to implement - you just need to
> > > forward mark addition / removal events from the client to the server and you
> > > forward event stream from the server to the client. Everything else is
> > > handled by the fsnotify infrastructure on the server.
> > >
> > > > I just don't see how this can scale other than:
> > > > - Local marks and connectors manage the subscriptions on local machine
> > > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I agree that depending on the usecase and particular FUSE filesystem
> > > performance of this solution may be a concern. OTOH the only additional
> > > cost of this solution I can see (compared to all those processes just
> > > watching files locally) is the passing of the events from the server to the
> > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > cheap since you have to do very little processing for each generated event.
> > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > >
> > > Also one problem I can see with my proposal is that it will have problems
> > > with stuff such as leases - i.e., if the client does not notify the server
> > > of the changes quickly but rather batches local operations and tells the
> > > server about them only on special occasions. I don't know enough about FUSE
> > > filesystems to tell whether this is a frequent problem or not.
> > >
> > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > an inode, the fuse client inode can be finally evicted.
> > > > I haven't tried to see how hard that would be to implement.
> > >
> > > Sure, there can be other solutions to this particular problem. I just
> > > want to discuss the other architecture to see why we cannot to it in a
> > > simple way :).
> > >
> >
> > Fair enough.
> >
> > Beyond the scalability aspects, I think that a design that exposes the group
> > to the remote server and allows to "inject" events to the group queue
> > will prevent
> > users from useful features going forward.
> >
> > For example, fanotify ignored_mask could be added to a group, even on
> > a mount mark, even if the remote server only supports inode marks and it
> > would just work.
> >
> > Another point of view for the post-mortem events:
> > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > an async notification queue is not adequate and synchronous notifications are
> > too costly, so there needs to be some shared memory solution involving guest
> > cache invalidation by host.
>
> Any shared memory solution works only limited setup. If server is remote
> on other machine, there is no sharing. I am hoping that this can be
> generic enough to support other remote filesystems down the line.

OK, so do I understand both you and Amir correctly that you think that
always relying on the FUSE server for generating the events and just piping
them to the client is not long-term viable design for FUSE? Mostly because
caching of modifications on the client is essentially inevitable and hence
generating events from the server would be unreliable (delayed too much)?

Honza

--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR