Re: [PATCH v3 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Alexander Larsson
Date: Fri Feb 03 2023 - 06:33:14 EST


On Thu, 2023-02-02 at 15:37 +0800, Gao Xiang wrote:
>
>
> On 2023/2/2 15:17, Gao Xiang wrote:
> >
> >
> > On 2023/2/2 14:37, Amir Goldstein wrote:
> > > On Wed, Feb 1, 2023 at 1:22 PM Gao Xiang
> > > <hsiangkao@xxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On 2023/2/1 18:01, Gao Xiang wrote:
> > > > >
> > > > >
> > > > > On 2023/2/1 17:46, Alexander Larsson wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > >
> > > > > > >                                     | uncached(ms)|
> > > > > > > cached(ms)
> > > > > > > ----------------------------------|-------------|--------
> > > > > > > ---
> > > > > > > composefs (with digest)           | 326         | 135
> > > > > > > erofs (w/o -T0)                   | 264         | 172
> > > > > > > erofs (w/o -T0) + overlayfs       | 651         | 238
> > > > > > > squashfs (compressed)                | 538         | 211
> > > > > > > squashfs (compressed) + overlayfs | 968         | 302
> > > > > >
> > > > > >
> > > > > > Clearly erofs with sparse files is the best fs now for the
> > > > > > ro-fs +
> > > > > > overlay case. But still, we can see that the additional
> > > > > > cost of the
> > > > > > overlayfs layer is not negligible.
> > > > > >
> > > > > > According to amir this could be helped by a special
> > > > > > composefs-like mode
> > > > > > in overlayfs, but its unclear what performance that would
> > > > > > reach, and
> > > > > > we're then talking net new development that further
> > > > > > complicates the
> > > > > > overlayfs codebase. Its not clear to me which alternative
> > > > > > is easier to
> > > > > > develop/maintain.
> > > > > >
> > > > > > Also, the difference between cached and uncached here is
> > > > > > less than in
> > > > > > my tests. Probably because my test image was larger. With
> > > > > > the test
> > > > > > image I use, the results are:
> > > > > >
> > > > > >                                     | uncached(ms)|
> > > > > > cached(ms)
> > > > > > ----------------------------------|-------------|----------
> > > > > > -
> > > > > > composefs (with digest)           | 681         | 390
> > > > > > erofs (w/o -T0) + overlayfs       | 1788        | 532
> > > > > > squashfs (compressed) + overlayfs | 2547        | 443
> > > > > >
> > > > > >
> > > > > > I gotta say it is weird though that squashfs performed
> > > > > > better than
> > > > > > erofs in the cached case. May be worth looking into. The
> > > > > > test data I'm
> > > > > > using is available here:
> > > > >
> > > > > As another wild guess, cached performance is a just vfs-
> > > > > stuff.
> > > > >
> > > > > I think the performance difference may be due to ACL (since
> > > > > both
> > > > > composefs and squashfs don't support ACL).  I already asked
> > > > > Jingbo
> > > > > to get more "perf data" to analyze this but he's now busy in
> > > > > another
> > > > > stuff.
> > > > >
> > > > > Again, my overall point is quite simple as always, currently
> > > > > composefs is a read-only filesystem with massive symlink-like
> > > > > files.
> > > > > It behaves as a subset of all generic read-only filesystems
> > > > > just
> > > > > for this specific use cases.
> > > > >
> > > > > In facts there are many options to improve this (much like
> > > > > Amir
> > > > > said before):
> > > > >     1) improve overlayfs, and then it can be used with any
> > > > > local fs;
> > > > >
> > > > >     2) enhance erofs to support this (even without on-disk
> > > > > change);
> > > > >
> > > > >     3) introduce fs/composefs;
> > > > >
> > > > > In addition to option 1), option 2) has many benefits as
> > > > > well, since
> > > > > your manifest files can save real regular files in addition
> > > > > to composefs
> > > > > model.
> > > >
> > > > (add some words..)
> > > >
> > > > My first response at that time (on Slack) was "kindly request
> > > > Giuseppe to ask in the fsdevel mailing list if this new overlay
> > > > model
> > > > and use cases is feasable", if so, I'm much happy to integrate
> > > > in to
> > > > EROFS (in a cooperative way) in several ways:
> > > >
> > > >    - just use EROFS symlink layout and open such file in a
> > > > stacked way;
> > > >
> > > > or (now)
> > > >
> > > >    - just identify overlayfs "trusted.overlay.redirect" in
> > > > EROFS itself
> > > >      and open file so such image can be both used for EROFS
> > > > only and
> > > >      EROFS + overlayfs.
> > > >
> > > > If that happened, then I think the overlayfs "metacopy" option
> > > > can
> > > > also be shown by other fs community people later (since I'm not
> > > > an
> > > > overlay expert), but I'm not sure why they becomes impossible
> > > > finally
> > > > and even not mentioned at all.
> > > >
> > > > Or if you guys really don't want to use EROFS for whatever
> > > > reasons
> > > > (EROFS is completely open-source, used, contributed by many
> > > > vendors),
> > > > you could improve squashfs, ext4, or other exist local fses
> > > > with this
> > > > new use cases (since they don't need any on-disk change as
> > > > well, for
> > > > example, by using some xattr), I don't think it's really hard.
> > > >
> > >
> > > Engineering-wise, merging composefs features into EROFS
> > > would be the simplest option and FWIW, my personal preference.
> > >
> > > However, you need to be aware that this will bring into EROFS
> > > vfs considerations, such as  s_stack_depth nesting (which AFAICS
> > > is not see incremented composefs?). It's not the end of the
> > > world, but this
> > > is no longer plain fs over block game. There's a whole new class
> > > of bugs
> > > (that syzbot is very eager to explore) so you need to ask
> > > yourself whether
> > > this is a direction you want to lead EROFS towards.
> >
> > I'd like to make a seperated Kconfig for this.  I consider this
> > just because
> > currently composefs is much similar to EROFS but it doesn't have
> > some ability
> > to keep real regular file (even some README, VERSION or Changelog
> > in these
> > images) in its (composefs-called) manifest files. Even its on-disk
> > super block
> > doesn't have a UUID now [1] and some boot sector for booting or
> > some potential
> > hybird formats such as tar + EROFS, cpio + EROFS.
> >
> > I'm not sure if those potential new on-disk features is unneeded
> > even for
> > future composefs.  But if composefs laterly supports such on-disk
> > features,
> > that makes composefs closer to EROFS even more.  I don't see
> > disadvantage to
> > make these actual on-disk compatible (like ext2 and ext4).
> >
> > The only difference now is manifest file itself I/O interface --
> > bio vs file.
> > but EROFS can be distributed to raw block devices as well,
> > composefs can't.
> >
> > Also, I'd like to seperate core-EROFS from advanced features (or
> > people who
> > are interested to work on this are always welcome) and composefs-
> > like model,
> > if people don't tend to use any EROFS advanced features, it could
> > be disabled
> > from compiling explicitly.
>
> Apart from that, I still fail to get some thoughts (apart from
> unprivileged
> mounts) how EROFS + overlayfs combination fails on automative real
> workloads
> aside from "ls -lR" (readdir + stat).
>
> And eventually we still need overlayfs for most use cases to do
> writable
> stuffs, anyway, it needs some words to describe why such < 1s
> difference is
> very very important to the real workload as you already mentioned
> before.
>
> And with overlayfs lazy lookup, I think it can be close to ~100ms or
> better.
>

If we had an overlay.fs-verity xattr, then I think there are no
individual features lacking for it to work for the automotive usecase
I'm working on. Nor for the OCI container usecase. However, the
possibility of doing something doesn't mean it is the better technical
solution.

The container usecase is very important in real world Linux use today,
and as such it makes sense to have a technically excellent solution for
it, not just a workable solution. Obviously we all have different
viewpoints of what that is, but these are the reasons why I think a
composefs solution is better:

* It is faster than all other approaches for the one thing it actually
needs to do (lookup and readdir performance). Other kinds of
performance (file i/o speed, etc) is up to the backing filesystem
anyway. 

Even if there are possible approaches to make overlayfs perform better
here (the "lazy lookup" idea) it will not reach the performance of
composefs, while further complicating the overlayfs codebase. (btw, did
someone ask Miklos what he thinks of that idea?)

For the automotive usecase we have strict cold-boot time requirements
that make cold-cache performance very important to us. Of course, there
is no simple time requirements for the specific case of listing files
in an image, but any improvement in cold-cache performance for both the
ostree rootfs and the containers started during boot will be worth its
weight in gold trying to reach these hard KPIs.

* It uses less memory, as we don't need the extra inodes that comes
with the overlayfs mount. (See profiling data in giuseppes mail[1]).

The use of loopback vs directly reading the image file from page cache
also have effects on memory use. Normally we have both the loopback
file in page cache, plus the block cache for the loopback device. We
could use loopback with O_DIRECT, but then we don't use the page cache
for the image file, which I think could have performance implications.

* The userspace API complexity of the combined overlayfs approach is
much greater than for composefs, with more moving pieces. For
composefs, all you need is a single mount syscall for set up. For the
overlay approach you would need to first create a loopback device, then
create a dm-verity device-mapper device from it, then mount the
readonly fs, then mount the overlayfs. All this complexity has a cost
in terms of setup/teardown performance, userspace complexity and
overall memory use.

Are any of these a hard blocker for the feature? Not really, but I
would find it sad to use an (imho) worse solution.



The other mentioned approach is to extend EROFS with composefs
features.  For this to be interesting to me it would have to include: 

* Direct reading of the image from page cache (not via loopback)
* Ability to verify fs-verity digest of that image file
* Support for stacked content files in a set of specified basedirs 
(not using fscache).
* Verification of expected fs-verity digest for these basedir files

Anything less than this and I think the overlayfs+erofs approach is a
better choice.

However, this is essentially just proposing we re-implement all the
composefs code with a different name. And then we get a filesystem
supporting *both* stacking and traditional block device use, which
seems a bit weird to me. It will certainly make the erofs code more
complex having to support all these combinations. Also, given the harsh
arguments and accusations towards me on the list I don't feel very
optimistic about how well such a cooperation would work.

(A note about Kconfig options: I'm totally uninterested in using a
custom build of erofs. We always use a standard distro kernel that has
to support all possible uses of erofs, so we can't ship a neutered
version of it.)


[1] https://lore.kernel.org/lkml/87wn5ac2z6.fsf@xxxxxxxxxx/

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=
Alexander Larsson Red Hat,
Inc
alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx
He's a world-famous day-dreaming cop on his last day in the job. She's
a
plucky streetsmart wrestler descended from a line of powerful witches.
They fight crime!