Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

From: Ian Kent
Date: Thu Dec 10 2020 - 21:03:55 EST


On Thu, 2020-12-10 at 16:44 +0000, Fox Chen wrote:
> Hi,
>
> I found this series of patches solves exact the problem I am trying
> to solve.
> https://lore.kernel.org/lkml/20201202145837.48040-1-foxhlchen@xxxxxxxxx/

Right.

>
> The problem is reported by Brice Goglin on thread:
> Re: [PATCH 1/4] drivers core: Introduce CPU type sysfs interface
> https://lore.kernel.org/lkml/X60dvJoT4fURcnsF@xxxxxxxxx/
>
> I independently comfirmed that on a 96-core AWS c5.metal server.
> Do open+read+write on /sys/devices/system/cpu/cpu15/topology/core_id
> 1000 times.
> With a single thread it takes ~2.5 us for each open+read+close.
> With one thread per core, 96 threads running simultaneously takes 540
> us
> for each of the same operation (without much variation) -- 200x
> slower than the
> single thread one.

Right, interesting that the it's actually a problem on such
small system configurations.

I didn't think it would be evident on hardware that doesn't
have a much larger configuration.

>
> My Benchmark code is here:
> https://github.com/foxhlchen/sysfs_benchmark
>
> The problem can only be observed in large machines (>=16 cores).
> The more cores you have the slower it can be.
>
> Perf shows that CPUs spend most of the time (>80%) waiting on mutex
> locks in
> kernfs_iop_permission and kernfs_dop_revalidate.
>
> After applying this, performance gets huge boost -- with the fastest
> one at ~30 us
> to the worst at ~180 us (most of on spin_locks, the delay just
> stacking up, very
> similar to the performance on ext4).

That's the problem isn't it.

Unfortunately we don't get large improvements for nothing so I
was constantly thinking, what have I done here that isn't ok ...
and I don't have an answer for that.

The series needs review from others for that but we didn't get
that far.

>
> I hope this problem can justifies this series of patches. A big mutex
> in kernfs
> is really not nice. Due to this BIG LOCK, concurrency in kernfs is
> almost NONE,
> even though you do operations on different files, they are
> contentious.

Well, as much as I don't like to admit it, Greg (and Tejun) do have
a point about the number of sysfs files used when there is a very
large amount of RAM. But IIUC the suggestion of altering the sysfs
representation for this devices memory would introduce all sorts
of problems, it then being different form all device memory
representations (systemd udev coldplug for example).

But I think your saying there are also visible improvements elsewhere
too, which is to be expected of course.

Let's not forget that, as the maintainer, Greg has every right to
be reluctant to take changes because he is likely to end up owning
and maintaining the changes. That can lead to considerable overhead
and frustration if the change isn't quite right and it's very hard
to be sure there aren't hidden problems with it.

Fact is that, due to Greg's rejection, there was much more focus
by the reporter to fix it at the source but I have no control over
that, I only know that it helped to get things moving.

Given the above, I was considering posting the series again and
asking for the series to be re-considered but since I annoyed
Greg so much the first time around I've been reluctant to do so.

But now is a good time I guess, Greg, please, would you re-consider
possibly accepting these patches?

I would really like some actual review of what I have done from
people like yourself and Al. Ha, after that they might well not
be ok anyway!

>
> As we get more and more cores on normal machines and because sysfs
> provides such
> important information, this problem should be fix. So please
> reconsider accepting
> the patches.
>
> For the patches, there is a mutex_lock in kn->attr_mutex, as Tejun
> mentioned here
> (https://lore.kernel.org/lkml/X8fe0cmu+aq1gi7O@xxxxxxxxxxxxxxx/),
> maybe a global
> rwsem for kn->iattr will be better??

I wasn't sure about that, IIRC a spin lock could be used around the
initial check and checked again at the end which would probably have
been much faster but much less conservative and a bit more ugly so
I just went the conservative path since there was so much change
already.

Ian