Re: [linus:master] [xfs] 2edf06a50f: fsmark.files_per_sec -5.7% regression

From: Feng Tang
Date: Mon May 15 2023 - 22:52:15 EST


On Tue, May 16, 2023 at 08:20:34AM +1000, Dave Chinner wrote:
[...]
> >
> > For commit 2edf06a50f5, it seems to change the semantics a little
> > about handling of 'flags' for xfs_alloc_fix_freelist(). With the debug
> > below, the performance is restored.
> >
> >
> > ecd788a92460eef4 2edf06a50f5bbe664283f3c55c4 68721405630744da1c07c9c1c3c
> > ---------------- --------------------------- ---------------------------
> >
> > 14349 -5.7% 13527 +0.6% 14437 fsmark.files_per_sec
> > 486.29 +5.8% 514.28 -0.5% 483.70 fsmark.time.elapsed_time
> >
> > Please help to review if the debug patch miss anything as I don't
> > know the internals of xfs, thanks.
>
> Well spotted. :)
>
> The relevant commit dropped the trylock flag, so the perf regression
> and change of allocator behaviour is due to it blocking on a busy AG
> instead of skipping to the next uncontended on and so all
> allocations came from extents in the last block of the free space
> btree in that AG.

Thanks for the confirming and analysis!

Late yesterday, I added trace printk in xfs_alloc_vextent_iterate_ags()
and it did show the flag has XFS_ALLOC_FLAG_TRYLOCK bit set.

fs_mark-28005 [016] ..... 14993.945487: xfs_alloc_vextent_iterate_ags: flags = 0x1
fs_mark-28004 [002] ..... 14993.945487: xfs_alloc_vextent_iterate_ags: flags = 0x1
fs_mark-27986 [006] ..... 14993.945497: xfs_alloc_vextent_iterate_ags: flags = 0x1

> >
> > ---
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 98defd19e09e..8c85cc68c5f4 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -3246,12 +3246,12 @@ xfs_alloc_vextent_set_fsbno(
> > */
> > static int
> > __xfs_alloc_vextent_this_ag(
> > - struct xfs_alloc_arg *args)
> > + struct xfs_alloc_arg *args, int flag)
> > {
> > struct xfs_mount *mp = args->mp;
> > int error;
> >
> > - error = xfs_alloc_fix_freelist(args, 0);
> > + error = xfs_alloc_fix_freelist(args, flag);
> > if (error) {
> > trace_xfs_alloc_vextent_nofix(args);
> > return error;
> > @@ -3289,7 +3289,7 @@ xfs_alloc_vextent_this_ag(
> > }
> >
> > args->pag = xfs_perag_get(mp, args->agno);
> > - error = __xfs_alloc_vextent_this_ag(args);
> > + error = __xfs_alloc_vextent_this_ag(args, 0);
> >
> > xfs_alloc_vextent_set_fsbno(args, minimum_agno);
> > xfs_perag_put(args->pag);
> > @@ -3329,7 +3329,7 @@ xfs_alloc_vextent_iterate_ags(
> > args->agno = start_agno;
> > for (;;) {
> > args->pag = xfs_perag_get(mp, args->agno);
> > - error = __xfs_alloc_vextent_this_ag(args);
> > + error = __xfs_alloc_vextent_this_ag(args, flags);
> > if (error) {
> > args->agbno = NULLAGBLOCK;
> > break;
>
> I don't think this is the right way to fix this.

I see. I understand this commit is in the middle of a series.

I should have stated clearly the debug hack patch was tried
mainly for debugging the regression.

> The code is -very-
> different at the end of the series that this is in the middle of,
> and I need to check what callers now use the trylock behaviour to
> determine how the trylock flag shold be passed around...

Sure. When the patch is fine, we can test it.


> > Also for the turbostat.Bzy_MHz diff, IIUC, 0Day always uses
> > 'performance' cpufreq governor. And as the test case is running
> > 32 thread in a platform with 96 CPUs, there are many CPUs in idle
> > state in average, and I suspect the Bzy_MHz may be calculated
> > considering those cpufreq and cpuidle factors.
>
> If "busy MHz" includes the speed of idle CPUs, then it's not really
> a measure of the speed of "busy" CPUs. If what you say is true, then
> it is, at best, badly names - it would just be the "average Mhz",
> right?


I found the turbostat.c in kernel tree tools/power/x86/turbostat/

if (DO_BIC(BIC_Bzy_MHz)) {
if (has_base_hz)
outp +=
sprintf(outp, "%s%.0f", (printed++ ? delim : ""), base_hz / units * t->aperf / t->mperf);
else
outp += sprintf(outp, "%s%.0f", (printed++ ? delim : ""),
tsc / units * t->aperf / t->mperf / interval_float);
}

Rui Zhang told me the 'aperf' is the actual cpu cycles of a CPU in a
period of time, and it only count when CPU is in C0 state, and will
stop counting when cpu is in idle power state. Like in one second
interval, if the CPU spends 500 ms running at 1000 MHz, and the other
500 ms in idle, then the Bzy_MHz will be shown 500 MHz.

Thanks,
Feng

> -Dave.
>
> --
> Dave Chinner
> david@xxxxxxxxxxxxx