Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression

From: Dave Chinner
Date: Fri Sep 04 2015 - 18:03:33 EST


On Fri, Sep 04, 2015 at 01:32:33PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> > You probably don't even need a VM to reproduce it - that would
> > certainly be an interesting counterpoint if it didn't....
>
> Even though you managed to restore your DEBUG_SPINLOCK performance by
> changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
> actual hardware just to test.
>
> [ Note: In any case, I would recommend you use (or at least try)
> PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
> performance, the test-and-set fallback really wasn't meant as a
> performance option (although it clearly sucks worse than expected).

I will try it, but that can happen when I've got a bit of spare
time...

> Pre qspinlock, your setup would have used regular ticket locks on
> vCPUs, which mostly works as long as there is almost no vCPU
> preemption, if you overload your machine such that the vCPU threads
> get preempted that will implode into silly-land. ]

I don't tend to overload the host CPUs - all my test loads are IO
bound - so this has never really been a problem I've noticed
in the past.

> So on to native performance:
>
> - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
> - 1.1T of md-stripe (5x200GB) SSDs
> - Linux v4.2 (distro style .config)
> - Debian "testing" base system
> - xfsprogs v3.2.1
>
>
> # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0

If you use xfsprogs v3.2.4 (current debian unstable) these are the
default options.

> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md0 isize=512 agcount=32, agsize=9157504 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1
> data = bsize=4096 blocks=293038720, imaxpct=5
> = sunit=128 swidth=640 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal log bsize=4096 blocks=143088, version=2
> = sectsz=512 sunit=8 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch
>
> # ./fs_mark -D 10000 -S0 -n 50000 -s 0 -L 32 \
> -d /mnt/scratch/0 -d /mnt/scratch/1 \
> -d /mnt/scratch/2 -d /mnt/scratch/3 \
> -d /mnt/scratch/4 -d /mnt/scratch/5 \
> -d /mnt/scratch/6 -d /mnt/scratch/7 \
> -d /mnt/scratch/8 -d /mnt/scratch/9 \
> -d /mnt/scratch/10 -d /mnt/scratch/11 \
> -d /mnt/scratch/12 -d /mnt/scratch/13 \
> -d /mnt/scratch/14 -d /mnt/scratch/15 \
>
>
> Regular v4.2 (qspinlock) does:
>
> 0 6400000 0 286491.9 3500179
> 0 7200000 0 293229.5 3963140
> 0 8000000 0 271182.4 3708212
> 0 8800000 0 300592.0 3595722
>
> Modified v4.2 (ticket) does:
>
> 0 6400000 0 310419.6 3343821
> 0 7200000 0 348346.5 4721133
> 0 8000000 0 328098.2 3235753
> 0 8800000 0 316765.3 3238971
>
>
> Which shows that qspinlock is clearly slower, even for these large-ish
> NUMA boxes where it was supposed to be better.

Be careful just reading the throughput numbers like that. You can
have the files/s number go down, but the benchmark wall time get
faster because the userspace portion runs faster (i.e. CPU cache
residency effects). In this case, however, both the userspace time
is down by 5-10% and the files/s is up by 5-10%, so (without knowing
the wall time) I'd say that there is significance in these
numbers....

FWIW. you've got a lot more CPUs than I have - you can scale up the
parallelism of the workload by increasing the number of working
directories (i.e. -d <dir> options). You'd also need to scale up the
amount of allocation concurrency in XFS - 32 AGs will be the
limiting factor for any more workload concurrency. i.e. use "-d
agcount=<xxx>" on the mkfs.xfs command line to increase the AG
count. For artificial scalability testing like this, you want the AG
count ot be at least 2x the number of directories you are working in
concurrently.

> Clearly our benchmarks used before this were not sufficient, and more
> works needs to be done.
>
>
> Also, I note that after running to completion, there is only 14G of
> actual data on the device, so you don't need silly large storage to run
> this -- I expect your previous 275G quote was due to XFS populating the
> sparse file with meta-data or something along those lines.

Yeah, that would have been after lots of other work being done on
the sparse file I use to back the 500TB filesystem I test on in the
VM. Currently:

$ ls -lh /mnt/fast-ssd
total 61G
-rw------- 1 root root 500T Sep 4 19:36 vm-500t.img
$ df -h /mnt/fast-ssd
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 400G 61G 340G 16% /mnt/fast-ssd
$

I'm using 61GB of space in the file that backs the 500TB device I'm
testing against. Every so often I punch out the file so that it gets
laid out again- I usually do that after running btrfs testing as
btrfs fragments the crap out of the backing file, even with extent
size hints set to minimise the fragmentation...

> Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)

That's why I do it in parallel - step 6 of my test script is:

echo removing files
for f in /mnt/scratch/* ; do time rm -rf $f & done
wait

And so:

.....
removing files

real 4m2.752s
user 0m3.387s
sys 2m56.801s
....
real 4m17.326s
user 0m3.333s
sys 2m57.831s
$

It takes a lot less than forever :)

Really, the fsmark run is just the part of my concurrent XFS inode
test script that takes about 20 minutes to run. It does:

Prep: mkfs, mount
1. run fsmark to create inodes in parallel
2. run xfs_repair with maximum concurrency
3. run multi-threaded bulkstat
4. run concurrent find+stat
5. run concurrent ls -R
6. run concurrent rm -rf

It stresses all sorts of stuff:

- steps 1 and 6 stress the XFS inode allocation and
transaction subsystems - it runs at about 4-500,000
transaction commits a second here.

- Step 2 absolutely thrashes the mmap_sem from userspace due
to the memory demand and concurrent access patterns of
xfs_repair.

- Step 3 is a cold cache inode traversal - it pushes close
to a million inodes/second through the slab caches. It
puts a hell of a lot of load on the inode and xfs_buf slab
cache, the xfs_buf slab shrinker and all the VFS inode
instantiation and teardown paths. It is currently limited
in scalability by the inode_sb_list_lock contention.

- Step 4 and 5 do different types of directory traversal,
putting heavy demand on the XFS buffer cache and inode
cache shrinkers to work effectively.

I have several variants - small files, different filesystems,
different directory structures, etc - because they all stress
different aspects of filesystem and core infrastructure. It's found
locking regressions. It's found mm/ subsystem regressions. It's
found writeback regressions. It's found all sorts of bugs in my code
over the years - it's a very useful test, so I keep using it. ;)

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/