RE: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates

From: Elliott, Robert (Server Storage)
Date: Sat May 16 2015 - 21:21:33 EST



> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@xxxxxxxxxxxx] On Behalf Of
> Dan Williams
> Sent: Tuesday, April 28, 2015 1:26 PM
> To: linux-nvdimm@xxxxxxxxxxxx
> Cc: Ingo Molnar; Neil Brown; Greg KH; Dave Chinner; linux-
> kernel@xxxxxxxxxxxxxxx; Andy Lutomirski; Jens Axboe; H. Peter Anvin;
> Christoph Hellwig
> Subject: [Linux-nvdimm] [PATCH v2 19/20] nd_btt: atomic sector updates
>
> From: Vishal Verma <vishal.l.verma@xxxxxxxxxxxxxxx>
>
> BTT stands for Block Translation Table, and is a way to provide power
> fail sector atomicity semantics for block devices that have the ability
> to perform byte granularity IO. It relies on the ->rw_bytes() capability
> of provided nd namespace devices.
>
> The BTT works as a stacked blocked device, and reserves a chunk of space
> from the backing device for its accounting metadata. BLK namespaces may
> mandate use of a BTT and expect the bus to initialize a BTT if not
> already present. Otherwise if a BTT is desired for other namespaces (or
> partitions of a namespace) a BTT may be manually configured.
...

Running btt above pmem with a variety of workloads, I see an awful lot
of time spent in two places:
* _raw_spin_lock
* btt_make_request

This occurs for fio to raw /dev/ndN devices, ddpt over ext4 or xfs,
cp -R of large directories, and running make on the linux kernel.

Some specific results:

fio 4 KiB random reads, WC cache type, memcpy:
* 43175 MB/s, 8 M IOPS pmem0 and pmem1
* 18500 MB/s, 1.5 M IOPS nd0 and nd1

fio 4 KiB random reads, WC cache type, memcpy with non-temporal
loads (when everything is 64-byte aligned):
* 33814 MB/s, 4.3 M IOPS nd0 and nd1

Zeroing out 32 MiB with ddpt:
* 19 s, 1800 MiB/s pmem
* 55 s, 625 MiB/s btt

If btt_make_request needs to stall this much, maybe it'd be better
to utilize the blk-mq request queues, keeping requests in per-CPU
queues while they're waiting, and using IPIs for completion
interrupts when they're finally done.


fio 4 KiB random reads without non-temporal memcpy
==================================================
perf top shows memcpy_erms taking all the time, a function that
uses 8-byte REP; MOVSB instructions:
85.78% [kernel] [k] memcpy_erms
1.21% [kernel] [k] _raw_spin_lock
0.72% [nd_btt] [k] btt_make_request
0.67% [kernel] [k] do_blockdev_direct_IO
0.47% fio [.] get_io_u

fio 4 KiB random reads with non-temporal memcpy
===============================================
perf top shows there are still quite a few unaligned accesses
resulting in legacy memcpy, but about equal time is now spent
in legacy vs NT memcpy:
30.47% [kernel] [k] memcpy_erms
26.27% [kernel] [k] memcpy_lnt_st_64
5.37% [kernel] [k] _raw_spin_lock
2.20% [kernel] [k] btt_make_request
2.03% [kernel] [k] do_blockdev_direct_IO
1.41% fio [.] get_io_u
1.22% [kernel] [k] btt_map_read
1.15% [kernel] [k] pmem_rw_bytes
1.01% [kernel] [k] nd_btt_rw_bytes
0.98% [kernel] [k] nd_region_acquire_lane
0.89% fio [.] get_next_rand_block
0.88% fio [.] thread_main
0.79% fio [.] ios_completed
0.76% fio [.] td_io_queue
0.75% [kernel] [k] _raw_spin_lock_irqsave
0.68% [kernel] [k] kmem_cache_free
0.66% [kernel] [k] kmem_cache_alloc
0.59% [kernel] [k] __audit_syscall_exit
0.57% [kernel] [k] aio_complete
0.54% [kernel] [k] do_io_submit
0.52% [kernel] [k] _raw_spin_unlock_irqrestore

fio randrw workload
===================
perf top shows that adding writes to the mix brings btt_make_request
its cpu_relax() loop to the forefront:
21.09% [nd_btt] [k] btt_make_request
19.06% [kernel] [k] memcpy_erms
14.35% [kernel] [k] _raw_spin_lock
10.38% [nd_pmem] [k] memcpy_lnt_st_64
1.57% [kernel] [k] do_blockdev_direct_IO
1.51% [nd_pmem] [k] memcpy_lt_snt_64
1.43% [nd_btt] [k] nd_btt_rw_bytes
1.39% [kernel] [k] radix_tree_next_chunk
1.33% [kernel] [k] put_page
1.21% [nd_pmem] [k] pmem_rw_bytes
1.11% fio [.] get_io_u
0.90% fio [.] io_u_queued_complete
0.74% [kernel] [k] system_call
0.72% [libnd] [k] nd_region_acquire_lane
0.71% [nd_btt] [k] btt_map_read
0.62% fio [.] thread_main

inside btt_make_request:

¦ /* Wait if the new block is being read from */
¦ for (i = 0; i < arena->nfree; i++)
2.98 ¦ ? je 2b4
0.05 ¦ mov 0x60(%r14),%rax
0.00 ¦ mov %ebx,%edx
¦ xor %esi,%esi
0.03 ¦ or $0x80000000,%edx
0.05 ¦ nop
¦ while (arena->rtt[i] == (RTT_VALID | new_postmap))
22.98 ¦290: mov %esi,%edi
0.01 ¦ cmp %edx,(%rax,%rdi,4)
30.97 ¦ lea 0x0(,%rdi,4),%rcx
21.05 ¦ ? jne 2ab
¦ nop
¦ }
¦
¦ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
¦ static inline void rep_nop(void)
¦ {
¦ asm volatile("rep; nop" ::: "memory");
¦2a0: pause
¦ mov 0x60(%r14),%rax
¦ cmp (%rax,%rcx,1),%edx
¦ ? je 2a0
¦ }


ddpt zeroing out
================
perf top shows 27% in spinlocks, and 14% in btt_make_request (all in
the "wait if the new block is being read from" loop).

26.48% [kernel] [k] _raw_spin_lock
14.46% [nd_btt] [k] btt_make_request
13.14% [kernel] [k] memcpy_erms
10.34% [kernel] [k] copy_user_enhanced_fast_string
3.12% [nd_pmem] [k] memcpy_lt_snt_64
1.15% [kernel] [k] __block_commit_write.isra.21
0.96% [nd_pmem] [k] pmem_rw_bytes
0.96% [nd_btt] [k] nd_btt_rw_bytes
0.86% [kernel] [k] unlock_page
0.65% [kernel] [k] _raw_spin_lock_irqsave
0.58% [kernel] [k] bdev_read_only
0.56% [kernel] [k] release_pages
0.54% [nd_pmem] [k] memcpy_lnt_st_64
0.53% [ext4] [k] ext4_mark_iloc_dirty
0.52% [kernel] [k] __wake_up_bit
0.52% [kernel] [k] __clear_user

---
Robert Elliott, HP Server Storage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/