Re: Expense of read_iter

From: Zhongwei Cai
Date: Tue Jan 12 2021 - 08:54:54 EST



I'm working with Mingkai on optimizations for Ext4-dax.
We think that optmizing the read-iter method cannot achieve the
same performance as the read method for Ext4-dax.
We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
results are listed below:

Overall time of 2^26 4KB read.

Method Time
read 26.782s
read-iter 36.477s

Perf result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13379476464
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
20.09% pread [kernel.vmlinux] [k] copy_user_generic_string
6.58% pread [kernel.vmlinux] [k] iomap_apply
6.01% pread [kernel.vmlinux] [k] syscall_return_via_sysret
4.85% pread libc-2.31.so [.] __libc_pread
3.61% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
3.25% pread [kernel.vmlinux] [k] _raw_read_lock
2.80% pread [kernel.vmlinux] [k] entry_SYSCALL_64
2.71% pread [ext4] [k] ext4_es_lookup_extent
2.71% pread [kernel.vmlinux] [k] __fsnotify_parent
2.63% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.55% pread [kernel.vmlinux] [k] new_sync_read
2.39% pread [ext4] [k] ext4_iomap_begin
2.38% pread [kernel.vmlinux] [k] vfs_read
2.30% pread [kernel.vmlinux] [k] dax_iomap_actor
2.30% pread [kernel.vmlinux] [k] __srcu_read_lock
2.14% pread [ext4] [k] ext4_inode_block_valid
1.97% pread [kernel.vmlinux] [k] _copy_mc_to_iter
1.97% pread [ext4] [k] ext4_map_blocks
1.89% pread [kernel.vmlinux] [k] down_read
1.89% pread [kernel.vmlinux] [k] up_read
1.65% pread [ext4] [k] ext4_file_read_iter
1.48% pread [kernel.vmlinux] [k] dax_iomap_rw
1.48% pread [jbd2] [k] jbd2_transaction_committed
1.15% pread [nd_pmem] [k] __pmem_direct_access
1.15% pread [kernel.vmlinux] [k] ksys_pread64
1.15% pread [kernel.vmlinux] [k] __fget_light
1.15% pread [ext4] [k] ext4_set_iomap
1.07% pread [kernel.vmlinux] [k] atime_needs_update
0.82% pread pread [.] main
0.82% pread [kernel.vmlinux] [k] do_syscall_64
0.74% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.66% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.66% pread [nd_pmem] [k] 0x00000000000001d0
0.59% pread [kernel.vmlinux] [k] dax_direct_access
0.58% pread [nd_pmem] [k] 0x00000000000001de
0.58% pread [kernel.vmlinux] [k] bdev_dax_pgoff
0.49% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.49% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.41% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.33% pread [nd_pmem] [k] 0x0000000000001083
0.33% pread [kernel.vmlinux] [k] dax_get_private
0.33% pread [kernel.vmlinux] [k] timestamp_truncate
0.33% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] copyout_mc
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] touch_atime
0.25% pread [nd_pmem] [k] 0x000000000000107f
0.25% pread [kernel.vmlinux] [k] rw_verify_area
0.25% pread [ext4] [k] ext4_iomap_end
0.25% pread [kernel.vmlinux] [k] _cond_resched
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.16% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] current_time
0.16% pread [nd_pmem] [k] 0x0000000000001075
0.16% pread [ext4] [k] ext4_inode_datasync_dirty
0.16% pread [kernel.vmlinux] [k] copy_mc_to_user
0.08% pread pread [.] pread@plt
0.08% pread [kernel.vmlinux] [k] __x86_indirect_thunk_r11
0.08% pread [kernel.vmlinux] [k] security_file_permission
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.08% pread [nd_pmem] [k] 0x000000000000108f
0.08% pread [nd_pmem] [k] 0x0000000000001095
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.00% pread [kernel.vmlinux] [k] native_write_msr


#
# (Tip: Show current config key-value pairs: perf config --list)
#

Perf result, using the read method we added for Ext4-dax:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13364755903
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
28.65% pread [kernel.vmlinux] [k] copy_user_generic_string
7.99% pread [ext4] [k] ext4_dax_read
6.50% pread [kernel.vmlinux] [k] syscall_return_via_sysret
5.43% pread libc-2.31.so [.] __libc_pread
4.45% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.20% pread [kernel.vmlinux] [k] down_read
3.38% pread [kernel.vmlinux] [k] _raw_read_lock
3.13% pread [ext4] [k] ext4_es_lookup_extent
3.05% pread [kernel.vmlinux] [k] __srcu_read_lock
2.72% pread [kernel.vmlinux] [k] __fsnotify_parent
2.55% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.47% pread [kernel.vmlinux] [k] vfs_read
2.31% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
1.89% pread [kernel.vmlinux] [k] up_read
1.73% pread [ext4] [k] ext4_map_blocks
1.65% pread pread [.] main
1.56% pread [kernel.vmlinux] [k] __fget_light
1.48% pread [ext4] [k] ext4_inode_block_valid
1.34% pread [kernel.vmlinux] [k] ksys_pread64
1.23% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
1.08% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
1.07% pread [nd_pmem] [k] __pmem_direct_access
0.99% pread [kernel.vmlinux] [k] atime_needs_update
0.91% pread [kernel.vmlinux] [k] security_file_permission
0.91% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.66% pread [kernel.vmlinux] [k] timestamp_truncate
0.58% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.49% pread pread [.] pread@plt
0.41% pread [kernel.vmlinux] [k] current_time
0.41% pread [kernel.vmlinux] [k] dax_direct_access
0.41% pread [kernel.vmlinux] [k] do_syscall_64
0.41% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.41% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] touch_atime
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] copy_mc_to_user
0.25% pread [kernel.vmlinux] [k] dax_get_private
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.25% pread [nd_pmem] [k] 0x0000000000001095
0.16% pread [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.16% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.16% pread [nd_pmem] [k] 0x0000000000001083
0.16% pread [kernel.vmlinux] [k] rw_verify_area
0.16% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] dax_read_lock
0.16% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] update_irq_load_avg
0.08% pread [nd_pmem] [k] 0x000000000000109d
0.08% pread [nd_pmem] [k] 0x000000000000107a
0.08% pread [kernel.vmlinux] [k] __x64_sys_pread64
0.00% pread [kernel.vmlinux] [k] native_write_msr


#
# (Tip: Sample related events with: perf record -e '{cycles,instructions}:S')
#

Note that the overall time of read method is 73.42% of the read-iter method.
If we sum up the percentage of read-iter specific functions (including
ext4_file_read_iter, iomap_apply, dax_iomap_actor, _copy_mc_to_iter,
ext4_iomap_begin, jbd2_transaction_committed, new_sync_read, dax_iomap_rw,
ext4_set_iomap, ext4_iomap_end and iov_iter_init), we will get 20.81%.
In the second trace, ext4_dax_read only consumes 7.99%, which can replace
all these functions.

The overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

There could be two approaches to optimizing: 1) implementing the read method
without the complexity of iterators and dax_iomap_rw; 2) optimizing both
iterators and how dax_iomap_rw works. Since dax_iomap_rw requires
ext4_iomap_begin, which further involves the iomap structure and others
(e.g., journaling status locks in Ext4), we think implementing the read
method would be easier.

Thanks,
Zhongwei