Re: [PATCH v2] locking/rtmutex: Limit # of lock stealing for non-RT waiters

From: Waiman Long
Date: Fri Jun 24 2022 - 22:23:37 EST


On 6/24/22 02:42, Sebastian Andrzej Siewior wrote:
On 2022-06-23 10:41:17 [-0400], Waiman Long wrote:
On 6/23/22 09:32, Sebastian Andrzej Siewior wrote:
Do you have more insight on how this was tested/ created? Based on that,
systemd and a random kworker waited on a lock for more than 10 minutes.
The hang happens when our QE team run thier kernel tier 1 test which, I
think, lasts several hours. The hang happens in some runs but not all of
them. So it is kind of opportunistic. Mike should be able to provide a
better idea about frequency and so on.
So we talk here about 64+ CPU or more than that?

The bug report that I saw showed 2 hangs in two different one-socket 4-core 8-thread systems. So they are not 64+ cpus. Of course, the more cpus that we have, the more chance we will get into trouble.



I added a trace-printk each time a non-RT waiter got the lock stolen,
kicked a kernel build and a package upgrade and took a look at the stats
an hour later:
- sh got its lock stolen 3416 times. I didn't lock the pid so I can't
look back and check how long it waited since the first time.
- the median average of stolen locks is 173.
Maybe we should also more lock stealing per waiter than the 10 that I used
in the patch. I am open to suggestion to what is a good value to use.
I have no idea either. I just looked at a run to see what the number
actually are. I have no numbers in terms of performance. So what most
likely happens is that on an unlock operation the waiter gets a wake-up
but before he gets a chance to acquire the lock, it is already taken and
he goes back to sleep again. While this looks painful it might be better
performance wise because the other task was able to acquire the lock
without waiting. But then it is not fair and this happens.
One thing that I'm curious about is, what lock is it (one or two global
hot spots or many). And how to benchmark this…

The hang can happen in different places. Like

[ 1873.205398] INFO: task pool-restraintd:122206 blocked for more than 622 seconds.
[ 1873.205400]       Tainted: G        W  OE  X --------- --- 5.14.0-104.rt21.104.el9.x86_64 #1
[ 1873.205401] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1873.205402] task:pool-restraintd state:D stack:    0 pid:122206 ppid:     1 flags:0x00000000
[ 1873.205405] Call Trace:
[ 1873.205407]  __schedule+0x21c/0x5b0
[ 1873.205410]  ? migrate_enable+0xcd/0x140
[ 1873.205413]  schedule+0x46/0xc0
[ 1873.205415]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1873.205418]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1873.205421]  __rwbase_read_lock+0x9c/0xe0
[ 1873.205425]  do_madvise+0x109/0x9c0
[ 1873.205429]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205432]  ? do_syscall_64+0x69/0x80
[ 1873.205434]  __x64_sys_madvise+0x28/0x30
[ 1873.205436]  do_syscall_64+0x5c/0x80
[ 1873.205438]  ? fpregs_restore_userregs+0x53/0x100
[ 1873.205442]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1873.205445]  ? syscall_exit_to_user_mode+0x12/0x30
[ 1873.205447]  ? do_syscall_64+0x69/0x80
[ 1873.205449]  ? do_syscall_64+0x69/0x80
[ 1873.205452]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1873.205455] RIP: 0033:0x7f35f5a93f5b
[ 1873.205457] RSP: 002b:00007f35f4a48e18 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
[ 1873.205460] RAX: ffffffffffffffda RBX: 00007f35f4a49640 RCX: 00007f35f5a93f5b
[ 1873.205461] RDX: 0000000000000004 RSI: 00000000007fb000 RDI: 00007f35f4249000
[ 1873.205463] RBP: 00007f35f4249000 R08: 00007f35ec0026c0 R09: 0000000000000005
[ 1873.205464] R10: 0000000000000008 R11: 0000000000000206 R12: 00007f35f4a49640
[ 1873.205466] R13: 0000000000000000 R14: 00007f35f5af4560 R15: 0000000000000000

or

[ 1872.559562] INFO: task auditd:926 blocked for more than 622 seconds.
[ 1872.559563]       Tainted: G        W        --------- --- 5.14.0-103.rt21.103.el9.x86_64 #1
[ 1872.559563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1872.559564] task:auditd          state:D stack:    0 pid:  926 ppid:     1 flags:0x00000000
[ 1872.559565] Call Trace:
[ 1872.559566]  __schedule+0x21c/0x5b0
[ 1872.559568]  schedule+0x46/0xc0
[ 1872.559569]  rt_mutex_slowlock_block.constprop.0+0x30/0x150
[ 1872.559570]  __rt_mutex_slowlock.constprop.0+0x6b/0xe0
[ 1872.559572]  __rwbase_read_lock+0x9c/0xe0
[ 1872.559574]  xlog_cil_commit+0x30/0x190 [xfs]
[ 1872.559631]  __xfs_trans_commit+0xb0/0x320 [xfs]
[ 1872.559686]  xfs_vn_update_time+0x12c/0x240 [xfs]
[ 1872.559741]  file_update_time+0xea/0x140
[ 1872.559744]  ? rwbase_write_lock+0xde/0x1b0
[ 1872.559746]  xfs_file_write_checks+0x247/0x310 [xfs]
[ 1872.559799]  xfs_file_buffered_write+0x76/0x2c0 [xfs]
[ 1872.559852]  ? update_load_avg+0x80/0x7d0
[ 1872.559855]  new_sync_write+0x11f/0x1b0
[ 1872.559858]  vfs_write+0x25a/0x2f0
[ 1872.559860]  ksys_write+0x65/0xe0
[ 1872.559862]  do_syscall_64+0x5c/0x80
[ 1872.559864]  ? fpregs_restore_userregs+0x12/0x100
[ 1872.559866]  ? exit_to_user_mode_prepare+0xda/0xe0
[ 1872.559868]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1872.559870] RIP: 0033:0x7f4bf7545d9f
[ 1872.559872] RSP: 002b:00007ffce95c2920 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 1872.559874] RAX: ffffffffffffffda RBX: 0000000000000111 RCX: 00007f4bf7545d9f
[ 1872.559874] RDX: 0000000000000111 RSI: 0000559015de18f0 RDI: 0000000000000005
[ 1872.559875] RBP: 0000559015de18f0 R08: 0000000000000000 R09: 00007f4bf75b74e0
[ 1872.559876] R10: 00007f4bf75b73e0 R11: 0000000000000293 R12: 0000000000000111
[ 1872.559877] R13: 0000559015dd9310 R14: 0000000000000111 R15: 00007f4bf75fb9e0

Since it is causing 25-50% of our test jobs to fail, we really want to get it fixed ASAP.

Regards,
Longman