Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm_cid

From: Mathieu Desnoyers
Date: Thu Apr 13 2023 - 09:42:26 EST


On 2023-04-13 09:13, Aaron Lu wrote:
On Wed, Apr 12, 2023 at 04:57:50PM -0400, Mathieu Desnoyers wrote:
On 2023-04-12 00:27, Aaron Lu wrote:
Just noticed below warning in dmesg on v4, the warning is triggered by
WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().

I think I know what triggers this in v4.

See sched_mm_cid_migrate_from() (near the end):

+ /*
+ * The src_cid is unused, so it can be unset.
+ */
+ if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
+ mm_cid_set_lazy_put(src_cid))
+ return;
+ __mm_cid_put(mm, src_cid);

There is a short window here between successful cmpxchg and clear mask bit
where the thread has ownership of the cid without rq lock held. This can
lead to rare over-allocation of cids beyond nr_possible_cpus if the src cpu
runqueue reallocates a cid concurrently, which causes __mm_cid_get() to
return -1.

Because we want to avoid taking the src rq lock in migrate-from, I plan to
fix this by doing the following:

- disable interrupts around this cmpxchg and __mm_cid_put(),

To reduce the short window?

Yes, just reduce. Eliminating the window would require grabbing the
src rq lock which we want to avoid.


- modify __mm_cid_get so it retries if cpumask_first_zero() returns -1.

This should take care of this kind of extremely rare over-allocation
scenario through retry, which will be bounded by the duration of the
instruction sequence between cmpxchg and clearing the bit in the bitmask.

I have not reproduced the warning on my end, only figured this out from
code review.

Thoughts ?

Sounds reasonable to me.

I tested below diff to verify your theory:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..2c0764e53b83 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)

cpumask = mm_cidmask(mm);
cid = cpumask_first_zero(cpumask);
- if (cid >= nr_cpu_ids)
+ if (cid >= nr_cpu_ids) {
+ WARN_ON_ONCE(1);
return -1;
+ }
cpumask_set_cpu(cid, cpumask);
return cid;
}

And it indeed fired.

I then applied the below diff according to your description:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425766cc1300..881571519a90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11536,6 +11536,7 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
struct mm_struct *mm = t->mm;
struct rq *src_rq;
struct task_struct *src_task;
+ unsigned long flags;
if (!mm)
return;
@@ -11623,10 +11624,12 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
/*
* The src_cid is unused, so it can be unset.
*/
+ local_irq_save(flags);
if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
mm_cid_set_lazy_put(src_cid))
return;
__mm_cid_put(mm, src_cid);
+ local_irq_restore(flags);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3e7dc2cd1cc..41ed7022bab0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
cpumask = mm_cidmask(mm);
cid = cpumask_first_zero(cpumask);
- if (cid >= nr_cpu_ids)
+ if (cid >= nr_cpu_ids) {
+ WARN_ON_ONCE(1);
return -1;
+ }
cpumask_set_cpu(cid, cpumask);
return cid;
}
@@ -3345,8 +3347,10 @@ static inline int mm_cid_get(struct mm_struct *mm)
if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
}
+ cid = -1;
raw_spin_lock(&mm->cid_lock);
- cid = __mm_cid_get_locked(mm);
+ while (cid == -1)
+ cid = __mm_cid_get_locked(mm);
raw_spin_unlock(&mm->cid_lock);
WRITE_ONCE(*pcpu_cid, cid);
return cid;

And did several iterations of hackbench, no more warnings from the
original place rseq_update_cpu_node_id(). It appears solved the issue,
will let you know if I ever hit it again.

One more thing is, with the above diff applied, I still get the warning
about cid == -1 in __mm_cid_get_locked() so I suppose disabling irq in
migrate_from can't entirely close the race.

Yes, this is expected. We need to do a retry loop on "get" when we find
a fully set bitmap and retry. This is why I want to disable irqs in
migrate-from: to keep this retry loop done with irqs off and rq lock held as small as possible. Disabling interrupts on the other side achieves this.


Another thing, it doesn't appear there is an user of __mm_get_cid()?

True. I'm preparing a v5 which significantly changes things: I introduce
a lock-free get, which hopefully should help reducing contention on the qspinlock.

Thanks,

Mathieu



Thanks,
Aaron

[ 1819.649803] ------------[ cut here ]------------
[ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
[ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
[ 1819.649903] sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
[ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
[ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
[ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
[ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
[ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
[ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
[ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
[ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
[ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
[ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
[ 1819.649947] FS: 00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
[ 1819.649950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
[ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1819.649957] PKRU: 55555554
[ 1819.649958] Call Trace:
[ 1819.649960] <TASK>
[ 1819.649964] exit_to_user_mode_prepare+0x13b/0x1a0
[ 1819.649970] syscall_exit_to_user_mode+0x2a/0x50
[ 1819.649976] ? __x64_sys_read+0x1d/0x30
[ 1819.649980] do_syscall_64+0x6d/0x90
[ 1819.649984] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1819.649990] RIP: 0033:0x7f81a77149cc
[ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
[ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
[ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
[ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
[ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
[ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
[ 1819.650012] </TASK>
[ 1819.650013] ---[ end trace 0000000000000000 ]---

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com