Re: contention on pwq->pool->lock under heavy NFS workload

From: Chuck Lever III
Date: Thu Jun 22 2023 - 10:38:55 EST




> On Jun 21, 2023, at 5:28 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote:
>> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
>> contended lock on my test NFS client. The issue appears to be that the three
>> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
>> get placed in the same worker_pool, so they have to fight over one pool lock.
>>
>> I notice that ib_comp_wq is allocated with the same flags, but I don't see
>> significant contention there, and a trace_printk in __queue_work shows that
>> work items queued on that WQ seem to alternate between at least two different
>> worker_pools.
>>
>> Is there a preferred way to ensure the NFS WQs get spread a little more fairly
>> amongst the worker_pools?
>
> Can you share the output of lstopo on the test machine?

Machine (P#0 total=32480548KB DMIProductName="Super Server" DMIProductVersion=0123456789 DMIBoardVendor=Supermicro DMIBoardName=X12SPL-F DMIBoardVersion=2.00 DMIBoardAssetTag="Base Board Asset Tag" DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=0123456789 DMIChassisAssetTag="Chassis Asset Tag" DMIBIOSVendor="American Megatrends International, LLC." DMIBIOSVersion=1.1a DMIBIOSDate=08/05/2021 DMISysVendor=Supermicro Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=6.4.0-rc7-00005-ga0c30c01f971 OSVersion="#8 SMP PREEMPT Wed Jun 21 11:29:02 EDT 2023" HostName=morisot.XXXXXXXXXXX.net Architecture=x86_64 hwlocVersion=2.5.0 ProcessName=lstopo)
Package L#0 (P#0 total=32480548KB CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=106 CPUModel="Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz" CPUStepping=6)
NUMANode L#0 (P#0 local=32480548KB total=32480548KB)
L3Cache L#0 (size=18432KB linesize=64 ways=12 Inclusive=0)
L2Cache L#0 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#0 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#0 (P#0)
PU L#0 (P#0)
L2Cache L#1 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#1 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#1 (P#1)
PU L#1 (P#1)
L2Cache L#2 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#2 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#2 (P#2)
PU L#2 (P#2)
L2Cache L#3 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#3 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#3 (P#3)
PU L#3 (P#3)
L2Cache L#4 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#4 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#4 (P#4)
PU L#4 (P#4)
L2Cache L#5 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#5 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#5 (P#5)
PU L#5 (P#5)
L2Cache L#6 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#6 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#6 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#6 (P#6)
PU L#6 (P#6)
L2Cache L#7 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#7 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#7 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#7 (P#7)
PU L#7 (P#7)
L2Cache L#8 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#8 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#8 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#8 (P#8)
PU L#8 (P#8)
L2Cache L#9 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#9 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#9 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#9 (P#9)
PU L#9 (P#9)
L2Cache L#10 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#10 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#10 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#10 (P#10)
PU L#10 (P#10)
L2Cache L#11 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#11 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#11 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#11 (P#11)
PU L#11 (P#11)
HostBridge L#0 (buses=0000:[00-07])
PCI L#0 (busid=0000:00:11.5 id=8086:a1d2 class=0106(SATA))
Block(Removable Media Device) L#0 (Size=1048575 SectorSize=512 LinuxDeviceID=11:0 Model=ASUS_DRW-24F1ST_b Revision=1.00 SerialNumber=E5D0CL034213) "sr0"
PCI L#1 (busid=0000:00:17.0 id=8086:a182 class=0106(SATA))
PCIBridge L#1 (busid=0000:00:1c.0 id=8086:a190 class=0604(PCIBridge) link=0.25GB/s buses=0000:[01-01])
PCI L#2 (busid=0000:01:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
Network L#1 (Address=3c:ec:ef:7a:0b:fa) "eno1"
PCIBridge L#2 (busid=0000:00:1c.1 id=8086:a191 class=0604(PCIBridge) link=0.25GB/s buses=0000:[02-02])
PCI L#3 (busid=0000:02:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
Network L#2 (Address=3c:ec:ef:7a:0b:fb) "eno2"
PCIBridge L#3 (busid=0000:00:1c.5 id=8086:a195 class=0604(PCIBridge) link=0.62GB/s buses=0000:[05-06])
PCIBridge L#4 (busid=0000:05:00.0 id=1a03:1150 class=0604(PCIBridge) link=0.62GB/s buses=0000:[06-06])
PCI L#4 (busid=0000:06:00.0 id=1a03:2000 class=0300(VGA))
PCIBridge L#5 (busid=0000:00:1d.0 id=8086:a198 class=0604(PCIBridge) link=3.94GB/s buses=0000:[07-07])
PCI L#5 (busid=0000:07:00.0 id=c0a9:540a class=0108(NVMExp) link=3.94GB/s)
Block(Disk) L#3 (Size=244198584 SectorSize=512 LinuxDeviceID=259:0 Model=CT250P2SSD8 Revision=P2CR012 SerialNumber=2116E597CC4F) "nvme0n1"
HostBridge L#6 (buses=0000:[17-18])
PCIBridge L#7 (busid=0000:17:02.0 id=8086:347a class=0604(PCIBridge) link=15.75GB/s buses=0000:[18-18])
PCI L#6 (busid=0000:18:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=6)
Network L#4 (Address=ec:0d:9a:92:b2:46 Port=1) "ens6np0"
OpenFabrics L#5 (NodeGUID=ec0d:9a03:0092:b246 SysImageGUID=ec0d:9a03:0092:b246 Port1State=4 Port1LID=0x0 Port1LMC=0 Port1GID0=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID1=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID2=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID3=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID4=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID5=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID6=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID7=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID8=fe80:0000:0000:0000:88dd:0692:352e:0cec Port1GID9=fe80:0000:0000:0000:88dd:0692:352e:0cec) "rocep24s0"
HostBridge L#8 (buses=0000:[50-51])
PCIBridge L#9 (busid=0000:50:04.0 id=8086:347c class=0604(PCIBridge) link=15.75GB/s buses=0000:[51-51])
PCI L#7 (busid=0000:51:00.0 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
Network L#6 (Address=00:00:05:f4:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0a Port=1) "ibs4f0"
OpenFabrics L#7 (NodeGUID=b8ce:f603:0037:7a0a SysImageGUID=b8ce:f603:0037:7a0a Port1State=4 Port1LID=0xc Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0a) "ibp81s0f0"
PCI L#8 (busid=0000:51:00.1 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
Network L#8 (Address=00:00:03:d3:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0b Port=1) "ibs4f1"
OpenFabrics L#9 (NodeGUID=b8ce:f603:0037:7a0b SysImageGUID=b8ce:f603:0037:7a0a Port1State=1 Port1LID=0xffff Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0b) "ibp81s0f1"
depth 0: 1 Machine (type #0)
depth 1: 1 Package (type #1)
depth 2: 1 L3Cache (type #6)
depth 3: 12 L2Cache (type #5)
depth 4: 12 L1dCache (type #4)
depth 5: 12 L1iCache (type #9)
depth 6: 12 Core (type #2)
depth 7: 12 PU (type #3)
Special depth -3: 1 NUMANode (type #13)
Special depth -4: 10 Bridge (type #14)
Special depth -5: 9 PCIDev (type #15)
Special depth -6: 10 OSDev (type #16)
Memory attribute #2 name `Bandwidth' flags 5
NUMANode L#0 = 1790 from cpuset 0x00000fff (Machine L#0)
Memory attribute #3 name `Latency' flags 6
NUMANode L#0 = 7600 from cpuset 0x00000fff (Machine L#0)
CPU kind #0 efficiency 0 cpuset 0x00000fff
FrequencyMaxMHz = 3300


> The following branch has pending workqueue changes which makes unbound
> workqueues finer grained by default and a lot more flexible in how they're
> segmented.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
>
> Can you please test with the brnach? If the default doesn't improve the
> situation, you can set WQ_SYSFS on the affected workqueues and change their
> scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please
> take a look at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350
>
> for more details.

I will give this a try.


--
Chuck Lever