Re: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued tags

From: Gabriel Krisman Bertazi
Date: Tue Nov 08 2022 - 22:03:35 EST


Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx> writes:

>> For more interesting cases, where there is queueing, we need to take
>> into account the cross-communication of the atomic operations. I've
>> been benchmarking by running parallel fio jobs against a single hctx
>> nullb in different hardware queue depth scenarios, and verifying both
>> IOPS and queueing.
>>
>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>> varying only the hardware queue length per test.
>>
>> queue size 2 4 8 16 32 64
>> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
>> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)
>
>>

Hi Chaitanya,

Thanks for the feedback.

> So if I understand correctly
> QD 2,4,8 shows clear performance benefit from this patch whereas
> QD 16, 32, 64 shows drop in performance it that correct ?
>
> If my observation is correct then applications with high QD will
> observe drop in the performance ?

To be honest, I'm not sure. Given the overlap of the standard variation
(in parenthesis) with the mean, I'm not sure the observed drop is
statistically significant. In my prior analysis, I thought it wasn't.

I don't see where a significant difference would come from, to be honest,
because the higher the QD, the more likely it is to go through the
not-contended path, where sbq->ws_active == 0. This hot path is
identical to the existing implementation.

> Also, please share a table with block size/IOPS/BW/CPU (system/user)
> /LAT/SLAT with % increase/decrease and document the raw numbers at the
> end of the cover-letter for completeness along with fio job to others
> can repeat the experiment...

This was issued against the nullb and the IO size is fixed, matching the
device's block size (512b), which is why I am not tracking BW, only
IOPS. I'm not sure the BW is still relevant in this scenario.

I'll definitely follow up with CPU time and latencies, and share the
fio job. I'll also take another look on the significance of the
measured values for high QD.

Thank you,

--
Gabriel Krisman Bertazi