Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V

From: Robin Murphy
Date: Wed Dec 22 2021 - 06:57:16 EST


On 2021-12-21 22:00, Nicolin Chen wrote:
[...]
The challenge to make ECMDQ useful to Linux is how to make sure that all
the commands expected to be within scope of a future CMND_SYNC plus that
sync itself all get issued on the same queue, so I'd be mildly surprised
if you didn't have the same problem.

PATCH-3 in this series actually helps align the command queues,
between issued commands and SYNC, if bool sync == true. Yet, if
doing something like issue->issue->issue_with_sync, it could be
tricker.

Indeed between the iommu_iotlb_gather mechanism and low-level command
batching things are already a lot more concentrated than they could be,
but arm_smmu_cmdq_batch_add() and its callers stand out as examples of
where we'd still be vulnerable to preemption. What I haven't even tried
to reason about yet is assumptions in the higher-level APIs, e.g. if
io-pgtable might chuck out a TLBI during an iommu_unmap() which we
implicitly expect a later iommu_iotlb_sync() to cover.

Though I might have oversimplified the situation here, I see
the arm_smmu_cmdq_batch_add() calls are typically followed by
arm_smmu_cmdq_batch_submit(). Could we just add a SYNC in the
_batch_submit() to all the queues that it previously touched
in the _batch_add()?

Keeping track of which queues a batch has touched is certainly possible, but it's yet more overhead to impose across the board when intra-batch preemption should (hopefully) be very rare in practice. I was thinking more along the lines of disabling preemption/migration for the lifetime of a batch, or more pragmatically just hoisting the queue selection all the way out to the scope of the batch itself (which also conveniently seems about the right shape for potentially forking off a whole other dedicated command submission flow from that point later).

We still can't mitigate inter-batch preemption, though, so we'll just have to audit everything very carefully to make sure we don't have (or inadvertently introduce in future) any places where that could be problematic. We really want to avoid over-syncing as that's liable to end up being just as bad for performance as the contention that we're nominally avoiding.

I've been thinking that in many ways per-domain queues make quite a bit
of sense and would be easier to manage than per-CPU ones - plus that's
pretty much the usage model once we get to VMs anyway - but that fails
to help the significant cases like networking and storage where many
CPUs are servicing a big monolithic device in a single domain :(

Yea, and it's hard to assume which client would use CMDQ more
frequently, in order to balance or assign more queues to that
client, which feels like a QoS conundrum.

Indeed, plus once we start assigning queues to VMs we're going to want to remove them from the general pool for host usage, so we definitely want to plan ahead here.

Cheers,
Robin.