Re: [PATCH v6 5/6] nouveau: use new mmu interval notifiers

From: Ralph Campbell
Date: Wed Jan 15 2020 - 17:09:51 EST



On 1/14/20 5:00 AM, Jason Gunthorpe wrote:
On Mon, Jan 13, 2020 at 02:47:02PM -0800, Ralph Campbell wrote:
void
nouveau_svmm_fini(struct nouveau_svmm **psvmm)
{
struct nouveau_svmm *svmm = *psvmm;
+ struct mmu_interval_notifier *mni;
+
if (svmm) {
mutex_lock(&svmm->mutex);
+ while (true) {
+ mni = mmu_interval_notifier_find(svmm->mm,
+ &nouveau_svm_mni_ops, 0UL, ~0UL);
+ if (!mni)
+ break;
+ mmu_interval_notifier_put(mni);

Oh, now I really don't like the name 'put'. It looks like mni is
refcounted here, and it isn't. put should be called 'remove_deferred'

OK.

And then you also need a way to barrier this scheme on driver unload.

Good point. I can add something like
void mmu_interval_notifier_synchronize(struct mm_struct *mm)
that waits for deferred operations to complete similar to
mmu_interval_read_begin().

+ }
svmm->vmm = NULL;
mutex_unlock(&svmm->mutex);
- mmu_notifier_put(&svmm->notifier);

While here it was actually a refcount.

+static void nouveau_svmm_do_unmap(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range)
+{
+ struct svmm_interval *smi =
+ container_of(mni, struct svmm_interval, notifier);
+ struct nouveau_svmm *svmm = smi->svmm;
+ unsigned long start = mmu_interval_notifier_start(mni);
+ unsigned long last = mmu_interval_notifier_last(mni);

This whole algorithm only works if it is protected by the read side of
the interval tree lock. Deserves at least a comment if not an
assertion too.

This is called from the invalidate() callback and while holding the
driver page table lock so the struct mmu_interval_notifier and
the interval tree can't change.
I will add comments for v7.

static int nouveau_range_fault(struct nouveau_svmm *svmm,
struct nouveau_drm *drm, void *data, u32 size,
- u64 *pfns, struct svm_notifier *notifier)
+ u64 *pfns, u64 start, u64 end)
{
unsigned long timeout =
jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
/* Have HMM fault pages within the fault window to the GPU. */
struct hmm_range range = {
- .notifier = &notifier->notifier,
- .start = notifier->notifier.interval_tree.start,
- .end = notifier->notifier.interval_tree.last + 1,
+ .start = start,
+ .end = end,
.pfns = pfns,
.flags = nouveau_svm_pfn_flags,
.values = nouveau_svm_pfn_values,
+ .default_flags = 0,
+ .pfn_flags_mask = ~0UL,
.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT,
};
- struct mm_struct *mm = notifier->notifier.mm;
+ struct mm_struct *mm = svmm->mm;
long ret;
while (true) {
if (time_after(jiffies, timeout))
return -EBUSY;
- range.notifier_seq = mmu_interval_read_begin(range.notifier);
- range.default_flags = 0;
- range.pfn_flags_mask = -1UL;
down_read(&mm->mmap_sem);

mmap sem doesn't have to be held for the interval search, and again we
have lifetime issues with the membership here.

I agree mmap_sem isn't needed for the interval search, it is needed if
the search doesn't find a registered interval and one needs to be created
to cover the underlying VMA. If an arbitrary size interval was created
instead, then mmap_sem wouldn't be needed.
I don't understand the lifetime/membership issue. The driver is the only thing
that allocates, inserts, or removes struct mmu_interval_notifier and thus
completely controls the lifetime.

+ ret = nouveau_svmm_interval_find(svmm, &range);
+ if (ret) {
+ up_read(&mm->mmap_sem);
+ return ret;
+ }
+ range.notifier_seq = mmu_interval_read_begin(range.notifier);
ret = hmm_range_fault(&range, 0);
up_read(&mm->mmap_sem);
if (ret <= 0) {

I'm still not sure this is a better approach than what ODP does. It
looks very expensive on the fault path..

Jason


ODP doesn't have this problem because users have to call ib_reg_mr()
before any I/O can happen to the process address space. That is when
mmu_interval_notifier_insert() / mmu_interval_notifier_remove() can
be called and the driver doesn't have to worry about the interval
changing sizes or being removed while I/O is happening.
For GPU like devices, I'm trying to allow hardware access to any user
level address without pre-registering it. That means inserting mmu
interval notifiers for the ranges the GPU page faults on and updating
the intervals as munmap() calls remove parts of the address space.
I don't want to register an interval per page so the logical range
is the underlying VMA.

It isn't that expensive, there is an extra driver lock/unlock as
part of the lookup and possibly a find_vma() and kmalloc(GFP_ATOMIC)
for new intervals. Also, the deferred interval updates for munmap().
Compared to the cost of updating PTEs in the device and GPU fault
handling, this is minimal overhead.