Re: [PATCH v3 0/2] iommu/iova: Make the rcache depot properly flexible

From: Robin Murphy
Date: Tue Jan 09 2024 - 06:27:24 EST


On 2024-01-09 6:23 am, Ethan Zhao wrote:

On 1/9/2024 1:54 PM, Ethan Zhao wrote:

On 1/9/2024 1:35 AM, Robin Murphy wrote:
On 2023-12-28 12:23 pm, Ido Schimmel wrote:
On Tue, Sep 12, 2023 at 05:28:04PM +0100, Robin Murphy wrote:
v2: https://lore.kernel.org/linux-iommu/cover.1692641204.git.robin.murphy@xxxxxxx/

Hi all,

I hope this is good to go now, just fixed the locking (and threw
lockdep at it to confirm, which of course I should have done to begin
with...) and picked up tags.

Hi,

After pulling the v6.7 changes we started seeing the following memory
leaks [1] of 'struct iova_magazine'. I'm not sure how to reproduce it,
which is why I didn't perform bisection. However, looking at the
mentioned code paths, they seem to have been changed in v6.7 as part of
this patchset. I reverted both patches and didn't see any memory leaks
when running a full regression (~10 hours), but I will repeat it to be
sure.

Any idea what could be the problem?

Hmm, we've got what looks to be a set of magazines forming a plausible depot list (or at least the tail end of one):

ffff8881411f9000 -> ffff8881261c1000

ffff8881261c1000 -> ffff88812be26400

ffff88812be26400 -> ffff8188392ec000

ffff8188392ec000 -> ffff8881a5301000

ffff8881a5301000 -> NULL

which I guess has somehow become detached from its rcache->depot without being freed properly? However I'm struggling to see any conceivable way that could happen which wouldn't already be more severely broken in other ways as well (i.e. either general memory corruption or someone somehow still trying to use the IOVA domain while it's being torn down).

Out of curiosity, does reverting just patch #2 alone make a difference? And is your workload doing anything "interesting" in relation to IOVA domain lifetimes, like creating and destroying SR-IOV virtual functions, changing IOMMU domain types via sysfs, or using that horrible vdpa thing, or are you seeing this purely from regular driver DMA API usage?

There no lock held when free_iova_rcaches(), is it possible free_iova_rcaches() race with the delayed cancel_delayed_work_sync() ?

I don't know why not call cancel_delayed_work_sync(&rcache->work); first in free_iova_rcaches() to avoid possible race.

between following functions pair, race possible ? if called cocurrently.

1. free_iova_rcaches() with iova_depot_work_func()

   free_iova_rcaches() holds no lock, iova_depot_work_func() holds rcache->lock.

Unless I've completely misunderstood the workqueue API, that can't happen, since free_iova_rcaches() *does* synchronously cancel the work before it starts freeing the depot list.

2. iova_cpuhp_dead() with iova_depot_work_func()

  iova_cpuhp_dead() holds per cpu lock cpu_rcache->lock, iova_depot_work_func() holds rcache->lock.

That's not a race because those are touching completely different things - the closest they come to interacting is where they both free IOVAs back to the rbtree.

3. iova_cpuhp_dead() with free_iova_rcaches()

   iova_cpuhp_dead() holds per cpu lock cpu_rcache->lock, free_iova_rcaches() holds no lock.

See iova_domain_free_rcaches() - by the time we call free_iova_rcaches(), the hotplug handler has already been removed (and either way it couldn't account for *this* issue since it doesn't touch the depot at all).

4. iova_cpuhp_dead() with free_global_cached_iovas()

   iova_cpuhp_dead() holds per cpu lock cpu_rcache->lock and free_global_cached_iovas() holds rcache->lock.

Again, they hold different locks because they're touching unrelated things.

Thanks,
Robin.