Re: [PATCH v1] ALSA: memalloc: Fix indefinite hang in non-iommu case

From: Takashi Iwai
Date: Thu Feb 15 2024 - 03:40:20 EST


On Thu, 15 Feb 2024 04:45:27 +0100,
Hillf Danton wrote:
>
> On Wed, 14 Feb 2024 17:07:25 -0700 Karthikeyan Ramasubramanian <kramasub@xxxxxxxxxxxx>
> > Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> > the alsa non-contiguous allocator always called the alsa fallback
> > allocator in the non-iommu case. This allocated non-contig memory
> > consisting of progressively smaller contiguous chunks. Allocation was
> > fast due to the OR-ing in of __GFP_NORETRY.
> >
> > After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> > the code tries the dma non-contig allocator first, then falls back to
> > the alsa fallback allocator. In the non-iommu case, the former supports
> > only a single contiguous chunk.
> >
> > We have observed experimentally that under heavy memory fragmentation,
> > allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
> > triggers an indefinite hang in the dma non-contig allocator. This has
> > high-impact, as an occurrence will trigger a device reboot, resulting in
> > loss of user state.
> >
> > Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
> > so it does not get stuck looking for that elusive large contiguous chunk,
> > in which case we will fall back to the alsa fallback allocator.
>
> The faster dma_alloc_noncontiguous() fails the more likely the paperover
> in 9d8e536d36e7 fails to work, so this is another case of bandaid instead
> of mitigating heavy fragmentation at the first place.

Yes, the main problem is the indefinite hang from
dma_alloc_noncontiguous().

So, is the behavior more or less same even if you pass
__GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
already implicitly set somewhere in the middle? It shouldn't hang
indefinitely, but the other impact to the system like OOM-killer
kickoff may be seen.

As of now, I'm inclined to take the suggested workaround. It'll work
in most cases. The original issue worked around by the commit
9d8e536d36e7 still remains, and we need to address differently.


thanks,

Takashi