Re: Bug in add_dma_entry()'s debugging code

From: Ferry Toth
Date: Thu Nov 30 2023 - 15:08:30 EST


Hi,

Op 28-11-2023 om 18:44 schreef Catalin Marinas:
On Tue, Nov 28, 2023 at 10:18:19AM -0500, Alan Stern wrote:
On Tue, Nov 28, 2023 at 02:37:02PM +0100, Christoph Hellwig wrote:
I'd actually go one step back:

1) for not cache coherent DMA you can't do overlapping operations inside
a cache line

Rephrasing slightly: You mustn't perform multiple non-cache-coherent DMA
operations that touch the same cache line concurrently. (The word
"overlapping" is a a little ambiguous in this context.)

The problem is worse. I'd say you should not perform even a single
non-cache-coherent DMA (usually from-device or bidirectional) operation
if the cache line is shared with anything else modifying it. It doesn't
need to be another DMA operation. But that's more difficult to add to
the DMA API debug code (maybe something like the bouncing logic in
dma_kmalloc_needs_bounce()).

The logical confcusion from that would be that IFF dma-debug is enabled on
any platform we need to set ARCH_DMA_MINALIGN to the cache line size.

Or just force the kmalloc() min align to cache_line_size(), something
like:

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4a658de44ee9..3ece20367636 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -543,6 +543,8 @@ static inline int dma_get_cache_alignment(void)
#ifdef ARCH_HAS_DMA_MINALIGN
return ARCH_DMA_MINALIGN;
#endif
+ if (IS_ENABLED(CONFIG_DMA_API_DEBUG))
+ return cache_line_size();
return 1;
}
#endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8d431193c273..d0b21d6e9328 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -879,7 +879,7 @@ static unsigned int __kmalloc_minalign(void)
unsigned int minalign = dma_get_cache_alignment();
if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
- is_swiotlb_allocated())
+ is_swiotlb_allocated() && !IS_ENABLED(CONFIG_DMA_API_DEBUG))
minalign = ARCH_KMALLOC_MINALIGN;
return max(minalign, arch_slab_minalign());

With above suggestion "force the kmalloc() min align to cache_line_size()" + Alan's debug code:

root@yuna:~# journalctl -k | grep hub
kernel: usbcore: registered new interface driver hub
kernel: hub 1-0:1.0: USB hub found
kernel: usb usb1: hub buffer at 71c7180, status at 71c71c0
kernel: hub 1-0:1.0: 1 port detected
kernel: hub 2-0:1.0: USB hub found
kernel: usb usb2: hub buffer at 71c79c0, status at 71c7a00
kernel: hub 2-0:1.0: 1 port detected
kernel: hub 1-1:1.0: USB hub found
kernel: usb 1-1: hub buffer at 65b36c0, status at 6639340
kernel: hub 1-1:1.0: 7 ports detected

and the stack trace indeed goes away.

IOW also the 2 root hub kmalloc() are now also aligned to the cache line size, even though these never triggered the stack trace. Strange: hub status is aligned far away from hub buffer, kmalloc mysteries.

This still did not land for me: are we detecting a false alarm here as the 2 DMA operations can never happen on the same cache line on non-cache-coherent platforms? If so, shouldn't we fix up the dma debug code to not detect a false alarm? Instead of changing the alignment?
Or, is this a bonafide warning (for non-cache-coherent platforms)? Then we should not silence it by force aligning it, but issue a WARN (on a cache coherent platform) that is more useful (i.e. here we have not an overlap but a shared cache line). On a non-cache coherent platform something stronger than a WARN might be appropriate?

Also note that to_cacheline_number() in kernel/dma/debug.c only takes
into account the L1_CACHE_SHIFT. On arm64 for example, cache_line_size()
returns the maximum line of all the cache levels (and we've seen
hardware where the L1 is 64-byte, L2 is 128).

BUT: we're actually reduzing our dependency on ARCH_DMA_MINALIGN by
moving to bounce buffering unaligned memory for non-coherent
architectures,

What's the reason for this? To allow the minimum allocation size to be
smaller than the cache line size? Does the savings in memory make up
for the extra overhead of bounce buffering?

Or is this just to allow people to be more careless about how they
allocate their DMA buffers (which doesn't seem to make sense)?

It's the former and it does make a difference with lots of small
structure or string allocations.

[...]
I get the impression that you would really like to have two different
versions of kmalloc() and friends: one for buffers that will be used in
DMA (and hence require cache-line alignment) and one for buffers that
won't be.

We've been there for the past 2-3 years (and a few other options). It's
hard to guess in a generic way because the allocation place may not
necessarily know how the device is going to access that memory (PIO,
DMA). The conclusion was that for those cases where the buffer is small,
we just do a bounce. If it's performance critical, the driver can use a
kmem_cache_create(SLAB_HWCACHE_ALIGN) and avoid the bouncing.