Correct use of DMA api (Some newbie questions)

From: Nikolai Zhubr
Date: Sun Jul 14 2019 - 13:01:37 EST


Hi all,

After reading some (apparently contradictory) revisions of DMA api references in Documentation/DMA-*.txt, some (contradictory) discussions thereof, and even digging through the in-tree drivers in search for a good enlightening example, still I have to ask for advice.

I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very special PCIe device. And actually it does work already, but performs differenly on different kernels. I'm targeting x86 (i686) only (although preferrably the driver should stay platform-neutral) and I need to support kernels 4.9+. Due to how the device is designed and used, very little has to be done in kernel space. The device has large internal memory, which accumulates some measurement data, and it is capable of transferring it to the host using DMA (with at least 32-bit address space available). Arranging memory for DMA is pretty much the only thing that userspace can not reasonably do, so this needs to be in the driver. So my currenly attempted layout is as follows:

1. In the (kernel-mode) driver, allocate large contiguous block of physical memory to do DMA into. It will be later reused several times. This block does not need to have a kernel-mode virtual address because it will never be accessed from the driver directly. The block size is typically 128M and I use CMA=256M. Currently I use dma_alloc_coherent(), but I'm not convinced it really needs to be a strictly coherent memory, for performance reasons, see below. Also, AFAICS on x86 dma_alloc_coherent() always creates a kernel address mapping anyway, so maybe I'd better simply kalloc() with subsequent dma_map_single()?

2. Upon DMA completion (from device to host), some sort of barrier/synchronization might be necessary (to be safe WRT speculative loads, cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(), however the latter looks like a nop for x86 AFAICS, and the former is apparently flush_write_buffers() which is not very involved either (asm lock; nop) and does not look usefull for my case. Currentlly, I do not use any, and it seems like OK, maybe by pure luck. So, is it so trivially simple on x86 or am I just missing something horribly big here?

3. mmap this buffer for userspace. Reading from it should be as fast as possible, therefore this block AFAICS should be cacheble (and prefetchable and whatever else for better performance), at least from userspace context. It is not quite clear if such properties would depend on block allocation method (in step 1 above) or just on remapping attributes only. Currently, for mmap I employ dma_mmap_coherent(), but it seems also possible to use remap_pfn_range(), and also change vm_page_prot somewhat. I've already found that e.g. pgprot_noncached hurts performance quite a lot, but supposedly without it some DMA barrier (step 2 above) seems still necessary?

Any hints greatly appreciated,

Regards,
Nikolai