Re: DMA from/to user-space memory

Robert Kaiser (
Fri, 15 May 1998 12:38:30 +0200 (MEST)


[I'm also CC'ing this to linux-kernel as you did. I'm not on that
list, so please, anybody engaging into this discussion, also CC
it to me -- Thanks]

On Thu, 14 May 1998, Linus Torvalds wrote:

> On Thu, 14 May 1998, Robert Kaiser wrote:
> >
> > For a project I recently did, I had to develop a device driver for
> > a frame grabber card. This driver has a requirement to do busmaster
> > DMA directly into user-space buffers. Having developed drivers for
> > other UNIXes before, I was a bit surprised that Linux didn't already
> > provide support for that. I looked through the kernel code and
> > found that it was fairly easy to do by (mis-)using the mlock()
> > function (more precisely, function do_mlock() in mm/mlock.c).
> I feel that misusing the kernel mlock functionality is exactly the wrong
> thing to do. It has horrible latency, and in general it just isn't "the
> right thing". It may be ok for some particular applications, but it has
> lots of down-sides (for example, there is no way for the kernel to handle
> restricted DMA memory with this approach - the mlock approach doesn't know
> about 16M limits or about need for larger physically contigous areas).

As I already said, I'm not so much interested in the way these
DMA functions work internally but more in what their API looks like.
If there is a more efficient way of implementing these functions
without using mlock, that's just fine with me (though I won't be
able to help coding it).

My patch already does concatenate physical pages if they happen
to be contiguous, but, yes, it has no way to enforce contiguousness.
Same thing about 16M address limits, though I think this is only
required by IMHO obsolete hardware. PCI busmaster devices don't
seem to have this problem. (After all, user space DMA is intended
as an option for very high-speed devices. Devices capable of such
high data rates (say more that 20 Megs per second) will most likely
be PCI devices these days).

> For example, some DMA engines are a _lot_ more efficient if they can have
> slightly larger areas in their scatter-gather list: some devices will
> generate an interrupt for _each_ entry in the SG list simply because they
> are too stupid to do this automatically, so they need a bit of
> hand-holding with the interrupt routine pointing them to the next entry.

Very True. It would be nice if there was a way to enforce the use of
physically contiguous pages, but this is not a question of wheter
mlock() is used for buffer locking, but more a question of how
buffer _allocation_ is done.

> Anyway, the approach I'd prefer is to have something that is expressly
> DMA-specific (you already added three system calls, so lets make those
> system calls do something really DMA specific),

?? I have not added any *system calls* .... misunderstanding or am I
picking nits ?

I merely added three kernel-level functions that can be called
by device drivers.

> and instead of allocating
> memory and _later_ tell the kernel that you want it for DMA (by that time
> it may be too late to sanely fix up issues like 16M and contiguos memory),
> you have those system calls set the stuff up the way the DMA code wants
> from the very beginning.
> So the kind of interface I'd perfer is more akin to something like this:
> typedef struct {
> unsigned long physaddr;
> unsigned long len;
> } dma_entry_t;
> void * dma_map(void *addr, size_t length,
> int prot, int flags,
> dma_entry_t * dma_table);
> which would work pretty much like mmap() (and if you look at the
> declaration for mmap() you'll find that this one looks similar). The
> "addr" parameter would be the preferred virtual address you'd like the
> mapping on, or NULL if you don't care, while "length" would be the size of
> the area, and "prot" would be the same prot as for mmap(). "flags" would
> be an extension of the mmap flags: you could have
> - MAP_CONTIGUOUS: require it to be _one_ contiguous chunk and return
> ENOMEM if none is available.
> - MAP_LIMITED: require the memory to be limited to physically below the
> 16MB mark
> - ... any other DMA-specific requirements - this may well be
> architecture-dependent ...
> And then the "dma_table" would be something that the mapping process fills
> in as it does the chunks. So for example, you could ask for a 64kB area,
> and if you don't specify MAP_CONTIGUOUS, then the kernel might decide to
> give you a "dma_table" that looks like
> 0x00014000, 16384
> 0x00102000, 8192
> 0x001C0000, 8192
> 0x00140000, 32768
> depending on how it actually found memory (so it would try to give you
> largish chunks, but it wouldn't guarantee it). The above information,
> together with the information on where it is mapped in virtual space
> (which is what the dma_map() system call would return) is sufficient for
> you to now have full knowledge of what the virtual mapping for that area
> is.

The *big* problem I have with that is that the _application_ has to be
aware that DMA will be used to fill the buffer. Thus, you will never
be able to write a driver that implements, for instance, the plain
simple read() and write() system calls using user-space DMA, even
if your hardware is smart enough to do 32-bit addressing and unattended
scatter/gather DMA.

It seems to me that all the potential problems you mentioned are not
really related to the use of mlock() for page locking, but originate
from the way the DMA buffers are allocated.

Why not offer applications a special memory allocation routine similar
to malloc(), but with additional parameters that allow you to enforce
physical contiguousness and addresses below 16M. This would allow both:
you could stick to the restrictions if your harwdware needs it, but you
can also get rid of them and make full use of your hardware's abilities
without the need of application-level programs to be aware of all that.

> So now you could build up your own scatter-gather table any way you'd like
> to (which gives you quite a lot of freedom: you might want to include the
> same physical range more than once, for example (and yes, those kinds of
> things _do_ make sense - imagine graphics-related DMA where you want to
> DMA patterns or similar).

There is nothing keeping you from doing that with the API I suggested.

> Done right, you never need to have any "dma_unmap()", because you can just
> use the normal "unmap()" on the region when you're done. Similarly, I
> suspect that your "build_sglist()" is unnecessary, because you can do all
> the building in user space because you have full information about what's
> up.

Yes, but the _need_ of the application to be aware of the DMA being used
by the driver is a big disadvantage IMHO.



Robert Kaiser email:
Carl-Zeiss-Str. 41 phone: (49) 6131 9138-80
D-55129 Mainz / Germany fax: (49) 6131 9138-10

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to