Re: Thread-private mappings and graphics (was Re: Per-Processor Data

James Simmons (jsimmons@edgeglobal.com)
Sun, 19 Dec 1999 22:11:21 -0500 (EST)


This thread is getting old and everyone including myself will not buge on
this issue. So this is the last I will say anything on this.

BS point #1: The multiple writes/page granularity issue. Obviously the
pagefault tricks are meant to be used with a non-direct rendering system.
You do provide something that behaves like direct rendering to userland
but on page fault the kernel first grabs the commands to validate them. On
sane card you can just mmap the real accel region. See my earlier post to
linus with the problems and solution you can implement with page fault on
porrly desigend cards. If you are going to use a direct rendering system
on poor hardware and lose all your stability and security anyway, you
might as well get rid of everything between Mesa and the hardware.

BS point #2: Sure there is such a thing as a setuid library. XFree86 is
exactly that. Sure, you can argue abou the difference between an
application and a library, but within X you make Xlib calls which result
in userspace code hitting hardware registers directly, I'd say that
there's no meaningful difference.

BS point #3: in-kernel acceleration bloats the kernel. Not really, by any
definition of 'bloat' I'd call reasonable. In-kernel acceleration usually
involves switching on a command word, jumping to the appropriate accel
programming function, doing some simple validation checks on the accel
parameters, and then writing to a bunch of registers or a DMA buffer.
There's just not enough code there to cause any significant amount of
kernel bloat. There's no .data overhead, and much of the code can be
shared between drivers in any case. You can set up the drivers so that
acceleration handling is in a separate module, so that it won't have to be
resident in-kernel unless it is being used. Look at the existing OSS
sound drivers in the kernel if you want to see examples of how onerous
this bloat is in real life - it is unnoticeable, basically, and the ALSA
drivers will be even less bloated while still doing no "direct sound
rendering" from userspace.

BS point #4: direct rendering is the only way to get acceptable speed
under Linux. Utter hogwash. DRI improves command _latency_ by
eliminating the expensive kernel-user transition, but it does not improve
performance AT ALL unless the latency of the kernel-user transition
results in your not being able to feed the hardware accel engine quickly
enough. This dramatic latency hit is present for ioctl-per-primitive, but
no one seriously considers using ioctls for anything except debugging,
where speed doesn't really matter.

With framerates rarely exceeding 100 frames per second, the
inherent latency in any rendering system is going to be huge no matter
what you do, so unless you compare DRI against a command transfer system
with absurdly high latencies like ioctls you'll see ZERO performance gain
from direct rendering. Even a 'dumb' system such as sending one ioctl
every VSYNC (max 100 ring transitions per second, not at all unreasonable)
to flush a static command buffer would probably work just fine, even with
command validation. If you use streaming flush-on-pagefault systems like
pingpong buffers, though, you'll get perfectly acceptable latency and
performace with even less overhead and with full 100% command sanity
validation. On SMP systems, using a kernel thread to process the command
buffers in the kernel in parallel with userspace filling the next buffer
will reduce the overhead even further. DRI ignores SMP at best and is
broken by it at worst.

[Strongest point]

And now that you are doing more of the command processing in the
kernel, you can realistically expect to make good use of FIFO high/low
watermark interrupts, command timeout interrupts, etc which are present on
modern video cards. These interrupts are designed to help the driver
balance the state of hardware and software command FIFOs, to prevent
buffer underrun or overrun stalls and keep the hardware fed and rendering
smoothly. This _cannot_ be done with DRI because fast IRQ callbacks to
userspace don't exist in Linux, and even if you wrote up a stub driver to
inform userspace of the interrupts it would still introduce so much
latency that the interrupts would become useless. And these FIFO
balancing IRQs are _FAR_ from the only example of very useful hardware
features that DRI is unable to make use of.

In sum, the only serious argument in favor of userspace direct
rendering (it is needed for acceptable performance) is a complete and
total strawman. Once this strawman has been knocked down, then, you have
to start asking yourself what features are being sacrificed in the name of
this strawman. And once you start to compile the list of what features
you have to give up for DRI (watermark interupts, hardware semaphores etc),
you will find that the list quickly grows to unacceptable proportions.

James Simmons (o_
fbdev/gfx developer (o_ (o_ //\
http://www.linux-fbdev.org (/)_ (/)_ V_/_
http://linuxgfx.sourceforge.net

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/