Don't worry. Byte sized accesses aren't really _that_ expensive. In places
where byte-sized copies are natural (and I'd agree that the tty layer
certainly counts as one) just continue to use them. After all, the tty layer
tends to do some operations on those bytes, so it's actually _logical_ to get
them as bytes rather than in larger blocks.
The places I reacted against were not places like the tty layer, but places
that really don't do "byte" operations in the first place. The ELF loader
really does a "memset()", which is definitely not a byte-at-a-time operation
except for the most stupid implementation. Similarly, most fast "strncpy()"
implementations tend to do word copies and do various tricks to find the zero
in the word. And I'm not talking about the kernel implementation here: I'm
talking about optimized _libc_ implementations.
Note that the overhead of "get_user()" is something like 10 assembler
instructions. It's NOT a costly operation in itself, but the instructions do
add up if you keep on doing them ;)
In short: if you actually do some _operation_ on the byte or word you're
fetching, the 10 instructions to fetch it are generally not the problem, and
doing double-buffering would only complicate the code more (certainly more
than 10 instructions). It's only if you're doing things like area copies or
clears that byte-wise operations are really silly, because there is obviously
a much better way to do them.
(And I wouldn't worry about an alpha keeping up with the tty layer. Trust
me, most alpha's have no problem at all with keeping up ;)
Linus