You're right, and you're not. There are two reasons why modern 3D GPUs put the world mesh into the card framebuffer and do all the T&L in the GPU. One reason is that it's faster to do it in dedicated hardware. The other more pressing reason is that the host interface (PCI or AGP) puts a very low upper-bound on your triangle rendering rate.
The number of parameters necessary to specify a single texture-mapped triangle are literally in the dozens. If you did only 32-bit fixed point (16.16) for coordinates, that's still a huge amount of information to move across the bus to instruct the chip to render a single triangle. And what if that triangle ends up amounting to a single pixel? Think about the waste!
Instead, the un-transformed geometry is loaded into the card memory only once, and the GPU is instructed to render the scene based on the camera perspective and lighting information. Aside from the need to cache textures on-card, this is another reason for the need to have LOTS of graphics memory.
I guess I would look at this as an opportunity to make a "visual coprocessor", that also has the hardware necessary to output to a monitor (preferably multiple monitors).
I don't think that's realistic. We could do that, but it would have terrible performance.