On Mon, 13 Apr 2009, Avi Kivity wrote:
- create a big file,Just creating a 5GB file in a 64KB filesystem was interesting - Windows was throwing out 256KB I/Os even though I was generating 1MB writes (and cached too). Looks like a paranoid IDE driver (qemu exposes a PIIX4).
Heh, ok. So the "big file" really only needed to be big enough to not be cached, and 5GB was probably overkill. In fact, if there's some way to blow the cache, you could have made it much smaller. But 5G certainly works ;)
[...]
(a) Windows caches things with a 4kB granularity, so the 512-byte write turned into a read-modify-write
You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for your example!). It's a total disaster. Imagine what would happen to user application performance if kmalloc() always returned 16kB-aligned chunks of memory, all sized as integer multiples of 16kB? It would absolutely _suck_. Sure, it would be fine for your large allocations, but any time you handle strings, you'd allocate 16kB of memory for any small 5-byte string. You'd have horrible cache behavior, and you'd run out of memory much too quickly.
The same is true in the kernel. The single biggest memory user under almost all normal loads is the disk cache. That _is_ the normal allocator for any OS kernel. Everything else is almost details (ok, so Linux in particular does cache metadata very aggressively, so the dcache and inode cache are seldom "just details", but the page cache is still generally the most important part).
So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane system does that. It's only useful if you absolutely _only_ work with large files - ie you're a database server. For just about any other workload, that kind of granularity is totally unnacceptable.
So doing a read-modify-write on a 1-byte (or 512-byte) write, when the block size is 4kB is easy - we just have to do it anyway.
Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is also _doable_, and from the IO pattern standpoint it is no different. But from a memory allocation pattern standpoint it's a disaster - because now you're always working with chunks that are just 'too big' to be good building blocks of a reasonable allocator.
If you always allocate 64kB for file caches, and you work with lots of small files (like a source tree), you will literally waste all your memory.