The application (a userspace filesystem with its own cache) manages memory in 4K pages, but can perform much larger I/Os, for example during readahead and after merging writes. After a very short while memory is completely fragmented.I think what happened was that the number of iocbs submitted (64 iocbs of 4K each) did not merge because the device queue depth was very large; no queuing occured because (I imagine) merging happens while a request is waiting for disk readiness.
Why did you submit 64 iocbs of 4K? Was every page virtually
discontiguous, or did you arbitrarily decide to create a worst-case?