Re: Linux-2.2.0 bad VM behaviour "dd if=/dev/zero of=/dev/hdc bs=256k"

Zygo Blaxell (uixjjji1@umail.furryterror.org)
29 Jan 1999 13:23:06 -0500


In article <Pine.LNX.3.95.990127111433.30467R-100000@penguin.transmeta.com>,
Linus Torvalds <torvalds@transmeta.com> wrote:
>On Wed, 27 Jan 1999, Andrea Arcangeli wrote:
>> 1. get the buffer uptodate
>> 2. write to the buffer the new data in the right place
>> 3. mark the buffer dirty and uptodate
>> 4. release the buffer
>>
>> but we don't start I/O at all, so in a msec we'll be in a state with
>> 128mbyte full of dirty and so not freeable buffers.
>
>This is one of my peeves - we should really start the IO once we've filled
>up "X nr" of full buffers. But I've always been too lazy to do it.

Solaris does this for very small X. Solaris is _much_ slower than
Linux at disk I/O patterns that involve modifying small files, because
it keeps forcing the disk heads to seek too early.

Compare Linux and Solaris on identical hardware tagging just a few
revisions in a CVS repository full of many large files. Linux wins every
time (at least up to Solaris 2.6; don't know if Sun fixed this in 2.7).

Linux will defer the buffer writing for a few seconds (default 5, but
I set that to 30 or 60 using /proc/sys/vm/bdflush). This is a huge
win for performance as the heads are usually kept where they should be
for the next read, and not "distracted" by writes sprinkled throughout.
The writes all happen in bursts, so proper write sorting algorithms can
happen and the total time spent seeking is much less.

Of course the places where Solaris wins over Linux are for huge writes
with no intervening reads, or totally random-access databases, where
the buffer cache just gets in the way in both cases. But the win isn't
very much except in one case. Here's the real Linux brain damage:

1. Start with a machine that has 100*(number of bytes your hard disks
can write per second) bytes of RAM. Start fresh: boot cleanly,
most of the memory should be free, not cache.

2. mmap() a file MAP_SHARED, PROT_READ|PROT_WRITE, that is the same
size as RAM, starting at char *ptr, size_t size, on the aforementioned
disk. The file be truncated to 0 length and then should have
its final size set with ftruncate() prior to the mmap(). It is
important that there be no data actually in the file to read.

3. for (ptr=0; ptr < ptr + size; ptr += 1024) *ptr = 0;
In other words, dirty a few hundred pages of RAM as quickly as possible.
This will take milliseconds.

4. msync()

5. sync()

6. Go back to step 3.

Between steps 3 and 6, your Linux box will be without a usable disk I/O
subsystem for almost two full minutes. If it happens in a loop and you
aren't executing the code on a console or serial port, you might as well
hit the reset button, because you aren't getting out of this loop today.
If step 3 above is too contrived for you, replace it with a video
capture loop at 140Mbit/second for a similar effect.

Linux e2fs also doesn't pre-allocate space while writing, which causes
the rest of the performance problems for large writes. Every working
real-time data capture Linux program I've written has had to have a
huge non-swappable buffer in process-space memory for holding data while
e2fs reads the inode and block bitmaps searching for a place to extend
the current file. Ideally the filesystem would be doing something
analogous to read-ahead for writes: try to read the bitmaps and inode
tables necessary for the next write beyond end of file.

-- 
Zygo Blaxell, Linux Engineer, Corel Corporation, zygob@corel.ca (work),
zblaxell@furryterror.org (play).  It's my opinion, I tell you! Mine! All MINE!
Size of 'cvs diff' between 'cvs.winehq.com' and 'linuxmaster' as of
Fri Jan 29 12:23:01 EST 1999: 44635 line(s)

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/