Re: Linux 2.6.29

From: Linus Torvalds
Date: Tue Mar 24 2009 - 13:39:59 EST

Next message: Jeremy Fitzhardinge: "[GIT PULL] xen.git 2.6.30: xen dom0 apic changes"
Previous message: Jeremy Fitzhardinge: "[GIT PULL] xen.git 2.6.30: usermode control interfaces"
In reply to: Jesper Krogh: "Re: Linux 2.6.29"
Next in thread: Mark Lord: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 24 Mar 2009, Jesper Krogh wrote:
>
> Theodore Tso wrote:
> > That's definitely a problem too, but keep in mind that by default the
> > journal gets committed every 5 seconds, so the data gets flushed out
> > that often. So the question is how quickly can you *dirty* 1.6GB of
> > memory?

Doesn't at least ext4 default to the _insane_ model of "data is less
important than meta-data, and it doesn't get journalled"?

And ext3 with "data=writeback" does the same, no?

Both of which are - as far as I can tell - total braindamage. At least
with ext3 it's not the _default_ mode.

I never understood how anybody doing filesystems (especially ones that
claim to be crash-resistant due to journalling) would _ever_ accept the
"writeback" behavior of having "clean fsck, but data loss".

> Say it's a file that you allready have in memory cache read in.. there
> is plenty of space in 16GB for that.. then you can dirty it at memory-speed..
> that about ½sec. (correct me if I'm wrong).

No, you'll still have to get per-page locks etc. If you use mmap(), you'll
page-fault on each page, if you use write() you'll do all the page lookups
etc. But yes, it can be pretty quick - the biggest cost probably _will_ be
the speed of memory itself (doing one-byte writes at each block would
change that, and the bottle-neck would become the system call and page
lookup/locking path, but it's probably in the same rough cost as cost of
writing out one page one page).

That said, this is all why we now have 'dirty_*bytes' limits too.

The problem is that the dirty_[background_]bytes value really should be
scaled up by the speed of IO. And we currently have no way to do that.
Some machines can write a gigabyte in a second with some fancy RAID
setups. Others will take minutes (or hours) to do that (crappy SSD's that
get 25kB/s throughput on random writes).

The "dirty_[background_ratio" percentage doesn't scale up by the speed of
IO either, of course, but at least historically there was generally a
pretty good correlation between amount of memory and speed of IO. The
machines that had gigs and gigs of RAM tended to always have fast IO too.
So scaling up dirty limits by memory size made sense both in the "we have
tons of memory, so allow tons of it to be dirty" sense _and_ in the "we
likely have a fast disk, so allow more pending dirty data".

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jeremy Fitzhardinge: "[GIT PULL] xen.git 2.6.30: xen dom0 apic changes"
Previous message: Jeremy Fitzhardinge: "[GIT PULL] xen.git 2.6.30: usermode control interfaces"
In reply to: Jesper Krogh: "Re: Linux 2.6.29"
Next in thread: Mark Lord: "Re: Linux 2.6.29"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]