Re: True fsync() in Linux (on IDE)

From: Heikki Tuuri
Date: Mon Mar 22 2004 - 08:14:52 EST

Next message: Jens Axboe: "Re: True fsync() in Linux (on IDE)"
Previous message: Eli Cohen: "Re: locking user space memory in kernel"
In reply to: Peter Zaitsev: "Re: True fsync() in Linux (on IDE)"
Next in thread: Jens Axboe: "Re: True fsync() in Linux (on IDE)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi!

I have written the InnoDB backend to MySQL. Some notes on the fsync()
processing problem:

1. It is dangerous for a database if fsync'ed files are physically written
to the disk in an order different from the order in which the fsync's were
called on them. In a power outage this can cause database corruption.

For example, a database must make sure that the log file is written to the
disk at least up to the 'log sequence number' of any data page written to
disk. Thus, we must first write to the log file and call fsync() on it, and
only after that are allowed to write the data page to a data file and call
fsync() on the data file.

2. An 'atomic' file write in the OS does not solve the problem of partially
written database pages in a power outage if the disk drive is not guaranteed
to stay operational long enough to be able to write the whole page
physically to disk. An InnoDB data page is 16 kB, and probably not
guaranteed to be any 'atomic' unit of physical disk writes. However, in
practice, half-written pages (either because of the OS or the disk) seem to
be very rare.

3. Jeffrey Siegal wrote to me that he checked a few disk drives if they
support a cache flush. Some of them did, others did not. If the disk drive
does not support a cache flush, then the only way to do a proper fsync is to
configure it not to cache writes at all. Though, in some drives even the
non-cache configuration option may be missing.

Best regards,

Heikki Tuuri
Innobase Oy
http://www.innodb.com

...........
List: linux-kernel
Subject: Re: True fsync() in Linux (on IDE)
From: Peter Zaitsev <peter () mysql ! com>
Date: 2004-03-20 19:48:23
Message-ID: <1079812102.3182.31.camel () abyss ! local>
[Download message RAW]

On Sat, 2004-03-20 at 02:20, Jamie Lokier wrote:
> Peter Zaitsev wrote:
> > If file system would guaranty atomicity of write() calls (synchronous
> > would be enough) we could disable it and get good extra performance.
>
> Store an MD5 or SHA digest of the page in the page itself, or elsewhere.
> (Obviously the digest doesn't include the bytes used to store it).
>
> Then partial write errors are always detectable, even if there's a
> hardware failure, so journal writes are effectively atomic.

Jamie,

The problem is not detecting the partial page writes, but dealing with
them. Obviously there is checksum on the page (it is however not
MD5/SHA which are designed for cryptographic needs) and so page
corruption is detected if it happens for whatever reason.

The problem is you can't do anything with the page if only unknown
portion of it was modified.

Innodb uses sort of "logical" logging which just says something like
delete row #2 from page #123, so if page is badly corrupted it will not
help to recover.

Of course you can log full pages, but this will increase overhead
significantly, especially for small row sizes.

This is why solution now is to use long term "logical" log and short
term "physical" log, which is used by background page writer, before
writing pages to their original locations.

--
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jens Axboe: "Re: True fsync() in Linux (on IDE)"
Previous message: Eli Cohen: "Re: locking user space memory in kernel"
In reply to: Peter Zaitsev: "Re: True fsync() in Linux (on IDE)"
Next in thread: Jens Axboe: "Re: True fsync() in Linux (on IDE)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]