Re: [sqlite] light weight write barriers

From: David Lang
Date: Fri Nov 16 2012 - 14:14:52 EST


On Fri, 16 Nov 2012, Howard Chu wrote:

David Lang wrote:
barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.

Users readily accept that if the system crashes, they will loose the most recent
stuff that they did,

*some* users may accept that. *None* should.

when users are given a choice of having all their work be very slow, or have it be fast, but in the unlikely event of a crash they loose their mose recent changes, they are willing to loose their most recent changes.

If you think about it, this is not much different from the fact that you loose all changes since the last time you saved the thing you are working on. Many programs save state periodically so that if the application crashes the user hasn't lost everything, but any application that tried to save after every single change would be so slow that nobody would use it.

There is always going to be a window after a user hits 'save' where the data can be lost, because it's not yet on disk.

There are a couple industry failures here:

1) the drive manufacturers sell drives that lie, and consumers accept it because they don't know better. We programmers, who know better, have failed to raise a stink and demand that this be fixed.
A) Drives should not lose data on power failure. If a drive accepts a write request and says "OK, done" then that data should get written to stable storage, period. Whether it requires capacitors or some other onboard power supply, or whatever, they should just do it. Keep in mind that today, most of the difference between enterprise drives and consumer desktop drives is just a firmware change, that hardware is already identical. Nobody should accept a product that doesn't offer this guarantee. It's inexcusable.

This is an option to you. However if you have enabled write caching and reordering, you have explicitly told the system to be faster at the expense of loosing data under some conditions. The fact that you then loose data under those conditions should not surprise you.

The idea that you must have enough power to write all the pending data to disk is problematic as that then severely limits the amount of cache that you have.

B) it should go without saying - drives should reliably report back to the host, when something goes wrong. E.g., if a write request has been accepted, cached, and reported complete, but then during the actual write an ECC failure is detected in the cacheline, the drive needs to tell the host "oh by the way, block XXX didn't actually make it to disk like I told you it did 10ms ago."

The issue isn't a drive having a write error, it's the system shutting down (or crashing) before the data is written, no OS level tricks will help you here.


The real problem here isn't the drive claiming the data has been written when it hasn't, the real problem is that the application has said 'write this data' to the OS, and the OS has not done so yet.

The OS delays the writes for many legitimate reasons (the disk may be busy, it can get things done more efficently by combining and reordering the writes, etc)

Unless the system crashes, this is not a problem, the data will eventually be written out, and on system shutdown everthing is good.

But if the system crashes, some of this postphoned work doesn't get done, and that can be a problem.

Applications can do fsync if they want to be sure that their data is safe on disk NOW, but they currently have no way of saying "I want to make sure that A happens before B, but I don't care if A happens now or 10 seconds from now"

That is the gap that it would be useful to provide a mechanism to deal with, and it doesn't matter what your disk system does in terms of lieing ot not, there still isn't a way to deal with this today.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/