Re: SuSE O_DIRECT|O_NONBLOCK overload

From: NeilBrown
Date: Wed Mar 12 2014 - 20:16:15 EST


On Wed, 12 Mar 2014 04:00:15 -0700 Christoph Hellwig <hch@xxxxxxxxxxxxx>
wrote:

> The SLES12 tree has various patches to implement special
> O_DIRECT|O_NONBLOCK semantics for block devices:
>
> https://gitorious.org/opensuse/kernel-source/source/806eab3e4b02e798c1ae942440051f81c822ca35:patches.suse/block-nonblock-causes-failfast
>
> this seems genuinely useful and I'd be really happy if people would do
> this work upstream for two reasons:
>
> a) implementing different semantics only in a vendor kernel is a
> nightmare. No proper way to document it in the man pages for
> example, and silent breakage of applications that expect it to be
> present, or even more nasty not present.
> b) Which brings us to: we had various issues with adding O_NONBLOCK to
> files that didn't support it before. How well was this whole feature
> tested?


This "feature" was really just a hack because a particular customer needed
something in a particular situation.

At the core of this in my thinking is the 'failfast' BIO flag ... or 'flags'
really because there are now three of them.

They don't seem to be documented or uniformly supported or used much at
all. dm-multipath uses one, and btrfs uses another. There could be value in
using one or more or something in md but as they aren't documented and could
mean almost anything I have stayed away.
I tried adding some sort of 'failfast' support to md once and I would get
occasional failures from regular sata devices which otherwise appeared to be
working perfectly well. So it seemed that "fast" was altogether *too* fast.

For a particular customer with some particular hardware there were issues
where that hardware could choose not to respond for extended periods. So we
modified the driver to accept a 'timeout' module parameter and to cause
REQ_FAILFAST_DEV (I think) requests to fail with -ETIMEDOUT if they could not
be serviced in that time.

We then modified md to cope with that particular well-defined semantic. And
hacked "O_NONBLOCK" support in so that mdadm could access the device without
the risk of hanging indefinitely.

I would be happy to bring at least some of this functionality into mainline,
but I would need a "FAILFAST" flag that actually meant something useful and
was sufficiently well documented so that if some driver got it wrong, I would
be justified in blaming the driver for not meeting the expectations that I
encoded into md.

I think that the FAILFAST flag that I need would do some error recovery but
would be time limited. Maybe a software TLER (Time Limited Error Recovery).

I also think there should probably be just one FAILFAST flag. Where it was
the DEV or the TRANSPORT or the DRIVER that failed could be returned in the
error code for any caller that cared. But as I don't know why the one became
three I could well be missing something important.


As for testing, only basic "does it function as expected" testing.
Part of the reason for only modifying O_NONBLOCK behaviour where O_DIRECT was
also set was to make it extremely unlikely that any code would use this
feature except code that specifically needed it.

NeilBrown

Attachment: signature.asc
Description: PGP signature