Re: [PATCH 3.16 106/217] sd: disable discard_zeroes_data for UNMAP

From: Martin K. Petersen
Date: Fri Apr 29 2016 - 08:17:22 EST


>>>>> "Rafael" == Rafael David Tinoco <rafael.tinoco@xxxxxxxxxxxxx> writes:

Rafael,

Rafael> And this happened because the storage in question didn't set
Rafael> properly "max_ws_blocks" (it was 0) from VPD 0xb0 (max_ws_blocks
Rafael> is calculated using it).

We appear to be talking about a device that has an internal limit but
does not advertise it in the dedicated field.

Please send me the output of:

sg_inq
sg_readcap -l
sg_vpd -p lbpv
sg_vpd -p bl

and we'll quirk it. Or feel free to submit a patch.

Rafael> Anyway, 2 examples of disk servers that had problems after this
Rafael> change. IMHO the change is good for regular kernel development,
Rafael> and it does guarantee further READ CMDs to read zeros from LBAs,
Rafael> but, it jeopardises already functioning storage servers.

... as opposed to jeopardizing the integrity of people's data?

Rafael> Without properly setting NDOB bit in WRITESAME(16), the data
Rafael> buffer will be read on every SCSI WRITESAME(16) command and that
Rafael> will impact the "discard method" performance (will probably be
Rafael> slower than regular UNMAP command).

NDOB is a recent performance optimization for high performance SSD
devices that allows us to skip sending the zeroed data buffer. However,
there is no way to tell whether a device supports NDOB without sending a
WRITE SAME command which has the nasty side effect of being destructive.
I am also not aware of any devices that actually support it yet.

Whether to set NDOB or not is completely orthogonal to the issue at hand
which is whether to use the UNMAP command or WRITE SAME with the UNMAP
bit set. WRITE SAME w/ UNMAP has been around for a long, long
time. WRITE SAME was used for thin provisioned devices before the UNMAP
command even existed.

Currently the filesystem code relies on being able to get predictable
results for zeroing metadata blocks etc. So if a device advertises that
it supports LBPRZ we'll use WRITE SAME with the UNMAP bit set since it
provides hard guarantees. Unlike the UNMAP command which by definition
is advisory.

As I mentioned earlier, I have some changes in the pipeline that will
separate the ioctl implementation from the library functions. That will
allow us to use WRITE SAME w/ UNMAP for zeroouts and UNMAP for discards
on the same device. However, this is still not going to solve the
problem with your device that fails WRITE SAME w/ UNMAP. Because we are
not going to discontinue using that combination. Quite the contrary, we
are increasingly depending on it.

Consequently, if you have a device that has problems in that area we
need to quirk it rather than try to work around it in the core code
heuristics.

--
Martin K. Petersen Oracle Linux Engineering