Re: [PATCH v3 15/15] nvme: Ensure atomic writes will be executed atomically

From: John Garry
Date: Thu Jan 25 2024 - 06:30:23 EST


On 25/01/2024 00:52, Keith Busch wrote:
On Wed, Jan 24, 2024 at 11:38:41AM +0000, John Garry wrote:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5045c84f2516..6a34a5d92088 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -911,6 +911,32 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+ /*
+ * Ensure that nothing has been sent which cannot be executed
+ * atomically.
+ */
+ if (req->cmd_flags & REQ_ATOMIC) {
+ struct nvme_ns_head *head = ns->head;
+ u32 boundary_bytes = head->atomic_boundary;
+
+ if (blk_rq_bytes(req) > ns->head->atomic_max)
+ return BLK_STS_IOERR;
+
+ if (boundary_bytes) {
+ u32 mask = boundary_bytes - 1, imask = ~mask;
+ u32 start = blk_rq_pos(req) << SECTOR_SHIFT;
+ u32 end = start + blk_rq_bytes(req);
+
+ if (blk_rq_bytes(req) > boundary_bytes)
+ return BLK_STS_IOERR;
+
+ if (((start & imask) != (end & imask)) &&
+ (end & mask)) {
+ return BLK_STS_IOERR;
+ }
+ }
+ }
Aren't these new fields, atomic_max and atomic_boundary, duplicates of
the equivalent queue limits? Let's just use the queue limits instead.

I was making a copy of these limits in the driver out of an abundance of caution. I suppose the atomic_max and atomic_boundary values won't be modified in the block layer, so I could use them instead.


And couldn't we generically validate the constraints are not violated in
submit_bio_noacct() instead of doing that in the low level driver? The
driver assumes all other requests are already sanity checked, so I don't
think we should change the responsibility for that just for this flag.

As a safety mechanism, we want to ensure the complete stack is not giving us out-of-limits atomic writes.

We have limits checks in XFS iomap and fops.c, but we would also want to ensure that the the block layer is not doing anything it shouldn't be doing after submit_bio_noacct(), like merging atomic write BIOs which straddle a boundary or exceed atomic_max (if there were any merging).

The SCSI standard already has provision for error'ing an atomic write command which exceeds the target atomic write capabilities, while NVMe doesn't.

BTW, Christoph did mention that he would like to see this:
https://lore.kernel.org/linux-nvme/20231109153603.GA2188@xxxxxx/

Thanks,
John