Re: [BUG] Regression introduced with "block: split bios to max possible length"

From: Linus Torvalds
Date: Thu Jan 21 2016 - 23:15:47 EST


On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch <keith.busch@xxxxxxxxx> wrote:
> On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
>>
>> I assume that in this case it's simply that
>>
>> - max_sectors is some odd number in sectors (ie 65535)
>>
>> - the block size is larger than a sector (ie 4k)
>
> Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
> sectors is supposed to be a valid transfer in multiples of the physical
> sector size?

If the controller interface is some 16-bit register, then the maximum
number of sectors you can specify is 65535.

But if the disk then doesn't like 512-byte accesses, but wants 4kB or
whatever, then clearly you can't actually *feed* it that maximum
number. Not because it's a maximal, but because it's not aligned.

But that doesn't mean that it's non-sensical. It just means that you
have to take both things into account. There may be two totally
independent things that cause the two (very different) rules on what
the IO can look like.

Obviously there are probably games we could play, like always limiting
the maximum sector number to a multiple of the sector size. That would
presumably work for Stefan's case, by simply "artificially" making
max_sectors be 65528 instead.

But I do think it's better to consider them independent issues, and
just make sure that we always honor those things independently.

That "honor things independently" used to happen automatically before,
simply because we'd never split in the middle of a bio segment. And
since each bio segment was created with the limitations of the device
in mind, that all worked.

Now that it splits in the middle of a vector entry, that splitting
just needs to honor _all_ the rules. Not just the max sector one.

>> What I think it _should_ do is:
>>
>> (a) check against max sectors like it used to do:
>>
>> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
>> goto split;
>
> This can create less optimal splits for h/w that advertise chunk size. I
> know it's a quirky feature (wasn't my idea), but the h/w is very slow
> to not split at the necessary alignments, and we used to handle this
> split correctly.

I suspect few high-performance controllers will really have big issues
with the max_sectors thing. If you have big enough IO that you could
hit the maximum sector number, you're already pretty well off, you
might as well split at that point.

So I think it's ok to split at the max sector case early.

For the case of nvme, for example, I think the max sector number is so
high that you'll never hit that anyway, and you'll only ever hit the
chunk limit. No?

So in practice it won't matter, I suspect.

Linus