Re: ATA 4 KiB sector issues.

From: Mike Snitzer
Date: Thu Mar 11 2010 - 11:01:52 EST


On Thu, Mar 11, 2010 at 10:00 AM, Nikanth Karthikesan <knikanth@xxxxxxx> wrote:
> On Thursday 11 March 2010 19:58:11 Theodore Tso wrote:
>> On Mar 11, 2010, at 8:57 AM, Nikanth Karthikesan wrote:
>> > I guess, what he meant was, to keep filesystem blocks aligned, even if
>> > the partition is not. Say if the partition is mis-aligned by 512-bytes,
>> > let the filesystem waste 4k-512bytes and keep it's blocks aligned. But it
>> > might be a case of over-engineering, possibly requiring disk format
>> > change.
>>
>> Ah, yes, I agree with you; that's probably what he meant.
>>
>> Sure, that's theoretically possible, but it would mean changing every
>>  single filesystem, and it would require a file system format change --- or
>>  at least a file system format extension.
>>
>> It would seem to be way easier to simply fix the partitioning tools to do
>>  the right thing, though.
>>
>
> Yes. May be, just a simple but transparent device-mapper like mapping on top
> of the mis-aligned partition, to do the alignment. Then the file-system code
> need not change much.
>
> But Linux already has device-mapper and Linux will not be affected with mis-
> aligned partitions, when we use LVM.

Well, device-mapper and LVM needed to be updated to make them "just
work" but yes that work has been done.

> But the actual problem here is that partitioning tools might create partitions
> that wont allow other operating-systems to boot. So it might be enough, if the
> partitioning tools just create partitions with (mis-)alignment requirement for
> Windows.

I'm not following...

Anyway, 4K drives that are 512b logical and 4K physical may or may not
also have "DOS partition compensation" that use LBA -1 as the first
naturally (4K) aligned start. This means that the partition tools
need to shift the start of the first primary partition to be offset by
3584 bytes (7 512b sectors) for use with Linux. But for windows,
AFAIK windows XP and windows 7 create all partitions aligned on 1MB
boundaries. Linux's parted and fdisk create 1MB aligned partitions
now too.

So the only outlier is older versions of windows (< XP) and Linux (old
fdisk and parted, etc also use DOS partitioning) that don't use
naturally aligned (e.g. 1MB) partition boundaries. In those versions
of Windows and LInux there are ways to change the default start of
sector 63. That said, there is an opportunity to improve
documentation for how to workaround DOS partitioning on these
operating systems.

One other piece worth mentioning on this "IO Toplogy" support in the
entire Linux I/O Stack is the virt layers. hch has already extended
the virt-io protocol and qemu is in the finishing stages of being
updated to properly consume the "IO Topology" information. So we
really don't have any gaps in the Linux I/O stack.

mkp in particular, Jens, James, myself, and others implemented and
refined the SCSI and block changes. kzak, jim meyering, hans de
goede, hch, eric sandeen, bob peterson, myself and others updated all
other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
FYI, all of these advances will be in Fedora 13 (quite a few are
already in Fedora 12).

There are obviously other Linux systems and userland tools (likely
Xen, other mkfs.* and more) that should be updated. Hopefully
maintainers and/or contributors of these projects will follow-up to
address those that need updating.

Again please see:
http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf
http://people.redhat.com/msnitzer/docs/io-limits.txt
Some omissions include: Linux MD, which has been updated as mkp
pointed out, and I neglected to talk about virt-io and qemu (but like
I said they have been updated too).

Hopefully we're all closer to being on the same page now.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/