Re: ATA 4 KiB sector issues.

From: Greg Freemyer
Date: Mon Mar 08 2010 - 00:38:17 EST


cc'ing Martin Petersen since I believe he is one of the most
knowledgeable kernel hackers on this topic and has been working the
issue for the last year.

On Sun, Mar 7, 2010 at 10:48 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, guys.
>
> It looks like transition to ATA 4k drives will be quite painful and we
> aren't really ready although these drives are already selling widely.
> I've written up a summary document on the issue to clarify stuff as
> it's getting more and more confusing and develop some consensus.  It's
> also on the linux ata wiki.
>
>  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
>
> I've cc'd people whom I can think of off the top of my head but I
> surely have missed some people who would have been interested.  Please
> feel free to add cc's or forward the message to other MLs.
> Especially, I don't know much about partitioners so the details there
> are pretty shallow and could be plain wrong.  It would be great if
> someone who knows more about this stuff can chime in.
>
> Thanks.
>
> === Document follows ===
>
> ATA 4 KiB sector issues
>
> Background
> ==========
>
> Up until recently, all ATA hard drives have been organized in 512 byte
> sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
> 976773168 512 byte sectors numbered from 0 to 976773167.  This is how
> a drive communicates with the driver.  When the operating system wants
> to read 32 KiB of data at 1 MiB position, the driver asks the drive to
> read 64 sectors from LBA (Logical block address, sector number) 2048.
>
> Because each sector should be addressable, readable and writable
> individually, the physical medium also is organized in the same sized
> sectors.  In addition to the area to store the actual data, each
> sector requires extra space for book keeping - inter-sector space to
> enable locating and addressing each sector and ECC data to detect and
> correct inevitable raw data errors.
>
> As the densities and capacities of hard drives keep growing, stronger
> ECC becomes necessary to guarantee acceptable level of data integrity
> increasing the space overhead.  In addition, in most applications,
> hard drives are now accessed in units of at least 8 sectors or 4096
> bytes and maintaining 512 byte granularity has become somewhat
> meaningless.
>
> This reached a point where enlarging the sector size to 4096 bytes
> would yield measurably more usable space given the same raw data
> storage size and hard drive manufacturers are transitioning to 4 KiB
> sectors.
>
> Anandtech has a good article which illustrates the background and
> issues with pretty diagrams[1].
>
>
> Physical vs. Logical
> ====================
>
> Because the 512 byte sector size has been around for a very long time
> and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
> sector size assumption is scattered across all the layers -
> controllers or bridge chips snooping commands, BIOSs, boot codes,
> drivers, partitioners and system utilities, which makes it very
> difficult to change the sector size from 512 byte without breaking
> backward compatibility massively.
>
> As a workaround, the concept of logical sector size was introduced.
> The physical medium is organized in 4 KiB sectors but the firmware on
> the drive will present it as if the drive is composed of 512 byte
> sectors thus making the drive behave as before, so if the driver asks
> the hard drive to read 64 sectors from LBA 2048, the firmware will
> translate it and read 8 4 KiB sectors from hardware sector 256.  As a
> result, the hard drive now has two sector sizes - the physical one
> which the physical media is actually organized in, and the logical one
> which the firmware presents to the outside world.
>
> A straight forward example mapping between physical sector and LBA
> would be
>
>  LBA = 8 * phys_sect
>
>
> Alignment problem on 4 KiB physical / 512 logical drives
> =======================================================
>
> This workaround keeps older hardware and software working while
> allowing the drive to use larger sector size internally.  However, the
> discrepancy between physical and logical sector sizes creates an
> alignment issue.  For example, if the driver wants to read 7 sectors
> from LBA 2047, the firmware has to read hardware sector 255 and 256
> and trim leading 7*512 bytes and tailing 512 bytes.
>
> For reads, this isn't an issue as drives read in larger chunks anyway
> but for writes, the drive has to do read-modify-write to achieve the
> requested action.  It has to first read hardware sector 255 and 256,
> update requested parts and then write back those sectors which can
> cause significant performance degradation[2].
>
> The problem is aggravated by the way DOS partitions[3] have been laid
> out traditionally.  For reasons dating back more than two decades,
> they are laid out considering something called disk geometry which
> nowadays are arbitrary values with a number of restrictions for
> backward compatibility accumulated over the years.  The end result is
> that until recently (most Linux variants and upto Windows XP) the
> first partition ends up on sector 63 and later ones on cylinder
> boundaries where each cylinder usually is composed of 255 * 63
> sectors.
>
> Most modern filesystems generate 4 KiB aligned accesses from the
> partition it is in.  If a drive maps 4 KiB physical sectors to 512
> byte logical sectors from LBA0, the filesystem in the first partition
> will always be misaligned and filesystems in later partitions are
> likely to be misaligned too.
>
>
> Solving the alignment problem on 4 KiB physical / 512 logical drives
> ====================================================================
>
> There are multiple ways which attempt to solve the problem.
>
> S-1. Yet another workaround from the firmware - offset-by-one.
>
>  Yet another workaround which can be done by the firmware is to
>  offset physical to logical mapping by one logical sector such that
>  LBA 63 ends up on physical sector boundary, which aligns the first
>  partition to physical sectors without requiring any software update.
>  The example mapping between phys_sector and LBA becomes
>
>    LBA = 8 * phys_sect - 1
>
>  The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
>  from after that point.  phys_sect 1 maps to LBA 7 and phys_sect 8 to
>  63, making LBA 63 aligned on hardware sector.
>
>  Although this aligns only the first partition, for many use cases,
>  especially the ones involving older software, this workaround was
>  deemed useful and some recent drives with 4 KiB physical sectors are
>  equipped with a dip switch to turn on or off offset-by-one mapping.
>
> S-2. The proper solution.
>
>  Correct alignments for all partitions can't be achieved by the
>  firmware alone.  The system utilities should be informed about the
>  alignment requirements and align partitions accordingly.
>
>  The above firmware workaround complicates the situation because the
>  two different configurations require different offsets to achieve
>  the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
>  export the physical and logical sector sizes and the LBA offset
>  which is aligned to the physical sectors.
>
>  In Linux, these parameters are exported via the following sysfs
>  nodes.
>
>    physical sector size        : /sys/block/sdX/queue/physical_block_size
>    logical sector size         : /sys/block/sdX/queue/logical_block_size
>    alignment offset            : /sys/block/sdX/alignment_offset
>
>  Let the physical sector size be PSS, logical sector size LSS and
>  alignment offset AOFF.  The system software should place partitions
>  such that the starting LBAs of all partitions are aligned on
>
>    (n * PSS + AOFF) / LSS
>
>  For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
>  and AOFF 3584 and with n of 7 the above becomes,
>
>    (7 * 4096 + 3584) / 512 == 63
>
>  making sector 63 an aligned LBA where the first partition can be
>  put, but without the offset-by-one mapping, AOFF is zero and LBA 63
>  is not aligned.
>
>  With the above new alignment requirement in place, it becomes
>  difficult to honor the legacy one - first partition on sector 63 and
>  all other partitions on cylinder boundary (255 * 63 sectors) - as
>  the two alignment requirements contradict each other.  This might be
>  worked around by adjusting how LBA and CHS addresses are mapped but
>  the disk geometry parameters are hard coded everywhere and there is
>  no reliable way to communicate custom geometry parameters.
>
>
> Complications
> =============
>
> Unfortunately, there are complications.
>
> C-1. The standard is not and won't be followed as-is.
>
>  Some of the existing BIOSs and/or drivers can't cope with drives
>  which report 4 KiB physical sector size.  To work around this, some
>  drive models lie that its physical sector size is 512 bytes when the
>  actual configuration is 4 KiB without offsetting.
>
>  This nullifies the provisions for alignment in the ATA standard but
>  results in the correct alignment for Windows Vista and 7.  OS
>  behaviors will be described further later.
>
>  For these drives, which are likely to continue to be shipped for the
>  foreseeable future, traditional LBA 63 and cylinder based aligning
>  results in misalignment.
>
> C-2. Windows XP depends on the traditional partition layout.
>
>  Windows XP makes use of the CHS start/end addresses in the partition
>  table and gets confused if partitions are not laid out
>  traditionally.  This means that XP can't be installed into a
>  partition prepared by later versions of Windows[4].  This isn't a
>  big problem for Windows because in most cases the later version is
>  replacing the older one, not the other way around.
>
>  Unfortunately, the situation is more complex for Linux because Linux
>  is often co-installed with various versions of Windows and XP is
>  still quite popular.  This means that when a Linux partitioner is
>  used to prepare a partition which may be used by Windows, the
>  partitioner might have to consider which version of Windows is going
>  to be used and whether to align the partitions for the correct
>  alignment or compatibility with older versions of Windows.
>
> C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.
>
>  The DOS partition format uses 32 bit for the starting LBA and the
>  number of sectors and, reportedly, 32 bit Windows XP shares the
>  limitation.  With 32 bit addressing and 512 byte logical sector
>  size, the maximum addressable sector + 1 is at
>
>    2^32 * 2^9 == 2^41 == 2 TiB
>
>  The DOS partition format allows a partition to reach beyond 2 TiB as
>  long as the starting LBA is under 2 TiB; however, both Windows XP
>  and and the Linux kernel (at least upto v2.6.33) refuse such
>  partition configurations.
>
>  With the right combination of host controller, BIOS and driver, this
>  barrier can be overcome by enlarging the logical sector size to 4
>  KiB, which will push the barrier out to 16 TiB.  On the right
>  configuration, Windows XP is reportedly able to address beyond the 2
>  TiB barrier with a DOS partition and 4 KiB logical sector size.
>  Linux kernel upto v2.6.33 doesn't work under such configurations but
>  a patch to make it work is pending[5].
>
>  This might also be beneficial for operating systems which don't
>  suffer from this limitation.  A different partition format - GPT[6]
>  - should be used beyond 2^32 sectors, which could harm compatibility
>  with older BIOSs or other operating systems which don't recognize
>  the new format.
>
>  As mentioned previously, 512 byte sector assumption has been there
>  for a very long time and changing it is likely to cause various
>  compatibility problems at many different layers from hardware up to
>  the system utilities.
>
>
> Windows
> =======
>
> As hard drive vendors aim for performance and compatibility in modern
> Windows environments, it is worthwhile to investigate how Windows
> partitions with different alignment requirements.  Up until Windows
> XP, it followed the traditional layout - the first partition on LBA 63
> and the others on cylinder boundaries where a cylinder is defined as
> 255 tracks with 63 sectors each.
>
> Windows Vista and 7 align partitions differently.  As the two behave
> similarly, only 7's behavior is shown here.  These partition tables
> are created by Windows 7 RC installer on blank disks.
>
> W-1. 512 byte physical and logical sector drive.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202100 07 df130c 00080000 00200300
>  00 df140c 07 feffff 00280300 00689e12
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2048 + 204800 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   20  : 206848
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206848 + 312371200 = 312578048
>
>  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
>
> W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202100 07 df130c 00080000 00200300
>  00 df140c 07 feffff 00280300 00b83f25
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2048 + 204800 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   20  : 206848
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206848 + 624932864 = 625139712
>
>  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
>
> W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.
>
>  ST FIRST  T  LAST   LBA      NBLKS
>  80 202800 07 df130c 07080000 f91f0300
>  00 df1b0c 07 feffff 07280300 f9376d74
>  00 000000 00 000000 00000000 00000000
>  00 000000 00 000000 00000000 00000000
>
>  Part0:        FIRST   C    0  H   32  S   40  : 2055          (63 sec/trk)
>                LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
>                LBA     2055 + 204793 = 206848
>
>  Part1:        FIRST   C   12  H  223  S   27  : 206855
>                LAST    C 1023  H  254  S   63  : E
>                LBA     206855 + 1953314809 = 1953521664
>
>  Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.
>
> The partitioner seems to be using 1M as the basic alignment unit and
> offsetting from there if explicitly requested by the drive and there
> is no difference between handling of 512 byte and 4 KiB drives, which
> explains why C-1 works for hard drive vendors.
>
> In all cases, the partitioner ignores both the first partition on LBA
> 63 and the others on cylinder boundary requirements while still using
> the same 255*63 cylinder size.  Also, note that in W-3, both part 0
> and 1 end up with odd number of sectors.  It seems that they simply
> decided to completely break away from the traditional layout, which is
> understandable given that there really isn't one good solution which
> can cover all the cases and that the default larger alignment benefits
> earlier SSDs.
>
> Windows Vista basically shows the same behavior.  Vista was tested by
> creating two partitions using the management tool.  Test data is
> available at [7].
>
>  *-alignment_offset    : alignment_offset reported by Linux kernel
>  *-fdisk               : fdisk -l output
>  *-fdisk-u             : fdisk -lu output
>  *-hdparm              : hdparm -I output
>  *-mbr                 : dump of mbr
>  *-part                : decoded partition table from mbr
>
> Please note that hdparm is misreporting the alignment offset.  It
> should be reporting 512 instead of 256 for offset-by-one drives.
>
>
> So, what now for Linux?
> =======================
>
> The situation is not easy.  Considering all the factors, the only
> workable solution looks like doing what Windows is doing.  Hard drive
> and SSD vendors are focusing on compatibility and performance on
> recent Windows releases and are happy to do things which break the
> standard defined mechanism as shown by C-1, so parting away from what
> Windows does would be unnecessarily painful.
>
> Unfortunately, while Windows can assume that newer releases won't
> share the hard drive with older releases including Windows XP, Linux
> distros can't do that.  There will be many installations where a
> modern Linux distros share a hard drive with older releases of
> Windows.  At this point, I can't see a silver bullet solution.
>
> Partitioners maybe should only align partitions which will be used by
> Linux and default to the traditional layout for others while allowing
> explicit override.  I think Windows XP wouldn't have problem with
> differently aligned partitions as long as it doesn't actually use them
> but haven't tested it.
>
> Reportedly, commonly used partitioners aren't ready to handle drives
> larger than 2 TiB in any configuration and alignment isn't done
> properly for drives with 4 KiB physical sectors.  4 KiB logical sector
> support is broken in both the kernel and partitioners.  (need more
> details and probably a whole section on partitioner behaviors)
>
> Unfortunately, the transition to 4 KiB sector size, physical only or
> logical too, is looking fairly ugly.  Hopefully, a reasonable solution
> can be reached in not too distant future but even with all the
> software side updated, it looks like it's gonna cause significant
> amount of confusion and frustration.
>
>
> [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
> [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
> [3] http://en.wikipedia.org/wiki/Master_boot_record
> [4] http://support.microsoft.com/kb/931760
> [5] http://thread.gmane.org/gmane.linux.kernel/953981
> [6] http://en.wikipedia.org/wiki/GUID_Partition_Table
> [7] http://userweb.kernel.org/~tj/partalign/
>
> * Mar 04 2009
>        Initial draft, Tejun Heo <tj@xxxxxxxxxx>
> * Mar 08 2009
>        Updated according to comments from Daniel Taylor
>        <Daniel.Taylor@xxxxxxx>.  Other minor updates.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/