Re: [PATCH v3 1/2] fs: New zonefs file system

From: Damien Le Moal
Date: Wed Dec 25 2019 - 01:05:21 EST


On 2019/12/24 13:40, Darrick J. Wong wrote:
[...]
>> config FS_DAX
>> bool "Direct Access (DAX) support"
>> diff --git a/fs/Makefile b/fs/Makefile
>> index 1148c555c4d3..527f228a5e8a 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -133,3 +133,4 @@ obj-$(CONFIG_CEPH_FS) += ceph/
>> obj-$(CONFIG_PSTORE) += pstore/
>> obj-$(CONFIG_EFIVAR_FS) += efivarfs/
>> obj-$(CONFIG_EROFS_FS) += erofs/
>> +obj-$(CONFIG_ZONEFS_FS) += zonefs/
>> diff --git a/fs/zonefs/Kconfig b/fs/zonefs/Kconfig
>> new file mode 100644
>> index 000000000000..6490547e9763
>> --- /dev/null
>> +++ b/fs/zonefs/Kconfig
>> @@ -0,0 +1,9 @@
>> +config ZONEFS_FS
>> + tristate "zonefs filesystem support"
>> + depends on BLOCK
>> + depends on BLK_DEV_ZONED
>> + help
>> + zonefs is a simple File System which exposes zones of a zoned block
>> + device as files.
>
> I wonder if you ought to mention here some examples of zoned block
> devices, such as SMR drives?

Yes, will add that.

>> +
>> + If unsure, say N.
>> diff --git a/fs/zonefs/Makefile b/fs/zonefs/Makefile
>> new file mode 100644
>> index 000000000000..75a380aa1ae1
>> --- /dev/null
>> +++ b/fs/zonefs/Makefile
>> @@ -0,0 +1,4 @@
>> +# SPDX-License-Identifier: GPL-2.0
>> +obj-$(CONFIG_ZONEFS_FS) += zonefs.o
>> +
>> +zonefs-y := super.o
>> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
>> new file mode 100644
>> index 000000000000..417de3099fe0
>> --- /dev/null
>> +++ b/fs/zonefs/super.c
>
> <snip>
>
>> +static int zonefs_report_zones_err_cb(struct blk_zone *zone, unsigned int idx,
>> + void *data)
>> +{
>> + struct inode *inode = data;
>> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> + loff_t pos;
>> +
>> + /*
>> + * The condition of the zone may have change. Check it and adjust the
>> + * inode information as needed, similarly to zonefs_init_file_inode().
>> + */
>> + if (zone->cond == BLK_ZONE_COND_OFFLINE) {
>> + inode->i_flags |= S_IMMUTABLE;
>
> Can a zone go from offline (or I suppose readonly) to one of the other
> not-immutable states? If a zone comes back online, you'd want to clear
> S_IMMUTABLE.

ZBC/ZAC specifications do not define transitions into and out of the
READONLY and OFFLINE states.

For both offline and read-only states, the standard says that a zone
transitions into offline or read-only for:

"a) as a result of media failure; or
b) for reasons outside the scope of this standard."

As for the transition out of these states:

"All transitions out of this state are outside the scope of this standard."

So from the file system point of view, once these states are seen,
nothing can be explicitly done to get out of them and even if the drive
itself does something, there is no notification mechanism and only
regularly doing report zones will allow detecting the change.
Of all the SMR drives I know of, these states are only used if there is
indeed a media failure/head failure. Seeing these states is likely
synonymous with "your drive is dying". NVMe Zoned Namespace may define
these in slightly different ways (work in progress) though, so we may
need to revisit the immutable flag management for that case.

For now, irreversibly setting the immutable flag matches the zone state
management by the disk, so I think it is OK.


>
>> + inode->i_mode = S_IFREG;
>
> i_mode &= ~S_IRWXUGO; ?

Yes, indeed that is better. checkpatch.pl does spit out a warning if one
uses the S_Ixxx macros though. See below.

>
> Note that clearing the mode flags won't prevent programs with an
> existing writable fd from being able to call write(). I'd imagine that
> they'd hit EIO pretty fast though, so that might not matter.
>
>> + zone->wp = zone->start;
>> + } else if (zone->cond == BLK_ZONE_COND_READONLY) {
>> + inode->i_flags |= S_IMMUTABLE;
>> + inode->i_mode &= ~(0222); /* S_IWUGO */
>
> Might as well just use S_IWUGO directly here?

Because checkpatch spits out a warning if I do. I would prefer using the
macro as I find it much easier to read. Should I just ignore checkpatch
warning ?

>> +static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone)
>> +{
>> + struct super_block *sb = inode->i_sb;
>> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> + umode_t perm = sbi->s_perm;
>> +
>> + if (zone->cond == BLK_ZONE_COND_OFFLINE) {
>> + /*
>> + * Dead zone: make the inode immutable, disable all accesses
>> + * and set the file size to 0.
>> + */
>> + inode->i_flags |= S_IMMUTABLE;
>> + zone->wp = zone->start;
>> + perm = 0;
>> + } else if (zone->cond == BLK_ZONE_COND_READONLY) {
>> + /* Do not allow writes in read-only zones */
>> + inode->i_flags |= S_IMMUTABLE;
>> + perm &= ~(0222); /* S_IWUGO */
>> + }
>> +
>> + zi->i_ztype = zonefs_zone_type(zone);
>> + zi->i_zsector = zone->start;
>> + zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
>> + zone->len << SECTOR_SHIFT);
>> + if (zi->i_ztype == ZONEFS_ZTYPE_CNV)
>> + zi->i_wpoffset = zi->i_max_size;
>> + else
>> + zi->i_wpoffset = (zone->wp - zone->start) << SECTOR_SHIFT;
>> +
>> + inode->i_mode = S_IFREG | perm;
>> + inode->i_uid = sbi->s_uid;
>> + inode->i_gid = sbi->s_gid;
>> + inode->i_size = zi->i_wpoffset;
>> + inode->i_blocks = zone->len;
>> +
>> + inode->i_fop = &zonefs_file_operations;
>> + inode->i_op = &zonefs_file_inode_operations;
>> + inode->i_mapping->a_ops = &zonefs_file_aops;
>> +
>> + sb->s_maxbytes = max(zi->i_max_size, sb->s_maxbytes);
>
> Uhh, just out of curiosity, can zones be larger than 16T? Bad things
> happen on 32-bit kernels when you set s_maxbytes larger than that.
>
> (He says with the hubris of having spent days sorting out various
> longstanding bugs in 32-bit XFS.)

In theory, yes, zones can be larger than 16TB. The standards do not
prevent it. However, the chunk_sectors queue limit attribute that holds
the zone size of a device is an unsigned int and there are checks when
it is initialized that the device zone size is not larger than UINT_MAX.

In any case, please note that I did make sure that we do not exceed
MAX_LFS_FILESIZE: a few line above the one you commented, there is:

zi->i_max_size = min_t(loff_t, MAX_LFS_FILESIZE,
zone->len << SECTOR_SHIFT);

So for sb->s_maxbytes, 16TB maximum is a hard limit on 32-bit arch that
cannot be exceeded.

Best regards.


--
Damien Le Moal
Western Digital Research