dm-zoned

The dm-zoned device mapper target exposes a zoned block device (ZBC andZAC compliant devices) as a regular block device without any writepattern constraints. In effect, it implements a drive-managed zonedblock device which hides from the user (a file system or an applicationdoing raw block device accesses) the sequential write constraints ofhost-managed zoned block devices and can mitigate the potentialdevice-side performance degradation due to excessive random writes onhost-aware zoned block devices.

For a more detailed description of the zoned block device models andtheir constraints see (for SCSI devices):

https://www.t10.org/drafts.htm#ZBC_Family

and (for ATA devices):

http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf

The dm-zoned implementation is simple and minimizes system overhead (CPUand memory usage as well as storage capacity loss). For a 10TBhost-managed disk with 256 MB zones, dm-zoned memory usage per diskinstance is at most 4.5 MB and as little as 5 zones will be usedinternally for storing metadata and performing reclaim operations.

dm-zoned target devices are formatted and checked using the dmzadmutility available at:

https://github.com/hgst/dm-zoned-tools

Algorithm

dm-zoned implements an on-disk buffering scheme to handle non-sequentialwrite accesses to the sequential zones of a zoned block device.Conventional zones are used for caching as well as for storing internalmetadata. It can also use a regular block device together with the zonedblock device; in that case the regular block device will be split logicallyin zones with the same size as the zoned block device. These zones will beplaced in front of the zones from the zoned block device and will be handledjust like conventional zones.

The zones of the device(s) are separated into 2 types:

1) Metadata zones: these are conventional zones used to store metadata.Metadata zones are not reported as usable capacity to the user.

2) Data zones: all remaining zones, the vast majority of which will besequential zones used exclusively to store user data. The conventionalzones of the device may be used also for buffering user random writes.Data in these zones may be directly mapped to the conventional zone, butlater moved to a sequential zone so that the conventional zone can bereused for buffering incoming random writes.

dm-zoned exposes a logical device with a sector size of 4096 bytes,irrespective of the physical sector size of the backend zoned blockdevice being used. This allows reducing the amount of metadata needed tomanage valid blocks (blocks written).

The on-disk metadata format is as follows:

1) The first block of the first conventional zone found contains thesuper block which describes the on disk amount and position of metadatablocks.

2) Following the super block, a set of blocks is used to describe themapping of the logical device blocks. The mapping is done per chunk ofblocks, with the chunk size equal to the zoned block device size. Themapping table is indexed by chunk number and each mapping entryindicates the zone number of the device storing the chunk of data. Eachmapping entry may also indicate if the zone number of a conventionalzone used to buffer random modification to the data zone.

3) A set of blocks used to store bitmaps indicating the validity ofblocks in the data zones follows the mapping table. A valid block isdefined as a block that was written and not discarded. For a buffereddata chunk, a block is always valid only in the data zone mapping thechunk or in the buffer zone of the chunk.

For a logical chunk mapped to a conventional zone, all write operationsare processed by directly writing to the zone. If the mapping zone is asequential zone, the write operation is processed directly only if thewrite offset within the logical chunk is equal to the write pointeroffset within of the sequential data zone (i.e. the write operation isaligned on the zone write pointer). Otherwise, write operations areprocessed indirectly using a buffer zone. In that case, an unusedconventional zone is allocated and assigned to the chunk beingaccessed. Writing a block to the buffer zone of a chunk willautomatically invalidate the same block in the sequential zone mappingthe chunk. If all blocks of the sequential zone become invalid, the zoneis freed and the chunk buffer zone becomes the primary zone mapping thechunk, resulting in native random write performance similar to a regularblock device.

Read operations are processed according to the block validityinformation provided by the bitmaps. Valid blocks are read either fromthe sequential zone mapping a chunk, or if the chunk is buffered, fromthe buffer zone assigned. If the accessed chunk has no mapping, or theaccessed blocks are invalid, the read buffer is zeroed and the readoperation terminated.

After some time, the limited number of conventional zones available maybe exhausted (all used to map chunks or buffer sequential zones) andunaligned writes to unbuffered chunks become impossible. To avoid thissituation, a reclaim process regularly scans used conventional zones andtries to reclaim the least recently used zones by copying the validblocks of the buffer zone to a free sequential zone. Once the copycompletes, the chunk mapping is updated to point to the sequential zoneand the buffer zone freed for reuse.

Metadata Protection

To protect metadata against corruption in case of sudden power loss orsystem crash, 2 sets of metadata zones are used. One set, the primaryset, is used as the main metadata region, while the secondary set isused as a staging area. Modified metadata is first written to thesecondary set and validated by updating the super block in the secondaryset, a generation counter is used to indicate that this set contains thenewest metadata. Once this operation completes, in place of metadatablock updates can be done in the primary metadata set. This ensures thatone of the set is always consistent (all modifications committed or noneat all). Flush operations are used as a commit point. Upon reception ofa flush request, metadata modification activity is temporarily blocked(for both incoming BIO processing and reclaim process) and all dirtymetadata blocks are staged and updated. Normal operation is thenresumed. Flushing metadata thus only temporarily delays write anddiscard requests. Read requests can be processed concurrently whilemetadata flush is being executed.

If a regular device is used in conjunction with the zoned block device,a third set of metadata (without the zone bitmaps) is written to thestart of the zoned block device. This metadata has a generation counter of‘0’ and will never be updated during normal operation; it just serves foridentification purposes. The first and second copy of the metadataare located at the start of the regular block device.

Usage

A zoned block device must first be formatted using the dmzadm tool. Thiswill analyze the device zone configuration, determine where to place themetadata sets on the device and initialize the metadata sets.

Ex:

dmzadm --format /dev/sdxx

If two drives are to be used, both devices must be specified, with theregular block device as the first device.

Ex:

dmzadm --format /dev/sdxx /dev/sdyy

Formatted device(s) can be started with the dmzadm utility, too.:

Ex:

dmzadm --start /dev/sdxx /dev/sdyy

Information about the internal layout and current usage of the zones canbe obtained with the ‘status’ callback from dmsetup:

Ex:

dmsetup status /dev/dm-X

will return a line

0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential

where <nr_zones> is the total number of zones, <nr_unmap_rnd> is the numberof unmapped (ie free) random zones, <nr_rnd> the total number of zones,<nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> thetotal number of sequential zones.

Normally the reclaim process will be started once there are less than 50percent free random zones. In order to start the reclaim process manuallyeven before reaching this threshold the ‘dmsetup message’ function can beused:

Ex:

dmsetup message /dev/dm-X 0 reclaim

will start the reclaim process and random zones will be moved to sequentialzones.