ZFS on Linux for Scalable Storage Volumes

Disk activity on datacenter rackZFS (the Zettabyte File System) is an enterprise file system that was first introduced in OpenSolaris by Sun Microsystems, later acquired by Oracle. It is a file system for Unix-based operating systems. Initially, it was more widely used on Solaris and BSD based distributions including Solaris, OpenSolaris, illumos, OpenIndiana, FreeBSD, and NetBSD, but the ZFS on Linux project has made ZFS available as kernel module for Linux users as well.

In fact, FreeBSD decided to rebase its implementation of ZFS based on the upstream code from the ZFS on Linux project. One of the main reasons why ZFS is not merged into the Linux kernel is because Oracle acquired the ZFS patents developed by Sun, and those patents would be incompatible with Linux’s GPL license. Fortunately, this license incompatibility doesn’t really affect end users, and you can install ZFS directly from your distribution’s package manager.

ZFS, the “final word on file systems”

ZFS is thus named as the Zettabyte File System because as a 128-bit file system, it theoretically has the capability of scaling to 256 quadrillion zettabytes. One zettabyte is one million petabytes. To provide an idea how enormous that is, the size of Facebook’s data warehouse for 1 billion users was 300 petabytes in 2020. Clearly, ZFS is suitable for scale-out storage use cases including for governments, scientific computing, and global-scale websites.

Most likely as an average organization, you are dealing with terabytes of data or less. You might be interested to know that ZFS has other advantages besides enormous scale. The first, and most important advantage is data integrity. ECC RAM, which comes standard for server-grade hardware, is recommended for systems using ZFS. “Bit flips” that can result from a hardware problem or even cosmic radiation, can be automatically detected and corrected by ZFS using checksums. This prevents “data rot” that happen over time with conventional file systems.

As a copy-on-write (CoW) filesystem, ZFS also natively provides advanced features such as live snapshots & rollbacks. Because ZFS writes the incremental changes to a file in a new block instead of overwriting the original file, it can instantly take a point-in-time snapshot of a directory or a file without taking up additional disk space. This is a very useful feature for system administrators who need to take a crash-consistent backup of a set of data at the same point in time.

Similar to LVM’s volume groups (VGs), physical volumes (PVs) and logical volumes (LVs), ZFS has a concept called storage pools (zpool) which can span across multiple virtual devices (vdev) consisting of physical disks, and be presented to the operating system as a contiguous volume (zvol) mounted to a single mount point.

Mirroring, Striped Mirroring, and Raid-Z Redundancy in ZFS

The primary modes of providing redundancy for ZFS storage are mirroring, striped mirroring, and RAID-Z.

Mirroring is the simplest solution for a vdev consisting of two disks. It is similar to RAID 1, as any data written to the ZFS volume is mirrored to all of the disks within the vdev (1+1 for two disks). Even if you add a third disk to the vdev, the capacity of the vdev remains the same as that of a single disk. Any additional disks can only be used for mirroring (1+1+1 with three disks) leading to slow write speeds, and a waste of the additional storage capacity.

Another common setup for ZFS is striped mirroring with RAID 10 (RAID 1+0), which provides additional read performance and redundancy within each mirrored set. Half of the total installed storage is available, and additional storage can be added in pairs of 2 disks at a time, making RAID 10 relatively safe & scalable.

Consider if Disk 1 + 2 are one mirrored set, and Disk 3 + 4 are another mirrored set. One drive can fail within each mirrored set and the data is still intact. Another benefit of RAID 10 over RAID 5 is that recovering from a failure places less stress on the disks compared to RAID 10. The data is simply copied from the healthy disk with RAID 10, instead of needing to read parity information from all the disks in RAID 5 (which can lead to a cascading failure, if the disks are already in poor condition).

RAIDZ-1 is the ZFS implementation of RAID 5, RAIDZ-2 is RAID 6, and RAIDZ-3 is RAID 7. These RAID levels use parity information to enable the rebuilding of the RAID, in case of a physical disk failure.

  • Parity: RAIDZ-1 (RAID 5) can tolerate up to 1 disk failures without data loss. A minimum of 3 disks is required. The storage of N-1 disks is available.
  • Double Parity: RAIDZ-2 (RAID 6) can tolerate up to 2 disk failures without data loss. A minimum of 4 disks is required. The storage of N-2 disks is available.
  • Triple Parity: RAIDZ-3 (RAID 7) can tolerate up to 3 disk failures without data loss. A minimum of 5 disks is required. The storage of N-3 disks is available.

The usable storage and performance, because of the additional parity data needing to be calculated, declines by increasing the RAID level, but the comparative level of data redundancy increases. If you are deploying your ZFS storage pool on a storage backing that already inherently has redundancy (e.g. a cloud provider’s block volumes, or dedicated server with hardware RAID), using the highest RAID levels is probably overkill, leading to increased costs and reduced performance – without gaining much in “peace of mind.”

A RAID-Z vdev in ZFS cannot be expanded without being rebuilt, with the data migrated a new RAID-Z array. However, it is possible to expand the capacity of a zpool by adding additional vdevs to it. This is transparent to the operating system, which simply sees the mount point exposed by the zpool with expanded capacity.

ZFS on Linux compared to BTRFS

ZFS is often compared to another file system called BTRFS (B-tree Filesystem). BTRFS began being developed on Linux as an answer to ZFS, primarily because of the licensing concerns surrounding if Oracle would enforce its patents on ZFS. Nonetheless, we believe ZFS is a better option for most enterprises looking for an enterprise file system, because it is much more mature and well-documented.

If you are deploying a file sync & share solution such as NextCloud or Seafile, it makes sense to consider ZFS as a file system that will probably even outlive the lifespan of the software solution. The developers at Sun Microsystems called ZFS the “last word on filesystems” and we are confident that it will stand the test of time, as it already has since its introduction in 2001. In the world of technology, almost 20 years is virtually an eternity, and we believe ZFS as a time-tested solution for ever-expanding data sets will remain for many decades to come.

Contact our cloud storage consultants for personalized advice how to deploy NextCloud on-prem or with the cloud provider of your choice as an open source cloud collaboration solution for your distributed workforce.