Clustered filesystems are an essential component of scale-out, distributed systems, such as applications hosted on a Kubernetes cluster. They provide storage volumes which can be written to by multiple pods across multiple Kubernetes nodes simultaneously. Even if a storage device (e.g. an iSCSI block volume) supports multi-attach to different hosts at the same time, a cluster-aware filesystem must be implemented to avoid file corruption if it’s used in read-write many mode.
Many clustered storage solutions that emerged in the nascent days of Kubernetes are no longer under active development, receiving maintenance releases only, or otherwise reaching end of life. This includes solutions such as GlusterFS, Longhorn, and StorageOS. For this reason, many Kubernetes administrators are looking for an alternative to migrate their Kubernetes storage to. The Cloud Native Computing Foundation (CNCF) and its industry benefactors such as IBM and Red Hat have rallied behind the Rook storage orchestrator and Ceph file system as the de-facto standard for Kubernetes storage.
Compared to Gluster, Longhorn, or StorageOS, which were relatively lightweight and simple to administer in small clusters, Ceph is designed to scale up to exabytes of storage. In fact, Ceph is the underlying technology for block, object, and file storage at many cloud providers, especially OpenStack-based providers. For a successful Ceph deployment, it is important to carefully the plan the architecture and allocate the needed resources (CPU, RAM, I/O, network) for the storage which your Kubernetes workloads require.
Performance Recommendations for Architecting a Ceph Cluster
Recommended specs for a minimal Ceph cluster deployment include:
- 3 nodes on different fault domain
- 4GB RAM, 4 vCPUs per node
- 1Gbe (or better) network connectivity between nodes
- SATA or NVMe SSD storage for OSDs (30 IOPS/GB or better)
- Min 5-10GB for each PVC for the MONs
The hardware resources required will increase along with the number of nodes, and OSDs (disks) in the cluster. The network connectivity between the Ceph daemons is essential to performance. On each write, data is being replicated across multiple nodes. A write can only be acknowledged when all the OSDs (3 nodes for a replication factor of 3x) have persisted the write operation.
If you have two classes of storage devices available, one which is more voluminous (economical) and one which is faster, then it can be advisable (for performance) to place the BlueStore metadata on the faster storage medium, and use the remainder of the storage for the OSDs where the actual data is stored.
Ceph has become the standard for software-defined distributed storage for Kubernetes due to its incredible scalability, contributors from leading technology companies, and flexibility to be deployed in any public or private cloud environment. It is a viable open-source alternative to proprietary storage offerings from Dell EMC, NetApp, and Qumulo for SAN (block), NAS (file), and object storage.
The Rook operator for Kubernetes is the most maintainable way to deploy a new Ceph cluster, as the storage orchestrator creates the CRDs (custom resource definitions) needed for your Kubernetes pods to consume the Ceph storage through CSI drivers. It also monitors the health of your cluster, automatically rescheduling the Ceph components (mon, mgr, mds) if a node fails, and rebalancing the data if an OSD drive is replaced.
Rook Ceph can be easily deployed onto any existing Kubernetes cluster in its own namespace (by default, called rook-ceph). Even though the Ceph cluster and OSDs reside in a separate namespace from your other Kubernetes deployments, the StorageClass allows the creation of PVCs and PVs in any namespace across the cluster.
PVC-Based Cluster vs. Host Storage Cluster
For our client projects residing on public cloud providers, we typically use Rook’s “cluster on PVC” pattern to consume block volumes as a service (e.g. EBS or similar) for the OSDs and MONs of the Ceph cluster through the provider’s official CSI driver. Then, we create a StorageClass which leverages the CephFS CSI Driver to allow the creation of ReadWriteMany persistent volumes (PV) that can be mounted by pods across multiple nodes at the same time. Finally, the deployments on the Kubernetes cluster can use this StorageClass to create any PVs they require through a PVC.
For bare metal Kubernetes clusters, there are other patterns which can be considered, such as using storage devices directly on the host as the OSDs. The OSDs can be thought of as the “raw storage” of the Ceph cluster, where the usable storage is a fraction of that depending on the replication factor (2X, 3X, etc). In small clusters, 3X replication is the most common replication factor (1/3 usable storage), providing a balance between data durability, performance, and usable storage. To optimize for usable storage, some deployments use 2x replication (1/2 usable storage) with erasure coding. This reduces the storage required to achieve a given capacity but creates a greater overhead on CPU and RAM.
Rook Ceph Encryption At-Rest & Over-the-Wire
For security and compliance, Rook-Ceph supports various encryption options.
The best encryption method to use for ReadWriteMany filesystems using the CephFS CSI driver is OSD encryption, where the underlying storage is encrypted-at-rest using LUKS and dm-crypt. The keys (for each node) are stored as Kubernetes Secrets (and optionally, managed by your KMS). For this option, encrypted: true should be set when deploying the deviceSet for the storage pool.
All storage (block, object, or file) created on encrypted OSDs will be automatically encrypted, and for performance reasons, it is not advisable to double-encrypt the data using any other encryption method.
If you only plan to create ReadWriteOnce block volumes on your Ceph cluster using the RBD CSI driver, then RBD encryption is another available method where you have the option of selectively encrypting some volumes but not others. This is achieved by creating a separate StorageClass referencing a passphrase stored in a Kuberenetes Secret, for creating RBD volumes.
The communication between Ceph nodes can also be encrypted over-the-wire, which is particularly helpful to meet certain compliance requirements, or if the Ceph cluster is not operating in a 100% trusted network environment. This is accomplished by setting encryption: enabled for the network when deploying the Ceph cluster.
Co-located vs. Disaggregated Ceph Cluster
One of the key design decisions when deploying a Ceph cluster using Rook is whether to go with a co-located or disaggregated Ceph cluster. A co-located cluster is where Ceph is situated on the same Kubernetes nodes where other applications are deployed. A disaggregated cluster resides on Kubernetes nodes that are exclusively used for Ceph.
The scheduling of Ceph cluster components onto the appropriate nodes can be accomplished by using Kubernetes taints and tolerations, which are labels are applied to nodes and pods, respectively.
A co-located architecture is more suitable for smaller clusters with fewer nodes, where compute & storage are often hyperconverged onto the same Kubernetes nodes. It reduces the overall number of servers to manage. A disaggregated architecture is more advantageous for larger clusters, where the resource consumption of Ceph and other applications need to be clearly separated.
When to use CephFS vs. Managed File Storage
Whether deploying Rook and CephFS on your own or using a managed file store from your cloud provider depends on your requirements. Given a 1Gbe or 10Gbe network on your Kubernetes cluster, and sufficient IOPS for your Ceph MON, Bluestore, and OSD volumes, the performance of Ceph file systems can easily reach hundreds of MB/s. This is far superior to NFS v3/v4 or SMB/CIFS based solutions provided by Amazon EFS, Azure Files, or Google Cloud Filestore.
Using a proprietary solution such as enterprise file storage based on NetApp Cloud Volumes ONTAP is also an option, but some NetApp-based providers do not support at-rest encryption, particularly with your own encryption key. Also, the NetApp volumes must still be mounted using the Kubernetes NFS CSI driver, running into similar performance bottlenecks as EFS.
Deploying your own Ceph cluster on Kubernetes circumvents many of the performance & security related limitations of managed file shares. It also prevents cloud lock-in, as Rook and Ceph can be deployed on any commodity hardware, providing portability. In the event of a future cloud migration (to a different cloud, or to your own datacenter), the data for the MONs, Bluestore, and OSDs can be migrated to comparable PVCs.
Example Use Case – Ceph storage for Nextcloud
One of the applications which we regularly support deploying on Kubernetes is Nextcloud, a leading open source cloud storage & groupware suite. By default, the Nextcloud Helm chart provided by the community assumes that you are using a ReadWriteOnce volume for persistence, limiting the number of replicas to 1. For a scale-out deployment, this is clearly insufficient as its recommended to have at least 3 replicas behind a load balanced Ingress for a highly available set up.
Swapping out StorageClasses that can only support ReadWriteOnce, such as block volumes like EBS, with a CephFS StorageClass that supports ReadWriteMany enables Nextcloud to be scaled out on Kubernetes.
Ceph OSD encryption vs. Nextcloud server-side encryption
Leveraging the at-rest encryption provided by Ceph on its OSDs is also preferable to using Nextcloud’s server-side encryption at the application-level. The “default encryption module” built into Nextcloud has been known to lock users out following an upgrade between certain server versions. This was especially the case when “user key” as opposed to “master key” was enabled, making the encryption key for each user’s share derived from their account password. Certain incompatibilities in the way the key was derived between different versions caused users to lose access to their data, or their external shares.
For this reason, we do not recommend the use of Nextcloud’s built-in encryption feature.
With CephFS, the PVC for the Nextcloud data directory is presented to Nextcloud as an ordinary filesystem, while the encryption & decryption transparently occurs behind the scenes using LUKS and dm-crypt. The keys can be backed up by the administrator from the Kubernetes Secrets, and can be further protected using a KMS for key rotation.
File vs. Object Storage as Nextcloud Primary Storage
Another feature which the maintainers of Nextcloud have been heavily pushing to customers deploying Nextcloud using Kubernetes is “object storage as primary storage” – which allows using a S3-compatible bucket such as OpenStack Swift or MinIO as Nextcloud’s storage backend.
The issue with Nextcloud’s object storage backend is that it places the appdata folder containing the application’s static assets (CSS, JS, images) into the object storage along with the data uploaded by users. This creates an incredible bottleneck in performance and unexpected timeouts, especially on the initial load of the Nextcloud dashboard where it’s not already cached in the browser.
Many object storage services like AWS S3 impose a rate limit and charge per 1000 API requests, making Nextcloud users who enabled “object storage as primary storage” have an unpredictable cloud bill due to the application continually calling the S3 API for directory listings or to generate thumbnails.
Furthermore, the filenames of the data uploaded on an instance with object storage enabled is obfuscated in numbered urn_oid objects by Nextcloud. This makes it difficult to browse and backup the stored data directly, without complex queries to cross-reference the Nextcloud database for the original file name of the object. If the state of the database becomes inconsistent, it can become impossible to match the urn_oid object with the file originally uploaded into Nextcloud.
For enterprise-grade deployments of Nextcloud, we recommend using CephFS, or a comparable clustered filesystem instead of object storage. The performance is many-fold better than an object storage-based solution, and the data can be easily backed up using traditional backup tools.
Integrating Rook Ceph with Kubernetes
Does all the above sound complicated? It doesn’t have to be.
Whether you are integrating Ceph with Nextcloud, or require Ceph-based storage for any other Kubernetes applications, our vendor-agnostic Kubernetes & Ceph consultants are eager to assist your organization with:
- Architectural planning for Ceph cluster & block, file, object storage requirements
- Deploying Rook operator & Ceph cluster – on any Kubernetes provider of choice
- Operating an existing Ceph cluster (adding nodes or OSDs, key rotation, etc.)
- Backing up or migrating data to/from CephFS volumes
Learn more about Ceph implementation at the major cloud providers, your own OpenStack cloud, or on bare metal – and how our consultants can help.