A Primer on Hedvig Persistent Volumes For Containers – Part II (Snapshots And Clones)

By Srividhya Kavaipatti Anantharamakrishnan and Abhijith Shenoy

In part one, we showcased the capabilities of Hedvig CSI Driver to support complete storage lifecycle management for stateful container workflows. In this blog, we’ll feature an in-depth overview of Hedvig snapshots and clones, the benefits of snapshots and how they are seamlessly integrated into container orchestrators through the Hedvig CSI Driver.

With the growing adoption of Kubernetes within enterprise organizations, cloud has become the destination of choice for not only modern but, also, legacy applications. With a cloud-first strategy, an organization’s data can be spread across multiple on-premises and/or cloud sites. When organizational data is spread across multiple disparate sites, continuous data protection can pose a significant challenge without a uniform data protection scheme.

With a single storage fabric that spans multiple sites, data placement policies that are declarative in nature coupled with built-in snapshot capabilities, Hedvig Distributed Storage Platform provides a uniform location-transparent scheme for protecting organizational data.

Continuous data protection using snapshots

A snapshot can be defined as the state of a storage volume captured at a given point in time. Persistent point in time states of volumes provide a fast recovery mechanism in the event of failures with the ability to restore from known working points. This capability has been proven to be extremely beneficial in the following scenarios:

  • Faster recovery from accidental updates to critical datasets (such as databases)
  • Protection against ransomware attacks
  • Space-efficient alternative to full backups

Hedvig Snapshots

Hedvig volume snapshots are space-efficient metadata-based zero-copy snapshots. Every newly created Hedvig volume has a version number and a version tree associated with it. The version number starts with “1” and is incremented on every successful snapshot operation along with an update to the version tree. Every block of data written is versioned with the version number associated with the Hedvig volume at the time of the corresponding write operation.

Let’s take ransomware attacks as an example to understand how Hedvig snapshots provide data protection. Consider the following sequence of events:

  • A Hedvig volume is provisioned for application data at time t1 (version number: 1)
  • A periodic snapshot is triggered at time t2 (version number: 2)
  • A periodic snapshot is triggered at time t3 (version number: 3)
  • Ransomware attack on application at time t4

At this time, any new writes that happen as a part of the ransomware attack are recorded with version number: 3. By reverting the Hedvig volume back to the previous version (2), the application can be recovered instantly.

The process of reverting a Hedvig volume to an earlier version is not dependent on the size of the volume or the amount of data it contains. No data of the Hedvig volume needs to be copied either during the snapshot or the revert operation, resulting in a data protection scheme that is simple, fast and inexpensive from an operation point of view.

Scheduled snapshots and SLAs

Data protection schemes for application workloads are defined in terms of Service Level Agreements (SLAs). At the very minimum, an SLA is defined by specifying the following:

  • A unique name
  • Periodicity of snapshots
  • Retention period for every successful snapshot

SLAs are set to align with your organization’s business needs with an inherent focus on business continuity. More specifically, SLAs are created to fulfill compliance rules for organizational data. As more applications move to the cloud, SLAs are also created to meet the application need for continuous delivery.

As an organization’s data grows, so do the SLAs and the manual process of creating and updating SLAs can be a deal-breaker. It is of utmost importance to have a policy-driven method of managing SLAs, which offers create and forget semantics, so that newer data is inherently protected.

Data protection for containerized applications

In this section, we will string together the concepts presented thus far and demonstrate how they can be applied effectively to protect containerized applications. Hedvig CSI driver provides users the ability to create on-demand snapshots as well as automated scheduled snapshots of stateful containerized applications. Snapshot management through the Hedvig CSI driver is completely policy-driven, thereby enabling automation to be extended all the way to the data layer. Let’s take a look at how this is done.

On-demand Snapshots

The workflow for on-demand snapshots is implemented as follows:

Create a VolumeSnapshotClass for creating snapshots of persistent volumes.

Create a VolumeSnapshot of an existing persistent volume using this class.

Use the volume snapshot to create a new PersistentVolumeClaim.

Scheduled Snapshots

While the ability to create on-demand snapshots is an important feature, it is not a feasible option when it comes to managing a large-scale production container ecosystems. With scheduled snapshots, users can easily create snapshot schedules for their persistent volumes and the built-in snapshot scheduler of the Hedvig CSI driver does the job of taking consistent snapshots as per the SLA specified.

Kubernetes (and the CSI Spec) does not provide a native type for creating snapshot schedules. Snapshot schedules are implemented as a CRD (CustomResourceDefinition) and are created by the Hedvig CSI driver. After the CSI driver has been deployed, a user can create snapshot schedules by specifying the periodicity and the retention period as follows:

This example creates a simple interval schedule that creates a new snapshot every minute and deletes the snapshot after two minutes. Snapshot schedules can be easily customized to meet the application needs.

After a snapshot schedule has been created, create a new storage class with the snapshot schedule. This will ensure that any new persistent volume provisioned using this storage class will be protected as per the snapshot schedule.

After the storage class has been created, create a persistent volume claim (PVC).

Based on the associated snapshot schedule, you should now see snapshots created for this persistent volume claim every minute. To list the snapshots, run the following command:

Use the snapshot AGE column to verify that the snapshots are deleted as per retention period.

Protecting primary application data and recovery in case of errors or failures can be a complex and time-consuming task for organizations. By leveraging snapshots for continuous data protection, Hedvig alleviates this problem by providing a fast mechanism to restore applications to a stable state, thereby significantly reducing the application downtimes.

Stay tuned for our next blog post on how we push the envelope further with seamless container data migration between test/dev environments across on-premises and disparate public cloud environments.