A Primer On Hedvig Persistent Volumes for Containers – Part III (Container Data Mover)

By Srividhya Kavaipatti Anantharamakrishnan and Abhijith Shenoy

In the second blog of our three-part series on persistent volumes for containers, we highlighted the importance of scheduled snapshots to drive continuous data protection by offering capabilities for fast application recovery in the event of errors or failures. In this final blog, we’ll present a unique capability built on top of snapshots that enables automated container data migration between Hedvig clusters across on-premises and disparate public cloud environments.

Seldom do organizations run a single large Kubernetes cluster. Each group within an organization might have its preferred choice of Kubernetes distribution as well as a preferred location (either on-premises or in the cloud, or both) for persistent application data. Even though Hedvig provides a single distributed fabric that can span multiple on-premises and cloud sites, different groups might choose to isolate their data (for reasons such as compliance and risk mitigation) within different Hedvig clusters, essentially creating a one-to-one mapping between a Kubernetes cluster and a Hedvig cluster.

A practical use case for isolating application data is test and dev environments. Developers deploy their code and test any new implemented features in the dev environment. Once the code is stable the application is deployed in the test/staging environment for further testing. Any defects found in the testing phase are reported back to the developers and this cycle continues until the code passes the testing phase. The test/dev cycles can be significantly reduced by enabling data migration between test and dev environments, thereby reducing the time taken to triage and resolve application issues.

Intelligent and fast data migration

The most efficient mechanism to keep a dataset in-sync across source and target locations is to constantly keep track of changes in the dataset at the source and periodically sync these changes to the target location. The commonly used phrase to describe this mechanism is “Change Block Tracking” or CBT. Change Block Tracking is an age-old technique that is most widely used as incremental backup technology for a large variety of datasets.

So how is Hedvig’s data migration technology any different? Yes, we still use CBT, but what differentiates us from the rest is the intelligence built into the Hedvig Distributed Storage Platform that leverages kernel to kernel copies, providing a fast data transfer channel between Hedvig clusters.

Data migration is orchestrated through snapshots and we constantly keep track of data blocks that change in every version of the Hedvig volume. When a migration job is initiated, we can identify the exact files containing the changed blocks at the source and stream these files to the target, thereby making the transfer mechanism extremely swift. Data migration does not involve individual data block reads or writes, which means that the data is available for consumption at the target location as soon as the file streaming is complete.

A deep dive into data migration

Before we present the policy-driven approach for migrating stateful containerized applications, it is necessary to understand what transpires behind the scenes as the data moves from the source location to the target location. Let’s illustrate that with the help of an example.

This diagram shows two disparate setups:

  • Test setup – This setup consists of a Kubernetes cluster (Amazon EKS) along with a Hedvig cluster running completely within an AWS region. The application data resides (within AWS) on persistent volumes provisioned from the Hedvig cluster.
  • Dev setup – This setup consists of a Kubernetes cluster (AKS) along with a Hedvig cluster running completely within an Azure region. The application data residing on the Hedvig cluster in AWS is migrated to the Hedvig cluster in Azure and consumed as persistent volumes in the AKS cluster.

Hedvig distributed storage platform leverages a novel distributed barrier abstraction to implement a state machine for data migration. This process involves the following steps:

  • The data migration job is initiated on a coordinator node on the Test cluster (source).
  • The coordinator node then gets the latest state information from all nodes involved in the migration.
  • Data migration to the Dev cluster (target) happens in a distributed manner with each replica node updating its current state to the barrier as the migration progresses.
  • In case of recoverable errors/failures, barriers enable replica nodes to perform smart retries.
  • In case of irrecoverable failures in the event of network partition, the coordinator node decides the outcome of the migration by reviewing the state machine view.
  • While migration is in progress for a given source volume, the corresponding target volume is frozen for consumption.

When the migration is complete, migrated volumes contain the most recent point-in-time snapshot received from the corresponding source volumes. Applications can then consume the migrated data as clones of migrated volumes on the Dev cluster (target).

Policy driven container data mover

In this section, we’ll describe how data migration can be seamlessly enabled through policies assigned to Kubernetes constructs. Snapshot schedules provided through the Hedvig CSI driver have been enhanced to allow users to configure data migration based on the snapshot retention period. Data migration workflow for CSI volumes is as follows:

Create a migration location

Migration location is implemented as a CRD (CustomResourceDefinition) and is managed by the Hedvig CSI driver. A migration location can be created on the source Kubernetes cluster by specifying the name of the target Hedvig cluster and the seeds. Following is an example:

Create a snapshot schedule

This example creates an interval schedule that creates a new snapshot every hour and retains it for two hours.

Create a storage class

After the migration location(s) and snapshot schedule have been created, create a new storage class by setting the following parameters:

  • migrationEnable – Set to “true”
  • migrationLocations – Comma-separated list of one or more migration location names
  • schedulePolicy – Snapshot schedule name

An example:

Any persistent volume provisioned using this storage class will have migration enabled. Based on the snapshot schedule associated with the storage class, a new snapshot will be created for the persistent volume every hour and upon the expiration of that snapshot (every two hours), changed data blocks will be migrated to the target cluster.

To summarize, with all these potent capabilities, Hedvig brings to the market a comprehensive storage infrastructure platform deeply integrated into Kubernetes ecosystem. This concludes our three-part series on persistent volumes for containers. Be sure to learn more about Hedvig and our use cases.