By Jonathan Howard
There are a number of companies that utilize snapshots for data protection. As Curtis Preston points out in his blog series, snapshots as part of the storage platform can provide a critical aspect of data protection for just about every organization today. The great thing about today’s storage technology is that many storage offerings utilize this technology freely. Some storage as recently as five years ago couldn’t use snapshots at scale. One vendor had written in their best practices that if snapshots were to be utilized, the best practice was to use one transient snapshot and delete it immediately after use. This wasn’t one snapshot per volume; this was one snapshot for the entire array. Times have changed, and the technologies are much improved. However, not all storage platforms are created equal. It’s important to understand the use cases and where they fit when looking at snapshots as part of data protection.
A single data protection operation can include multiple types of snapshots, all working in concert, to provide the desired outcome. A backup for VMware virtual machines can include three different snapshot operations to ensure proper data and application consistency; a VSS quiesce inside of the virtual machine itself (which can be referred to as a VSS snapshot), a software snapshot inside of VMware, and then a hardware snapshot as part of the storage subsystem. So while storage snapshots are a critical component, they are also part of a complete solution. That complete solution will address the data and infrastructure orchestration on multiple levels. Commvault leads the way in hardware snapshot integration, with our IntelliSnap technology that has more than 275 different array models qualified. We provide a consistent experience for our storage experts and end users alike, and bring a unique viewpoint to this discussion.
Commvault believes and understands the delicate balance between a snapshot and an application persistent snapshot, as data protection operations and snapshots are not freely interchangeable. To prove this point we have performed a set of tests with one of our storage partners in their labs. One was protected with only its native snapshots; and one with Commvault providing the orchestration on top of its native snapshots. The first series of tests were on a SQL database, the second on 10 VMs. In both occasions we ensured that there was data in flight at the time of each snapshot operation, and we re-ran the operation 10 times to ensure valid results. The result? Nine out of 10 times the database failed to come back cleanly, and all 10 of the VMs snapshots had issues. A successful snapshot operation provided 95 percent recovery failures.
The snapshot is only as good as the data that is being put into it. Or simply put, garbage in produces garbage out. If data is actively being written, then there are buffers and caches to deal with outside of the storage system that need to be addressed. Additionally, add in virtualization – there are multiple layers of that involved. This is why an application consistent backup is so critical to be used as a recovery point. When data recovery is required, it needs to be trusted.
Utilizing snapshots properly provide a great first line of defense. As Preston pointed out (and we couldn’t agree more), it should never be the only line of defense. Snapshots being linked to the production data can still suffer from primary array failures, site failures and user error. This is where another critical point of integration is required: what to do next? Create an alternate disk based backup copy? Send it to the cloud? Replicate and then cloud? Replicate to another array?
Replication to another array can extend the functionality of the snapshots to an alternate site, which can solve site and array failure challenges. But remember, it’s still on the same storage platform being managed by people. A few years ago, a customer moved to an all-snapshot, all-replicated model for data protection. They had 200TB of data at a remote site and were utilizing snapshots as their protection, and then replicating the data to the main data center with no other copies configured. The architect was doing maintenance directly on the arrays, and from the array console deleted the wrong volumes, the snapshots along with them, and the replicated copies. He mixed up the site designation and thought he was logging into a remote site, and not the production site. The fortunate part for him was that the backup admin had taken one last tertiary copy with Commvault before turning off the additional protection, so he had a copy to recover from. Even the smartest people make mistakes. Making the snapshot and replication process an integrated part of the protection and retention strategy is important, and this requires that the data is consistent and can be trusted from start of the process.
Access and recovery of protected data is something that end users, application owners and storage admins alike all require. But would you provide an end user access to the storage array directly to recover snapshots? Most certainly not, and this is where another critical aspect of snapshots being used for data protection comes to light. In the recovery process – if you present a unified access console for the data owners that allows them to recover data regardless of data tier – you can accelerate a customer’s true RTO because it eliminates multiple handoff processes to get access to quick recoveries. A number of customers can script up data protection. However, even in the best of times that doesn’t solve access and recovery challenges. If your company has an RPO/RTO of 15 minutes, meeting the RPO with scripts can be challenging enough. And with the demands on storage staffs today, reaching them in 15 minutes could be challenging enough. Allowing customers to recover their own data is critical. Snapshots need to be integrated into the data protection system, not external to it.
Additionally, scripts can easily become unmanageable. A few years ago I was on-site at a large financial institution looking at its infrastructure challenges surrounding its critical databases. Recovering its databases in the primary and secondary site was such a problem that the institution had one of its storage administrators writing scripts to control some of the rudimentary aspects of its data protection needs. The customer had more than 1,700 scripts to manage this workload, and every time there was a volume that was changed or added, one of these scripts had to be edited. Tracking those changes was impossible. Although it was set up once, it could never be maintained. Being able to automatically discover the data, where it resides, and on what storage infrastructure is critical to the ongoing use of snapshots. Changing an array from one array class to another, let alone one array vendor to another, can change the entire script language or interface.
As I said before, snapshots can make a critical first line of defense in a data protection operation, when used the right way. As part of any good data protection system, knowing what types of technology provides the best type of recovery is critical, and taking a layered approach to ensure multiple aspects and failure scenarios is crucial.