Three Reasons Why the Content INDEX Scales Backup & Recovery Performance

Posted 04/25/2014 by Commvault

Posted in

Lately there has been some spirited discussion around how, exactly, a backup and recovery solution enables scale and increases performance. So we thought it made sense to share a little bit about what makes Commvault’s architecture one of the leading in the industry today, focusing specifically on content indexing, incremental full backups, and virtual machine recovery. It all starts with the index, which can boost backup and recovery performance in three key ways.

Data Intelligence that is Efficient and Available

The Simpana software platform provides granular intelligence about the data being managed for more efficient and cost-effective data protection, recovery and access. The metadata (data about the data) ranges from the name of a file to the subject of an email, and it is captured in a context index as part of the backup, snapshot, or OnePass process, or job. Commvault uses a distributed indexing structure in which all job-based metadata is retained in a central database (we call it the CommServe DB). Job and backup catalog information is also stored on the media agent protecting the job, automatically copied to backup media (i.e. the backup target, which these days is typically disk) containing the job, and optionally copied to an index cache server. This distributed index keeps the size of the CommServe DB relatively small, enabling more scalable backups and faster, easier restores. The distributed index also creates a more reliable, more robust architecture because the amount of horsepower required to run the CommServe is manageable with no single point of failure. Data can still be restored even if the Media Agent or CommServe DB goes down because a copy of the index is still available. The index architecture is a core component of the Simpana ContentStore, and part of the reason that the ContentStore is a highly efficient, scalable virtual repository for data being managed across an enterprise, regardless of where the data is physically located (see Figure 1).


Figure 1. A typical Commvault architecture

On the other hand, traditional and competitive backup solutions use a centralized index that contains all job summary data AND the detailed index information. As a result, this database can become extremely large, often limiting the scale of that solution.

In real numbers, the average size of a CommServe database is in the 3 to 5 GB range, with the largest we’ve seen in the 20 GB range. We recently met with a large government client who has 10 instances of a legacy competitive product, and they told us that for each of these instances, the database exceeds 2 TB. In a worst case scenario, if that organization’s entire environment goes down, rebuilding 10 separate instances of our competitor’s environment, each with a backend database more than 2 TB will take several days to complet - and that’s before they can restore any data. By comparison, a single instance of the CommServe (which can support over 15,000 clients) with a 15 GB database enables an environment to be back up and running in a fraction of the time – hours instead of days. We can get our CommServe DB so small because we ONLY store the metadata.

Incremental Forever vs. Synthetic Full vs. DASH Full

This talk of index structure may sound like a simple architectural difference, but its effects are far-reaching. For example, Commvault’s 'incremental forever' backup strategy is made possible because we don’t have issues with the scale of our index (we never have). Customers can take an initial full backup once and then only need to backup the changed data (incremental forever), without the requirement to consolidate those incremental copies into a synthetic full backup.

In contrast, other products must consolidate backups into periodic synthetic fulls in order to reduce the size of their index. The alternate process requires a full replay of original full and all incrementals, often taking multiple days in large environments. Days in which very little activity can occur in the backup environment because the construction of the synthetic full takes more than its fair share of resources. Productivity slows to a crawl because you’ve essentially added a whole new backup window for these other solutions.

As part of Commvault’s incremental forever methodology, we recommend that a customer perform periodic 'DASH Full' backups. A DASH Full is a read-optimized synthetic full backup job that avoids the performance toll of a traditional synthetic full. Instead of requiring a full read, a DASH Full operation will simply update the index files and deduplication database to signify that a full backup has been performed and will not read blocks from the disk library back to the Media Agent. Because of Simpana software’s distributed index, DASH Fulls aren’t necessary to clear the index. Its primary purpose is to facilitate data pruning during the next data aging operation. As you can imagine, this provides a massive performance advantage over traditional synthetic fulls.

Accelerating VMware and Hyper-V backup & recovery

One of the biggest advantages the content index and incremental full backup method offers Commvault customers, resellers and service provider partners is to modernize data protection for virtualized and private cloud environments. When it comes to improving performance of VMware backup and recovery, streaming backups using vStorage APIs for Data Protection (VADP) and Changed Block Tracking (CBT) are cool, but even slightly accelerated adaptations of VADP/CBT backups are still not fast enough for most high-I/O applications. Using a tiered approach to VM management based on SLA is the most reliable way to ensure performance and protection in both the short and long terms.

For Microsoft Hyper-V, Simpana software offers similar advantages. The major difference is that there is no CBT-like functionality for Hyper-V available in the industry as of this writing (but stay tuned!). So the impact of incremental forever backups is foundational to efficient Hyper-V data protection.

Commvault provides heterogeneous hardware snapshot management capabilities to integrate snapshots into your backup and recovery methodologies, and we continue to lead the way with more than 20 array families on our storage hardware support matrix. Employing hardware snapshots is crucial for applications with high change rates, often the most critical applications, because it dramatically reduces the amount of time that a VM must be quiesced to complete an application consistent snapshot. Our IntelliSnap technology integrates snapshot management and recovery for both Hyper-V and vSphere VMs.

Then, for your rank-and-file VM with a moderate to low change rate, VADP/CBT or other hypervisor-based streaming backups are great. Use them to your heart’s content. They’ll give you all the performance you need for 80 percent of the applications you run.

For still other mission-critical applications, there are complexities that need an extra level of application integration (I’m looking at you, Oracle, SQL, Exchange, SharePoint, etc.) to lay claim to the full level of protection needed. This fact isn’t negated just because an application lives in a virtual machine. While it’s not the preferred method of backing up every virtual machine, the surgical use of application-specific agents in guests is certainly warranted for these applications.

Applying each of these methodologies to the appropriate workloads creates the optimal solution. Trying to come up with a one-size-fits-all solution is like trying to maintain your car with a single tool – you’ll end up with a lot of bloody knuckles, and in certain cases it just won’t work.


What does all of this have to do with the index? Plenty. Simpana software’s ability to integrate all of this - the broadest array of hardware snapshots, VMware and Microsoft hypervisor-based backups, and agent-based backups - into a single, distributed index with massive scale provides clear value to our customers and partners. Think 15,000 clients; 120,000 jobs; physical and virtual servers; and thousands of simultaneous web 2.0 connections all protected and managed by a single software platform. Having granular intelligence about the data can also enable new methods of virtual machine integration including VM lifecycle management. With a single view of the data, regardless of where it’s physically stored, customers and partners can easily adopt private, hybrid or public clouds, and leverage cloud services for disaster recovery. In the end, the modern enterprise can extend the value of the content index to the point where it becomes a strategic differentiator that can save loads of IT budget and time.

So what does this mean to you? We mentioned the implications to recovery times earlier. Connect two concepts together. First, Commvault’s ability to protect data across the entire spectrum of applications in the enterprise – from mission-critical Oracle & SAP, through productivity workhorses like Exchange and SharePoint to web, files, and print servers – and running either in a physical context or virtualized. Second, with the massively scalable and highly efficient indexing and catalog structure, you get a data management solution that can deliver protection across literally every piece of data supporting every application and sitting in any location, and then deliver it back either to IT, or directly to the end user, to run searches, recovery operations, or even mobile access and sharing. And, most important, this functionality comes in a deeply integrated solution – you don’t have to write a bunch of scripts or do a complex custom installation to accomplish this.