It's time to rethink your data deduplication strategy. Most environments using dedupe today either suffer from resource bottlenecks, scale limitations, or a combination of both. This seems to be particularly true with the rapid spread of dedupe appliances. Don't get us wrong — a dedupe appliance is a useful tool that solves some important challenges customers may face when rapidly rolling out the technology to see a quick benefit. But this strategy retains the fundamental reliance on hardware that can create other problems. In particular, as data sets continue to accelerate in growth, these approaches fundamentally ask you to throw more and more hardware at the problem. As a result, if you're deduping this way today, you may be overpaying and/or under performing. Read on to determine how to remedy these problems.
Let's set the stage: on average, datasets today are growing at about a 40 percent annual rate. In other words, they are doubling every two years — with no end in sight. Other industry pundits state even more aggressive data growth forecasts1. The point is, no matter which stat you choose to believe, rapid data growth is here to stay2.
With this in mind, Commvault® Simpana® 10 delivers 4th Generation data deduplication, engineered to meet the challenges of the continued explosion in data growth. The major productivity improvement in fourth Gen Dedupe is "Parallel Deduplication" technology, which gets your infrastructure working smarter rather than harder. The fundamental premise behind parallel deduplication is to deliver massively scalable and highly resilient deduplication - via a software-centric approach - designed for the largest datasets and most demanding business applications. It does this by leveraging a grid-based architecture to the dedupe database (DDB) and the media agent.
With grid architecture, Simpana 10 parallel deduplication will federate multiple DDBs together to present a single, very large deduplication pool for use by data protection jobs (clients and subclients). Figure 1 below is intended to be an example of what a 2-node parallel deduplication pool would look like. Using this type of architecture, we can scale deduplication capacity and throughput in a near linear fashion to support very large dedupe workloads.
Figure 1: Example of 2-Node Parallel Deduplication Pool Configuration
In this example (Figure 1), we have federated two deduplication nodes together, each individually could protect up to 120 TB of front-end storage3 and approximately 4.5 TB/hr throughput4. By federating the 2 nodes together into a single dedupe pool, we can now manage deduplication of up to 240 TB of data and 9 TB/hr throughput.
Beyond the large scale and throughput this approach delivers, we can also combine the parallel deduplication approach with Commvault's unique GridStor® capability to deliver full load balancing and job failover options. If one node in the dedupe pool goes down, other nodes in the pool will immediately pick up the load to prevent any down time.
There are a couple of caveats I want to address on parallel deduplication up front that customers should understand:
- As of this post, Simpana 10 currently supports two nodes in a parallel dedupe policy, although there is no hard limit to the number of dedupe nodes that can be federated. And customers can expect CommVault to continue to push the limit upwards in terms of the number of dedupe nodes that will be supported in a single parallel dedupe policy.
- Parallel dedupe nodes need to be preconfigured up front in the Storage Policy — a single node cannot be converted to 2 nodes; 2 nodes will not be convertible to 4 nodes; so there is still a need to plan ahead and architect the solution for growth.
Creating a 2-Node Parallel Deduplication Storage Policy in Commvault Simpana 10
Finally, we want to provide a quick walkthrough of the highlights in configuring a parallel dedupe storage policy. This isn't the full step-by-step guide. For that, you can reference our Books Online: Create a new Storage Policy for Parallel Deduplication in Simpana 10:
Designate this Storage Policy for Parallel Deduplication (select Multiple Deduplication Databases option):
Select Number of Deduplication Database Nodes (a.k.a. Partitions):
And finally, finish off the Storage Policy Wizard to review your configuration:
As you can see, the configuration process itself is fairly straightforward and wizard driven. The trick is to be sure to have sized your DDB and media agent appropriately to handle the workloads. To help address this sizing, we typically recommend you work with your Commvault technical and services teams to get the configuration just right. But to give you an idea and direction for this sizing, check out the Simpana 10 Dedupe Requirements and Sizing on Books Online.
Parallel dedupe is just one of several capabilities available in Simpana 10 that help you dedupe smarter rather than harder. Here are several more I'll address in future posts:
- Consolidate remote and central office deduplication in a single software-based architecture. You can leverage single node dedupe policies at the remote site. Then run DASH Copy operations to the central office using a parallel dedupe policy at the central site. Combining single and multiple node dedupe gives you the flexibility to right size the capabilities at each location based on the business need.
- Run incremental forever backups leveraging DASH Full. This drives a much smarter backup strategy with minimal impact on production servers and network and helps drive better infrastructure utilization. For example, with the traditional weekly full daily incrementals, VM backups can only drive 20-25 TB per node, with incremental forever and DASH copy the same node can drive 40-50 TB of VM data.
- Holistically manage multiple dedupe pools, based on data type, from a single console. This ensures you are creating dedupe pools with the maximum dedupe benefits to optimize resource consumption.
1 IDC Big Data Technology and Services 2012-2016 Forecast, January 2013
2 2013 Spending Intentions Survey, ESG January 2013
3 120 TB requires use of SSD — on the DDB Store
4 Throughput is prelim v10 metric, expect this number to go up during the life of v10