Ratio Schmatio: Time to Get Over Dedupe Ratios and Other Myths

Posted 02/05/2014 by Commvault

Posted in

Doesn’t it feel like data deduplication has been around forever? But it wasn’t that long ago that it was the Big New Thing in data protection. The first Data Domain box shipped in 2004 but it took a few years more for dedupe to really get rolling. And like with most Big New Things, there were endless vendor wars over the best way to do the Big New Thing. Some vendors argued fiercely that in-line deduplication was the only way ('no landing zone!') while others touted the virtues of post-processing ('landing zones mean faster backups!') and some eventually offered both. Similar arguments raged over the best target format: tape-emulation or file system? Those arguments seem ancient now.

When dedupe technology moved into backup software, there were arguments over source-side versus target-side. There still are. But as a data protection technology, deduplication is nearly ubiquitous in backup software, though many users still aren’t taking advantage of it.

One thing that hasn’t changed all that much is the emphasis on deduplication ratios. Appliance vendors in particular still like to tout extravagant claims like 30-to-1 deduplication ratios. But do ratios mean anything anymore? If you ask us, we'd say ratio schmatio. It was never more than a marketing tactic.

The truth is that high deduplication ratios derive from the 'garbage-in, garbage-thrown-out' concept. Namely, if you dump a whole bunch of duplicate data into a box (say, a nightly full backup) and then get rid of it, you get a high dedupe ratio. But why bother dumping it into the box in the first place?

For years we’ve used the simple metaphor of moving a pile of bricks. Let’s say you had 1,000 heavy bricks in 10 different colors, and your job was to carry one brick of each color up five flights of stairs. Would you haul up all 1,000 bricks and then sort them out at the top? Or would you sort them out first and only carry up 10 of them? To ask the question is to answer it. But a target-only approach to dedupe means asking your applications and your network to move all the bricks so the device can throw most of them away at the end of the line. So much wasted effort! 

We don’t want to be too simplistic; there are use cases where a target-only approach make sense, but not as a general rule of thumb. Source-side dedupe is more efficient because you save all the work of 'moving the bricks,' meaning far less disk I/O on the source and much less data sent over the network.  (There is also a discussion to be had about the efficiency of archiving data to remove it from the backup stream, but let’s stick to dedupe for now.) 

With Simpana software, you can use source-side as well as target-side, for a truly global deduplication approach. You can distribute the dedupe workload in various ways, which allows you to deploy in whatever mode makes the most sense for a given workload and infrastructure.

A lot of yakkity yak also takes place in the argument over variable-length block versus fixed-length block dedupe. To which we say: variable schmariable and fixed schmixed. Those terms are often used in misleading ways. Some vendors make us want to go all Inigo Montoya and say, “You keep using that word. I do not think it means what you think it means.”      

Target devices need to use variable-length deduplication for a simple reason: they don’t know what’s coming. Since the data they are receiving will be formatted differently based on the backup software used to move it, they need to adjust on the fly based on the chunks of meta-data inserted into the backup stream, which is different for every backup product. This requires additional processing power but is necessary to get decent data reduction results and to keep up the required data ingest rates. But it’s not the only way.

Sometimes vendors who use a variable-length approach like to sniff at Commvault and say, 'Oh, they used fixed-length dedupe,' as if they were saying we just took the last piece of pie. Well, first of all that’s misinformation since Simpana 10 deduplication provides a variable-length option, which we recommend for certain use cases. But for most situations, Simpana software uses a fixed-length approach to save on compute resources. But there’s a big, big difference here: we know what’s coming.

Unlike the target device, which is flying blind until the data shows up, Commvault’s source-side deduplication is content-aware. In other words, we know our own backup format so we can make upfront adjustments based on our meta-data. This allows us to easily align our block chunking at the proper data boundaries without the need to spend a lot of high-impact compute cycles figuring out the most effective starting points.

If all that chunk talk sounds mysterious, the bottom line is that Simpana 10 deduplication ends up as effective as variable-length approaches in reducing data, but does it far more efficiently. We call that smart.  

And while it’s fun to talk about ratio schmatio and variable schmariable, these technologies don’t exist in a vacuum and nobody uses them in a vacuum. They are used operationally as an integral part of the end-to-end data protection lifecycle, and that’s where Simpana software really shines. We’ll take a look at the full lifecycle next time around.