Understanding Your Data: Peeking Into The Commvault 4D Index And How It Works

By Matt Tyrer

The need to know more about your data continues to grow.  Whether it’s data privacy initiatives, governance and regulatory changes, or you had “a close call” with ransomware or a data breach – knowing what data you have and where it is shouldn’t require a private investigator.

Let Commvault Activate™ be your Sherlock and provide you the clues to help you to understand your data, and give you the ability to drive action against it.

So, how does Activate do this?  Behind the scenes, Commvault’s unique 4D Index provides the intelligence to wrap your data management strategy around.  Grab your trusty magnifying glass and let’s take a closer look into how it works…

What’s better than 3D? 4D!

The Commvault 4D index is at the center of our architecture and acts as a central namespace that describes all data managed or defined to Commvault (backup, archive, live data).  It is comprised of four aspects that we will examine in more detail:

1. Basic Metadata

Out of the box, we collect basic metadata about the data as we perform regular backup and archiving that load the data into the Index.  This is critical to manage these processes, but also for data recall or recovery. The metadata for an email could be items like subject, from, to, date sent, date received, size, etc. For a file, it could be path, location, size, create date, created by, modify date, owner, last modified by, and so on. This basic metadata and indexing is standard within Commvault Complete™ Backup & Recovery, and can be searched (as part of our base backup offering) to support self-serve restore and other operational tasks.

2. Content Indexing

This is a feature specific to Commvault Activate™ and is where we make the file/email contents themselves searchable. This is sometimes also referred to as full-text indexing, but effectively, we load whatever text we can find into additional metadata fields within the index. We can harvest this text from office files, emails, pdf and a variety of other unstructured, semi-structured, and structured data sources. Once this other metadata is populated, our search operations can be applied to it, therefore providing the ability to find keywords or phrases within files, or to apply archiving policies based on the content of the files.  Activate can further extend the collection of this additional metadata beyond the data managed directly by Commvault and index data sources sitting “live” in production systems.

3. Classification

The classification of data uses a process called entity detection, or entity extraction, which allows us to identify specific types of information that could be stored within the data. This also allows you to define categories/types of data (entities) and make the categories searchable. Examples could include sensitive data, PII flags, credit card numbers, customer IDs, sentiment analysis, purchase order numbers, broad financial details, etc.  The idea here is that you would define a (regular expression) pattern for, say, a driver’s license format. As they perform content-indexing, the full text is searched for any matches to the patterns, and those matches are stored in metadata fields for that entity. These are extracted so they could then search for the presence of a specific driver’s license number, or they could search for an asset that looks like it might contain any driver’s license number.

This is central to our offering for Sensitive Data Governance (SDG), where we help to reduce your data risk by highlighting the presence of personal (or other sensitive) data across your environment.

4. Advanced Insights 

Here is where we leverage different types of Artificial Intelligence (AI) to enrich our index, usually from technology partners like Microsoft, Google, and AWS. AI comes in many forms – and whether we’re talking about statistical, like machine learning (ML), semantic, like natural language processing (NLP) or other AI techniques – the goal is to enrich the information we have about your data to make it more searchable, accessible, usable and actionable.

With the help of AI, you have the potential to the use data more productively and effectively.

Layer Cake

Putting together all the layers of the Commvault 4D Index we can look at a some incredibly powerful methods to not only find and manage the data intelligently, but to visualize and derive additional value from it as well.

Examples could include:

  • Allow for the possibility that search requests or assets contain spelling mistakes and should still be returned when searching.
  • Make our entity detection more accurate, providing more accurate assessment of data risks and opportunities.  Fewer “false positives” really boost confidence in exception-based searches and make automated data policies more reliable.
  • Allow for document classifications – “find me all contracts”, “find me all contracts that have a clause like this.”
  • Allow for meaning and context of rich media so that documents – “find me all pictures of cats,” “find me all pictures that have a stop-sign and a red car,” “find me any pictures of documents that look like purchase orders,” “find me a list of videos that mention Commvault Virtual Connections more than five times.”
  • Use information about the consumption or access to data as a consideration when retrieving search results or recommending content. Over time, we can become more intuitive about what people are looking for, and how their search terms more closely align with what content they’re actually consuming.
  • Understand intent or emotion expressed within information. This sentiment analysis could be used to help understand and shape customer/employee/ experience or produce better search results.

The possibilities are truly endless.

This is why the Commvault 4D index is such a powerful tool to increase efficiency, reduce risk and, ultimately, let you know more about your data.  You can see Activate in action by going to this On-demand webinar for a full technical demo.