Don’t Let Big Data Overwhelm Your Business – Part 1 of 2
What Does Big Data Mean for Business?
Big data is one of the most important elements of the drive toward digital business, which is re-shaping the competitive landscape. IDC divides the world’s top 2,000 companies into either 'thrivers' or ‘survivors' based on whether or not they have digital transformation at the heart of their corporate strategy. If they do not, IDC believe they are risking their very survival. Combinations of new technology are driving new business models, forging new partnerships and opening new markets in novel and surprising ways. A mattress business, digitally enabled with sensors and apps, becomes ‘a cloud connected bio monitoring business’; an aviation tire company sells the data it collects on its airborne test rig to the concrete manufacturer that supplies airports; a car company offers new services every month based on its intelligent auto platform – this week managing the car’s speed to avoid red lights, next week finding open parking places in a crowded city. Big data plays a vital role here in analyzing larger quantities of data to provide a finer level of granular insights, which will help define the next business model, transform business processes or spark the idea for the next killer product.
How Do You Define Big Data?
Big data is a different kind of workload beyond the reach of traditional database, storage and compute infrastructure, providing a finer granularity of business insights. It is based on a distributed cluster which brings supercomputing levels of compute to the data and is resilient to local failures. Its purpose is to explore complex and evolving relationships from heterogeneous data sources. It is important to consider that the architecture of big data infrastructure in terms of its resilience and compute efficiency means it is not very storage efficient.
The data sources of big data include: operational data, which comes from transaction systems, streaming data and sensors; ‘dark data’ (the data you already own but don't use); commercial data; social data and public data. The latter can take numerous formats and cover many topics, such as economic data and socio-demographic data. It is provided by numerous government open data initiatives which serve up data on everything from climate to crime. New data sources are emerging all the time. According to Gartner, 70% of organizations running big data are now analyzing (or planning to analyze) location data, and 64% are analyzing (or planning to analyze) free-form text.
The Evolution of Big Data
According to Gartner, the proportion of companies investing or planning to invest in big data is quite high at around 76%, but production deployments remain flat at 14%. The ownership of the project or program shows a pretty even split between the Business Units and CIOs but they do tend to start out experimental and R&D oriented. It is when these projects go into production that they become much more like conventional applications in that they need service levels…and that’s where Commvault can really help!
Big Data Challenges
The most obvious challenge is the massive volume of data! Traditional data protection solutions will struggle to scan the data and walk the directory structure fast enough and will also really struggle to transfer the quantities of data at a fast enough rate. That scale, together with the heterogeneity of the data make big data complex. In the case of Hadoop, the stack is changing constantly which also increases complexity. Big data is also relatively new and lacks maturity. As a result, a lot of the native tools require manual customization and scripting. Manual work is bad news in any environment but it is worse in a complex environment where any changes made can require hours of work that cannot be automated.
Also, whilst the built in resilience it brings as standard is undoubtedly welcome, big data can’t protect itself against logical errors and user errors. In some cases reliance on that native resilience results in the misconception that data protection is not required at all, which is manifestly untrue. There are failure scenarios where the cluster will self-heal, but there are also wider scale scenarios where this is not true, and hardware resilience does not provide the facility to return the data to an earlier point in time where it was logically consistent.
Lastly, big data can be hard to predict and control from a capacity standpoint and big surges of data inputs can consume capacity really quickly.
Stay tuned for my next blog entry, where I will discuss Part 2 of this topic.