Big Data | Complete guide
Big Data. It’s a term on the lips of seemingly every industry pundit and in the headlines of every social media feed. Far from the latest trend, Big Data is a new reality for firms in every industry (and even non-profit organizations and government agencies), fundamentally transforming the way they do business. Forward thinking firms have integrated Big Data in strategic planning and invested in the work force and financial resources necessary to leverage it in marketing, sales, manufacturing, supply chain management. Business college and graduate degree programs increasingly emphasize data analysis; and firms are eagerly snapping up talent. But what, exactly, is it and why, if you are an entrepreneur or executive, should it be on your mind?
In this article we will 1) define exactly what is meant by the term Big Data; 2) explore the origins & the uses of Big Data, 4) discuss the benefits of using Big Data; 5) explore the sources of Big Data; 6) provide an overview of the management of Big Data; and 7) cover a case study of a business using Big Data to achieve a strategic business objective.
WHAT IS BIG DATA?
Big Data is a term that refers to the data gathered by businesses and organizations, stored digitally. It is data that is too big for standard database processing systems, and varies greatly in type. It can include anything from social media mentions to keywords and hashtags to manufacturing equipment logs to digital images. In short, it is any data that is, and that can be, aggregated and digitized for commercial use. It not only includes data captured and aggregated online, but also data from internal processes and external events, as well as acquired from third parties. Technology research firm IDC estimates that Big Data grows at a rate of 50 percent a year. Big Data is often described in terms of the three Vs: volume, velocity, and variety.
Volume, in sum, refers to the size and scale of the data at hand, though conversations about volume usually concern database storage and access solutions. Modeling data with more variables rather than fewer should lead to more accurate and complete insights. However, one must determine the best storage solutions that allow for security, ease of maintenance, standardization, and easy access by end users. Think about it: how exactly would you develop implement a system with potentially 200 different inputted file types and allow users to run queries and conduct analyses? Luckily, there are many software solutions on the market that have answered this question for you, such as Hadoop and Greenplum. However, every solution has its own limitations, and ultimately, a determination must be made, based on availability of IT and financial resources, how data storage and access will be handled.
Velocity refers to the speed with which data is automatically fed back into the database. Depending on the number of data sources and the speed with which new data is generated/aggregated and fed back into the firm’s database, velocity necessitates some planning and investment. High frequency trading is an example of the use of sophisticated high velocity Big Data management tools: these trading systems take data from capital markets in increments of a fraction of a second. Proprietary software automatically mathematically models this data to process, execute, and settle trades, just as quickly. Institutional investors can trade millions of times a day, in hundreds or thousands of positions representing hundreds of millions or even billions of dollars.
The third V is variety and represents not only different underlying data types, but also differing delivery types. Different browsers and databases, with differing packaging protocols can easily complicate your ability to standardize and analyze your data. When you standardize it, invariably some of the data must be discarded. However, Big Data, by definition is messy, coming from a variety of sources at different times, and in different formats. We prefer data to be clean and tidy, but humans are by nature not; otherwise complex mathematical models would not be needed to predict behavior. While, inevitably, data must be reduced to numbers, according to some relevant rubric(s) for modeling, in case those numbers do not tell the whole story, best practices call for all data to be preserved.
Fortunately, advances in machine learning have increased the complexity of data a computer can process and the speed at which it can do so. Further, the more Big Data is parsed, the more people can refine the algorithms that power machine learning. In other words, the more Big Data grows over time, the better machines can process it, without the need for discarding non-standard elements.
In addition to the traditional three Vs, some industry pundits have added a fourth V, veracity.
Data managers and scientists must also be concerned with the veracity of the data they capture. The inability to capture correct data, or quickly correct or purge incorrect data, can seriously affect market research, financial forecasting, optimization efforts, and other attempts to harness Big Data.
But beyond data that is flat out incorrect, through inaccurate aggregating methods, data corruption, or human error, veracity refers to a data scientist’s ability to use the data to make predictions with a particular degree of certainty. In statistics, modeling is done within certain confidence levels, that taken into account estimates of how the effects of variation (uncertainty) on the model. If the variables in a particular model have considerable variation, then the amount of uncertainty will increase. Data scientists must take great pains to minimize the uncertainty in their models (through careful gathering of data, selection of datasets, and exploration of relevant variables) to ensure that the actionable insights they present to their boss’ are in fact, reflective of reality. What complicates this further is the real-time nature of the data: the velocity of Big Data.
THE ORIGINS & USES OF BIG DATA
Big Data is a phenomenon whose roots are firmly grounded in the Digital Era. Firms have not only built online tools to capture significant amounts of consumer information, they have begun to incorporate digital technologies in everything from managing internal timekeeping to optimizing manufacturing to refining supply chain management.
Big Data is most often used to predict patterns in consumer behavior and/or gain product insights. The ability to aggregate data from a variety of sources in real-time, and use sophisticated analytical tools like SAS (an industry-standard statistical software package), to standardize and analyze data has greatly reduced firm costs and capabilities to do market research, and account for continuous change in the market space.
These new capacities have allowed for mass customization of traditional products and even allowed for the creation of new products, services, and revenue-driving features. For example, the recommendation engine, a software application that uses predictive analytics (analysis of key data that predicts a user’s behavior and preferences) has allowed Amazon and Netflix to suggest products they are likely to purchase with a considerable degree of accuracy, allowing them to upsell effectively.
THE BENEFITS OF USING BIG DATA
By collecting and aggregating Big Data, data scientists can use it to create mathematical models to refine offline and online marketing strategies (including targeting and pricing); obtain product insights, optimize their supply chain management and distribution strategies; create financial forecasts; and more. Implementing insights gained from Big Data can be a source of competitive advantage.
Big Data, Small World: Kirk Borne at TEDxGeorgeMasonU
SOURCES OF BIG DATA
Reams of Big Data are aggregated on the back-end of websites and mobile apps, and are available for firms to use. Big Data is also captured digitally through the proliferation and integration of digital technologies in traditionally non-digital processes. For example, manufacturers may add digital sensors to assembly arms, which measure how effective work is performed at varying speeds, with minimal defects and minimal increases in wear and tear. This can help them determine the right speed for optimal production. Big Data can include publicly available information. For example, Google Maps allows web developers to develop APIs that can aggregate and manipulate their satellite data in real-time; this could be a source of Big Data. Big Data can include digital photos, video, and music created by individuals and published online; weather reports; online purchases; Tweets; email lists acquired from direct marketing list brokers; mileage logs from a firm’s delivery trucks, and more.
MANAGING BIG DATA
Managing Big Data is a difficult task, even for the most forward thinking and cash-rich firm. For one, data is always changing. That demographic data you scrapped from one of many customer Facebook pages can change with a keystroke. Ensuring the integrity of the data you capture, both through continuous data mining and processes design to minimize data-related human error are critical. The determination of what data to capture is also key. Choosing to capture everything creates data storage issues, and can contribute to faulty decision-making.
Data managers must be able to pare volumes of data down into actionable insights. Decision makers at the executive level do not have time to be guided through the finer points of how a data scientist has arrived at their conclusion. They need to know the story the data is telling. And that starts with asking the right question.
What is the right question or questions? In part, it depends on the strategic business objective, and at least, initially, a series of open-ended questions. For example, a drink manufacturer looking to launch a new soft drink may ask what soft drink are customers most likely to buy. They might then look at what the data tells them about their average customer and what makes them by, and/or what other soft drinks their average customer buys. But when you are aggregating data about everything from the typical consumer’s daily routine (from location-based social networks like Foursquare); to websites visited (through cookies); to responses to email marketing messages; to social media messages, the task can become daunting indeed. Data analysis must proceed from initial question to determination of relevant variables and datasets to hypothesis formulation and testing to conclusion.
There are many available IT systems on the market, and with the right IT personnel, homegrown or customized versions of out-of-the-box systems. There are also cloud-based and SaaS systems readily available for firms of every size: from the sole proprietor to the multinational corporation.
At its root, whether it is stored in-house or outsourced, harnessing Big Data requires hardware and software for data storage and management. Server and network administrators are needed to maintain the hardware; database administrators are needed to manage how data is stored in, and accessed from, the database; and data scientists are needed to analyze and interpret the data.
Do you have the expertise to manage Big Data in-house? Most firms don’t. Is it worth investing in the personnel/training and technology infrastructure to manage it in-house? This depends on a number of factors. This can be an expensive proposition in terms of initial set up (hiring, IT planning and deployments) and ongoing associated expenses. Nevertheless, this may be worth it if your firm’s near future and long-term strategic objectives depend on your ability to leverage Big Data. For example, a local bicycle retailer whose current marketing plans initiatives include events-based marketing and online interest communities targeting cyclists, and plans in the near-future to shift its core business product offerings fitness-based wearable technology it is developing to cyclists, might want to invest in the personnel and IT infrastructure to ensure that all data is integrated and properly analyzed.
This same bicycle firm could outsource the data warehousing, integration and analysis. However, there are several trade-offs to this approach. One, the firm, which is essentially becoming a technology firm, will lack, in-house, a key core competency needed to compete, as they have outsourced it. This may hinder their ability to, and the rapidity with which they can, make strategic decisions. In-house staff members, who understand the business from the frontlines, and from a data analysis perspective can develop actionable, innovative insights of a quality, and at a rate, that the outside firm, likely cannot match. Another trade-off is that you will only be as good as your data firm, and if you lack even the understanding to properly evaluate your data needs, you may very well pick a data management firm that is a poor fit.
Privacy and security
Privacy and security are key concerns. In the vast majority of countries, you are expected, by law, to reasonably protect the confidentiality and security of user data, and must immediately inform consumers if there has been a security breach. Further, you are liable for damages in full or in part to compensate your consumers for any harm they have suffered as a result of the breach.
Another aspect of privacy involves the use of data. Many marketers have gotten a bad rap since the beginning of the Digital Age for using consumer information to create intrusive marketing messages (e.g. spam emails). You can undercut the efforts you put into managing Big Data if your usage turns off consumers. Governments have stepped in, and have in many countries enacted privacy laws that define what you can and cannot do with consumer data.
Susan Etlinger: What do we do with all this big data?
CASE STUDY OF BIG DATA USAGE
One of the most well-known users of Big Data is Wal-Mart. Indeed, Big Data is integral to Wal-Mart’s operation. The retailer uses real-time inventory information to rapidly replace its goods and services. It rids itself of excess inventory through a pricing strategy known as everyday Low pricing, which allows them to deeply discount it, in accordance with its real-time inventory. It is also able to leverage store-wide pricing data, inventory data, and financial forecasts to ensure that even it is not taking a net profit loss on a daily basis due to the discounts. It can do this by optimizing how much it purchases from a variety of suppliers, again, in real-time. For example, if a Wal-Mart store is overstocked with toothbrushes from a particular supplier, but running out of more expensive toothbrushes from another supplier, they might deeply discount the overstocked items and order more from the more expensive supplier. If this does not mitigate the loss on the first set of toothbrushes, Wal-Mart can review at a complementary good or an entirely different category where they can increase revenue by strategically ordering more inventory, and marking up or down the prices, to ensure profitability.
Due to its popularity, supply vendors clamor to be stocked on Wal-Mart’s shelves. Because of this, Wal-Mart can influence how much it pays wholesalers for their goods.
Wal-Mart has been leveraging Big Data since 1975 to aggregate and collect inventory and supply chain-related data. By 1989, they had reduced their distribution costs to less than half of Kmart’s costs and under a third of Sears’s costs, providing it with a considerable strategic advantage. Today, they are widely recognized for their profitability, low internal costs, low prices, and popularity.