Big Data: How to Manage Hadoop
In the era of Big Data, IT managers need robust and scalable solutions that allow them to process, sort, and store Big Data. This is a tall order, given the complexity and scope of available data in the Digital Era. However, several solutions exist, though perhaps none more popular than Apache Hadoop (“Hadoop for short”).
If you do not work in IT, and/or do not keep up-to-date with the latest in IT trends, you may not have heard of Hadoop; but the open source framework is used by firms ranging from social networking sites like Facebook to phone companies like AT&T to energy companies like Chevron. These firms must not only store and process the data from both their internal operations and a wide variety of other sources; they must be able to leverage that data in order to develop a strong competitive advantage. And while leveraging Big Data requires vision, firm-wide buy-in, and skilled personnel, Hadoop provides arguably the most robust and versatile technical solution to a firm’s Big Data needs.
In this article, we will cover 1) what is Hadoop, 2) the components of Hadoop, 3) how Hadoop works, 4) deploying Hadoop, 5) managing Hadoop deployments, and 6) an overview of common Hadoop products and services for Big Data management, as well as 7) a brief glossary of Hadoop-related terms.
WHAT IS HADOOP?
Hadoop is an open-source big-data management framework, developed by the Apache Software Foundation, written in Java. Hadoop is a cross-platform distributed file system that allows individuals and organizations to store and process Big Data on commodity hardware – computing components used for optimized parallel computing. Doug Cutting (then a Yahoo software engineer) and Mike Carafella created Hadoop in 2005 to support Nutch, an open-source search engine. Hadoop is scalable and allows for the processing of both a large volume and a wide variety of datatypes and dataflows.
Hadoop is a file system, not, as often thought, a database management solution as it lacks certain traditional database features. Indeed, many relational databases provide a wealth of business intelligence applications that the base package lacks. Hadoop’s core benefits lie in its abilities to store and process large volumes of diverse data (both structured and unstructured), and provide end-users with advanced analytical tools, such as machine learning and mathematical modeling, which can be applied to the data that Hadoop stores and processes.
COMPONENTS OF HADOOP
When one speaks of Hadoop, they are usually speaking about the Hadoop ecosystem, which includes a base package, as well as a variety of other software products that can be used in conjunction with the base package. The base package itself is composed of four modules:
- Hadoop Distributed File System (HDFS): This distributed file system is where the data is stored within the Hadoop framework.
- Hadoop YARN: This module provides a framework for both cluster resource management and job scheduling.
- Hadoop MapReduce: Using YARN, this Java-based module allows for the parallel processing of large data sets.
- Hadoop Common: This consists of utilities that support the other three modules.
Beyond the base package, there are many other software packages, collectively comprising the Hadoop ecosystem. These include:
- Apache Pig – a programming platform for creating MapReduce programs;
- Apache Spark – an open-source cluster computing framework; and
- Apache Hbase – a non-relational distributed database designed to store large quantities of sparse data, among others.
Hadoop is also modular and allows IT personnel to replace different components for different software applications, depending on the deployment and the desired functionality.
HOW HADOOP WORKS
Fundamentally, HDFS ingests data through batch and/or interactive data processing. As a distributed file system, all data and the MapReduce system are housed on every machine in a Hadoop cluster, which creates redundancy and increased processing speed. In a Hadoop cluster, a single machine is designated as a NameNode while others are designated as DataNodes: these track where data is stored throughout the cluster. HDFS duplicates the data on each machine. The more machines added, the more space is gained. Further, multiple machines, each containing DataNodes, mitigates against the failure of any single component of the cluster.
MapReduce is used to process the data, and it does so through two components, a JobTracker in the MasterNode in a Hadoop cluster, and the TaskTrackers on the DataNodes. The JobTracker splits a computing job into its component parts and distributes them to the TaskTrackers on the machines to carry out the components. The data is then returned, or reduced, to the central node. In addition to the space, the more machines added, the more processing power is gained.
YARN, which stands for “Yet Another Resource Negotiator” packages MapReduce’s CPU and memory resource management functionality, allowing them to be run by other engines, and allowing one to run multiple applications in Hadoop.
Further, the architecture and features of the base package allow for rebalancing data on multiple nodes, allocation of storage and task scheduling based on node location, versioning of the HDFS, and high availability. It is designed to be cost-effective, flexible and scalable as well as fault-tolerant.
For IT managers, Hadoop deployment involves configuring, deploying and managing a Hadoop cluster. This assumes that the firm has the in-house talent, or resources to recruit and retain the in-house talent to do so. Assuming the firm has neither, there are firms that specialize in outsourcing and insourcing this function; though long-term, it remains to be seen whether one can commoditize the innovation, creativity of firm-specific insights that can come from in-house talent.
If the firm does decide to deploy Hadoop and manage its data itself, it must ask the same questions necessitated by any Big Data deployment – questions requiring cross-departmental and cross-functional teams to answer them. These include:
- What the overall goals for Big Data integration at the firm are;
- How the use of Big Data will be aligned with the firm’s strategic business objectives;
- What the firm’s overall data needs are, based on existing and needed data infrastructure;
- What processes and procedures will be implemented to ensure that the firm’s goals and the data science unit achieve their organizational goals;
- What human resources are needed; and
- What technical resources are needed, among others.
The firm must determine whether they will use a managed service provider or deploy it on-premises. The benefits of integrating Hadoop into business operations through a managed service provider are many, and include backup/recovery, automated upgrades, data security tools, technical support, automated configuration and provisioning, query analysis tools, data visualization tools, and testing environments, among others. In essence, managed service providers allow firms to focus on the data analysis aspects of Big Data without the necessity of managing a Hadoop environment. However, the firm must have a solid plan for managing data inflows and outflows to and from the managed service provider, as well as the analytics; both will require human capital and training resources.
If the usage of Hadoop is likely to be sporadic or low volume, then a cloud deployment makes sense, from both a cost and human capital perspective. If, however, there is regular high volume usage, and the firm envisions the need for rapid scaling, then an on-premises install is recommended. In such an instance, usage would likely drive the costs of using the managed service provider up. Further, firms with on-premises installations may find it easier to innovate and enter new markets, especially in data-intensive industries. For example, an entertainment website with a branded social network, may decide to start selling branded merchandise through an online storefront. An on-premises installation may allow the firm to scale its new data collection needs more rapidly, and less expensively, then a firm using a managed services provider. Moreover, depending on the Internet of Things – the connection of both objects and living organisms through embedded computing technologies, affects a firm’s existing operations, and/or is harnessed to create new opportunities, the rapid scaling of data available throughout on-premises deployments may be the best option.
From a technical perspective, the firm must consider the design and architecture of the Hadoop cluster for an on-premises deployment. This starts with setting up and configuring a cluster, on which Hadoop will be installed. This must be carefully thought-out in terms of: choice of operating system; the number of map/reduce slots needed; memory and storage requirements; the number of hard drive disks and their capacity; and the optimal network configuration. These must be chosen with scalability in mind. The cluster install must be evaluated and tested before Hadoop installation. Once it is installed, IT managers also must consider which Hadoop applications to deploy and/or develop to meet the firm’s specific data needs, and map out a plan to obtain or create them.
From a human resources perspective, an on-premises deployment includes the personnel to deploy and maintain the Hadoop cluster, as well as the data scientists (either in a standalone data science department, dispersed throughout strategic business units, or a combination of the two), and other staff who will serve as end-users. From an implementation perspective, it is recommended to hire a consultant/consulting firm to design the architecture and assist in the deployment due to the complexity of enterprise-level Hadoop implementations, and the paucity of professionals versed in the latest Hadoop-specific technologies and products. And from a process standpoint, IT managers must work with department heads to determine what legacy systems should be phased out as well as training for end-users.
Once deployed, firms must ensure that the environment operates with low latency, processes dynamic data in real-time and supports both data parallelism and high computing intensity. The deployment must be able to handle analytics tasks that place a high demand on computing resources without failure or necessitating further customization/server space, and the attendant loss of data center space and financial resources. Instead, IT managers must use Hadoop’s framework to improve server utilization and ensure load balancing. They must also optimize data ingestion to ensure the integrity of the data; and perform regular maintenance on the various nodes throughout the cluster, replacing and upgrading nodes and operating systems when necessary, to minimize the effects of drive failures, overheating servers, obsolete technologies, and other developments that can create service interruptions. When scaling the firm’s Hadoop deployment, they should use open source frameworks to ensure flexibility, compatibility, and innovative approaches to common data processing and storage problems. Of course, they also must ensure that common business analytics programs such as SAS and Revolution R can easily pull data from Hadoop for the use of internal data scientists, with minimal service interruptions.
Further, IT managers, whether utilizing an on-premises deployment or a managed services provider, must ensure the security of the data. Many managed service providers, such as Cloudera, offer security options like Project Rhino and Sentry, now both open source Hadoop security initiatives. Like Cloudera’s Cloudera Navigator, some other managed service providers offer governance solutions to ensure legal compliance. Given the volumes of data and prevalence of hackers, viruses, and other security threats, this is one responsibility that cannot be taken too lightly.
HADOOP PRODUCTS AND SERVICES FOR BIG DATA MANAGEMENT
It is important to note that while Hadoop is open source, it is not free. Deployment, customization and optimization can drive up costs. However, since it is open source, any firm can offer Hadoop-based products and services. The firms offering the most robust releases of Hadoop-based services for Big Data management include Amazon Web Services, and Cloudera, Hortonworks, with Cloudera (which counts Doug Cutting as an employee) arguably being the most popular. Hortonworks is also well-known for counting many former Hadoop experts who formerly worked at Yahoo – a heavy Hadoop user.
The Apache Foundation offers many open source packages that can be appended to a Hadoop installation. Further, many third-party business intelligence providers, like SAP AG, SAS, IBM, and Oracle provide support for Hadoop implementations, regardless of the managed service provider.
GLOSSARY OF HADOOP-RELATED TERMS
The following are a few of the terms critical to understanding how Hadoop can be deployed at a firm to harness its data.
- Commodity computing: this refers to the optimization of computing components to maximize computation and minimize cost, and is usually performed with computing systems utilizing open standards. This is also known as cluster commodity computing.
- DataNode: this is what stores data in Hadoop. In a Hadoop cluster, there are multiple dataNodes with replicated data in each.
- Database management system: a system that allows a user to order, store, retrieve and manipulate data from a database. HDFS is commonly referred to as a database management system when in reality it is a file system
- ELT (Extract, Load and Transform): an acronym describing the processes, in order, which comprise a data manipulation method before data is uploaded into a file system or a database management system.
- File system: a directory-based system that controls how data is stored and retrieved.
- Hadoop cluster: this is a computational cluster designed to store, process and analyze large volumes of unstructured data in a distributed computing environment.
- Hive: An Hadoop-based open source data warehouse that provides many relational database features, such as querying and analysis. Hive uses a programming language similar to SQL, called HiveQL.
- HUE: a browser-based interface for Hadoop end-users.
- JobTracker: this is a service that distributes MapReduce subtasks to specific nodes, once a client application has submitted a job.
- NameNode: the repository of the directory tree of all files in the HDFS, that also tracks where data is kept throughout the Hadoop cluster.
- Parallel processing: the concurrent use of one or more CPUs to run a program, which increases computing speed.
- Pig: a programming language used to create MapReduce programs in Hadoop.
- Unstructured data: data that lacks a predesigned data model, such as social media comments, or one that is not organized in a prearranged manner, such as tags in a number of documents.
- Zookeeper: Hadoop-based infrastructure and services that allow cluster-wide synchronization.
In Santa Clara (CA), we meet co-founder and CSO of Cyphort, Fengmin Gong. Fengmin talks about his …