Monthly Archives: October 2013

Apache Hadoop and its components

Hadoop consists of two components

  1. MapReduce –
    • programming framework
    • Map
      • distributes work to different Hadoop nodes
    • Reduce
      • gathers results from multiple nodes and resolves them into a single value
      • the source come from HDFS, and the output is typically written back to HDFS
    • Job Tracker: manages nodes
    • Task Tracking: takes orders from Job Traker
    • MapReduce originally developed by Google.
    • Apache MapReduce is built on top of Apache YARN which is a framework for job scheduling and cluster resource management.
  2. HDFS (Hadoop Distributed File System) – file store
    • It is neither a file system nor a database, it’s neither yet it’s both.
    • Within HDFS are two components
      • Data Nodes:
        • data repository
      • Name Nodes:
        • where to find the data; maps blocks of data on slave nodes (where job and task trackers are running)
        • Open, Close, Rename files
    • On top of HDFS you can run HBase
      • Super scalable (billions of rows, millions of columns) repository for key-value pairs
      • This is not a database, cannot have multiple indices

Programming MapReduce

MapReduce is often programmed using Java. However, other options are available. Hadoop Streaming is a utility that is used to program against MapReduce using languages such as C, Perl, Python, C++, and Bash. For example, Python can be used for the Mapper, and AWK for the Reduce.

Hive can be used to program MapReduce using a subset of SQL.

Pig is another high level procedural language created specifically to do MapReduce programming.

Cheap Hardware

In theory, a big data cluster uses low cost commodity hardware (2 CPUs, 6-12 drives, 32 GB RAM). By clustering many cheap machines, high performance can be achieved at a low cost, along with high reliability due to decentralization.

There is little benefit to running Hapdoop nodes in a virtualized environment (e.g VMWare), since when the node is active (batch processing) it may be pushing RAM and CPU utilization to its limits. This is in contrast to an application or database server which has idle and bursts, but generally has constant utilization at some medium level. What is of greater benefit is a cloud implementation (e.g. Amazon Elastic Cloud) in which one can scale from a few nodes to hundreds or thousands of nodes in real time as the batch cycles through its process.

Unlike a traditional n-tier architecture, Hadoop combines compute & storage on the same box. In contrast, an Oracle cluster would typically store its databases on a SAN, and application logic would reside on yet another set of application servers which probably do not utilize their inexpensive internal drives for application specific tasks.

A Hadoop cluster is linearly scalable, up to 4000 nodes and dozens of petabytes of data.

In a traditional db cluster (such as Oracle RAC), the architecture of the cluster should be designed with knowledge of the schema and volume (input and retrieval) of the data. WIth Hadoop, scalability is, at worst, linear. Using a cloud architecture, additional Hadoop nodes can be provisioned on the fly as node utilization increases or decreases.

Cloudera Distribution of Hadoop

Hadoop is an open source Apache project, but a lot of the contributions come from Cloudera.

The Cloudera Distribution of Hadoop (CDH) appears to be the defacto standard, although other vendors such as IBM have their own. Cloudera provides a downloadable VM with a fully configured single node of Hadoop. I was able to get this up an running on my own MacBook Pro running Oracle Virtual Box in about 15 minutes.

Cloudera claims that they have more customers and more experience thatn any other Hadoop vendor.

Sources:

What makes “Big Data” different from “A Lot” of data

Big Data pertains to any/all of the following

  • Data may be captured at a rapid rate
  • Data may be unstructured, or consists of a variety of structures. In other words, not well suited to a relational database
  • Can’t easily mine the data using SQL. Extracting seems to be better suited to procedural algorithms
  • Does not have to be a lot of data. Could be a few MBs, but it also could be a few petabytes.