Tag Archives: hadoop.apache.org

Apache Hadoop and its components

Hadoop consists of two components

  1. MapReduce –
    • programming framework
    • Map
      • distributes work to different Hadoop nodes
    • Reduce
      • gathers results from multiple nodes and resolves them into a single value
      • the source come from HDFS, and the output is typically written back to HDFS
    • Job Tracker: manages nodes
    • Task Tracking: takes orders from Job Traker
    • MapReduce originally developed by Google.
    • Apache MapReduce is built on top of Apache YARN which is a framework for job scheduling and cluster resource management.
  2. HDFS (Hadoop Distributed File System) – file store
    • It is neither a file system nor a database, it’s neither yet it’s both.
    • Within HDFS are two components
      • Data Nodes:
        • data repository
      • Name Nodes:
        • where to find the data; maps blocks of data on slave nodes (where job and task trackers are running)
        • Open, Close, Rename files
    • On top of HDFS you can run HBase
      • Super scalable (billions of rows, millions of columns) repository for key-value pairs
      • This is not a database, cannot have multiple indices

Cloudera Distribution of Hadoop

Hadoop is an open source Apache project, but a lot of the contributions come from Cloudera.

The Cloudera Distribution of Hadoop (CDH) appears to be the defacto standard, although other vendors such as IBM have their own. Cloudera provides a downloadable VM with a fully configured single node of Hadoop. I was able to get this up an running on my own MacBook Pro running Oracle Virtual Box in about 15 minutes.

Cloudera claims that they have more customers and more experience thatn any other Hadoop vendor.

Sources: