Hadoop consists of two components
- MapReduce –
- programming framework
- Map
- distributes work to different Hadoop nodes
- Reduce
- gathers results from multiple nodes and resolves them into a single value
- the source come from HDFS, and the output is typically written back to HDFS
- Job Tracker: manages nodes
- Task Tracking: takes orders from Job Traker
- MapReduce originally developed by Google.
- Apache MapReduce is built on top of Apache YARN which is a framework for job scheduling and cluster resource management.
- HDFS (Hadoop Distributed File System) – file store
- It is neither a file system nor a database, it’s neither yet it’s both.
- Within HDFS are two components
- Data Nodes:
- data repository
- Name Nodes:
- where to find the data; maps blocks of data on slave nodes (where job and task trackers are running)
- Open, Close, Rename files
- Data Nodes:
- On top of HDFS you can run HBase
- Super scalable (billions of rows, millions of columns) repository for key-value pairs
- This is not a database, cannot have multiple indices