eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.
The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.
- Node/Region server failed while writing
- Node/Region server failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot (due to CPU temperature)
- NIC speed steps down to 100Mb/s from gigabit speeds
The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.
I’m summarizing this article. For specifics (such as how to configure split machines across racks to better configure the network switches) see the article. None of this content is operating system or hardware vendor specific, but generally the discussions assume Linux.
Goal is to minimize data movement and process on the same machine that stores the data. Therefore each machine in the cluster needs appropriate CPU and disk. Problem is that when building the cluster the nature of the queries and the resulting bottlenecks may not yet be known. If a business is building its first Hadoop cluster, it may not yet fully understand the types of business problems that will eventually be solved by it. That’s in contrast to a business deploying it’s Nth Oracle server.
Types of bottlenecks:
- IO: reading from disk (or a network location)
- data import/export
- data transformation
- CPU: processing the Map query
- text mining
- natural language processing
Other issues, since a cluster could eventually scale to hundreds or thousands off machines
The Cloudera Manager can provide realtime statistics about how a currently running MapReduce job impacts the CPU, disk, and network load.
Roles of the components within a Hadoop cluster:
- Name Node (and Standby Name Node): coordinating data storage on the cluster
- Job Tracker: coordinating data processing
- Task Tracker
- Data Node
Data Node and Task Tracker
- The vast majority of the machines in a cluster will only peform the roles of Data Node and Task Tracker, which should not be run on the same nodes as Name and Job.
- Other components (such as HBase) should only be run on the Data Nodes if they operate on data. You want to keep data local as much as possible. HBase needs about 16 GB Heap to avoid garbarge collection timeouts. Impala will consume up to 80% of available RAM.
- Assumed to be lower performance machines than the Name Node and Job Tracker
Name Node and Job Tracker
- Standby Name Node should (obviously) not be on the same machine as the Name Node.
- Name Node (and Standby Name Node) and Job Tracker should be enterprise class machines (redundant power supplies, enterprise class raid’ed disks)
- Name Node should have RAM in proportion to number of data blocks in the cluster. 1GB RAM for every 1 million blocks in HDFS. With 100 Data Node cluster, 64 GB RAM is fine. Since the machine’s tasks will be disk intensive, you’ll want enough RAM to minimize virtual memory swapping to disk.
- 4 – 6 TB of disk, not raid’ed (JBOD configuration)
- 2 CPUs (at least quad code). Recommend more CPUs and/or cores as opposed to faster CPU speed, since in a large cluster the higher speed will draw more power and generate more heat, yet not scale as well as if there were simply more CPUs or better yet nodes.
Cloudera has defined four standard configurations
- Light Processing (I’m not sure what the use case is for this. Prototype? Sandbox?)
- Balanced Compute (recommeded for your 1st cluster, since it’s not likely you’ll properly identify which configuration is best suited for your use case)
- Storage Heavy
- Compute Heavy
Posted in cloudera, DataNode, hardware, HBase, Impala, JobTracker, Linux, MapReduce, NameNode, TaskTracking, tutorial, UNIX
Hadoop consists of two components
- MapReduce –
- programming framework
- distributes work to different Hadoop nodes
- gathers results from multiple nodes and resolves them into a single value
- the source come from HDFS, and the output is typically written back to HDFS
- Job Tracker: manages nodes
- Task Tracking: takes orders from Job Traker
- MapReduce originally developed by Google.
- Apache MapReduce is built on top of Apache YARN which is a framework for job scheduling and cluster resource management.
- HDFS (Hadoop Distributed File System) – file store
- It is neither a file system nor a database, it’s neither yet it’s both.
- Within HDFS are two components
- Data Nodes:
- Name Nodes:
- where to find the data; maps blocks of data on slave nodes (where job and task trackers are running)
- Open, Close, Rename files
- On top of HDFS you can run HBase
- Super scalable (billions of rows, millions of columns) repository for key-value pairs
- This is not a database, cannot have multiple indices