Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.
I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.
The “Hello World” tutorial provided me with hands on:
- Uploading a file into HCatalog
- Typing queries into Beeswax, which is a GUI into Hive
- Running a more complex query by writing a short script in Pig
There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.
GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations. It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.
HortonWorks developed solutions to add into Hive the ability to update multiple records as a single transaction following the ACID model. Part of the complexity of transactional updates is that the data must be written to all applicable nodes before the transaction can be considered complete. The naming convention within HDFS folders includes a transaction ID so that both committed and uncommitted files persist until all portions of the transaction have been completed. Because the transaction ID is included, any read operations that occur before the transaction has completed will access the old data.
Why go through all of this work to add an ACID model to Hive rather than just use HBase, which already supports transactions. The primary reason is that HBase only supports Consistency at the level of a single row update, rather than with a larger set of operations. Without Consistency, there is no ACID. HortonWorks lists a few other reasons, but I’m discounting them because they are general reasons why they prefer Hive over HBase.
eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.
The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.
- Node/Region server failed while writing
- Node/Region server failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot (due to CPU temperature)
- NIC speed steps down to 100Mb/s from gigabit speeds
The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.
Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.
Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.
Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.
Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.
Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.
DataStax provides a commercially packaged version of Cassandra.
MongoDB is a good non-HBase alternative to Cassandra.
Posted in apache, Cassandra, Facebook, HBase, HDFS, Hive, mongodb, NoSQL, Oozie, Pig, Relational DB, SQL, Use Case
Tagged arnnet.com.au, datastax.com, dbta.com, wiki.apache.org/cassandra
I’m summarizing this article. For specifics (such as how to configure split machines across racks to better configure the network switches) see the article. None of this content is operating system or hardware vendor specific, but generally the discussions assume Linux.
Goal is to minimize data movement and process on the same machine that stores the data. Therefore each machine in the cluster needs appropriate CPU and disk. Problem is that when building the cluster the nature of the queries and the resulting bottlenecks may not yet be known. If a business is building its first Hadoop cluster, it may not yet fully understand the types of business problems that will eventually be solved by it. That’s in contrast to a business deploying it’s Nth Oracle server.
Types of bottlenecks:
- IO: reading from disk (or a network location)
- data import/export
- data transformation
- CPU: processing the Map query
- text mining
- natural language processing
Other issues, since a cluster could eventually scale to hundreds or thousands off machines
The Cloudera Manager can provide realtime statistics about how a currently running MapReduce job impacts the CPU, disk, and network load.
Roles of the components within a Hadoop cluster:
- Name Node (and Standby Name Node): coordinating data storage on the cluster
- Job Tracker: coordinating data processing
- Task Tracker
- Data Node
Data Node and Task Tracker
- The vast majority of the machines in a cluster will only peform the roles of Data Node and Task Tracker, which should not be run on the same nodes as Name and Job.
- Other components (such as HBase) should only be run on the Data Nodes if they operate on data. You want to keep data local as much as possible. HBase needs about 16 GB Heap to avoid garbarge collection timeouts. Impala will consume up to 80% of available RAM.
- Assumed to be lower performance machines than the Name Node and Job Tracker
Name Node and Job Tracker
- Standby Name Node should (obviously) not be on the same machine as the Name Node.
- Name Node (and Standby Name Node) and Job Tracker should be enterprise class machines (redundant power supplies, enterprise class raid’ed disks)
- Name Node should have RAM in proportion to number of data blocks in the cluster. 1GB RAM for every 1 million blocks in HDFS. With 100 Data Node cluster, 64 GB RAM is fine. Since the machine’s tasks will be disk intensive, you’ll want enough RAM to minimize virtual memory swapping to disk.
- 4 – 6 TB of disk, not raid’ed (JBOD configuration)
- 2 CPUs (at least quad code). Recommend more CPUs and/or cores as opposed to faster CPU speed, since in a large cluster the higher speed will draw more power and generate more heat, yet not scale as well as if there were simply more CPUs or better yet nodes.
Cloudera has defined four standard configurations
- Light Processing (I’m not sure what the use case is for this. Prototype? Sandbox?)
- Balanced Compute (recommeded for your 1st cluster, since it’s not likely you’ll properly identify which configuration is best suited for your use case)
- Storage Heavy
- Compute Heavy
Posted in cloudera, DataNode, hardware, HBase, Impala, JobTracker, Linux, MapReduce, NameNode, TaskTracking, tutorial, UNIX
- Wizard for installing/configuring Hadoop services across many hosts
- Start, stop, reconfigure Hadoop across many hosts
- Dashboard for health & status
- Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
- Alerting via Nagios
- Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs
These tools are supported by Ambari:
- HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop
Posted in Ambari, apache, Ganglia, HBase, HCatalog, HDFS, Hive, MapReduce, Nagios, Oozie, Pig, REST, Sqoop, Zookeeper
Tagged ganglia.sourceforge.net, github.com, incubator.apache.org, nagios.org
- Tool for bi-directional data between Hadoop and relational database using JDBC.
- Optimized drivers for specific database vendors are available.
- Command line tool
Flume and FlumeNG (Next Generation)
- Enables realtime streaming into HDFS and HBase.
- The use case for Flume is for streaming of data, such as continual input from web server logs.