Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.
Telecom OEM WebNMS discusses their use of Hadoop. In one trial, they stored latency data from 7 million cable modems. Using a Hadoop cluster of 20 nodes, they observers a factor of 1o increase in performance compared to a relational database. In addition, the cost to deploy was a small fraction of the a traditional infrastructure.
By definition, a SAN is about consolidating data and Hadoop is about distributing data. Can they co-exist? Not according to this article.
If you take data out of a Hadoop node and put it on a SAN, you’re reducing performance. You want data to transfer to the CPU at bus speed, not network speed. And maybe a heavy Hadoop load could saturate your network.
Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.
Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.
Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.
Posted in C++, cloudera, Hive, HortonWorks, Impala, Java, MapReduce, performance, Python, Rails, Scala, Shark, Spark, SQL, Twitter
Tagged apache.org, berkeley.edu, databricks.com, gigaom.com, scala-lang.org
There’s limited potential for improvements in throughput of high performance transactional databases. Financial institutions are looking to Hadoop to supplement their application stack, but need to accept these cultural differences.
- Data quality is not 100%. Must use algorithms to refine on an ongoing basis during the transaction. Otherwise look for use cases where close is close enough. For example, Spotify uses Hadoop (HortonWorks) to select song recommendations. You probably wouldn’t use Big Data results to make decisions on which stocks to trade, at what second, at what price to trade them.
- Batch will never be real time. Some users are able to get algorithms to complete in hours rather than days, but even if the hours can be reduced to some number of minutes, the evolution of Hadoop does not seem to be approaching real time.
GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations. It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.
The Hadoop ecosystem contains many application components. Configuring them (memory, ram, etc) is challenging so getting a full distribution from a single vendor, as opposed to downloading it all from Apache sites or from multiple vendors, can be a good idea.
The application components interface with each other via Java APIs. One poorly written piece of custom Java code can result in performance problems that cascade throughout the entire system. You don’t want to write these integrations yourself. Let the vendor provide it.
eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.
The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.
- Node/Region server failed while writing
- Node/Region server failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot (due to CPU temperature)
- NIC speed steps down to 100Mb/s from gigabit speeds
The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.
- Samza is a massively scalable framework for distributed stream transport and limited processing
- Samza uses Yarn and Apache Kafka (publish/subscribe messaging able to handle 100s of MB reads/writes per second)
- LinkedIn utilizes Samza to publish 26+ billion unique messages per day to 100s of message feeds that are picked up by 1000s of automated subscribers (some are real time, others batch)