Category Archives: Twitter

Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.



Twitter creates Hadoop hybrid system to mitigate tradeoffs between batch and stream processing

Storm is an open sourced system (from Twitter) that processes streams of big data in realtime (but without 100% guaranteed accuracy), making it the opposite of Hadoop which processes a repository of big data in batch.

Twitter has needs for both streaming and batch, so created an open sourced hybrid system called Summingbird. It does what Storm does, then uses Hadoop for error correction.

Twitter’s use cases include updating timelines and trending topics in real time, but then making sure that the analytics are accurate.

Yahoo’s contribution to this effort was to enable Storm to be configured using Yarn.