Category Archives: Spark

Machine learning on its way from Cloudera?

In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.

Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).

This competes with Apache Mahout, which processes in batch mode only.



Hydra is a non-Hadoop database for realtime analysis of dynamic data

Hydra is not built on top of Hadoop, but functions similar to Summingbird, Storm, and Spark.

Data can stream into it, and analytics can be run in real time, rather than only in batch.

AddThis is the company that originally developed Hydra, which is now in open sourced through Apache. AddThis runs six Hydra clusters, one of which is comprised of 156 servers and processes 3.5 billion transactions per day.


Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.


What’s the root of the differences been Big Data and Relational Database

Database World

  • Encode the relationships between objects in tables, and use keys to link the tables together
  • Standard query language (with emphasis on standard, applying to all database vendors, versions, implementations, programmers) relies on the relationship encoding and vendor architecture for optimization/efficiency
  • Algorithms rely on a single pass execution, using operations such as Joins and Group Bys and Counts.

Big Data World

  • Based on linear algebra and probability theory
  • Encode objects using a property list
  • Data  stored as a matrix, similar to relational tables, except that the intersection of multiple matrices does not imply relationships
  • Algorithms have iterative solutions with multiple steps each of which store results that are used as input by the next step, which is very inefficient to execute in SQL
  • Indices are not needed, since massively scaled hardware will be used to process the entire data set by brute force or by intelligent jobs (on the front side in Map or the back side in Reduce).

Either you structure your data ahead of time so that SQL algorithms will work, or you break down your algorithms in to algebra (MapReduce jobs) in order to process semi-structured data.

Where does this leave systems like Hive, that enable programmers to write something that looks like SQL and is transformed on the backend into MapReduce jobs? Maybe purists don’t like Hive because it’s used by people on the fence between Database and Big Data, instead of those who have fully converted to Big Data?

Systems similar yet different from Hadoop/MapReduce. They claim to be Big Data, but have roots in the database world.

  • Twitter’s Storm/Summingbird is event driven (not batch) so can target real time applications
  • Spark uses iterative algorithms and in-memory processing with the goal of being a few orders of magnitude faster than MapReduce