Category Archives: Python

Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.

Source:

Hive can be used to program MapReduce using a subset of SQL

Hive enables MapReduce to be programmed using something that looks like SQL, instead of a procedural language like Java or Python. This is useful if a team of database, as opposed to application, programmers are called upon to program MapReduce.

Using Hive tables requires defining a schema.

The SQL-like language (called HiveQL) is converted to a MapReduce job.

Hue is a browser based GUI within which you can do Hive work. You type your query and see tabular results. Hue has ODBC drivers, and can export a CSV to Excel.

The Apache page for Hive calls it “a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets.” I’m not sure how the data warehouse piece applies.

Source: