- Hive is a SQL-like layer on top of Hadoop
- Use it when you have some sort of structure to your data.
- You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
- Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
- Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.
At its core, what the NSA is doing is finding anti-patterns. Crunching through huge sets of non-interesting data is the only way to find the interesting data.
Also, the Department of Defense sees the success that NSA is having with Hadoop technologies, and is considering using it (most likely Accumulo) to store large amounts of unstructured and non-schema data.
In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.
Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).
This competes with Apache Mahout, which processes in batch mode only.
Hadoop works well when a problem can be broken down into discrete and parallel sub-tasks. Some problems must be applied to an entire dataset. She lists some of these: correlation, covariance, principal component analysis, multivariate statistics, generalized linear models.
I haven’t tried this myself (don’t have a RaspberryPi, but only have an Arduino), and even if it’s possible to get it to install I’m not sure what the runtime could accomplish, but this guy has published a short list of instructions on how to install Hadoop on RaspberryPi.
Western Union has 70 million customers in 200 countries, and processes 29 payment service transactions per second. They are now using Hadoop for real time analytics, which seems surprising as I’d expect a more likely use case to be batch analytics.
Hadoop is generally assumed to run on clusters of generic commodity hardware. Intel has just released a customized/optimized distribution that it claims is up to 30x faster if run on the Xenon E7 v2 family of processors, which is hardly generic or commodity.
Not sure how well this will work, or if the use cases support it. Rather than optimize Hadoop for use cases that it was designed for, Teradata is merging Hadoop into its legacy core data warehouse. Will Hadoop add value or make it overly complex?
Hydra is not built on top of Hadoop, but functions similar to Summingbird, Storm, and Spark.
Data can stream into it, and analytics can be run in real time, rather than only in batch.
AddThis is the company that originally developed Hydra, which is now in open sourced through Apache. AddThis runs six Hydra clusters, one of which is comprised of 156 servers and processes 3.5 billion transactions per day.
Advantage is that schemas don’t need to be created in order to search for patterns, since Hadoop is leveraged. Makes sense, since by creating a schema the user is already making assumptions about where the patterns exist. By doing a schema-less analysis, it’s possible to find unexpected anomalies within patterns and to find entirely new patterns.
Splunk includes visualization components.
Splunk’s Director of Big Data Marketing, Brett Sheppard, says that this is well suited for the Internet of Things (IoT), which can leverage visualization tools that report on the results of searching for anomalies in large amounts of machine generated data.