- Hive is a SQL-like layer on top of Hadoop
- Use it when you have some sort of structure to your data.
- You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
- Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
- Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.
In 2013 Cloudera acquired a company called Myrrix, which has morphed into project (not yet a product) called Oryx. The system still uses MapReduce, which is not optimal. Before is becomes a product it’ll be rewritten using Spark.
Oryx will enable construction of machine learning models that can process data in real time. Possible use cases are spam filters and recommendation engines (which seems to be its sweet spot).
This competes with Apache Mahout, which processes in batch mode only.
Hadoop works well when a problem can be broken down into discrete and parallel sub-tasks. Some problems must be applied to an entire dataset. She lists some of these: correlation, covariance, principal component analysis, multivariate statistics, generalized linear models.
Western Union has 70 million customers in 200 countries, and processes 29 payment service transactions per second. They are now using Hadoop for real time analytics, which seems surprising as I’d expect a more likely use case to be batch analytics.
Telecom OEM WebNMS discusses their use of Hadoop. In one trial, they stored latency data from 7 million cable modems. Using a Hadoop cluster of 20 nodes, they observers a factor of 1o increase in performance compared to a relational database. In addition, the cost to deploy was a small fraction of the a traditional infrastructure.
Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.
Interesting article with examples of presentations made by large corporations of how they use Hadoop. Most presentations at this conference were about standalone big data.
HSBC created a 360 degree view of the customer, but it was for “agile reporting” not the traditional sort that would be used in a call center or from a data warehouse. There wasn’t, however, a plan on reconciling Hadoop and the data warehouse. They were parallel and standalone.
Many presentations avoided core enterprise concerns such as governance. Some seemed “proud” to bypass this as somehow being exempt from an inflexible model.
1. Big Data Exploration
I don’t agree with the author’s category. He admits that this is a “one size fits all category”. Almost seems like he had four use cases, and decided to make it into five by says adding that you can search, visualize, and understand data from multiple sources to help decision making. Haven’t we been doing this all along, with whatever database tools we’ve had?
2. Enhanced 360 degree view of the customer
From my own experience I had a project where we did this for a call center. However, the key was that we did real time queries to generate the 360 degree view when the call center agent took the call from the customer. The problem there was that in order to produce the view in only a couple of seconds we were very limited in what sort of data we had access to, and how we could analyze this. The Big Data perspective of 360 degrees assumes that the Hadoop repository retains a persistent copy of the data, something that many organizations don’t want. For example, the data will likely not be real time. However, having a copy of the data, and having the time to crunch it in batch mode will give a deeper insight into the customer. Perhaps what’s needed is a hybrid of realtime and batch, sort of like what Twitter is doing with Storm.
3. Security/Intelligence Extension
Searching for past occurrences of fraud, or creating a predictive model of possible future occurrences is very much a batch operation, and Hadoop works well on this since the scope of the analysis is limited only by the depth of the data and the duration of operations upon it.
4. Operations Analysis
I think that the author’s example of the “internet of things” might be a stretch, but commingling and analysis of unstructured and/or semi-structured server and application logs is a perfect use case for Hadoop. This is especially true if the log data streams in, so that the results of your analysis are updated as each batch cycle completes.
5. Data Warehouse Augmentation
Some data can be pre-processed in Hadoop before loading into a traditional data warehouse. Other data can be analyzed without needing to load into a data warehouse at all, where it might just clutter up other queries. Hadoop lets you dump everything in, and sort it out later. Data warehouses are intended to be kept tidy.
There’s limited potential for improvements in throughput of high performance transactional databases. Financial institutions are looking to Hadoop to supplement their application stack, but need to accept these cultural differences.
- Data quality is not 100%. Must use algorithms to refine on an ongoing basis during the transaction. Otherwise look for use cases where close is close enough. For example, Spotify uses Hadoop (HortonWorks) to select song recommendations. You probably wouldn’t use Big Data results to make decisions on which stocks to trade, at what second, at what price to trade them.
- Batch will never be real time. Some users are able to get algorithms to complete in hours rather than days, but even if the hours can be reduced to some number of minutes, the evolution of Hadoop does not seem to be approaching real time.
This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.
Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).
The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.
Searching and viewing the results can be done using the Solr GUI or Hue’s search application.
Posted in apache, cloudera, Flume, HDFS, Hue, MapReduce, Solr, Tika, tutorial, Use Case
Tagged cloudera.com, github.com, lucene.apache.org, tika.apache.org