Clients should access an HDFS mount using Fuse for Production use, but the NFS proxy gets you initially deployed faster. Here’s how.
Posted in Fuse, HDFS, NFS
Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.
Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.
Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.
Posted in C++, cloudera, Hive, HortonWorks, Impala, Java, MapReduce, performance, Python, Rails, Scala, Shark, Spark, SQL, Twitter
Tagged apache.org, berkeley.edu, databricks.com, gigaom.com, scala-lang.org
There’s limited potential for improvements in throughput of high performance transactional databases. Financial institutions are looking to Hadoop to supplement their application stack, but need to accept these cultural differences.
- Data quality is not 100%. Must use algorithms to refine on an ongoing basis during the transaction. Otherwise look for use cases where close is close enough. For example, Spotify uses Hadoop (HortonWorks) to select song recommendations. You probably wouldn’t use Big Data results to make decisions on which stocks to trade, at what second, at what price to trade them.
- Batch will never be real time. Some users are able to get algorithms to complete in hours rather than days, but even if the hours can be reduced to some number of minutes, the evolution of Hadoop does not seem to be approaching real time.
Sqoop is a tool for efficient and large loads/extracts between RDMS and Hadoop.
This ecosystem has enough made up words that it’s important to get the commonplace industry standard words correct — “JDBC Driver” and “JDBC Connector”.
- Driver is a JDBC driver.
- Connector could be generic or vendor specific
- Sqoop’s Generic JDBC connector is always available as part of the standard distribution.
- Also includes connectors for MySQL, PostgreSQL, Oracle, MS SQL, IBM DB2, and Neteza. However, the DB vendors (or someone else) might have customized/optimized connectors.
- If the programmer doesn’t select a connector, or if the data source is not known until runtime, Sqoop can try to figure out what the appropriate connector is. Sometimes this is easy, such as if the url to access the data is something like jdbc::myslq//…
Hadoop mindset glorifies having as much raw data as possible. Just build more nodes if necessary. However, there is currently a lack of good meta data tools. Where did the data come from? What’s the retention policy? Who has access to read it, delete it?
CTO of Sqrrl thinks this is a result of the Hadoop environment being designed for developers, not for business users.
Cloudera had appeared to be the defacto standard of Hadoop distributions, but Hortonworks has scored big in this deal. Spotify has a 690 node Cloudera cluster that it will be moving to a Hortonworks cluster (undisclosed size). Apparently it’s the new Hive implementation that makes Hortonworks so attractive.
When Spotify launched in 2008 it had a 30 node cluster hosted on Amazon’s AWS, then switched to an on-premises 60 node cluster that grew to 690 nodes. The cluster currently contains 4 petabytes of data which grows by 200 gigabytes per day.
Spotify has a 12 person Hadoop team and uses a Python (not Java) framework for batch processing.
GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations. It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.
This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.
Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).
The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.
Searching and viewing the results can be done using the Solr GUI or Hue’s search application.
Posted in apache, cloudera, Flume, HDFS, Hue, MapReduce, Solr, Tika, tutorial, Use Case
Tagged cloudera.com, github.com, lucene.apache.org, tika.apache.org