Category Archives: Hue

Email indexing using Cloudera Search

This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.

Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).

The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.

Searching and viewing the results can be done using the Solr GUI or Hue’s search application.



Programming MapReduce

MapReduce is often programmed using Java. However, other options are available. Hadoop Streaming is a utility that is used to program against MapReduce using languages such as C, Perl, Python, C++, and Bash. For example, Python can be used for the Mapper, and AWK for the Reduce.

Hive can be used to program MapReduce using a subset of SQL.

Pig is another high level procedural language created specifically to do MapReduce programming.