Clients should access an HDFS mount using Fuse for Production use, but the NFS proxy gets you initially deployed faster. Here’s how.
Posted in Fuse, HDFS, NFS
GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations. It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.
This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.
Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).
The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.
Searching and viewing the results can be done using the Solr GUI or Hue’s search application.
Posted in apache, cloudera, Flume, HDFS, Hue, MapReduce, Solr, Tika, tutorial, Use Case
Tagged cloudera.com, github.com, lucene.apache.org, tika.apache.org
eBay worked with HortonWorks and ScaledRisk to improve Mean Time to Recovery (MTTR). Not only did this require faster recovery time, but also faster detection of failures.
The types of failures considered included the following, but only Node/Region server failures were included in the tests. The HBase tables contained 900 million rows.
- Node/Region server failed while writing
- Node/Region server failed while reading
- Rack failure
- Whole cluster failure
- Machine reboot (due to CPU temperature)
- NIC speed steps down to 100Mb/s from gigabit speeds
The tests had favorable results, with improvements submitted (some implemented, some proposed) into Apache HBase and HDFS.
Cloudera has announced a realtime search engine running on top of HBase and HDFS, enabling natural language keyword searches.
Indices are stored in HDFS and indexing takes place in batches using MapReduce. Realtime indexing happens via Flume and the Lily HBase indexer.
Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.
Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.
Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.
DataStax provides a commercially packaged version of Cassandra.
MongoDB is a good non-HBase alternative to Cassandra.
Posted in apache, Cassandra, Facebook, HBase, HDFS, Hive, mongodb, NoSQL, Oozie, Pig, Relational DB, SQL, Use Case
Tagged arnnet.com.au, datastax.com, dbta.com, wiki.apache.org/cassandra
Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.
HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.
Posted in cloudera, Data Warehouse, Facebook, hadoop, HDFS, Hive, HortonWorks, Impala, MapReduce, Relational DB, SQL, Stinger
Tagged gigaom.com, hortonworks.com
Most of the discussions about NSA data collection are devoid of technical facts. The media just likes to throw around the word “metadata” as if that means nothing to those of us who work all day with nothing other than metadata.
Here’s an article that doesn’t talk down to us, but explains how simple it is to replicate the HDFS nodes from Yahoo and Google data centers.
The problem seems to be that Yahoo and Google encrypt data in motion, but not data at rest. Would Accumulo solve the encryption problem for data at rest? However, Accumulo was originally developed for the NSA, who can likely break the encryption using the processing power of huge Hadoop clusters.