Category Archives: hadoop

Big Data as a Service provider has free developer account

Founders of Qubole built some of the big data technology at Facebook (scaled to 25 petabytes). Their new company has a hosted Hadoop infrastructure. Interesting small and free accounts take the IT configuration out of learning Hadoop.


You can have too much data. “How to avoid drowning in surplus information”

Hadoop mindset glorifies having as much raw data as possible. Just build more nodes if necessary. However, there is currently a lack of good meta data tools. Where did the data come from? What’s the retention policy? Who has access to read it, delete it?

CTO of Sqrrl thinks this is a result of the Hadoop environment being designed for developers, not for business users.


In-memory Hadoop – use it when speed matters

GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations.  It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.


benefits of a single vendor hadoop systems

The Hadoop ecosystem contains many application components. Configuring them (memory, ram, etc) is challenging so getting a full distribution from a single vendor, as opposed to downloading it all from Apache sites or from multiple vendors, can be a good idea.

The application components interface with each other via Java APIs. One poorly written piece of custom Java code can result in performance problems that cascade throughout the entire system. You don’t want to write these integrations yourself. Let the vendor provide it.


Hadoop’s part in a Modern Data Architecture

Hadoop must:

  • Integrate with existing infrastructure. You can’t expect a green field. It can certainly replace some existing components, but will need to augment others even if it capable of replacing them.
  • Utilize existing staff. Hive isn’t optimal, but there is value in having existing staff who understand existing systems, business needs, and data sets work in the new Hadoop environment.


HortonWorks trying to make Hive faster, contrasting it to Impala

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.


JSON and Big Data

JSON is a good fit for NoSQL databases, and for analysis within Hadoop because it uses key/value pairs. Keeping the same datamodel throughout an application (from Hadoop, to a NoSQL db, to a web front end that uses JSON) might make sense.

Article on IBM DeveloperWorks

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL


Synchronizing nodes within a Hadoop cluster

Zookeeper is a backend service for managing synchronization within the Hadoop cluster. I saw in one article that there are two kinds of people who mess around with Zookeeper — contributors to the Apache project, and people who are doing something that they shouldn’t be doing.


Programming MapReduce

MapReduce is often programmed using Java. However, other options are available. Hadoop Streaming is a utility that is used to program against MapReduce using languages such as C, Perl, Python, C++, and Bash. For example, Python can be used for the Mapper, and AWK for the Reduce.

Hive can be used to program MapReduce using a subset of SQL.

Pig is another high level procedural language created specifically to do MapReduce programming.