Tag Archives: gigaom.com

Sqrrl co-founder explains how NSA uses Accumulo

At its core, what the NSA is doing is finding anti-patterns. Crunching through huge sets of non-interesting data is the only way to find the interesting data.

Also, the Department of Defense sees the success that NSA is having with Hadoop technologies, and is considering using it (most likely Accumulo) to store large amounts of unstructured and non-schema data.


Databricks to commercialize Spark and Shark in-memory processing

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.


You can have too much data. “How to avoid drowning in surplus information”

Hadoop mindset glorifies having as much raw data as possible. Just build more nodes if necessary. However, there is currently a lack of good meta data tools. Where did the data come from? What’s the retention policy? Who has access to read it, delete it?

CTO of Sqrrl thinks this is a result of the Hadoop environment being designed for developers, not for business users.


Four tips (and a few use cases) to take you beyond the big data hype cycle

  1. Think big: Oregon Health & Services University is using big data to speed up analyis of human genone profiles (approx 1 TB per patient). What if sequencing became commonplace instead of rare, and sequencing needed to be done 5k times per day?
  2. Find relevant data for the business
  3. Be flexible: Iterate, don’t build the final system in the 1st release. The lack of schema definition in Hadoop supports this model. Pecan Street In, in Austin, Texas is on their 3rd iteration of a system that collects smart grid energy data, partly because energy meters have become more advanced and provide additional data points.
  4. Connect the dots: Intel uses manufacturing data to change its design process so that future manufacturing processes will become more efficient.


HortonWorks trying to make Hive faster, contrasting it to Impala

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.


Twitter creates Hadoop hybrid system to mitigate tradeoffs between batch and stream processing

Storm is an open sourced system (from Twitter) that processes streams of big data in realtime (but without 100% guaranteed accuracy), making it the opposite of Hadoop which processes a repository of big data in batch.

Twitter has needs for both streaming and batch, so created an open sourced hybrid system called Summingbird. It does what Storm does, then uses Hadoop for error correction.

Twitter’s use cases include updating timelines and trending topics in real time, but then making sure that the analytics are accurate.

Yahoo’s contribution to this effort was to enable Storm to be configured using Yarn.