Category Archives: Storm

Hydra is a non-Hadoop database for realtime analysis of dynamic data

Hydra is not built on top of Hadoop, but functions similar to Summingbird, Storm, and Spark.

Data can stream into it, and analytics can be run in real time, rather than only in batch.

AddThis is the company that originally developed Hydra, which is now in open sourced through Apache. AddThis runs six Hydra clusters, one of which is comprised of 156 servers and processes 3.5 billion transactions per day.

Sources:

Advertisements

Top 5 Big Data Use Cases

1. Big Data Exploration

I don’t agree with the author’s category. He admits that this is a “one size fits all category”. Almost seems like he had four use cases, and decided to make it into five by says adding that you can search, visualize, and understand data from multiple sources to help decision making. Haven’t we been doing this all along, with whatever database tools we’ve had?

2. Enhanced 360 degree view of the customer

From my own experience I had a project where we did this for a call center. However, the key was that we did real time queries to generate the 360 degree view when the call center agent took the call from the customer. The problem there was that in order to produce the view in only a couple of seconds we were very limited in what sort of data we had access to, and how we could analyze this. The Big Data perspective of 360 degrees assumes that the Hadoop repository retains a persistent copy of the data, something that many organizations don’t want. For example, the data will likely not be real time. However, having a copy of the data, and having the time to crunch it in batch mode will give a deeper insight into the customer. Perhaps what’s needed is a hybrid of realtime and batch, sort of like what Twitter is doing with Storm.

3. Security/Intelligence Extension

Searching for past occurrences of fraud, or creating a predictive model of possible future occurrences is very much a batch operation, and Hadoop works well on this since the scope of the analysis is limited only by the depth of the data and the duration of operations upon it.

4. Operations Analysis

I think that the author’s example of the “internet of things” might be a stretch, but commingling and analysis of unstructured and/or semi-structured server  and application logs is a perfect use case for Hadoop. This is especially true if the log data streams in, so that the results of your analysis are updated as each batch cycle completes.

5. Data Warehouse Augmentation

Some data can be pre-processed in Hadoop before loading into a traditional data warehouse. Other data can be analyzed without needing to load into a data warehouse at all, where it might just clutter up other queries. Hadoop lets you dump everything in, and sort it out later. Data warehouses are intended to be kept tidy.

Source:

Twitter creates Hadoop hybrid system to mitigate tradeoffs between batch and stream processing

Storm is an open sourced system (from Twitter) that processes streams of big data in realtime (but without 100% guaranteed accuracy), making it the opposite of Hadoop which processes a repository of big data in batch.

Twitter has needs for both streaming and batch, so created an open sourced hybrid system called Summingbird. It does what Storm does, then uses Hadoop for error correction.

Twitter’s use cases include updating timelines and trending topics in real time, but then making sure that the analytics are accurate.

Yahoo’s contribution to this effort was to enable Storm to be configured using Yarn.

Source:

Using Yarn to monitor resources and provision capacity in order to run other applications alongside MapReduce

Hadoop 2.0 enables clusters to grow as large as 4000 nodes within deployments that contain multiple clusters. I think that companies like Google and Facebook each run tens of thousands of nodes.

Using Yarn, developers can run additional applications within the cluster by monitoring what the applications need, and then creating CPU/RAM containers within the cluster (and across clusters?) to run them.

There’s speculation that eventually Yarn could provide a PaaS using Hadoop in order to compete with VMWare’s Cloud Foundry. I suppose that while with VMWare you need to first think in terms of virtualizing hardware components and an operating system, Yarn jumps past that to provide an environment that’s abstracted for a specific application.

Source: