Category Archives: Data Warehouse

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.

Source:

Teradata to integrate Hadoop into its legacy platform

Not sure how well this will work, or if the use cases support it. Rather than optimize Hadoop for use cases that it was designed for, Teradata is merging Hadoop into its legacy core data warehouse. Will Hadoop add value or make it overly complex?

Source:

Enterprise use cases for Hadoop

Interesting article with  examples of presentations made by large corporations of how they use Hadoop. Most presentations at this conference were about standalone big data.

HSBC created a 360 degree view of the customer, but it was for “agile reporting” not the traditional sort that would be used in a call center or from a data warehouse. There wasn’t, however, a plan on reconciling  Hadoop and the data warehouse. They were parallel and standalone.

Many presentations avoided core enterprise concerns such as governance. Some seemed “proud” to bypass this as somehow being exempt from an inflexible model.

source:

 

Top 5 Big Data Use Cases

1. Big Data Exploration

I don’t agree with the author’s category. He admits that this is a “one size fits all category”. Almost seems like he had four use cases, and decided to make it into five by says adding that you can search, visualize, and understand data from multiple sources to help decision making. Haven’t we been doing this all along, with whatever database tools we’ve had?

2. Enhanced 360 degree view of the customer

From my own experience I had a project where we did this for a call center. However, the key was that we did real time queries to generate the 360 degree view when the call center agent took the call from the customer. The problem there was that in order to produce the view in only a couple of seconds we were very limited in what sort of data we had access to, and how we could analyze this. The Big Data perspective of 360 degrees assumes that the Hadoop repository retains a persistent copy of the data, something that many organizations don’t want. For example, the data will likely not be real time. However, having a copy of the data, and having the time to crunch it in batch mode will give a deeper insight into the customer. Perhaps what’s needed is a hybrid of realtime and batch, sort of like what Twitter is doing with Storm.

3. Security/Intelligence Extension

Searching for past occurrences of fraud, or creating a predictive model of possible future occurrences is very much a batch operation, and Hadoop works well on this since the scope of the analysis is limited only by the depth of the data and the duration of operations upon it.

4. Operations Analysis

I think that the author’s example of the “internet of things” might be a stretch, but commingling and analysis of unstructured and/or semi-structured server  and application logs is a perfect use case for Hadoop. This is especially true if the log data streams in, so that the results of your analysis are updated as each batch cycle completes.

5. Data Warehouse Augmentation

Some data can be pre-processed in Hadoop before loading into a traditional data warehouse. Other data can be analyzed without needing to load into a data warehouse at all, where it might just clutter up other queries. Hadoop lets you dump everything in, and sort it out later. Data warehouses are intended to be kept tidy.

Source:

SAP HANA combines database, data processing, and application platform capabilities in-memory

  • Enables OLTP and OLAP data processing within a single in-memory column-based data store
  • Eliminates data redundancy and latency.
  • Provides real-time analytics
    • Operational Reporting
    • Data Warehousing
    • Predictive and Text Analytics on up to 80 terabytes of data, integrated with Hadoop.

Source:

HortonWorks trying to make Hive faster, contrasting it to Impala

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.

Sources: