Category Archives: Data Warehouse

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Posted on April 16, 2014 | Comments Off

Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.

Source:

Comments Off on Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Posted in Data Warehouse, Facebook, HBase, Hive, HortonWorks, performance

Tagged enterprisetech.com, facebook.com, github.com, hortonworks.com

Teradata to integrate Hadoop into its legacy platform

Posted on February 10, 2014 | Comments Off

Not sure how well this will work, or if the use cases support it. Rather than optimize Hadoop for use cases that it was designed for, Teradata is merging Hadoop into its legacy core data warehouse. Will Hadoop add value or make it overly complex?

Source:

http://www.zdnet.com/teradata-threads-q4-hadoop-needle-for-now-7000026050/

Comments Off on Teradata to integrate Hadoop into its legacy platform

Posted in Data Warehouse, Teradata

Tagged zdnet.com

Enterprise use cases for Hadoop

Posted on December 4, 2013 | Comments Off

Interesting article with examples of presentations made by large corporations of how they use Hadoop. Most presentations at this conference were about standalone big data.

HSBC created a 360 degree view of the customer, but it was for “agile reporting” not the traditional sort that would be used in a call center or from a data warehouse. There wasn’t, however, a plan on reconciling Hadoop and the data warehouse. They were parallel and standalone.

Many presentations avoided core enterprise concerns such as governance. Some seemed “proud” to bypass this as somehow being exempt from an inflexible model.

source:

http://smartdatacollective.com/timoelliott/171081/why-enterprises-should-be-more-interested-hadoop

Comments Off on Enterprise use cases for Hadoop

Posted in Data Warehouse, Governance, Use Case

Tagged smartdatacollective.com

Top 5 Big Data Use Cases

Posted on December 2, 2013 | Comments Off

1. Big Data Exploration

I don’t agree with the author’s category. He admits that this is a “one size fits all category”. Almost seems like he had four use cases, and decided to make it into five by says adding that you can search, visualize, and understand data from multiple sources to help decision making. Haven’t we been doing this all along, with whatever database tools we’ve had?

2. Enhanced 360 degree view of the customer

From my own experience I had a project where we did this for a call center. However, the key was that we did real time queries to generate the 360 degree view when the call center agent took the call from the customer. The problem there was that in order to produce the view in only a couple of seconds we were very limited in what sort of data we had access to, and how we could analyze this. The Big Data perspective of 360 degrees assumes that the Hadoop repository retains a persistent copy of the data, something that many organizations don’t want. For example, the data will likely not be real time. However, having a copy of the data, and having the time to crunch it in batch mode will give a deeper insight into the customer. Perhaps what’s needed is a hybrid of realtime and batch, sort of like what Twitter is doing with Storm.

3. Security/Intelligence Extension

Searching for past occurrences of fraud, or creating a predictive model of possible future occurrences is very much a batch operation, and Hadoop works well on this since the scope of the analysis is limited only by the depth of the data and the duration of operations upon it.

4. Operations Analysis

I think that the author’s example of the “internet of things” might be a stretch, but commingling and analysis of unstructured and/or semi-structured server and application logs is a perfect use case for Hadoop. This is especially true if the log data streams in, so that the results of your analysis are updated as each batch cycle completes.

5. Data Warehouse Augmentation

Some data can be pre-processed in Hadoop before loading into a traditional data warehouse. Other data can be analyzed without needing to load into a data warehouse at all, where it might just clutter up other queries. Hadoop lets you dump everything in, and sort it out later. Data warehouses are intended to be kept tidy.

Source:

Comments Off on Top 5 Big Data Use Cases

Posted in Data Warehouse, integration, Realtime, Storm, streaming, Use Case, visualization

Tagged ibm.com, smartdatacollective.com

SAP HANA combines database, data processing, and application platform capabilities in-memory

Posted on November 3, 2013 | Comments Off

Enables OLTP and OLAP data processing within a single in-memory column-based data store
Eliminates data redundancy and latency.
Provides real-time analytics
- Operational Reporting
- Data Warehousing
- Predictive and Text Analytics on up to 80 terabytes of data, integrated with Hadoop.

Source:

http://www.saphana.com/community/about-hana

Comments Off on SAP HANA combines database, data processing, and application platform capabilities in-memory

Posted in Data Warehouse, OLAP, OLTP

Tagged saphana.com

HortonWorks trying to make Hive faster, contrasting it to Impala

Posted on November 3, 2013 | Comments Off

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.

Sources:

Comments Off on HortonWorks trying to make Hive faster, contrasting it to Impala

Posted in cloudera, Data Warehouse, Facebook, hadoop, HDFS, Hive, HortonWorks, Impala, MapReduce, Relational DB, SQL, Stinger

Tagged gigaom.com, hortonworks.com

Category Archives: Data Warehouse

Facebook compresses its 300 petabyte Hadoop Hive data warehouse layer by factor of 8x

Teradata to integrate Hadoop into its legacy platform

Enterprise use cases for Hadoop

Top 5 Big Data Use Cases

SAP HANA combines database, data processing, and application platform capabilities in-memory

HortonWorks trying to make Hive faster, contrasting it to Impala

Categories

Sources

RSS

Archives