Telecom OEM WebNMS discusses their use of Hadoop. In one trial, they stored latency data from 7 million cable modems. Using a Hadoop cluster of 20 nodes, they observers a factor of 1o increase in performance compared to a relational database. In addition, the cost to deploy was a small fraction of the a traditional infrastructure.
Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.
Sqoop is a tool for efficient and large loads/extracts between RDMS and Hadoop.
This ecosystem has enough made up words that it’s important to get the commonplace industry standard words correct — “JDBC Driver” and “JDBC Connector”.
- Driver is a JDBC driver.
- Connector could be generic or vendor specific
- Sqoop’s Generic JDBC connector is always available as part of the standard distribution.
- Also includes connectors for MySQL, PostgreSQL, Oracle, MS SQL, IBM DB2, and Neteza. However, the DB vendors (or someone else) might have customized/optimized connectors.
- If the programmer doesn’t select a connector, or if the data source is not known until runtime, Sqoop can try to figure out what the appropriate connector is. Sometimes this is easy, such as if the url to access the data is something like jdbc::myslq//…
Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.
Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.
Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.
DataStax provides a commercially packaged version of Cassandra.
MongoDB is a good non-HBase alternative to Cassandra.
Posted in apache, Cassandra, Facebook, HBase, HDFS, Hive, mongodb, NoSQL, Oozie, Pig, Relational DB, SQL, Use Case
Tagged arnnet.com.au, datastax.com, dbta.com, wiki.apache.org/cassandra
Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.
HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.
Posted in cloudera, Data Warehouse, Facebook, hadoop, HDFS, Hive, HortonWorks, Impala, MapReduce, Relational DB, SQL, Stinger
Tagged gigaom.com, hortonworks.com
- Encode the relationships between objects in tables, and use keys to link the tables together
- Standard query language (with emphasis on standard, applying to all database vendors, versions, implementations, programmers) relies on the relationship encoding and vendor architecture for optimization/efficiency
- Algorithms rely on a single pass execution, using operations such as Joins and Group Bys and Counts.
Big Data World
- Based on linear algebra and probability theory
- Encode objects using a property list
- Data stored as a matrix, similar to relational tables, except that the intersection of multiple matrices does not imply relationships
- Algorithms have iterative solutions with multiple steps each of which store results that are used as input by the next step, which is very inefficient to execute in SQL
- Indices are not needed, since massively scaled hardware will be used to process the entire data set by brute force or by intelligent jobs (on the front side in Map or the back side in Reduce).
Either you structure your data ahead of time so that SQL algorithms will work, or you break down your algorithms in to algebra (MapReduce jobs) in order to process semi-structured data.
Where does this leave systems like Hive, that enable programmers to write something that looks like SQL and is transformed on the backend into MapReduce jobs? Maybe purists don’t like Hive because it’s used by people on the fence between Database and Big Data, instead of those who have fully converted to Big Data?
Systems similar yet different from Hadoop/MapReduce. They claim to be Big Data, but have roots in the database world.
- Twitter’s Storm/Summingbird is event driven (not batch) so can target real time applications
- Spark uses iterative algorithms and in-memory processing with the goal of being a few orders of magnitude faster than MapReduce
|Max Data Size per DB
||interactive & batch
||read & update many times
||read many times — write once
||Define before inital load of data, and difficult to make (and test?) extensive schema changesAlso has structured data, in which field entries are in a defined format such as XML or the type of a db column.
The keys/values are defined at the the time of data insert.
|No schema required, and if schema then it can be changed dynamically without regression impact.Fields may be unstructured (free text, images) or semi-structured (spreadsheet or even this table, in which some description is implied by the row and column headings).
The keys/values are defined at the time of processing.
||Generally designed to be high.Recovery made more difficult due to multiple autonomous components (SAN, database cluster, applications server farm), that must communicate with each other.
||Assumed to be low.Failure is assumed to have occured, so computations occur on multiple redundant nodes whose results are sorted and merged, or are rescheduled.
Tasks generally do not have dependencies on each other in a shared-nothing architecture.
||non-linear (cannot simply add additional cloned servers to the cluster forever)
||linear (can scale by factor of 10, 100, or even 1000 cloned nodes)
||Database performs better if data is normalized.
||Database performs better is data is de-normalized since the data is scattered across multiple nodes. As de-normalized, all information is available within each block as it is read. Example is a webserver log, in which hostnames are specified in full each time.
||Data is typically stored on a SAN, and access speed limited by fibre bandwidth. Additional processing may be done on an application server, in which bandwidth is limited by the datacenter LAN.
||Data is processed by CPU on the same computer which hosts the data on internal drives. Access speed is limited by internal bus bandwidth. This model is called “lack of data motion”.