Category Archives: SQL

Interesting use case about migrating away from SQL to Hadoop and NoSQL

Posted on December 7, 2013 | Comments Off

Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.

source:

http://www.informationweek.com/software/information-management/making-the-case-for-hadoop-variety-not-volume/d/d-id/1112894

Comments Off on Interesting use case about migrating away from SQL to Hadoop and NoSQL

Posted in database, mongodb, NoSQL, Relational DB, SQL, Use Case

Summary of Terradata’s big data approach

Posted on December 4, 2013 | 1 comment

Terradata Aster 6 platform
Includes graph analysis engine (visualization), in addition to traditional rows/columns.
Enables execution of SQL across multiple NoSQL repositories
Integrates with multiple 3rd parties for solutions such as analytical workflow (Alteryx), advanced analytics algorithms (Fuzzy Logix).
Cloud services at comparable cost to on-premises

Source

http://www.information-management.com/blogs/teradata-establishes-trust-in-big-data-technology-10025111-1.html

1 Comment

Posted in cloud, NoSQL, SQL, Teradata, visualization

Tagged information-management.com

Databricks to commercialize Spark and Shark in-memory processing

Posted on November 15, 2013 | Comments Off

Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.

Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.

Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.

Source:

Comments Off on Databricks to commercialize Spark and Shark in-memory processing

Posted in C++, cloudera, Hive, HortonWorks, Impala, Java, MapReduce, performance, Python, Rails, Scala, Shark, Spark, SQL, Twitter

Tagged apache.org, berkeley.edu, databricks.com, gigaom.com, scala-lang.org

10 Key/Value Store, Distributed, Open Source Databases

Posted on November 4, 2013 | Comments Off

Riak

HTTP API
Master-less, so remains operational even if multiple nodes fail
Near linear scalability
Architecture same of both large and small clusters
Key/value model, flat namespace, can store anything

Redis

Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
In-memory dataset
Easy to setup, master/slave replication

Hibari

Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
Chain replication across nodes that are geographically dispersed. Not single points of failure
Excellent performance for large batches (~200k) read/write operations
Runs on commodity hardware or blades. Does not require SAN

Hypertable

High performance, massively scalable, modeled after Google’s Bigtable
Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

High scalability due to allowing only very simple key/value data access.
Used by LinkedIn
Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

High performance persistent storage that’s compatible with Memcache protocol

Tarantool

NoSQL database with messaging server
All data maintained in RAM. Persistence via a write ahead log.
Asynchronous replication and hot standby
Supports stored procedures
Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
Was used by Facebook for Inbox Search until 2010
Read/write scales linearly with number of nodes
Data replicated across multiple nodes
Supports MapReduce, Pig, and Hive
Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

NoSQL key/value that provides lower latency and higher throughput than some alternatives
Replicates data to multiple nodes
Very easy to administer and maintain
Data model: key plus zero or more attributes

Lightcloud

Great performance even on small clusters with millions of keys
Nodes replicated via master-to-master replication. Hot backups and restores
Very small client footprint
Built on top of Tokyo Tyrant

Sources:

Comments Off on 10 Key/Value Store, Distributed, Open Source Databases

Posted in apache, Cassandra, database, Facebook, Hibari, Hive, HyperDex, Hypertable, Lightcloud, LinkedIn, MapReduce, MemcacheDB, NoSQL, Pig, Redis, Riak, scalability, SQL, Tarantool, Voldemort

Tagged basho.com, cassandra.apache.org, github.com, hyperdex.org, hypertable.com, memcachedb.org, project-voldemort.com, redis.io, tarantool.org, toolsjournal.com

Cassandra – NoSQL database to use in conjunction with Hadoop

Posted on November 3, 2013 | Comments Off

Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.

Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.

Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.

DataStax provides a commercially packaged version of Cassandra.

MongoDB is a good non-HBase alternative to Cassandra.

Sources:

Comments Off on Cassandra – NoSQL database to use in conjunction with Hadoop

Posted in apache, Cassandra, Facebook, HBase, HDFS, Hive, mongodb, NoSQL, Oozie, Pig, Relational DB, SQL, Use Case

Tagged arnnet.com.au, datastax.com, dbta.com, wiki.apache.org/cassandra

HortonWorks trying to make Hive faster, contrasting it to Impala

Posted on November 3, 2013 | Comments Off

Hive was invented by Facebook as a data warehouse layer on top of Hadoop, and has been adopted by HortonWorks. The benefit of Hive is that it enables programmers, with years of experience in relational databases, to write MapReduce jobs using SQL. The problem is that MapReduce is slow, and Hive slows it down even further.

HortonWorks is pushing for optimization (via project Stinger) of the developer friendly toolset provided by Hive. Cloudera has abandoned Hive in favor of Impala. Rather than translate SQL queries into MapReduce, Impala implements a massively parallel relational database on top of HDFS.

Sources:

Comments Off on HortonWorks trying to make Hive faster, contrasting it to Impala

Posted in cloudera, Data Warehouse, Facebook, hadoop, HDFS, Hive, HortonWorks, Impala, MapReduce, Relational DB, SQL, Stinger

Tagged gigaom.com, hortonworks.com

What’s the root of the differences been Big Data and Relational Database

Posted on November 3, 2013 | Comments Off

Database World

Encode the relationships between objects in tables, and use keys to link the tables together
Standard query language (with emphasis on standard, applying to all database vendors, versions, implementations, programmers) relies on the relationship encoding and vendor architecture for optimization/efficiency
Algorithms rely on a single pass execution, using operations such as Joins and Group Bys and Counts.

Big Data World

Based on linear algebra and probability theory
Encode objects using a property list
Data stored as a matrix, similar to relational tables, except that the intersection of multiple matrices does not imply relationships
Algorithms have iterative solutions with multiple steps each of which store results that are used as input by the next step, which is very inefficient to execute in SQL
Indices are not needed, since massively scaled hardware will be used to process the entire data set by brute force or by intelligent jobs (on the front side in Map or the back side in Reduce).

Either you structure your data ahead of time so that SQL algorithms will work, or you break down your algorithms in to algebra (MapReduce jobs) in order to process semi-structured data.

Where does this leave systems like Hive, that enable programmers to write something that looks like SQL and is transformed on the backend into MapReduce jobs? Maybe purists don’t like Hive because it’s used by people on the fence between Database and Big Data, instead of those who have fully converted to Big Data?

Systems similar yet different from Hadoop/MapReduce. They claim to be Big Data, but have roots in the database world.

Twitter’s Storm/Summingbird is event driven (not batch) so can target real time applications
Spark uses iterative algorithms and in-memory processing with the goal of being a few orders of magnitude faster than MapReduce

Source:

http://blog.mikiobraun.de/2013/09/data-base-vs-data-science.html

Comments Off on What’s the root of the differences been Big Data and Relational Database

Posted in Hive, MapReduce, Relational DB, Spark, SQL, Summingbird

Tagged mikiobraun.de

Hive can be used to program MapReduce using a subset of SQL

Posted on November 2, 2013 | Comments Off

Hive enables MapReduce to be programmed using something that looks like SQL, instead of a procedural language like Java or Python. This is useful if a team of database, as opposed to application, programmers are called upon to program MapReduce.

Using Hive tables requires defining a schema.

The SQL-like language (called HiveQL) is converted to a MapReduce job.

Hue is a browser based GUI within which you can do Hive work. You type your query and see tabular results. Hue has ODBC drivers, and can export a CSV to Excel.

The Apache page for Hive calls it “a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets.” I’m not sure how the data warehouse piece applies.

Category Archives: SQL

Interesting use case about migrating away from SQL to Hadoop and NoSQL

Summary of Terradata’s big data approach

Databricks to commercialize Spark and Shark in-memory processing

10 Key/Value Store, Distributed, Open Source Databases

Cassandra – NoSQL database to use in conjunction with Hadoop

HortonWorks trying to make Hive faster, contrasting it to Impala

What’s the root of the differences been Big Data and Relational Database

Hive can be used to program MapReduce using a subset of SQL

Categories

Sources

RSS

Archives