Category Archives: database

Apache Hive: 5 facts

Posted on March 7, 2014 | Comments Off

Hive is a SQL-like layer on top of Hadoop
Use it when you have some sort of structure to your data.
You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.

Source:

http://www.cbronline.com/news/tech/software/businessintelligence/five-things-you-should-know-about-apache-hive-4190151

Comments Off on Apache Hive: 5 facts

Posted in database, Facebook, Hive, JDBC, Netflix, ODBC, Pig, tutorial, Use Case

Tagged cbronline.com

Interesting use case about migrating away from SQL to Hadoop and NoSQL

Posted on December 7, 2013 | Comments Off

Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.

source:

http://www.informationweek.com/software/information-management/making-the-case-for-hadoop-variety-not-volume/d/d-id/1112894

Comments Off on Interesting use case about migrating away from SQL to Hadoop and NoSQL

Posted in database, mongodb, NoSQL, Relational DB, SQL, Use Case

10 Key/Value Store, Distributed, Open Source Databases

Posted on November 4, 2013 | Comments Off

Riak

HTTP API
Master-less, so remains operational even if multiple nodes fail
Near linear scalability
Architecture same of both large and small clusters
Key/value model, flat namespace, can store anything

Redis

Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
In-memory dataset
Easy to setup, master/slave replication

Hibari

Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
Chain replication across nodes that are geographically dispersed. Not single points of failure
Excellent performance for large batches (~200k) read/write operations
Runs on commodity hardware or blades. Does not require SAN

Hypertable

High performance, massively scalable, modeled after Google’s Bigtable
Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

High scalability due to allowing only very simple key/value data access.
Used by LinkedIn
Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

High performance persistent storage that’s compatible with Memcache protocol

Tarantool

NoSQL database with messaging server
All data maintained in RAM. Persistence via a write ahead log.
Asynchronous replication and hot standby
Supports stored procedures
Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
Was used by Facebook for Inbox Search until 2010
Read/write scales linearly with number of nodes
Data replicated across multiple nodes
Supports MapReduce, Pig, and Hive
Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

NoSQL key/value that provides lower latency and higher throughput than some alternatives
Replicates data to multiple nodes
Very easy to administer and maintain
Data model: key plus zero or more attributes

Lightcloud

Great performance even on small clusters with millions of keys
Nodes replicated via master-to-master replication. Hot backups and restores
Very small client footprint
Built on top of Tokyo Tyrant

Sources:

Comments Off on 10 Key/Value Store, Distributed, Open Source Databases

Posted in apache, Cassandra, database, Facebook, Hibari, Hive, HyperDex, Hypertable, Lightcloud, LinkedIn, MapReduce, MemcacheDB, NoSQL, Pig, Redis, Riak, scalability, SQL, Tarantool, Voldemort

Tagged basho.com, cassandra.apache.org, github.com, hyperdex.org, hypertable.com, memcachedb.org, project-voldemort.com, redis.io, tarantool.org, toolsjournal.com

MapReduce vs Traditional RDDBMS

Posted on November 2, 2013 | Comments Off

	Traditional RDBMS	MapReduce
Max Data Size per DB	many gigabytes	many petabytes
Access Method	interactive & batch	batch
Updates	read & update many times	read many times — write once
Schema	Define before inital load of data, and difficult to make (and test?) extensive schema changesAlso has structured data, in which field entries are in a defined format such as XML or the type of a db column. The keys/values are defined at the the time of data insert.	No schema required, and if schema then it can be changed dynamically without regression impact.Fields may be unstructured (free text, images) or semi-structured (spreadsheet or even this table, in which some description is implied by the row and column headings). The keys/values are defined at the time of processing.
System Integrity/Availability	Generally designed to be high.Recovery made more difficult due to multiple autonomous components (SAN, database cluster, applications server farm), that must communicate with each other.	Assumed to be low.Failure is assumed to have occured, so computations occur on multiple redundant nodes whose results are sorted and merged, or are rescheduled. Tasks generally do not have dependencies on each other in a shared-nothing architecture.
Scalability	non-linear (cannot simply add additional cloned servers to the cluster forever)	linear (can scale by factor of 10, 100, or even 1000 cloned nodes)
Normalization	Database performs better if data is normalized.	Database performs better is data is de-normalized since the data is scattered across multiple nodes. As de-normalized, all information is available within each block as it is read. Example is a webserver log, in which hostnames are specified in full each time.
Data locality	Data is typically stored on a SAN, and access speed limited by fibre bandwidth. Additional processing may be done on an application server, in which bandwidth is limited by the datacenter LAN.	Data is processed by CPU on the same computer which hosts the data on internal drives. Access speed is limited by internal bus bandwidth. This model is called “lack of data motion”.

Source:

http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

Comments Off on MapReduce vs Traditional RDDBMS

Posted in database, MapReduce, Relational DB, scalability

Tagged hortonworks.com

Realtime Databases

Posted on November 2, 2013 | 1 comment

Retrieval from Hadoop is not a real-time process. Depending on the dataset and the query, retrieval may be quick or it may take many days to execute. Therefore it’s important to extract from Hadoop and store results into a transactional database. Traditional SQL databases can be used, such as MS SQL, Oracle, DB2, etc, open source databases such as MySQL (or MySQL Drizzle), or NoSQL databases such as MongoDB or CouchDB.

Hadoop is used for exhaustive data analysis, whereas the SQL or NoSQL database is used for retrieval for application use. Hadoop can also feed into a data warehouse (but probably not extract from?). A data warehouse has data that is structured for very fast retrieval based on analysis that has already been performed. Hadoop’s data almost seems to be structured in the opposite manner.

Sources:

1 Comment

Posted in apache, CouchDB, database, drizzle, mongodb

Tagged couchdb.apache.org, drizzle.org, mongodb.org