Paytronix analyzes data from 8,000 restaurants that adds up to a few tens of terrabytes of data. Not that complex in terms of volume, but there are a lot of data fields and potential reports. They migrated from MS SQL Sever and constantly evolving ETL jobs to Hadoop and MongoDB with a lot of success.
- Batch aggregation of data processed in Hadoop, and then stored in MongoDB for later ad-hoc analysis
- Staging area for batch loads into Hadoop
- Using MapReduce for complex ETL migrations
Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.
Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.
Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.
DataStax provides a commercially packaged version of Cassandra.
MongoDB is a good non-HBase alternative to Cassandra.
Posted in apache, Cassandra, Facebook, HBase, HDFS, Hive, mongodb, NoSQL, Oozie, Pig, Relational DB, SQL, Use Case
Tagged arnnet.com.au, datastax.com, dbta.com, wiki.apache.org/cassandra
JSON is a good fit for NoSQL databases, and for analysis within Hadoop because it uses key/value pairs. Keeping the same datamodel throughout an application (from Hadoop, to a NoSQL db, to a web front end that uses JSON) might make sense.
Retrieval from Hadoop is not a real-time process. Depending on the dataset and the query, retrieval may be quick or it may take many days to execute. Therefore it’s important to extract from Hadoop and store results into a transactional database. Traditional SQL databases can be used, such as MS SQL, Oracle, DB2, etc, open source databases such as MySQL (or MySQL Drizzle), or NoSQL databases such as MongoDB or CouchDB.
Hadoop is used for exhaustive data analysis, whereas the SQL or NoSQL database is used for retrieval for application use. Hadoop can also feed into a data warehouse (but probably not extract from?). A data warehouse has data that is structured for very fast retrieval based on analysis that has already been performed. Hadoop’s data almost seems to be structured in the opposite manner.