Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.
I much preferred this tutorial in Hive, rather than the previous one using Pig. Using the same dataset in each example made the comparison clearer.
Pig makes sense for sequential steps, such as an ETL job. Hive seemed better suited for tasks comparable to ones in which we’d write stored procedures within a more traditional database server.
Another difference came with debugging.
- The Pig editor bundled into the Hortonworks sandbox isn’t very sophisticated as IDEs go. No breakpoints, viewing of data, etc. Perhaps there’s a way to accomplish this, but (thankfully) it isn’t covered in such an early stage tutorial. There’s a button to upload a UDF jar, so I’ve got to research how one develops that jar outside of the Pig script editor.
- The Hive tutorial makes it easier to view progress at each step, since you can think of each step as an independent SQL (actually HiveQL) statement. If the programming task were far more complex, I could see myself structuring the Pig scripts in a way that might be easier to debug than Hive.
- Hive seemed good for an ad-hoc query and Pig for a complex procedural task.
- The next tutorial combines Pig and Hive. I’ll see how that shapes my perceptions.
I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.
The “Hello World” tutorial provided me with hands on:
- Uploading a file into HCatalog
- Typing queries into Beeswax, which is a GUI into Hive
- Running a more complex query by writing a short script in Pig
There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.
- Hive is a SQL-like layer on top of Hadoop
- Use it when you have some sort of structure to your data.
- You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.
- Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.
- Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.
Elasticworks enables real-time searching and analytics. Yarn is supported. Integration extends into Hive and Pig.
Shark utilizes in-memory SQL queries for complex analytics, and is Apache Hive compatible. The name “Shark” is supposed to be short hand for “Hive on Spark”. This seems to be a competitor to Cloudera Impala or the Hortonworks implementation of Hive.
Apache Spark utilizes APIs (Python, Scala, Java) for in-memory processing with very fast reads and writes, claiming to be 100x faster than disk-based MapReduce. Spark is the engine behind Shark. Spark can be considered as an alternative to MapReduce, not an alternative to Hadoop.
Scala is an interesting language being used by companies such as Twitter as both higher performance and easier to write than Java. Some companies that had originally developed using Rails or C++ are migrating to Scala rather than to Java.
Posted in C++, cloudera, Hive, HortonWorks, Impala, Java, MapReduce, performance, Python, Rails, Scala, Shark, Spark, SQL, Twitter
Tagged apache.org, berkeley.edu, databricks.com, gigaom.com, scala-lang.org
GridGain has a 100% HDFS compatible RAM solution that it claims is 10x faster for IO and network intensive MapReduce processing. I understand the IO, but am not sure why it work help with network intensive operations. It can be used standalone or along with disk based HDFS as a cache. It is compatible with all Hadoop distributions as well as standard tools like HBase, Hive, etc.
HortonWorks developed solutions to add into Hive the ability to update multiple records as a single transaction following the ACID model. Part of the complexity of transactional updates is that the data must be written to all applicable nodes before the transaction can be considered complete. The naming convention within HDFS folders includes a transaction ID so that both committed and uncommitted files persist until all portions of the transaction have been completed. Because the transaction ID is included, any read operations that occur before the transaction has completed will access the old data.
Why go through all of this work to add an ACID model to Hive rather than just use HBase, which already supports transactions. The primary reason is that HBase only supports Consistency at the level of a single row update, rather than with a larger set of operations. Without Consistency, there is no ACID. HortonWorks lists a few other reasons, but I’m discounting them because they are general reasons why they prefer Hive over HBase.
- HTTP API
- Master-less, so remains operational even if multiple nodes fail
- Near linear scalability
- Architecture same of both large and small clusters
- Key/value model, flat namespace, can store anything
- Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
- In-memory dataset
- Easy to setup, master/slave replication
- Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
- Chain replication across nodes that are geographically dispersed. Not single points of failure
- Excellent performance for large batches (~200k) read/write operations
- Runs on commodity hardware or blades. Does not require SAN
- High performance, massively scalable, modeled after Google’s Bigtable
- Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
- Data model is a traditional, but huge table, that is physically stored in sort order of the primary key
- High scalability due to allowing only very simple key/value data access.
- Used by LinkedIn
- Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
- Includes in-memory caching, so separate caching tier isn’t required
- High performance persistent storage that’s compatible with Memcache protocol
- NoSQL database with messaging server
- All data maintained in RAM. Persistence via a write ahead log.
- Asynchronous replication and hot standby
- Supports stored procedures
- Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)
- Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
- Was used by Facebook for Inbox Search until 2010
- Read/write scales linearly with number of nodes
- Data replicated across multiple nodes
- Supports MapReduce, Pig, and Hive
- Has SQL-like CQL providing for a hybrid between key/value and tabular database
- NoSQL key/value that provides lower latency and higher throughput than some alternatives
- Replicates data to multiple nodes
- Very easy to administer and maintain
- Data model: key plus zero or more attributes
- Great performance even on small clusters with millions of keys
- Nodes replicated via master-to-master replication. Hot backups and restores
- Very small client footprint
- Built on top of Tokyo Tyrant
Posted in apache, Cassandra, database, Facebook, Hibari, Hive, HyperDex, Hypertable, Lightcloud, LinkedIn, MapReduce, MemcacheDB, NoSQL, Pig, Redis, Riak, scalability, SQL, Tarantool, Voldemort
Tagged basho.com, cassandra.apache.org, github.com, hyperdex.org, hypertable.com, memcachedb.org, project-voldemort.com, redis.io, tarantool.org, toolsjournal.com
- Integrate with existing infrastructure. You can’t expect a green field. It can certainly replace some existing components, but will need to augment others even if it capable of replacing them.
- Utilize existing staff. Hive isn’t optimal, but there is value in having existing staff who understand existing systems, business needs, and data sets work in the new Hadoop environment.