Facebook’s 300 PB data warehouse grows by approximately 600 TB per day and resides on more than 100k servers (although I’m not certain how many of those are Hadoop nodes). With the brute force approach of more storage and more servers reaching a logistical limit, the Facebook engineers have increased their level of data compression to 8x (using a custom modification of the Hortonworks ORCFile) from a previous 5x (using RCFile) compression. The Hortonworks ORCFile is generally faster than RCFile when reading, but is slower on writing. Facebook’s custom ORCFile was always fastest on both read and write and also the best compression.
Hydra is not built on top of Hadoop, but functions similar to Summingbird, Storm, and Spark.
Data can stream into it, and analytics can be run in real time, rather than only in batch.
AddThis is the company that originally developed Hydra, which is now in open sourced through Apache. AddThis runs six Hydra clusters, one of which is comprised of 156 servers and processes 3.5 billion transactions per day.
Clients should access an HDFS mount using Fuse for Production use, but the NFS proxy gets you initially deployed faster. Here’s how.
Posted in Fuse, HDFS, NFS
This article from Cloudera offers up use cases (such as customer sentiment) and a tutorial for using Apache Flume for near-real-time indexing (as emails arrive on your mail server) or MapReduce (actually MapReduceIndexerTool) for batch indexing of email archives. The two methods can be combined if you decide to do real-time, but later decide to add another MIME header field into the index.
Cloudera Search is based on Apache Solr (which contains components like Apache Lucene, SolrCloud, Apache Tika, and Solr Cell).
The email (including the MIME header) is parsed (with the help of Cloudera Morphlines), then uses Flume to push the messages into HDFS, as Solr intercepts and indexes the contents of the email fields.
Searching and viewing the results can be done using the Solr GUI or Hue’s search application.
Posted in apache, cloudera, Flume, HDFS, Hue, MapReduce, Solr, Tika, tutorial, Use Case
Tagged cloudera.com, github.com, lucene.apache.org, tika.apache.org
- HTTP API
- Master-less, so remains operational even if multiple nodes fail
- Near linear scalability
- Architecture same of both large and small clusters
- Key/value model, flat namespace, can store anything
- Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
- In-memory dataset
- Easy to setup, master/slave replication
- Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
- Chain replication across nodes that are geographically dispersed. Not single points of failure
- Excellent performance for large batches (~200k) read/write operations
- Runs on commodity hardware or blades. Does not require SAN
- High performance, massively scalable, modeled after Google’s Bigtable
- Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
- Data model is a traditional, but huge table, that is physically stored in sort order of the primary key
- High scalability due to allowing only very simple key/value data access.
- Used by LinkedIn
- Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
- Includes in-memory caching, so separate caching tier isn’t required
- High performance persistent storage that’s compatible with Memcache protocol
- NoSQL database with messaging server
- All data maintained in RAM. Persistence via a write ahead log.
- Asynchronous replication and hot standby
- Supports stored procedures
- Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)
- Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
- Was used by Facebook for Inbox Search until 2010
- Read/write scales linearly with number of nodes
- Data replicated across multiple nodes
- Supports MapReduce, Pig, and Hive
- Has SQL-like CQL providing for a hybrid between key/value and tabular database
- NoSQL key/value that provides lower latency and higher throughput than some alternatives
- Replicates data to multiple nodes
- Very easy to administer and maintain
- Data model: key plus zero or more attributes
- Great performance even on small clusters with millions of keys
- Nodes replicated via master-to-master replication. Hot backups and restores
- Very small client footprint
- Built on top of Tokyo Tyrant
Posted in apache, Cassandra, database, Facebook, Hibari, Hive, HyperDex, Hypertable, Lightcloud, LinkedIn, MapReduce, MemcacheDB, NoSQL, Pig, Redis, Riak, scalability, SQL, Tarantool, Voldemort
Tagged basho.com, cassandra.apache.org, github.com, hyperdex.org, hypertable.com, memcachedb.org, project-voldemort.com, redis.io, tarantool.org, toolsjournal.com
- Wizard for installing/configuring Hadoop services across many hosts
- Start, stop, reconfigure Hadoop across many hosts
- Dashboard for health & status
- Metrics via Ganglia (Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids)
- Alerting via Nagios
- Integrate provisioning, mangement, and monitoring into their own application using the Ambari REST APIs
These tools are supported by Ambari:
- HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop
Posted in Ambari, apache, Ganglia, HBase, HCatalog, HDFS, Hive, MapReduce, Nagios, Oozie, Pig, REST, Sqoop, Zookeeper
Tagged ganglia.sourceforge.net, github.com, incubator.apache.org, nagios.org