Category Archives: LinkedIn

10 Key/Value Store, Distributed, Open Source Databases

Riak

  • HTTP API
  • Master-less, so remains operational even if multiple nodes fail
  • Near linear scalability
  • Architecture same of both large and small clusters
  • Key/value model, flat namespace, can store anything

Redis

  • Key/value. Can store data types such as sets, sorted lists, hashes and do operations on them such as set intersection and incrementing the value in a hash.
  • In-memory dataset
  • Easy to setup, master/slave replication

Hibari

  • Very simple data model with 5 attributes: keys, values, timestamps, expiry date, flags for metadata
  • Chain replication across nodes that are geographically dispersed. Not single points of failure
  • Excellent performance for large batches (~200k) read/write operations
  • Runs on commodity hardware or blades. Does not require SAN

Hypertable

  • High performance, massively scalable, modeled after Google’s Bigtable
  • Runs on top of a distributed file system such as Apache Hadoop DFS, GlusterDS, or Kosmos File System
  • Data model is a traditional, but huge table, that is physically stored in sort order of the primary key

Voldemort

  • High scalability due to allowing only very simple key/value data access.
  • Used by LinkedIn
  • Not an object or a relational database. Just a big, distributed, fault-tolerant, persistent hash table
  • Includes in-memory caching, so separate caching tier isn’t required

MemcacheDB

  • High performance persistent storage that’s compatible with Memcache protocol

Tarantool

  • NoSQL database with messaging server
  • All data maintained in RAM. Persistence via a write ahead log.
  • Asynchronous replication and hot standby
  • Supports stored procedures
  • Data model: tuples (unique key plus any number of other fields); spaces (multiple tuples)

Apache Cassandra

  • Can use massive cluster of commodity servers with no single point of failure. Can be deploy across multiple data centers.
  • Was used by Facebook for Inbox Search until 2010
  • Read/write scales linearly with number of nodes
  • Data replicated across multiple nodes
  • Supports MapReduce, Pig, and Hive
  • Has SQL-like CQL providing for a hybrid between key/value and tabular database

HyperDex

  • NoSQL key/value that provides lower latency and higher throughput than some alternatives
  • Replicates data to multiple nodes
  • Very easy to administer and maintain
  • Data model: key plus zero or more attributes

Lightcloud

  • Great performance even on small clusters with millions of keys
  • Nodes replicated via master-to-master replication.  Hot backups and restores
  • Very small client footprint
  • Built on top of Tokyo Tyrant

Sources:

Advertisements

Apache Samza: LinkedIn’s Real-time Stream Processing Framework

  • Samza is a massively scalable framework for distributed stream transport and limited processing
  • Samza uses Yarn and Apache Kafka (publish/subscribe messaging able to handle 100s of MB reads/writes per second)
  • LinkedIn utilizes Samza to publish 26+ billion unique messages per day to 100s of message feeds that are picked up by 1000s of automated subscribers (some are real time, others batch)

Sources:

Use Cases for Hadoop

The Apache “Powered by Hadoop” page lists a long list of companies that use Hadoop. Some only list the company name. Others have a sentence or two about what they’re using Hadoop for. And some, like LinkedIn, list the specs of the hardware that they have Hadoop running on. There’s also a link to a great article about how the NYTimes used Hadoop  (in 2007!!!) running on Amazon’s AWS cloud to generate PDFs of 11 million articles in 24 hours running on 100 machines. One of the things I find interesting about the NYTimes use case is that they used Hadoop for a one time batch process. A lot what we read about Hadoop assumes that the use case is an ongoing, maybe multi-year application.

Source: