Tag Archives: linkedin.com

big data is not about storing a lot of data, it’s about analyzing multiple data sets in ways that previously were not practical

Keys are:

  • cloud infrastructure
  • network bandwidth
  • unstructured data

Use cases:

  • Walmart uses data from past buying patterns, store inventory, mobile phone location data, social media as well as weather information. It can then send someone a coupon for a BBQ grill cleaner to their smart phone – but only if that person previously bought a barbeque at Walmart, the weather is currently nice and he or she is currently within a 3 miles radius of a WalMart store that has the BBQ cleaner in stock.
  • The UK Olympic cycling team uses bike sensors and wearable devices to collect data. This collects both mechanical and medical information (vastly different data schemas.) Mining social media posts introduces data points on emotional states of athletes.
  • Hospital collects data on every breath and heartbeat of premature babies, which accumulates into a vast data store over the course of days or weeks.
  • eHarmony correlates far more site analytics and user entered profile and search data than could be done without Hadoop.
  • Cities correlate traffic data (cameras, traffic observations, mass transit schedule updates, along with social media) to re-route or re-schedule traffic in near real-time.


Apache Samza: LinkedIn’s Real-time Stream Processing Framework

  • Samza is a massively scalable framework for distributed stream transport and limited processing
  • Samza uses Yarn and Apache Kafka (publish/subscribe messaging able to handle 100s of MB reads/writes per second)
  • LinkedIn utilizes Samza to publish 26+ billion unique messages per day to 100s of message feeds that are picked up by 1000s of automated subscribers (some are real time, others batch)


Discussion in LinkedIn group on load testing Hadoop with large public domain datasets

Discussion thread on LinkedIn Group:

Benchmarking and Stress Testing an Hadoop Cluster With TeraSort, TestDFSIO & Co.

Project Gutenberg (approximately 30,000 books)

Wikipedia (full download)

Datasets available through Amazon, such as the Human Genome Project and US Census Database