Cloudera had appeared to be the defacto standard of Hadoop distributions, but Hortonworks has scored big in this deal. Spotify has a 690 node Cloudera cluster that it will be moving to a Hortonworks cluster (undisclosed size). Apparently it’s the new Hive implementation that makes Hortonworks so attractive.
When Spotify launched in 2008 it had a 30 node cluster hosted on Amazon’s AWS, then switched to an on-premises 60 node cluster that grew to 690 nodes. The cluster currently contains 4 petabytes of data which grows by 200 gigabytes per day.
Spotify has a 12 person Hadoop team and uses a Python (not Java) framework for batch processing.