Normally we’d like to think of Hadoop running on hundreds of racks of commodity hardware, but that doesn’t mean that we should forget all of the reasons why we love virtualization.
This case study explains that how & why, and provides benchmarks of the experiment of running Hadoop on VMWare. Of course the experiment was successful, as the study was published by VMWare.
The moral of the story is that just because Hadoop can run on commodity hardware doesn’t mean that it has to, or that it’s the best way to deploy.
Hadoop 2.0 enables clusters to grow as large as 4000 nodes within deployments that contain multiple clusters. I think that companies like Google and Facebook each run tens of thousands of nodes.
Using Yarn, developers can run additional applications within the cluster by monitoring what the applications need, and then creating CPU/RAM containers within the cluster (and across clusters?) to run them.
There’s speculation that eventually Yarn could provide a PaaS using Hadoop in order to compete with VMWare’s Cloud Foundry. I suppose that while with VMWare you need to first think in terms of virtualizing hardware components and an operating system, Yarn jumps past that to provide an environment that’s abstracted for a specific application.