- Follow Cloud (IaaS) & Big Data on WordPress.com
Categories
Sources
accumulo.apache.org amazon.com arnnet.com.au aws.amazon.com basho.com bigdata-startups.com bigdatauniversity.com cassandra.apache.org cisco.com cloudera.com couchdb.apache.org cringely.com ctovision.com datanami.com datastax.com dbta.com developer.yahoo.com drdobbs.com drizzle.org edwardtufte.com flume.apache.org ganglia.sourceforge.net gigaom.com github.com gutenberg.org hadoop.apache.org hbase.apache.org hive.apache.org hortonworks.com hp.com hyperdex.org hypertable.com ibm.com ibm.com/developerworks incubator.apache.org informationweek.com infoworld.com jaigak.blogspot.com kafka.apache.org learnitdaily.com linkedin.com lucene.apache.org mahout.apache.org memcachedb.org michael-noll.com mikiobraun.de mongodb.org nagios.org networkworld.com nytimes.com oozie.apache.org project-voldemort.com r-project.org redis.io revolutionanalytics.com samza.incubator.apache.org saphana.com scaledrisk.com siliconangle.com smartdatacollective.com spark.incubator.apache.org sqoop.apache.org sqrrl.com stackoverflow.com storm-project.net tarantool.org techtarget.com tika.apache.org toolsjournal.com whirr.apache.org wiki.apache.org/cassandra wiki.apache.org/hadoop wikipedia.org wsj.com zdnet.comRSS
Archives
Tag Archives: developer.yahoo.com
Big Data Tutorials
Comments Off on Big Data Tutorials
Posted in tutorial
Tagged bigdatauniversity.com, ctovision.com, developer.yahoo.com, hortonworks.com, techtarget.com
What benefit does Yarn bring to the existing MapReduce?
Within the classic MapReduce is the Job Tracker component. Yarn splits Job Tracker into two further components: Resource Manager (aka RM) (allocating cpu, ram, etc) and Node Manager (aka NM) (which operates at the level of a single node/machine). The Application Manager (aka AsM) negotiates resources from the Resource Manger and with the Node Manager to execute tasks. Job Tracker is already an ancient architecture — five years old!!
Yarn is sometimes referred to as MapReduce 2.0 or MRv2.
Resource Manager supports hierarchical application queues to guarantee allocation ratios of cluster resources. However, it does not enable recovery from application or hardware failures. It does not monitor. It only schedules. Scheduling methods include FIFO (default) and Capacity. Fair is not currently supported.
ZookKeeper monitors Resource Manager in order to switch to a secondary if Resource Manager itself fails. In a failover scenario, running applications are restarted and the queue continues. Preservation of state within currently running applications is handled by checkpoints stored by the Application Master within HDFS.
Rather than having specific containers to execute Map jobs and Reduce jobs, Yarn enables containers for more generic jobs, which enables developers to write other applications that run on the cluster.
It’s unclear whether Yarn will make the system run faster or slower. Generalization and modularization usually comes at a cost. However, Yarn allows for more complete utilization of CPU and RAM resources so in theory can squeeze every last bit of capacity out of a cluster, whereas the fixed size containers in MapReduce 1.0 could have left some resources idle. Yarn does not mange I/O which is typically a bigger bottleneck than RAM. There’s also no management of network bandwidth in Yarn. (Note to self, got to figure this out: I saw another article that says that Yarn does manage cpu, disk and network, yet didn’t mention RAM).
Another benefit of a more modularized architecture is that it makes the system easier to maintain. Any updates to MapReduce 1.0 requires the replacement of a pretty big chunk of software. Being able to run multiple versions of MapReduce within a cluster of thousands of nodes is important. Significant downtime would otherwise be required for upgrades.
Source:
- http://stackoverflow.com/questions/12992743/what-additional-benefit-does-yarn-bring-to-the-existing-map-reduce
- http://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html
- http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
- http://spark.incubator.apache.org/
Comments Off on What benefit does Yarn bring to the existing MapReduce?
Posted in ApplicationManager, JobTracker, MapReduce, NodeManager, ResourceManager, Yarn, Zookeeper
Tagged cloudera.com, developer.yahoo.com, spark.incubator.apache.org, stackoverflow.com
Twitter creates Hadoop hybrid system to mitigate tradeoffs between batch and stream processing
Storm is an open sourced system (from Twitter) that processes streams of big data in realtime (but without 100% guaranteed accuracy), making it the opposite of Hadoop which processes a repository of big data in batch.
Twitter has needs for both streaming and batch, so created an open sourced hybrid system called Summingbird. It does what Storm does, then uses Hadoop for error correction.
Twitter’s use cases include updating timelines and trending topics in real time, but then making sure that the analytics are accurate.
Yahoo’s contribution to this effort was to enable Storm to be configured using Yarn.
Source:
Comments Off on Twitter creates Hadoop hybrid system to mitigate tradeoffs between batch and stream processing
Posted in Storm, Summingbird, Twitter, Yahoo, Yarn