HortonWorks / Apache Tez

HortonWorks / Apache Tez provides an alternative to MapReduce in order to process near real time jobs at petabyte scale. The HortonWorks Stinger project utilizes Tez in order to increase the speed of Hive and Pig by an order (or multiple orders) of magnitude.

Tez is based on a multiple stage dataflow architecture: pre-processor, sampler, partition, aggregate in contract to the traditional Map and Reduce.

Tez assumes use of Yarn for resource acquisition, so cannot be run in legacy environments. Also assumed is complex user defined logic to eliminate duplicate work in order to increase performance. Legacy Hadoop assumes duplicate work, made less painful by the massive scale of the cluster, and the benefit of redundancy.

Tez may also run multiple instances within a single Yarn container, which reduces the overhead of additional containers. However, this may decrease efficient resource utilization on a very large scale since using many Yarn containers help to allocate every last available hardware resource, as opposed to Tez squeezing as much as possible within fewer containers.


Comments are closed.