Hadoop Tez, Stinger's Baby

The Tragedy of Tez

Tez is one of the marvelous ironies of the fast moving big data and open source software space, a piece of brilliant technology that was obsolete almost as soon as it was released. In the second in my series of short posts on Hadoop data processing frameworks, I’ll look at the bouncing baby born of the Stinger Initiative, and point out where it’s ugly.

In 2013, Hadoop 2.0 (2.2 really) with YARN (Yet Another Resource Negotiator) made Hadoop essentially an operating system that could coordinate many different types of applications on a single cluster. MapReduce was no longer the only game in town. This meant that improving MapReduce was no longer the only way to improve Hadoop. People in the Hadoop community were beginning to realize that the shortcomings of MapReduce could not be solved without a major overhaul.

About that same time, an open source project was launched, mainly sponsored by Hortonworks, called the Stinger Initiative. It’s main goal was to improve the speed of Hive, a SQL-like interface that ran on top of MapReduce. Since MapReduce was the problem, they needed a replacement.

So, a new data processing framework was born, Tez.

Data Processing Paradigm (How does it work?)

Tez works very similar to MapReduce. This is both its greatest strength and its greatest weakness.

Hadoop Tez, Stinger's Baby

One technological advancement in Tez is the use of a DAG (Directed Acyclical Graph) to define workflows. Dataflow DAGs are a strategy used for many years in the HPCC (High Performance Cluster Computing) world. A DAG allows the developer to define the steps he wants taken without focusing on the low level aspects of execution. The parallel execution of that graph is handled by an automated optimizer at runtime. This allows a lot of flexibility in execution environment, and greatly simplifies, and therefore speeds, development. Having workflows defined as a DAG also allows the optimizer to look at the workflow as a whole when planning the execution, and plan when data can be held in memory and when it needs to hit disk.

That last bit provides another advancement Tez has over MapReduce, data pipelining. No more MapReduce pre-clogged pipes or kangaroo data processing. Tez still has the Mapper and Reducer as the building blocks of all jobs, but if the job doesn’t require writing each intermediate step to disk, then it doesn’t. That means the execution engine can read data from disk, perform several Map and Reduce processing steps in memory, and then write the results to disk.

This means that, Tez jobs have as much as an order of magnitude increase in processing speed over the same workflow written in MapReduce.

No matter who you are or what job you’re doing, you can’t argue with jobs that are easier to build and execute as much as 10 times faster.

Interface (How do you use it?)

Because Tez jobs are logically very similar to MapReduce jobs, the API calls are also very similar. This was one of the goals of the Stinger Initiative, to make a MapReduce replacement that was highly compatible with things like Hive that used MapReduce to do their work. This means that Hive, Pig, Cascading and other interfaces that generate MapReduce can now generate Tez code.

 Weak Areas (What is it NOT good for?)

The biggest tragedy of Tez doesn’t have anything to do with Tez, itself. Tez is a brilliant advancement on the MapReduce paradigm, a marvelous move forward that solves the biggest weakness of its predecessor.  The problem is that the apple didn’t fall far enough away from the tree.

Tez still tightly conforms to the strict Map Shuffle Reduce pattern of MapReduce. Jobs that do not fit that pattern well are just as difficult to build with Tez as they are with MapReduce. Tez is MapReduce with turbo boost. This means that it still isn’t very good at sophisticated machine learning or other complex operations that don’t match up to a Map Shuffle Reduce pattern.

Every other data processing framework that I know of for Hadoop, other than MapReduce, also uses DAGs for ease of development, data pipelining and execution speed.  The difference is that other frameworks actually looked at the problem of parallel data processing on Hadoop clusters in completely different ways, and came up with different strategies for handling data. This gives them the exact same speed boost that Tez has from the DAG strategy, plus other advantages, making Tez already outclassed.

 Best Use Case (What is it good for?)

The biggest strength of Tez, ironically, is how tightly it conforms to the MapReduce programming paradigm. Because the two frameworks are logically well-matched, it is relatively easy for MapReduce programmers to learn how to use Tez. It is fairly low difficulty for applications that support MapReduce to also support Tez.

One of the main goals of the Stinger Initiative was to find a way to speed up Hive. While Hive is the most versatile and capable of the many SQL or SQL-like ways of accessing data on Hadoop, it is also the slowest. With MapReduce as its execution engine, it was so slow that human comfortable interactive query speeds weren’t really an option. You put in your query, went to lunch, and hoped it was done by the time you got back. Not exactly interactive.

Tez gives Hive a speed boost that brings it closer to the level of true interactive speed; minutes or even seconds on queries that took hours before. Hive’s greatest strength is its ability to define schema at query time, with no limits on data scale. Cloudera’s Impala, for instance, is faster, but lacks Hive’s breadth of capabilities. Hive on Tez brings Hive into the realm of tolerable speed for most jobs while keeping all that power.

If you need to do interactive speed queries on heterogenous, poorly defined data (the kind of messy data Hadoop is famous for), Hive on Tez is arguably the very best engine for the job. In that way, the Stinger Initiative accomplished its main goal. This may change as Hive on Spark develops further, but for now, Hive on Tez can’t be beat.

Tez is also good for the same kind of slow batch, large scale, chugging along, get it done eventually jobs that MapReduce works well for, such as ETL (Extract Transform Load) style jobs. The difference is that Tez makes slow batch jobs faster. Even a job that it won’t hurt to have run in 20 hours would be better if it could run in 2 hours on the same hardware, and was easier to build.

With Tez in existence, I really can’t think of any reason why people would continue to use MapReduce for new work. Tez does everything MapReduce does, only faster.

General Comparison to Other Options

MapReduce was the original Hadoop data processing framework. It parallelizes data processing across a cluster, and therefore can process data at unlimited scale. But MapReduce is surprisingly slow, difficult to use, and overly rigid, and its fundamental design guarantees that it will continue to be slow, difficult and rigid.

MapReduce and Tez use the same logical programming paradigm, but Tez uses dataflow DAGs for resource optimization and data pipeline planning. This means that Tez provides an order of magnitude speed boost over MapReduce, but has the same overly rigid design limitations.

Other options I’ll look at in later posts are Spark, DataFlow, Storm, Heron and Flink.

Related Posts

six Comments

  1. Jeff On June 24, 2015 at 22:18

    when discussing the “tragedy” of timing for Tez re:Spark we should be careful not to conflate use cases. While apple indeed needed to fall far from the MR tree for ML augmented analytics, the ETL at petascale use cases rightly needed to stay nearer to MR roots. Too many Spark fanboys overlook the issues that remain. Tez has value in a post Spark era. In the world of big data, there is no one-size-fits-all solution.

    • Paige On June 25, 2015 at 13:48

      Very true. Tez doesn’t have the memory requirements and other weaknesses that hamper Spark in many types of implementations, which makes it ideal for some purposes that Spark isn’t great for. I tend to think in terms of more interactive types of applications, just because that has been my focus for a while. Large scale ETL is still a really important and tough job, and Tez is the best tool for that job at this time.

      Thanks for bringing the discussion over here from the Big Data, Low Latency group on LinkedIn.

  2. Paige On June 25, 2015 at 18:09

    Lots of good discussion on this post in the Big Data, Low Latency group on LinkedIn.

  3. Majid Azimi On July 6, 2015 at 13:30

    There two amazing points about Tez:
    1. It runs old map reduce code just fine with better performance. Huge amount of code has been written until now and converting them to spark would take too much time. Any legacy code built by cascading and crunch runs on Tez without changing a single line of code.
    2. God bless Tez installation. Just copy Tez jars into HDFS and there you go.

    • Paige On July 6, 2015 at 23:13

      Very true. Backwards compatibility with existing legacy apps and interfaces is a huge advantage.

  4. Orval Kidd On July 30, 2015 at 3:33

    Thanks for sharing the information, Paige!
    Thing I love about Tez is that it brings you nice and smooth performance even though the code is reduced. You know, like Majid said, it’s really hard to convert huge amount of code since the process will waste too much time.