Tez is one of the marvelous ironies of the fast moving big data and open source software space, a piece of brilliant technology that was obsolete almost as soon as it was released. In the second in my series of short posts on Hadoop data processing frameworks, I’ll look at the bouncing baby born of the Stinger Initiative, and point out where it’s ugly.
In 2013, Hadoop 2.0 (2.2 really) with YARN (Yet Another Resource Negotiator) made Hadoop essentially an operating system that could coordinate many different types of applications on a single cluster. MapReduce was no longer the only game in town. This meant that improving MapReduce was no longer the only way to improve Hadoop. People in the Hadoop community were beginning to realize that the shortcomings of MapReduce could not be solved without a major overhaul.
About that same time, an open source project was launched, mainly sponsored by Hortonworks, called the Stinger Initiative. It’s main goal was to improve the speed of Hive, a SQL-like interface that ran on top of MapReduce. Since MapReduce was the problem, they needed a replacement.
So, a new data processing framework was born, Tez.
Data Processing Paradigm (How does it work?)
Tez works very similar to MapReduce. This is both its greatest strength and its greatest weakness.
One technological advancement in Tez is the use of a DAG (Directed Acyclical Graph) to define workflows. Dataflow DAGs are a strategy used for many years in the HPCC (High Performance Cluster Computing) world. A DAG allows the developer to define the steps he wants taken without focusing on the low level aspects of execution. The parallel execution of that graph is handled by an automated optimizer at runtime. This allows a lot of flexibility in execution environment, and greatly simplifies, and therefore speeds, development. Having workflows defined as a DAG also allows the optimizer to look at the workflow as a whole when planning the execution, and plan when data can be held in memory and when it needs to hit disk.
That last bit provides another advancement Tez has over MapReduce, data pipelining. No more MapReduce pre-clogged pipes or kangaroo data processing. Tez still has the Mapper and Reducer as the building blocks of all jobs, but if the job doesn’t require writing each intermediate step to disk, then it doesn’t. That means the execution engine can read data from disk, perform several Map and Reduce processing steps in memory, and then write the results to disk.
This means that, Tez jobs have as much as an order of magnitude increase in processing speed over the same workflow written in MapReduce.
No matter who you are or what job you’re doing, you can’t argue with jobs that are easier to build and execute as much as 10 times faster.
Interface (How do you use it?)
Because Tez jobs are logically very similar to MapReduce jobs, the API calls are also very similar. This was one of the goals of the Stinger Initiative, to make a MapReduce replacement that was highly compatible with things like Hive that used MapReduce to do their work. This means that Hive, Pig, Cascading and other interfaces that generate MapReduce can now generate Tez code.
Weak Areas (What is it NOT good for?)
The biggest tragedy of Tez doesn’t have anything to do with Tez, itself. Tez is a brilliant advancement on the MapReduce paradigm, a marvelous move forward that solves the biggest weakness of its predecessor. The problem is that the apple didn’t fall far enough away from the tree.
Tez still tightly conforms to the strict Map Shuffle Reduce pattern of MapReduce. Jobs that do not fit that pattern well are just as difficult to build with Tez as they are with MapReduce. Tez is MapReduce with turbo boost. This means that it still isn’t very good at sophisticated machine learning or other complex operations that don’t match up to a Map Shuffle Reduce pattern.
Every other data processing framework that I know of for Hadoop, other than MapReduce, also uses DAGs for ease of development, data pipelining and execution speed. The difference is that other frameworks actually looked at the problem of parallel data processing on Hadoop clusters in completely different ways, and came up with different strategies for handling data. This gives them the exact same speed boost that Tez has from the DAG strategy, plus other advantages, making Tez already outclassed.
Best Use Case (What is it good for?)
The biggest strength of Tez, ironically, is how tightly it conforms to the MapReduce programming paradigm. Because the two frameworks are logically well-matched, it is relatively easy for MapReduce programmers to learn how to use Tez. It is fairly low difficulty for applications that support MapReduce to also support Tez.
One of the main goals of the Stinger Initiative was to find a way to speed up Hive. While Hive is the most versatile and capable of the many SQL or SQL-like ways of accessing data on Hadoop, it is also the slowest. With MapReduce as its execution engine, it was so slow that human comfortable interactive query speeds weren’t really an option. You put in your query, went to lunch, and hoped it was done by the time you got back. Not exactly interactive.
Tez gives Hive a speed boost that brings it closer to the level of true interactive speed; minutes or even seconds on queries that took hours before. Hive’s greatest strength is its ability to define schema at query time, with no limits on data scale. Cloudera’s Impala, for instance, is faster, but lacks Hive’s breadth of capabilities. Hive on Tez brings Hive into the realm of tolerable speed for most jobs while keeping all that power.
If you need to do interactive speed queries on heterogenous, poorly defined data (the kind of messy data Hadoop is famous for), Hive on Tez is arguably the very best engine for the job. In that way, the Stinger Initiative accomplished its main goal. This may change as Hive on Spark develops further, but for now, Hive on Tez can’t be beat.
Tez is also good for the same kind of slow batch, large scale, chugging along, get it done eventually jobs that MapReduce works well for, such as ETL (Extract Transform Load) style jobs. The difference is that Tez makes slow batch jobs faster. Even a job that it won’t hurt to have run in 20 hours would be better if it could run in 2 hours on the same hardware, and was easier to build.
With Tez in existence, I really can’t think of any reason why people would continue to use MapReduce for new work. Tez does everything MapReduce does, only faster.
General Comparison to Other Options
MapReduce was the original Hadoop data processing framework. It parallelizes data processing across a cluster, and therefore can process data at unlimited scale. But MapReduce is surprisingly slow, difficult to use, and overly rigid, and its fundamental design guarantees that it will continue to be slow, difficult and rigid.
MapReduce and Tez use the same logical programming paradigm, but Tez uses dataflow DAGs for resource optimization and data pipeline planning. This means that Tez provides an order of magnitude speed boost over MapReduce, but has the same overly rigid design limitations.
Other options I’ll look at in later posts are Spark, DataFlow, Storm, Heron and Flink.
when discussing the “tragedy” of timing for Tez re:Spark we should be careful not to conflate use cases. While apple indeed needed to fall far from the MR tree for ML augmented analytics, the ETL at petascale use cases rightly needed to stay nearer to MR roots. Too many Spark fanboys overlook the issues that remain. Tez has value in a post Spark era. In the world of big data, there is no one-size-fits-all solution.