Using MapReduce is Like Plumbing with Pre-Clogged Pipes

On June 16, 2015 in Big Data, Data Management, Hadoop

MapReduce is no longer the only way to process data on Hadoop. In fact, it’s arguably the worst Hadoop data processing framework.

By now, everyone knows how awesome Hadoop is for large scale, data storage, processing and analysis. Hadoop is the darling of large scale data processing, while MapReduce keeps getting nothing but bad press and complaints that it’s too slow, too hard to use, and generally doesn’t live up to its hype. But aren’t Hadoop and MapReduce the same thing?

MapReduce and HDFS were the original center of the Hadoop universe, but thanks to YARN, the door is open to other ways to do work on clusters. Hadoop has become, not two specific technologies, but a huge and growing ecosystem of technologies. As MapReduce has been more and more widely used, more and more problems with it have shown up. Newer, smarter data processing frameworks have popped up to address those shortcomings. But trying to pick which one to use can be a serious challenge.

I’m going to write a short series of posts on the various Hadoop cluster data processing framework options available now. If you’re thinking about doing a Hadoop implementation, and bewildered by the available options, hopefully this will help. I’m going to try to answer some of these questions:

Which data processing framework makes sense for what you’re trying to do, Spark, Tez, DataFlow, Storm, Heron, Flink?
What do all of these frameworks even do?
Why not just use MapReduce for everything like we did in Hadoop early days?
What if you’re not a programmer, or don’t have a team of specialized programmers?
Is there any way to process data on a Hadoop cluster, other than writing code for some obscure framework you never heard of?
When is it a really bad idea to use a particular framework, and when is it the perfect fit for the job?

Let’s start by taking a look at the original Hadoop data processing framework, MapReduce.

Data Processing Paradigm (How does it work?)

MapReduce was originally designed for fast batch processing across a cluster. Essentially, it divides compute instructions across the computers that contain data on a cluster, pushes those instructions out to the data, and executes them. This is what all cluster computing frameworks have in common. They all parallelize instructions so that they can be executed all at once on partitions of the data. They all push compute to data, rather than trying to pull data to the computation. This is what gives these frameworks the power to do operations on massive data sets in reasonable amounts of time, without buying expensive mega-computers.

Now, specifically, MapReduce is set up to do three operations for every instruction, every step in a data processing workflow.

First, it reads data off disk into memory and executes a single instruction, a Map operation.
Second, it shuffles, sorts and/or partitions output data as needed and passes it to the reducer. The Reducer returns the result of the operation back.
This puts data results back onto disk.

Then, the next instruction in the workflow is executed, and these three steps are repeated. For complex operations, multiple MapReduce cycles would be required for a single action, but the same pattern is always followed.

Interface (How do you use it?)

The most common way that MapReduce is used is simply by writing MapReduce code, which is a flavor of parallel Java. This is not base level, beginner’s Java. We’re talking some pretty serious coding to make a decent MapReduce workflow. MapReduce is tough to write, time consuming, and it requires a programmer with a deep understanding of parallel cluster computing to do it well. This is why MapReduce developers are practically worth their weight in gold in the job market right now.

This is NOT the only way to create workflows in MapReduce, however.

One of the most common ways that MapReduce is used is through Hive and Pig. Hive is a query interface that is SQL-like. Users can interact with data with something similar to SQL. Under the covers, the SQL will be converted into MapReduce jobs that will do the work. Pig is essentially a scripting language for MapReduce. Essentially, Pig is to MapReduce as Javascript is to Java. Pig is an easier way to write MapReduce applications while focusing more on the logical needs of the current job, and less on the low level code. Some people also use development environments like Cascading to try to reduce the pain of MapReduce applicaiton development.

In addition, quite a few applications now provide drag and drop visual interfaces to define MapReduce workflows. Informatica, Pentaho and Talend are some examples. These use their normal drag and drop workflow design interfaces to generate MapReduce code that you can execute on a cluster. That’s machine generated MapReduce, so it’s not going to be anywhere near as efficient or fast as code written by a skilled developer, but it will get the job done.

This means that if you already use one of those tools for data processing, you probably don’t need to hire a team of MapReduce coders just because you’re dealing with larger data sets. It would probably still be a good idea to have at least one person on your team who really gets MapReduce to dig in and troubleshoot issues when they come up. But you can have your normal ETL (Extract, Transform, Load) and BI (Business Intelligence) experts use the interface they’re accustomed to, and get the vast majority of your company’s work done.

Weaknesses (What is it NOT good for?)

The way MapReduce reads data off disk, does a single simple operation, and then writes data back to disk is what I think of as kangaroo data processing. Read data up, do one operation, write data down. Read data up, do one operation, write data down. Every time a data processing workflow hits disk, that’s a pipe buster, a choke point, a place where fast, in-memory data processing grinds down to a snail’s pace. With MapReduce, that means every single step in a data processing workflow is like a clog in your pipe. The more steps in the workflow, the slower it will execute on the whole.

In addition, MapReduce is inflexible in the way it processes data. Every single job MUST be broken down into individual Map and Reduce functions. While this works well for some jobs, it requires some pretty extreme coding gymnastics to work with other types of jobs.

This means that MapReduce is utterly useless for interactive purposes, a bad choice for dealing with high speed streaming data, and not the best idea for any application that requires a response in seconds or sub-seconds. It’s not the best bet for many machine learning algorithms, and it generally sucks at jobs that require a lot of steps to complete. I don’t mean that it can’t, and hasn’t, been used for all of those situations. I just mean that there are far better data processing options now for these scenarios. Even for batch processing, if you need the batch processing done in a short time frame and don’t want to have to use hundreds of nodes to accomplish that goal, there are better options now than MapReduce.

Strengths (What is it good for?)

MapReduce is a slow batch workhorse framework. It’s not specialized to any particular type of data processing. It’s an all purpose swiss army knife. Large scale ETL is a good use case, especially when there are not a lot of transformation steps involved. If all you want to do is move massive amounts of data in parallel, or make a few large scale data changes or searches, and you have plenty of time, MapReduce will get the job done.

Other options

A lot of other options exist in the Hadoop ecosystem, and new ones are popping up all the time. Tez was one of the first on the market, an outcome of Hortonworks stinger project. Actian DataFlow, formerly Pervasive DataRush, has been in existence a long time, but is the one proprietary option on this list, so it hasn’t gotten the exposure that a lot of the others have. It was originally created to handle parallelization of workflows on massively multi-core servers (scale up), and later modified to parallelize across clusters (scale out) as Hadoop started hitting its heyday. Spark came out of Berkeley a few years back and has been a huge hit. Storm was developed specifically to elegantly handle streaming data processing. Those are the relatively mature frameworks.

Very recently, Twitter has come up with Heron as a way to address some of Storm’s shortcomings. Flink is the brand new kid on the block, that looks like it wants to take on Spark’s territory. Heron and Flink were the darlings of the recent Hadoop Summit, which, for the first time in years, I did not attend. I did watch the keynotes live streamed, but somehow, it just wasn’t the same. (Shameless plug: I’m currently looking for a new position as a technology analyst or evangelist in the big data and Hadoop space. Feel free to contact me with leads. Thanks in advance.)

I’ll look at the quirks, strengths and weaknesses of each of these in posts over the next few weeks, and look at when, where and why you would use each one. If you know of another data processing framework I haven’t mentioned, remind me in the blog comments, and I’ll see if I can put it into context with the others.

five Comments

TheSerpentKing On June 17, 2015 at 16:36

What about Hive and Hbase? I admit that MapReduce requires a lot of proficiency in Java but there are already options for people who are from a datatwarehousimg background. There are other solutions as well like Impala ( said to be fastest as per Cloudera).
Paige On June 17, 2015 at 18:01

Good point! I meant to put Hive under MapReduce interfaces, and forgot. D’oh!

A lot of MapReduce code gets created with SQL, or SQL-like interfaces such as Hive. I would estimate that more MapReduce code gets created with Hive than written directly, although that’s just a guess based on what I’ve heard. I don’t have any research to back it.

The Cascading development environment is something I forgot to mention as well. It’s main reason for existence is to make MapReduce easier to develop.

Impala actually doesn’t use MapReduce at all. Instead it uses something similar to an RDBMS query optimizer, and a structured data format (Parquet). NOT using MapReduce is one of the reasons Impala’s so much faster than other options for interactive querying.
Paige On June 22, 2015 at 22:08

I did an edit above to add information about Hive, Pig and Cascading.
William Vambenepe On July 1, 2015 at 7:25

Among the “other options” to consider, I’d recommend looking at Google Cloud Dataflow, a recently open-sourced programming model (https://github.com/GoogleCloudPlatform/DataflowJavaSDK) which unifies stream and batch processing in one pipeline. Those pipelines can then run on a fully-managed service on Google Cloud (https://cloud.google.com/dataflow/) but they can also execute on other runtimes like Spark or Flink. The programming model is designed to be portable across different runtimes. Cloudera wrote the adapter (called a “runner”) to make Dataflow pipelines work on Spark, while Data Artisans (the company behind Flink) wrote the runner for Flink.

Disclosure: I work on Cloud Dataflow. Happy to talk about it if you’d like.
Paige On July 1, 2015 at 20:26

That sounds like some very cool tech. Google has all the nice toys. I don’t know enough about it to compare it to the others. Is it actually a new runtime of its own? Or, is it a new way to generate runtime code, like Hive can generate MapReduce or Tez?

Big Data Page by Paige

Thoughts on Analytics, Software and Data Management

Using MapReduce is Like Plumbing with Pre-Clogged Pipes