Spark is the darling of the open source community right now. It’s setting the Hadoop world on fire with its power and speed in large scale data processing on Hadoop clusters. Spark is one of the most active big data open source projects, has bunches of enthusiastic committers, has its own group of ecosystem applications, and is now part of most standard Hadoop distributions. Neat trick for a data processing framework that didn’t even start life as a Hadoop project.
Spark was developed by academics at Berkeley. I suspect they were looking for a better way to design and execute large scale machine learning algorithms. Regardless of the reasons, Spark has become one of the best performing and most well-loved of the Hadoop frameworks. I have adored this bright, fascinating engine since it was just a bit more than a twinkle in the eye of the BDAS (Berkeley Data Analytics Stack) team. I have also watched it burn a lot of other promising technologies in its wake.The Tragedy of Tez, Drill and Impala as Michael Segel on LinkedIn so eloquently put it, is mainly the timing of Spark.
Data Processing Paradigm (How does it work?)
Like Actian DataFlow, Spark doesn’t require Hadoop to function. Spark was originally developed to run on Mesos clusters. Like DataFlow and Tez, it implements DAGs, which gives it excellent flexibility and parallel speed. Spark pipelines like a champ. Also, like DataFlow, but unlike Tez, Spark does not use the MapReduce mapper and reducer programming paradigm. It does provide a map and reduce capability, so if that’s what you need, you can still do it with Spark. However, Spark is free of that restrictive paradigm’s stilted style of execution. (No kangaroo data processing here.)
Spark does all of its data processing in memory in its own unique data format called an RDD (Resilient Distributed Dataset). It starts by reading data from disk into its in-memory RDD format, and the data stays in memory from there until you tell it to put some sort of output on disk.
One of the interesting aspects of the RDD format is that its immutable. Once created, an RDD is never modified. Data transformation commands result in a new RDD being created. A simple command can make an RDD hang around in memory, giving it some of the useful characteristics of an in-memory distributed database.
Interface (How do you use it?)
The Spark API has historically been easy to use and well-documented with a good library of pre-built functions to work with. It is definitely a Scala centric framework, with Python and Java interfaces lagging behind.
DataBricks has developed a cloud offering with a notebook style interface that has elevated Spark from raw Scala programming to something approaching an IDE (Interactive Development Environment), and vastly simplifies getting acquainted with this sparkly tool.
However, the interactive Spark shell is the obvious and most commonly used way to interact with this framework. Anyone who is comfortable programming in Scala is most likely going to go that route.
Weak Areas (What is it NOT good for?)
One of the common weaknesses of any in-memory data processing framework is recovering from failure. Due to the ephemeral nature of RAM, a hardware failure in the middle of an in-memory database process can wipe out information irrecoverably. Any process in-flight will have to be started over again from the beginning. This is NOT a Spark weakness.
The “Resilient” aspect of the RDD format means that a journal entry of each data change is persisted to disk. In the case of a hardware failure, the data changes can be replicated to restore the RDD to the state it was in just before the crash.
Spark’s real weakness is also its strength. It pulls entire data sets into memory, and does all of its work there. Doing everything in-memory makes Spark a huge memory hog. It gets particularly ugly if multiple transformation steps cause the creation of multiple, slightly different copies of the RDD. Spark has intelligent spill-to-disk capabilities, like any good in-memory database, but if it has to use them, you lose the processing speed advantage of an in-memory data processing framework.
The main thing this means is that you have to beef up the RAM on your data processing nodes if you have a Spark-based cluster. RAM is becoming less expensive over the years, but that still means an extra expense on every single node. On a large cluster, that can add up fast. One of the big advantages of Hadoop clusters has always been the inexpensive hardware. If you need each server in your cluster to have half a terabyte or more of RAM to work well, those dollar signs can add up in a hurry.
Spark also has some security issues, and concurrency issues, especially at larger scale, but I suspect those will be ironed out as the software matures. For now, you’re probably better off having several small Spark clusters, rather than one large one.
Best Use Case (What is it good for?)
Spark shines brightest when it gets used for machine learning. A lot of the sophisticated machine learning algorithms that are changing our world require multiple reads of the same data set to do their jobs. That makes the Spark’s RDD concept an ideal catalyst to machine learning brilliance.
It also works exceptionally well for high speed querying, does a great job of processing streaming data with microbatches, and frankly, it burns through most cluster type workloads at blazing speeds.
General Comparison to Other Options
When it comes to comparisons to other engines, only DataFlow can keep up in terms of versatility and performance. I’ve written entire technical papers comparing these two engines. Which engine is faster depends on the situation.
There are three situations where proprietary DataFlow outshines the shining star of the open source world. One is in workflows that have lots of data transformation steps. DataFlow’s roots as a high speed ETL engine gives it some efficiency bonuses when it comes to complex data munging. Another situation is fairly obvious. When RAM is limited, Spark has to do a lot of its processing on disk. DataFlow pipelines a bit more efficiently, and doesn’t try to cram entire data sets into memory. This means it still powers on when Spark slows down for RAM limitations. This also means that DataFlow runs faster than Spark on cheap or older hardware.
The third situation is not only not obvious, but counter-intuitive. After sorting data, DataFlow persists the data to disk. Spark doesn’t persist sorted data at all. Hitting disk should slow DataFlow way down, right? Not if you look at overall workflow performance. The reason data gets sorted, generally, is so later steps in the workflow can access the data in that sorted order. Spark has to keep re-sorting the data every time another workflow step needs sorted data to work properly. If you have several of those steps in a row, that can really bog things down.
There is one situation where Spark leaves DataFlow eating its smoke: machine learning. Really, Spark blazes in any scenario that requires multiple reads of the same data set. Having that data set in memory already is a huge performance advantage. Of course, community support and no license costs put Spark way ahead in other ways.
So, what about Storm and Heron and the other streaming data engines? Unsurprisingly, in true streaming scenarios, they snuff Spark. In all other scenarios, they’re not even in the running. The thing that most people don’t realize is that true streaming data scenarios are relatively rare, and micro-batching works like gangbusters for most real world needs. If you have sensor data pouring in from all over town, and you chop that into 10 second batches, then process each batch in 7 seconds and re-set for the next batch in 2 seconds, you’re golden. In most situations, there’s no real advantage to processing each individual record as it flies by.
Well, what about Flink? That is the question of the day. Flink is the new kid on the block, with a stack like Spark and an attitude like Spark, and a chance to take on the champ. But Flink is immature, unproven, and just starting to take its first baby steps out into the real Hadoop world. We’ll have to wait and see if Flink has what it takes to blow out the Spark, or if it’s just another candle in the wind.
In the meantime, Spark is lighting up the Hadoop world. If I were going to build a cluster right now, for almost any of the vast majority of Hadoop workloads, I would probably go with a Spark-based cluster. Spark is shiny.