In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.
Having worked at Actian, and Pervasive Software before that, and a little startup ETL software company called Data Junction before that where DataFlow was born, I know way more about this engine than the shiny paint on the surface. I know it down to the dirt and grease under the wheels.
Since I no longer work for Actian, I now have the option to give my completely honest opinion of DataFlow’s strengths and weaknesses. I no longer have any party line to toe. Perhaps surprisingly, that hasn’t significantly changed what I have to say about it. I still think that this little engine could take over the Hadoop execution world, but I also think that it probably won’t.
Data Processing Paradigm (How does it work?)
DataFlow was invented originally back in the early 2000’s for the multi-core revolution. As Moore’s Law started to slow down, a lot of hardware folks adapted to computer chips no longer getting faster at the same rate by putting in more and more chips. DataFlow was designed to automatically scale up at runtime to make best use of all those cores, without knowing ahead of time how many cores it was going to be running on. It’s power lay in a philosophy of “Create once, run many.” and leaving no hardware power behind. It squeezed power levels out of standard hardware that no one previously believed possible.
Then along came Hadoop, and instead of spending money on scaling up on machines with more and more cores, businesses started scaling out to do their data processing on multiple computers. A few code tweaks later, and voila, DataFlow was an engine for clusters that detected available cores, and nodes, and automatically parallelized jobs at runtime to make best use of all available hardware.
DataFlow uses the same high performance computing DAG strategy that gives Tez it’s advantages over MapReduce, but it doesn’t have any of the MapReduce baggage since it actually pre-dates it by quite a few years. DataFlow was never influenced by the MapReduce kangaroo data processing paradigm. Pipelining data in memory was the focus when DataFlow was created. Since it was intended to be a next-gen data and compute intensive ETL engine, a lot of thought was put into transforming data many times in many ways as efficiently as possible. Later, parallel machine learning and predictive analytics operators were added that took advantage of the same multiple pipeline strategy, but at it’s heart, DataFlow is an ETL engine.
Interface (How do you use it?)
The folks developing DataFlow really had their fellow engineers in mind when they designed the framework, so a lot of emphasis was put on making it easier to develop with. Like many of the Hadoop engines, the first users of DataFlow were the same people who invented it. DataFlow’s creators built a lot of abstraction into the framework itself to handle the difficult parallel aspects of multi-threaded application building. The DataFlow Java API is a breeze compared to writing MapReduce or most other types of parallel code. Most Java programmers can pick it up in about a week.
Or, the Actian partnership with open source Eclipse-based data mining platform KNIME, means you can drag and drop to build applications with a mouse.
Weak Areas (What is it NOT good for?)
DataFlow’s biggest weakness is obvious. It’s not open source. I’ve got a lot of love and admiration for this zippy little engine, but it’s just not going to make it up that Hadoop elephant-shaped hill without support from the open source community.
Right now, KNIME is the only open source community that even notices DataFlow’s existence, but they can’t touch the source code. So, even if the data mining and predictive analytics folks WANTED to improve, support and build around this engine, they couldn’t. For them, it’s just a handy bit of freeware that they can use to boost speed on larger data mining jobs.
Best Use Case (What is it good for?)
Like Spark, DataFlow doesn’t really require Hadoop. It will run fine on anything from a laptop to a super-computer, almost any platform with a JVM. However, DataFlow has worked hand-in-hand with Hadoop development. It has it’s own built-in cluster manager and resource allocation capabilities created specifically so it could share resources on pre-YARN Hadoop versions. Then, DataFlow was practically first in line for the new YARN-Ready certification. DataFlow edged in through the back door as a second class citizen, then YARN opened the door and made it welcome.
Like all Hadoop engines, and Hadoop itself, DataFlow was built to solve a problem, mainly compute intensive data matching and profiling bogging down and taking forever. The little engine was put to work more than a decade ago in Pervasive’s data quality tools, Data Profiler, Data Matcher and Data MatchMerge. Years before Hadoop was more than a crazy idea that Google did a research paper on, DataFlow (or DataRush as it was once called) was executing parallel fuzzy matching algorithms for high speed record de-duplication, and blowing through data quality validation jobs against hundreds of business rules in seconds on plain old desktop computers. It’s had a lot of battle testing, and been refined by the use, abuse and demands of real users over those years.
That level of maturity and time tested solidity isn’t something you see yet in other Hadoop engines. If you have old school ETL and data quality problems at modern massive scales, DataFlow can power through those at unmatched speeds, dependably. That’s DataFlow’s sweet spot.
Also, if you need basic statistics or machine learning style analytics, DataFlow handles those fairly well. The library of operators is limited, but if they meet your needs, the performance is excellent.
General Comparison to Other Options
If you look at sheer power to do what a Hadoop engine should do, crunch through data at high speeds, DataFlow looks pretty darned impressive. Spark is the only batch style engine that even approaches DataFlow’s speed, and DataFlow doesn’t have the huge memory requirements that Spark has. (Yes, I know, Spark Streaming. Spark Streaming does micro-batching, not true stream processing. And so does DataFlow.)
Unfortunately, the ability to get the job done isn’t the only thing that guides adoption. Per node software license fees are not popular with the open source favoring companies that are likely to choose Hadoop. The ease of use and generally higher processing speed might make up for that with some companies, and the several years head start in software maturity could also help ease the pain of proprietary license costs.
The problem is that Spark has something in the neighborhood of 300 committers, and an entire ecosystem of its own being built around it. No company smaller than IBM or Oracle can afford to pay that many developers. Actian doesn’t stand a chance of keeping up. MapReduce and Tez have their own built in communities and integrated stacks, as does Storm. That’s the one thing that every successful open source project absolutely must have, community support.
DataFlow functionality may be ahead for now, thanks to Actian CTO, Mike Hoskins’, ability to see into the future, and build software to solve problems other folks didn’t even know were problems yet. But it won’t take long for open source to make up that lead and pass DataFlow. Without community support, this powerful little engine that could, won’t.