Apache Hadoop took the world by storm and looked like it was going to own the data analytics and data management industries for a while there. But now, the hype machine, and the weaknesses of Hadoop – complexity, lack of security and governance, slow performance, poor concurrency, etc. – have everyone looking for a good Hadoop alternative.
Let’s look at some of the options that are being touted for doing Hadoop data analytics, and their pros and cons as Hadoop alternatives.
There’s a big push that says, “Go to the cloud. All your problems will be solved.” Ironically, that sounds a lot like the big push from a few years back that said, “Go to Hadoop. All your problems will be solved.” Funny how the hype machine makes the latest cool thing sound like a magic bullet.
Public clouds, for the longest time, had issues around security that kept people from adopting them. Now, public cloud providers have better security than most other organizations. Security is only an issue if that is the focus of your company. Secretive three letter agencies aren’t going to be putting their data on a public cloud any time soon, but for most companies, cloud security is pretty solid.
Advantages of the cloud:
Ease of Use
Ease of use is one of the most underestimated and most important aspects of any data analytics capability. Putting your data analytics on a cloud means you don’t have to maintain your own data center. You don’t have to do, or know how to do, a lot of things, especially if you use software as a service. The barriers to getting started with data analytics in the cloud are incredibly small. You can do it in a few minutes.
If you need to scale up quickly, or you have highly variable or seasonal workloads, the cloud is the best. You don’t have to try to guess what your maximum capacity needs will be and maintain infrastructure for all of that, even when you’re not using half of it. I know of one gaming company that released two highly successful games in the same year. They had to scale up their infrastructure 50 times, in one year. The cloud is the only sensible way to do that.
Disadvantages of the cloud:
Wildly unpredictable costs
Some people reading this will be surprised I didn’t put costs under “Advantages.” A lot of pushes to the cloud are about saving costs. And yes, moving to the cloud will certainly save you in hardware costs, and the costs associated with paying people to manage and maintain that hardware. However, the concept that cloud data analytics is cheap is a bit flawed. Storage on the cloud is fairly cheap. Compute of any kind is not cheap, and cloud data analytics software can vary wildly in TCO. Especially, if it has any sort of auto-scaling feature. This means that it will automatically provide more and more of that expensive compute as you put demands on it. Your CFO may have a heart attack when the auto-scaled bill comes around.
Performance on a cloud can be unpredictable due to the multi-tenancy aspect, and the noisy neighbor effect. When the network is quiet, performance is excellent. When a lot of people are using it, the network may be slow and unreliable. Data analytics software on the cloud often has strategies to deal with that variable network speed. The real problem is that most cloud applications get paid by usage, how long you use their software. There isn’t a lot of incentive for cloud-based data analytics applications to improve performance when it will actually reduce their paychecks. Check out the recent McKnight benchmark for some cloud data analytics performance numbers.
The first time I heard of an “egress fee” I thought it was a joke. The essence of data management is that data flows from place to place, gets combined, stored, used, changed, moved, etc. I tend to think of it like water. Data needs to flow. Some cloud data analytics software will literally charge you money to move your data out of their platform. And the network speed to move data out of a public cloud is somehow way slower than the network speed to put data into a public cloud. Imagine that. Add in the proprietary services and software packages that only work in one place, and trying to do data analytics in the cloud can bring lock-in to a whole new level. You can get in really easy, but getting out is a whole different story.
Is the public cloud actually a Hadoop alternative?
If you think of Hadoop as a scalable infrastructure to store and process data on, then yeah, public clouds can do that. If you think of Hadoop as a set of software that you can use to manage and analyze data, the public cloud can provide a completely different set of software to do that, sure. This is where it gets interesting, though. The public cloud can be simply an alternative to the hardware, the big data centers that you used to run Hadoop ecosystem software on. There’s not a single reason in the world why you couldn’t just put Hadoop on the cloud, and in fact, many people do.
So, the public cloud isn’t so much a Hadoop alternative as an alternative to on-premises data centers. The software you choose to use on the cloud is up to you. It could be Hadoop, or it could be something else, like a Hadoop alternative. ?
Okay, so what’s another thing people say is a Hadoop alternative?
Spark is an alternative to Hadoop, depending again on what you think of Hadoop as being. A lot of people think of Hadoop as being the two projects that got it started, HDFS and MapReduce, or maybe they add in Yarn. I tend to think of it broadly as the entire ecosystem that was used to manage a wide variety of data for big data analytics. But some think of it just as MapReduce.
Advantages of Spark:
If you think of Spark as a replacement for MapReduce, then it is so completely, off-the-charts better than MapReduce that this isn’t really a discussion. I referred to kangaroo data processing where you write to disk and read from disc after every little step as MapReduce plumbing with pre-clogged pipes years ago. Heck, yeah, Spark is far more performant than MapReduce. That’s a no-brainer. If you’re building ETL pipes on HDFS right now, do not, I repeat, DO NOT use MapReduce. Spark is just a way better option all the way around.
Spark was originally developed to be good at machine learning type of workloads, and it still is. That also means being good at data preparation for machine learning. This makes Spark super-good at ETL style workloads on clusters. A bunch of ETL tools now let you use their interfaces to define Spark workflows.
Then Spark got massive community support and its own ecosystem of software. There’s Spark SQL and Spark Streaming and a wide variety of things you can do with Spark. A lot of the ecosystem aspects and advantages of Hadoop are now in the Spark ecosystem. This means that a lot of things you used to do with Hadoop, you can now do with Spark.
That seems to me like a real alternative to Hadoop, but there are some key things Spark is missing.
Disadvantages of Spark:
Persistent data storage
Spark uses RDDs and DataFrames which are in-memory structured data formats kind of like a database table. They’re hugely useful. The trouble is, they’re ephemeral. The Hadoop Distributed File System (HDFS) is a wonderful distributed, fault tolerant file system, and Spark doesn’t have a persistent file system. It generally uses Hadoop’s. In fact, the vast majority of the times that I’ve seen Spark in use, it was on HDFS, in the middle of some other Hadoop ecosystem products. Spark can run on cloud object storage like S3, but it uses the Hadoop file APIs to do it. Hard to replace Hadoop when you depend on a fundamental part of it.
I know, I already put performance under Spark’s plusses. But everything is relative. Compared to MapReduce, Spark screams. But if you’re doing data analytics, you’re generally comparing data analytic options. SQL on Hadoop options, or database options. Spark SQL tends to be about the same to mildly slower than some other SQL on Hadoop options like Presto, and considerably slower than actual distributed analytical database options. It also lacks a lot of the other strengths of a data warehouse, but we’re not looking to replace a data warehouse here, we’re looking for a viable alternative to Hadoop.
Is Spark actually a Hadoop alternative?
I think I’ve made that point already. Spark is almost a Hadoop alternative, except for the part where it depends on parts of Hadoop to function. That makes it more of a complement to Hadoop, or an improvement on the Hadoop concept, than an alternative.
By the way, check out this post by the folks at Domino Data Labs if you want to compare Apache Spark with its alternatives.
Enterprise Data Warehouse
So, I mentioned a distributed data warehouse earlier. And it can certainly be an alternative to some aspects of Hadoop, or to almost all of it if you’re using an evolved modern data warehouse or this crazy new concept of a unified analytics warehouse that EMA is talking about. I’ve also heard it called a data lakehouse, but I think that name is a bit silly. But since I work at Vertica, a company that makes these, I don’t think I’m anywhere close to an objective reviewer. And it would kind of change the point of the whole article, so go read my article on Medium about the Evolution of Modern Data Warehouses if you’re considering that.
So, what is the best alternative to Hadoop?
I think the main problem with trying to answer this question is that people think of Hadoop as one thing, and the hype machine has decided to crunch down on that thing, saying it’s dead and you should find an alternative.
Hadoop is an ecosystem of applications used to scalably and affordably store large amounts of many types of data, process that data, and put it to work.
Unless you suddenly don’t need to store large amounts of data of multiple types or put it to work, you probably aren’t really looking for a Hadoop alternative. You’re looking for an alternative to some aspect of Hadoop that isn’t quite working for you, or a complement to Hadoop, or an improvement on Hadoop. You may not even care one way or the other about Hadoop. You’re just looking for a way to get the job done that you need to do.
Consider this: In the last ten years, every large scale production data analytics architecture I’ve seen uses some form of scalable, affordable, data storage paired with some form of data processing. A huge percentage of them are using Hadoop, whether on-premises or on the cloud. Not all, but a lot. Some are using distributed data warehouses instead, some cloud software, some a Spark stack. Some, a combination of all of those.
But wait, isn’t Hadoop dead, or something? Cloudera doesn’t seem to think so. Databricks is still doing pretty well with their Spark and Hadoop ecosystem product. My own company, Vertica, incorporated a lot of the strengths of Hadoop, and the ability to work with Hadoop ecosystem software into its core. All those public cloud vendors have object storage that is essentially, an improvement on HDFS, and they often use something that looks a lot like MapReduce to do data processing.
If you’re looking for an alternative to Hadoop, ignore the hype, and consider what aspect of Hadoop you’re really looking to replace or improve.
Is MapReduce way too slow for you, but you still like having an extensive open source ecosystem of products? Use Spark. If you need to do high scale data preparation or ETL, I’d also recommend Spark.
Is maintaining all that hardware on-premises a pain, or do you have highly variable workloads? Take your Hadoop cluster to the cloud, or better yet, take the highly variable part of your workload to the cloud so that crazy unpredictability of price only hits you when your workload gets nuts.
Having other issues like a need for faster queries or greater concurrency or workload isolation? Come talk to us at Vertica about a modern enterprise data warehouse, or a unified analytics warehouse that combines some of the strengths of both Hadoop and an enterprise data warehouse.
There isn’t one best alternative to Hadoop because Hadoop was never just one thing. Instead of listening to the voices that say Hadoop is dead, nail down what it is you genuinely need, and what aspect of Hadoop doesn’t meet those needs. That will very rapidly guide you to the right Hadoop alternative or Hadoop complement for you.