Of all the open source SQL on Hadoop options, I think Presto is the most technically sound. It has a lot of what you need in a SQL interface on top of Apache Hadoop data lakes. But part of the Hadoop hype has been that it will replace the data warehouse. As people have bumped into the shortcomings of the SQL on Hadoop options, a lot of that hype has faded. But, why?
Presto is the best of the SQL on Hadoop open source bunch. Why not just use it and ditch your analytical database?
Well, Uber tried that, and it didn’t work out so well. Uber’s data architecture evolution over the last few years, is an excellent illustration of why Apache Hadoop hasn’t been working by itself as a data analytics platform for most companies.
What Does Uber Have to Do With This?
First of all, they have made a pretty typical journey over the last five or six years from standard data warehouse to data lake to something new, and in many ways, better than both. Uber is a good, illustrative example of what is happening in the data analytics space across multiple industries and use cases.
Second, they’re very public about their data analytics architecture. Nearly every Strata data convention I’ve been to in the last several years has had an Uber data engineer or two or three doing a presentation on the latest tweak they’ve made to the architecture. And when they finish adding new capabilities, they open source a lot of their innovations, so even the code is publicly available. Since I attended a lot of those lectures over the years and still have my notes, and a lot of the slides are on slideshare, they’ve given everyone a front row seat to their architecture’s evolution. A lot of companies are shy about that sort of thing; Uber is an open book.
And lastly, Uber is a data-driven company from day one. They collect massive amounts of a wide variety of data types, and do a lot of different kinds of data analytics. They have over 100 Petabytes of data these days, but it doesn’t just sit around in archives. They use data for everything from rider retention to spend analysis, A/B testing to real-time fraud detection, business performance to location predictions. They’re the perfect example of putting data to work for the business in a practical, rather than theoretical, way.
All In On Hadoop Data Lake
So, let’s set the way back machine to 2014. Uber is a data warehouse kind of place. Here’s their architecture at the time (Presented at Strata 2016).
Even in 2014, they were on the cutting edge, running that Vertica data warehouse on Amazon’s cloud. They were doing ad-hoc data analytics, data science, running applications, and generally running their company from the data warehouse. But unless you’ve been living under a rock, you know that Uber grew since then. A lot. More data, different kinds of data, greater need for more and more advanced data analytics led them, as it did many, to the Hadoop data lake.
Their data rapidly exploded into Petabytes. Their concurrent users increased, too. They had thousands of ops folks, and hundreds of data scientists hitting that data with queries daily. That was on top of the normal business intelligence, interface experimentation, and the in-database ETL.
So, back in 2016, they did a presentation at Strata that essentially said they were putting all their chips in on Hadoop. They were switching everything over to the Hadoop data lake with Presto for the SQL analytics part.
But they ran into some limitations.
In 2016, the above illustration was the ideal goal. No more analytical database. All in on Hadoop. Uber was leading the charge in that imaginary holy war that pretty much everyone believed in back then, especially Silicon Valley startups. The “rules” said a company had to be either all open source or all proprietary, no compromises. Uber stood solidly on the side of open source.
This was the way.
And, with their massive increase in those classic three v’s, data volume, velocity, and variety, a Hadoop data lake was arguably the only way forward for them. A data warehouse alone couldn’t serve all their needs. So, they jumped on the crest of the Hadoop wave and planned to ride it to victory.
Except it didn’t work out.
In 2019, at Strata, they did a presentation that referenced what their architecture in 2015 and 2016 actually looked like:
Photo of Uber 2019 Strata NYC presentation slide from my notes. (Presentation: Creating an extensible 100+ PB real-time Big Data Platform by unifying storage & serving)
They were running both their data warehouse and their data lake side by side.
There were a lot of factors, a lot of things that a distributed data warehouse can do well that Hadoop can’t, but I’m going to drill down on one major aspect.
Let’s look at the strengths and weaknesses of SQL on Hadoop in general, and Presto in particular as its shining star.
Presto SQL on Hadoop Strengths
One of the reasons Presto has largely won the open source SQL on Hadoop war is because they have some smart technology choices under the covers. The first smart thing they did was support ANSI standard SQL, instead of trying to invent their own, or telling everyone they could get by with just a subset. People have been doing business intelligence in SQL a long time now. Business Intelligence visualization tools, and a huge army of data analysts do not want your pseudo-SQL, or your semi-SQL, thanks.
Presto pipelines everything, so none of the slow MapReduce paradigm of writing to disk in between each operation. They vectorize data when possible like Spark with Tungsten to take advantage of chip cache for greater speed.
Presto on Parquet is particularly excellent. Parquet has a lot of smart choices built into the data format already. Uber improved Presto’s performance on Parquet by adding predicate pushdown with column pruning, so you don’t waste resources reading data that’s not part of your query.
They also did some smart things to handle the evolution of data schemas over time, an inevitability of working with data you didn’t necessarily create. And, they did their best to address some of the single point of failure limitations created by the Hadoop name node by making their metadata footprint as small as possible.
One big advantage is that Presto does a decent job of managing its resources. This is a biggie because some other SQL on Hadoop options that have largely dominated the market have also fed the disappointment in the market at large by doing a bad job of this, or depending on YARN to do it. Poor resource management results in limited concurrent user capabilities.
If you want your company to be data-driven, you need to let a lot of people in the company use the data.
This past March, 2020, just as the COVID
craziness hit, Uber did a presentation at the Vertica Big
Data Conference, which was switched to virtual, instead of in-person as
planned. Uber talked about how they’d expanded their use of Vertica to the
point where they have three large Vertica clusters now instead of just one.
Instead of phasing out the data warehouse, they expanded it.
Frequent query failures and huge drops in performance with more concurrent users are not something a company can tolerate in their data analysis tool, but those problems have haunted SQL on Hadoop query engines.
Presto kicks out queries right away if they demand more resources than are available, and has a sensible queuing system for concurrent queries that are too much for the system to handle all at once at that moment. It’s not in the ballpark of the concurrency possible in modern analytical databases. Even so, it’s way better than any other open source SQL on Hadoop option.
Presto SQL on Hadoop Weaknesses
On the other hand, some of Presto’s application architecture is not so smart. One thing they did was try to do everything in-memory. They were going for the performance advantages, but the larger and more complex the query, the more likely this strategy is to backfire. In its earlier days, Presto didn’t have a spill to disk capability at all. It still has only a few situations when it can do that, and the performance drops like a stone in those circumstances. In most cases, if a query exceeds set memory limitations, it simply fails.
This can cause some major issues because Presto doesn’t have any fault tolerance on long running queries. If it fails, it fails completely.
Another unwise decision was giving Presto a sort of master/slave architecture where the Presto coordinator is one single server. This is a single point of failure situation. That is one of the big weaknesses of a lot of big data technologies, and I’m a little surprised to see that mistake repeated here.
Presto prioritizes shorter queries, which means you get good performance on those. But longer queries may wait for a very long time before the larger amount of resources they need are free because shorter queries constantly jump the line.
This particularly causes grief when doing data preparation jobs, like joining and aggregating for machine learning or other analyses. Presto has extreme difficulties doing, say a cross join on a large fact table. Available memory limits what Presto can do. Some of the most advanced and demanding analyses are out of reach.
Some of the things Presto can’t do well that Uber needed to do, and could do in the analytical data warehouse:
- Machine learning – there’s a Presto add-in, but it’s not very mature
- Batch processing
- Geographic joins, prep, and analyses – Uber did some workarounds for this, and contributed some improvements
That last one was big. Geospatial analytics is a big part of Uber’s data analytic workload. But the huge joins required tend to overload memory. They found workarounds, and improved Presto as much as they could, but it was still a weakness. Vertica, the analytical data warehouse Uber uses, has geospatial analytics and other machine learning capabilities built in.
Uber compensated for Presto’s shortcomings by using Spark for their data science team and for a lot of their ETL as well since a lot of the data preparation work required joins of large tables, aggregations, and other demanding operations that Presto couldn’t handle. So, in 2017, 2018, and 2019, their architecture looked like this (Source, multiple Uber presentations on Slideshare):
But surely an in-memory query engine gets excellent performance, right? Well, it doesn’t suck. It definitely beats other open source SQL on Hadoop options like Hive.
How does Presto performance stack up against an analytical data warehouse? Uber uses the Vertica database, which can handle the Uber scale, and Vertica is known for its performance speed. But, we’re talking about SQL on Hadoop data like ORC and Parquet, not data stored in optimized form in an analytical database. So, while Vertica queries scream compared to Presto queries, it’s not an apples to apples comparison.
Ironically, Presto doesn’t blow away Vertica’s SQL on Hadoop option that queries Parquet and ORC like you’d think. It does reasonably well, better in some ways, not as good in others. But I would expect the one thing a SQL on Hadoop query engine would do better than a database would be querying Hadoop data.
In-house, Vertica’s engineers did some benchmarking to see how their SQL on Hadoop capability stacked up against some of the open source SQL on Hadoop options like Presto, and Hive with LLAP.
As far as performance, Presto did well.
The other important aspect of query response, though, is failure. In a practical situation, getting your query back slowly is always preferable to not getting a response to your question at all. A higher percentage of query failures is, in my opinion, the biggest flaw in trying to use a SQL on Hadoop query engine in place of an analytical database.
When it comes to failed query percentage, Presto didn’t do so well.
Hive on LLAP did better, but Vertica’s SQL on Hadoop nailed it.
Of course, TPC-DS queries are specifically designed to test the limits of SQL technologies. On a practical workload, your mileage may vary. But the big point is, even when doing SQL on Parquet, rather than its own internal database format, Vertica succeeds more often than even the best other option.
So, why not just use Vertica for all of it? Well, Presto does more than just query ORC and Parquet. It queries Elastic, thanks to an Uber contribution, and Cassandra, Mongo, and some other Hadoop ecosystem formats. Vertica has connectivity to some of those, but can’t query them directly.
So, what’s the best possible data analytic architecture choice? Well, let’s look at Uber.
What Does Uber’s Data Architecture Look Like Now?
So, Uber did the data warehouse thing with Vertica, and they did the data lake thing with Presto. Neither architecture alone had all the capabilities they needed. And they’ve been growing and growing and growing. Their data analysis problems are not getting any easier.
I saw a presentation Uber did at one of the last Stratas, in late 2019 in New York. Strata conferences are sadly a COVID casualty. I always looked forward to those, and will miss them. At that last one, Uber talked about the latest tweaks to their data architecture, but the focus was on speeding up ETL, so data latency was reduced. Plans for the future focused on improving data quality overall.
The fast query response requirements were still handled by the Vertica database. That hadn’t changed since 2014. They were still using a modern data warehouse. They were also still using a Hadoop data lake with Presto.
Photo of Uber 2019 Strata NYC presentation slide from my notes. (Presentation: Creating an extensible 100+ PB real-time Big Data Platform by unifying storage & serving)
This past March, 2020, just as the COVID craziness hit, Uber did a presentation at the Vertica Big Data Conference, which was switched to virtual, instead of in-person as planned. Uber talked about how they’d expanded their use of Vertica to the point where they have three large Vertica clusters now instead of just one. Instead of phasing out the data warehouse, they expanded it.
The multiple clusters of Vertica are to serve different teams with different types of queries. Two large clusters both have the most commonly queried data in them. The third, smaller cluster, has less frequently queried information.
They built some stuff themselves, like a proxy that points queries to the right cluster, and a manager and manifest to keep track of which data went where. The point is to avoid resource conflicts. If a long-running query needs to hog all the resources, then shorter running queries can go use the other cluster for a while until it’s done.
Vertica has great resource management, with extremely rare query failures. Still, it makes practical sense to isolate different workloads. One person’s ad-hoc query that hogs a ton of resources won’t slow down another team’s BI report drill-downs. One person training a machine learning model won’t delay another person generating daily reports.
Uber is doing a lot of work with Vertica, including taking advantage of the built-in geospatial analytics capabilities.
Since Vertica saw more and more of their customers needing workload isolation like Uber does, they built sub-clustering capabilities into their new Vertica version 10. Long-running joins, aggregations and other data preparation functions can run on one sub-cluster, another can do ad-hoc BI, while another runs dashboards, and another trains machine learning models.
So, Uber’s next architecture tweak will undoubtedly be to take advantage of that to get similar benefits to what they were getting, but without a bunch of extra code to make it work, and with easier sub-cluster management. They’re talking about moving some of their workload to the Google Cloud, too.
So, what about Presto? Did Uber just abandon their data lake and SQL on Hadoop?
Not hardly. This is one of my own slides from the talk I’ve been doing at various virtual conferences this year on patterns in successful IoT data architectures. It pulls together Uber’s architecture and puts it in one image. The style is my own, not Uber’s, but I think I’ve captured the high points.
Uber has what I’d call a Cooperative Architecture, combining the strengths of data lake and data warehouse by having both working side by side each doing what it does best.
Vertica works and plays well with the Hadoop ecosystem, so it fits right in with the open source guys. It has a Spark connector, a Kafka connector, the ability to auto-parse JSON and Avro. It can query, import and export Parquet, even complex data types like Maps, Arrays, and Structs. It also has its own internal storage format, optimized for blazing fast query responses, far beyond what any SQL on Hadoop technology can provide (even their own), and in-database machine learning, including PMML and TensorFlow integration. Uber even made a better Python interface for it, vertica-python, and open sourced it, like they do.
A lot of companies are finding that Cooperative Architecture style useful, but I think it’s a transitory, in-between kind of stage. Over time, what I see happening is the data lake and the data warehouse merging into a single new way to do data analytics that combines advantages of both. I’m calling it the Unified Analytics Warehouse. EMA is on board with that vision of the future, and sees a lot of companies racing to provide the technology to make it work.
I expect, in another year or two, I’ll see Uber presenting at a conference with a cool new data analytics architecture. I look forward to COVID being a thing of the past, so I’ll get to see that in person.