I just started working on the Vertica team a few months ago. As the “new guy,” my first few weeks of work have been largely about cramming as much Vertica information into my brain as possible in the shortest time possible. I’ve been aware of the Vertica Analytics Platform for a while. I used to work for a competitor. However, my crazy career in data management software has been focused mainly on data movement and transformation, more than storage and analysis, with an emphasis on the big data side of things for most of the last decade. So, making the shift over to evangelizing a big data analytics database has been taking a little adjustment and a lot of educating myself. That’s my favorite part of working in the big data analysis world. You always have to keep learning, like being a freshman in college again just when you were afraid you might hit graduation day and have to leave. It’s awesome.
I knew Vertica had some excellent technology, but wow! Now, that I’m focused on it, I’m learning some things that I did NOT know that are blowing my mind. I’m willing to bet you didn’t know either, so …
Let’s start with scale. There are a lot of aspects that go into making a technology scalable. (I submitted a talk proposal for Texas Scalability Summit on this very subject.)
Data storage that can expand indefinitely without breaking the budget is probably the one thing people think about first when scalability comes to mind, but there’s a lot more to it. You not only have to store massive amounts of data, you have to be able to find it, and do something useful with it in a reasonable amount of time. Depending on the use case, a reasonable amount of time can be anywhere from a few hours to less than a second. Unfortunately, as anyone knows who has used MapReduce to do pretty much anything on a cluster, processing data that size tends to be SLOW.
To be useful, you need to be able to not just scale storage up, but scale up processing speed along with it, so your response speed remains generally constant. You’d think scalability would be a given in the big data tech landscape, but you’d be surprised at how many supposedly scalable applications break down at a certain level. Maybe they can handle processing hundreds of Terabytes, but choke when you get to Petabytes. Maybe they can even handle Petabytes, but speed of processing drops as you increase hardware underneath, from a near linear increase to something that looks almost flat. So, eventually, you hit a point where adding more nodes just isn’t worth it. You can only process the data so fast.
Vertica doesn’t stop scaling, like ever, as far as I can tell. It has linear processing speed scalability that just stays linear no matter how big the dataset size gets. They have customers doing fast queries on Petabytes of data. I don’t even know if there is a top end. The kinds of companies that crunch the most data in the world, like the social media giants and the smart sensor data companies, and the companies that have the shortest SLA’s for their use cases like the cybersecurity companies, they use Vertica. Vertica marketing advertises Exabyte scale, and they’re not kidding.
Scaling up to still perform fast queries even on massive data is great, but if only three people can use it at a time, that has limited utility. If you’ve got a team of 20 data scientists and data analysts, or you’ve got 10 teams of 10 each across the enterprise, and they all need to access that huge mass of data to do various types of analytics at speed, you’ve got a problem. The ability to handle multiple concurrent operations efficiently is the Achilles heel of most big data technologies. But if you’re using Vertica, then you don’t have a problem. Because Vertica has all the normal concurrency capabilities you expect from a good BI database, but at big data scale. Throw those concurrent users at it. It can take hundreds of them, and still give back excellent query response speed.
Full ANSI SQL Independent of Storage Type
There are a lot of fairly new storage types out there created to try to speed things up a bit since trying to analyze massive amounts of raw files in a file system is an exercise in extreme patience and frustration. Most of the new storage types depend on one of two strategies; indexing metadata about the data, the strategy in most object databases, or imposing structure on the data itself, usually moving it into columnar formats like ORC, Parquet, etc. Other formats have been created for serializing or message chunking streaming data like Avro, JSON, ROS, etc.
A lot of the columnar, SQL-optimized storage formats have their own special SQL engine designed to query large amounts of data at high speed. Those engines have various subsets of support for the full glory of standard ANSI SQL, but not full support. There’s always a fair amount of SQL that they just can’t execute, and they generally only work on their own special type of stored data.
Vertica has full support for the whole range of ANSI standard SQL commands, independent of data storage type. ORC, parquet, ROS, whatever. It supports all the columnar and streaming types. It supports IOT use cases, builds aggregates, does JOINs and does all the other normal data preparation with standard SQL, and then returns query results fast, and it can even call machine learning models within SQL commands. More on that next.
Now, I always thought there was SQL in databases, and there was machine learning in clusters using compute engines like Spark. Maybe that was just my own mental block, but I thought that you could only create, train, and put to work machine learning models on data that was sort of free of the database. Which, now that I think about it, was pretty dumb, actually. If you train a machine learning model in Spark, you first put the data into a dataframe or RDD, which is sort of like an in-memory database table. If you have to put the data in something like a table, then why couldn’t you just use a table?
Well, one good reason is that most machine learning algorithms aren’t written that way. But Vertica has a good set of workhorse machine learning algorithms built in. Vertica has all the advanced analytics algorithms you expect, plus maybe some you wouldn’t expect like time series, geospatial, pattern matching, projections. You can call them with SQL just like any other SQL command, and train them on the data in the database. Plus, there’s a robust UDF framework, so you can add your own.
You get higher performance than Spark, especially as the data sets get bigger. The amount of coding is less. And you don’t have to move the data anywhere before you run your analyses.
And a bunch of other stuff …
I’m learning a lot of other stuff that’s cool about Vertica, like:
- Complete independence from platform – on-prem, Cloud, hybrid, one Hadoop Vendor, another one, one Cloud vendor, another one – doesn’t matter.
- Intense levels of security, enough to make it a favorite with alphabet soup agencies and cybersecurity specialists.
- The huge numbers of other well-known technologies that have Vertica embedded inside of them. Nearly a third of Vertica’s business is
OEM. Chances are, you’re already using it.
- A bunch more stuff that I’m trying to cram into my brain as rapidly as possible.
There’s one more thing that I found really surprising, though, about Vertica, and it’s not so much about the tech itself, but about the customers that use it. Companies like Twitter, Uber, Etsy, Cerner, Chevron, DreamWorks, GoodData, Intuit, New York Genome Center, Fidelis Cybersecurity, Adgear, Trane, Inovvo, Wayfair, Choice Hotels, Guess, Bank of America, and many others. Vertica customers are all over the globe in every major industry you can think of. If that company needs to crunch a ton of data fast, Vertica has them covered.