In the age of businesses with data that lives on dozens or even hundreds of servers, expecting transactional integrity and data consistency and currency are old-fashioned notions. On Hadoop, you just have to settle for the new NoSQL standard of BASE and eventual consistency. That’s what they say. But, as usual, “they” are wrong. Not all Hadoop users have to drop ACID.
Having spent a pleasant day at the Hortonworks roadshow in Houston recently, I topped it off with a fun chat with Owen O’Malley, a Hortonworks co-founder and architect. We started with some mutual geekiness over the references in his speeches to “one platform to rule them all” and how he came to name ORC files with that particular acronym (Optimized Row Columnar, officially.) Then, we moved on to discussing the changing face of programming, computers, data storage and the dizzyingly rapid progress we’ve both seen in our lifetimes. In the end, we landed on the fact that ACID is becoming a high priority in the SQL in Hadoop world now, where it was virtually unheard of a short time ago.
Hadoop users have all had to give up ACID and settle for the new standard, BASE, as a general rule, but like so many things in the data wrangling industry, that’s changing fast. This may come as a shock to a lot of current Hadoop users and database users considering making the switch to Hadoop, but using Hadoop doesn’t mean you have to give up your ACID habit.
Where did this ACID come from?
ACID is a database acronym that lots of folks use, but even the folks who use it a lot don’t necessarily know the origin of it, so here’s a very simplified definition. (Note from the friendly, neighborhood acronym translator: There are plenty of more in-depth definitions of ACID out there. If you want more detail, Google is your friend. Although, if you don’t want your mind completely blown, you might want to search “ACID database,” not just ACID.)
Atomicity: All parts of a transaction are treated as a single action. All are completed or none are.
Consistency: Transactions follow the rules and restrictions of the database. So, no transaction creates an invalid data state.
Isolation: No incomplete transaction can affect another incomplete transaction.
Durability: Once a transaction is committed, it is done and will persist, even if there is a system failure.
ACID is one of those remnants of the 70’s relational database revolution that we don’t really want to see go the way of headbands and bell-bottom jeans. ACID compliance means, in the tie-dyed words of Owen O’Malley, that a database provides “consistent views of changing data.” ACID created the mind-altering concept of transactional integrity that made relational databases the revolution of the 20th century when it came to data management.
ACID makes the standard CRUD (Create, Retrieve, Update, Delete) operations of a database happen in a predictable, traceable way. If you want to be able to query a historical view of data, see where the data stood last week or last month, you need ACID. If you want to track changes made on a particular row or column over time, when they were made, by who, etc., that’s ACID. If you want data like people’s email addresses for instance, to stay current, and always get the correct current address when you query, that’s ACID. If you have to delete old records past a certain date for compliance to a policy or law, you need ACID. If some data was inserted inaccurately and you want to be able to update with corrected data, ACID. If you need to treat multiple changes as one action, so for instance, money can’t be deducted from one account unless it is added to the other, ACID.
Basically, if you’re accustomed to inserting, updating and deleting data as it changes, and having the data behave predictably and reliably, you’ve been doing ACID.
Hadoop is all about the BASE
In the age of big data, ACID just hasn’t been hip. Most NoSQL and Hadoop data stores don’t do ACID. They work on a principle called BASE. (This is another obscure database acronym, so here’s another quickie definition.)
Basically Available: Even if a compute unit fails, a node in a cluster for example, all data will still be available for queries.
Soft state: The data state can change over time, even without any additional data changes being made. This is because of eventual consistency.
Eventual consistency is really the core of the BASE concept. The trouble with trying to maintain changing data in a cluster-based data storage system is that data is replicated across multiple locations. A change that is made in one place may take a while to propagate to another place. So, if two people send a query at the same time and hit two different replicated versions of the data, they may get two different answers. Eventually, the data will be replicated across all copies, and the data, assuming no other changes are made in the meantime, will then be consistent. This is called “eventual consistency.”
This concept is why BASE is considered the polar opposite of ACID. Eventual consistency assumes that data will reach a consistent, undisturbed resting state. The thing about data, though, in general, is that it never rests. It changes constantly. While eventual consistency tries to catch up, new data changes are highly likely to impact the system. This means that NoSQL databases often find themselves in that soft state where the data shifts and moves, never becoming stable.
For extremely high volume, low change, non-transactional data systems where transactional integrity isn’t really where it’s at, this works fine. It allows your data to scale without constantly checking to make sure each change passes a bunch of rules from the man. It frees your data from a lot of restrictions and gives it more freedom to grow and find itself.
But this comes at a price.
Eventual consistency fails the ACID test
In a business setting, if two people send the same query to the same data at the same time, they will expect to get the same answer. If two people ask the same question of the data and get two different answers, which answer is correct? The general reaction of humans to this situation is to not trust either answer. Then, they throw the query at the data again, and even though they haven’t changed anything, the soft state concept means that they might get yet a third answer. This has a high impact on trust.
Businesses that adopt the data lake concept have Hadoop clusters that are essentially an ever-changing dumping ground for new data. Even if only certain data is expected to be queried in a SQL fashion, that data is highly unlikely to remain static. If the data is constantly changing and being replicated out like drips of liquid with ripples working their way outward, there is never a time when the data can reach that settled, ideally consistent state. There never comes a time when the query results can be considered definitive.
ACID compliance on a cluster-based data management system means the end of “eventual consistency” and the return of data query results that are simply consistent.
What Hadoop technologies do ACID?
O’Malley leads the charge for ACID compliance on Hive and HBase at Hortonworks. Actian, where I work, has one of only two other Hadoop-based database technologies that I know of that already have ACID compliance, Actian Vortex. The other one would be the Splice Machine. The folks at Splice Machine have been proudly wearing t-shirts at various Hadoop events for years that say, “Still Doing ACID.”
HBase/Hive – That psychedelic slide I stole (with O’Malley’s permission) was from a presentation he did at Hadoop Summit 2014, Adding ACID Updates to Hive. His colleague, Alan Gates did a presentation at Strata + Hadoop World last month in San Jose, Hive 0.14 Does ACID. Those slides give a pretty good state-of-the-union on Hive with HBase. So, go and read them. Right now, please. I’ll wait. Notice, on Gates’ third slide, “Do or Do Not, There is NO Try.” I love how that neatly and geekily summarizes the essence of ACID.
One of his other slides very emphatically says, “Not OLTP!” (Acronym translation service: On-Line Transactional Processing.) HBase and Hive are not meant to run transactional, operational day-to-day systems, such as POS (Point Of Sale), or ERP (Enterprise Resource Planning). The insert, update, and delete capabilities are intended to keep data current and queries consistent, not to make HBase a new Oracle on steroids.
So, that’s what Hive on HBase isn’t good for. What IS it good for? It’s absolutely awesome for time series and streaming data sets. HBase can ingest data from the fire hose like nobody’s business and store it in a Hadoop cluster for as long as you need it. I’m not saying that’s the only thing it’s good for. HBase and Hive are very versatile systems. But persisting streaming data and doing historical analysis is what it really knocks out of the park from my experience.
If you thought you couldn’t have nice, consistent, current queries on a massive, high volume, time series data set, you’re on crack. Hive is on ACID.
Splice Machine – You know that slide for Hive and HBase that says, “Not OLTP!,” well, Splice is OLTP. Splice is the only technology on Hadoop that I know of that is intended to be used as a day-to-day, transactional, operational RDBMS (Relational Database Management System.) Essentially, they ARE Oracle on steroids. More accurately, they have the kind of capabilities that people expect from an RDBMS like Oracle, only built on the affordable, industry standard hardware and scale-out architecture of Hadoop. (They’re built on a Derby and HBase core.)
It’s a crazy, wacked out, radical concept, but you really don’t have to trade off capabilities like standard ANSI SQL (Sorry, I’m done. Acronym overload. Google it.) support and ACID compliance to get affordability and scale. Folks have been under the impression that using Hadoop meant sacrificing ease of use, business intelligence tool support, ACID compliance, etc.
Nope. You can have your Hadoop and your RDBMS, too.
By the way, I am in no way affiliated with Splice Machines. What I know about them you could learn by visiting their website or chatting with them at a trade show, which you should definitely do.
Actian Vortex – On the other hand, since I work at Actian, I do know a thing or two about Actian Vortex (formerly known as Hadoop SQL Edition of the Actian Analytics Platform, but everyone got tired of saying all that. I know I did. And no one wanted to keep translating AAP – HSE either.) Actian Vortex is not an OLTP database like Splice. It’s an OLAP (Analytical, not Transactional) database. Essentially, Vortex is more like Netezza on steroids, except that Actian Vector, the single-server version of Vortex blows Netezza out of the water with query speed. Go check out the TCP-H non-cluster benchmark records. When it comes to query response, Actian Vector pretty much blows everything else away in the small data arena. So, Vortex is essentially Vector on the affordable, scale-out Hadoop technology. (It has an Ingres core and uses HDFS and YARN.) No published benchmarks yet, but Cloudera’s Impala is already eating our dust. Watch those cluster-based TCP-H records. You can bet money that Vortex will be taking those over soon.
That’s another thing folks have thought for a while that they had to give up when they made the move to Hadoop: interactive query response speed. No one expects to be able to throw an ANSI standard SQL query into a Hadoop data source with a hundred terabytes of data and get back a nice, reliable ACID compliant answer in seconds. But they should. This technology is here, it’s mature, and it works.
What does that make it good for? It’s kicking butt at all kinds of response time sensitive analytics like financial risk analysis, stopping ATM fraud, and customer data analysis. Vortex also particularly rocks at feeding BI tools like Tableau, MicroStrategy, Actuate, Yellowfin, etc. If you’re used to being able to click a dot on Tableau and have it instantly expand out and show you all the details about your sales figures in the state of New Hampshire, but your data got too big to fit in Tableau’s single-server in-memory format, Vortex will solve your problem. If you want to throw ad hoc queries at your big data and answer new questions in a few minutes, not hours, days or weeks, that’s Vortex. If you need to do some pretty complex SQL gymnastics with your data, but you still have to be able to get answers back fast, Vortex is your drug of choice.
There’s a free version, too, Actian Vortex Express. Feel free to download that puppy and kick the tires. As they say, the first taste is always free.
(I monitor the Actian Vortex and DataFlow community forums, so if you want to chat about Vortex capabilities, or get stuck on how to do something, just ask. I’ll answer, or if I can’t, I’ll track down someone who can.)
Anyone else doing ACID on Hadoop?
Not that I know of, but I don’t have a crystal ball. Stuff changes in this business so fast it’s trippy. There may already be a database on Hadoop somewhere that does ACID that I just don’t happen to know about. Or, someone may build one soon. Ping me in the comments if you know of a Hadoop system that’s ACID compliant that I didn’t mention. I’m always learning.
So, Using Hadoop Doesn’t Mean Dropping ACID?
No, you have options. If BASE is your scene, there are a lot of good data storage and management technologies in the Hadoop ecosystem. If you need transactional integrity or you just need a consistent view of your changing data, even though it’s Hadoop elephant-sized, that option is already here and growing in maturity and diversity every day. Whether you need an ACID compliant data system for flowing, time related data, daily transaction operational data, or low-latency analytical data, the Hadoop ecosystem can make sure you don’t have a bad trip.