In an article written last year by an industry analyst that I respect, IDC’s Carl Olofson, he gave the impression that in-memory analytics are the wave of the future, the new paradigm for high performance analytic databases. He said, “embrace the new paradigm and plan for it.”
For once, I didn’t agree with him.
In-memory analytics are last decade’s revolution, or even last century’s. The wave of the future is something far faster, and far more revolutionary.
I think the biggest swell of the in-memory analytics sea change has already hit. Those of us more focused on cluster-based computing are still feeling the crest of it wash over us now, but it hit the single server market ages ago. The new, larger wave that we’re just starting to feel the pull of will be in-chip analytics. That’s the new paradigm that analytics software will need to embrace and plan for. That’s the big wave that new technologies will be able to surf all the way in. Only a few database technologies have caught on to this new wave that’s coming, though. I predict that you’ll see more and more in the coming years until, by the middle of the next decade, any analytics database that doesn’t have in-chip vector processing will be caught in the undertow.
Way Back Machine Set to RDBMS
To show the pattern of data processing technology waves over time, I’m going to go into a little of the history of databases. If you set the way back machine to the 1970’s, it was an exciting time in data processing. Computers were really taking off, and people were working on finding the best software strategies to make this new technology crunch data efficiently.
The relational database eventually won out as the best way to organize, analyze and process data. Ingres database, arguably the father of relational databases, spawned Sybase and SQL Server and Postgres and others. Oracle and IBM spawned their own branches of the database family. Most of these wonderful new RDBMS’s (Relational Database Management Systems) were designed to hold only small amounts of data in RAM (Random Access Memory, or just “memory”) at a time. They also only passed one piece of data through the chip cache, the chip’s own special high speed chunk of memory, at a time. This was necessary, not because of limitations of software technology, but because of limitations of hardware, and economics. RAM was expensive and could generally only hold a tiny fraction of the data that the disks could hold. Chips on affordable computers were scalar processing SISD (Single Instruction, Single Data) chips. They could do one operation on one data point in one chip clock cycle unit of time.
In-Memory Databases Were Born
In the 80’s and 90’s, the in-memory database came along. The in-memory version of IBM’s DB2 database launched in the early 90’s. This revolution was made possible in part by IBM’s big iron computers with a comparatively large amount of RAM. The computers were not cheap, but the capabilities they provided, including the speed of in-memory data processing, were worth the extra dollar signs. They were especially worth the money to large companies who dealt with “big data” before anyone had coined the term.
Even today, in-memory databases still process data one piece at a time through scalar SISD chips, but they store the data in memory, not on disk. Reading data from memory is easily ten times faster than reading it from disk. As a prime example of this, the TimesTen in-memory database, named for its order of magnitude improvement in data processing speed was launched in the mid 90’s. Oracle bought it in 2005, and it’s been part of their Oracle Exalytics appliance for the last decade.
In-Memory Wave Crests
A little over a decade ago, the price of RAM started dropping like a rock, while it’s capacity kept improving. Normal, industry standard hardware suddenly came with a lot more memory. And upgrading the memory was no longer a pricey proposition. So, in-memory data processing technology, the cutting edge revolutionary concept of the 90’s found its heyday. Most analytics appliances and high performance single server analytic databases store data in-memory these days, or at least have the option of running in memory or spilling to disk if the memory gets too full.
Beyond data processing speed, there are a lot of other advantages to an in-memory database. Flexibility of schema is one advantage that Carl Olofson pointed out in that article I mentioned earlier, The In-Memory Database Revolution. (A columnar data structure can offer some of those same schema flexibility strengths, but that’s another blog post.)
There are also some weaknesses to the in-memory database concept, mainly in the Durability requirement of ACID (If you don’t know that acronym, go read my other post: Not All Hadoop Users Drop ACID.) Over the last couple of decades, a lot of those weaknesses have been addressed, though, as the technology has matured. Ways to add resiliency and failure tolerance to in-memory data structures have been invented. Ways to spill data to disk when the RAM is overwhelmed have become a common feature for in-memory databases. Of course, you lose the performance gains if you have to do that, but seriously, RAM is cheap nowadays. Just add more memory.
It still costs quite a bit to upgrade all the memory in a large cluster of servers, but even that is becoming less of a limitation. As clusters take over the world of big data analytics, in-memory cluster databases like MemSQL and Spark SQL are moving into that world and establishing themselves. That is the point where the wave is still cresting. Large scale, cluster-based in-memory data analysis is still young, cool technology.
But on single servers, in-memory database technology has already been out-classed and passed up.
Vector Processing Makes a Splash
The majority of single server TCP-H benchmarks for the last five years haven’t been set by in-memory analytics databases, they’ve been set by an in-chip analytics database.
I’m going to jump back to the mid 80’s again for just a minute. The Cray super-computer, the superstar of the 80’s computing scene, did something different. They used better chips than other computers of the time, vector processing chips called SIMD (Single Instruction, Multiple Data) chips. In order to take advantage of these cutting edge, awesome chips, they designed very different data processing software from everyone else. The chip cache, the chip’s own memory space, allowed data to be passed in as a vector, a single dimensional array of data. That container full of data could all be processed in one clock cycle. To use it, the software on the Cray supercomputers stored and processed all the data in vectors. This gave them a huge speed edge over every other technology of the time, including in-memory technologies, but it also made their supercomputers super-expensive and made the data crunching software super-complicated.
Hardware Makes Its Own Wave
Just as the rapid drop in RAM prices triggered the cresting of the in-memory database wave, a sea change in chip design could have triggered a similar wave of in-chip databases, but it didn’t. Today, you don’t need a Cray supercomputer to get advanced SIMD vector processing chips. Intel and AMD and all the chip manufacturers now put them on the affordable industry standard servers, desktops and laptops that everyone uses. My Lenovo laptop, my Dell laptop, my Macbook, and my little HP 4-node mini cluster all have SIMD chips. All of the big hardware guys have been putting these powerful supercomputer chips in regular computers since the 90’s. They did it to support smoother and faster graphics rendering. Video game software and graphics accelerators all process data in vectors through the chip cache.
Processing data through chip cache gives an average of not just one, but two orders of magnitude speed increase for data processing. No kidding. Vector processing of data through chip cache can give near a hundred times the performance of scalar SISD, one data item at a time, processing.
So, the obvious question is, why isn’t all analytics software built to process data in the chip cache?
The Tide Gets Turned by Some Really Smart Dutch Folks
The trouble is that, while storing data in memory isn’t that big a leap in software technology over storing it on disk, processing data in chip cache requires a whole new way to store, move, retrieve and generally handle data. The data has to all be in vectors, like neat little organized containers full of data of the same type, in order for the SIMD chips to process it at the crazy speeds they can handle. Relational databases don’t store data in vectors.
“These kind of video instruction enhancements that were introduced into common CPUs, like Intel CPUs and AMD CPUs and PCs, were not really used very well by database systems.” – Peter Boncz interview on BEye by Ron Powell
Just as the new century was being born, a new way to process data was being created. The X100 (Times a hundred) project got going in Amsterdam at CWI (I don’t do Dutch acronyms, but it basically means institute for math and computer science.) A brilliant guy named Dr. Peter Boncz (the guy who architected MonetDB) and some of his equally brilliant PHD students decided to find a way to make an OLAP database system two orders of magnitude faster than a standard RDBMS. They came up with lots of nifty shortcuts and software tricks, but their real breakthrough was … (you should know this by now) … vector processing of data in SIMD chip cache. Their 1999 research paper, “Database architecture optimized for the new bottleneck: Memory access” won a bunch of awards, and launched a project that will change the way data is processed forever.
In 2008, the Vectorwise project was spun off from the MonetDB project. Ingres Corporation bought Vectorwise, and later changed their name to Actian. In 2010, Actian used the knowledge gained from the X100 project to turn the father of open source relational databases, Ingres, into the fastest BI and analytics database ever created, Actian Vectorwise, later renamed Actian Vector. For the last five years, they’ve been sweeping the TPC-H benchmarks, making all the hardware guys look good by showing off what modern chips can really do.
Riding the In-Chip Analytics Wave
Actian certainly isn’t the only data crunching software company out there who has figured out which way the tide is flowing. Bruno Aziza wrote a smart piece over on the Sisense blog, Bet Your Chips on In-Chip Analytics! Sisense is going so far as to programmatically specify which chip cache is the best one to process specific types of data. I would be very surprised to see any new analytics databases appear on the market from this point forward that DON’T process vector data in chip cache.
In-chip vector-based analytics is the future of single server analytic databases. – Me, just now.
But what about clusters? It took years for in-memory technology to start making headway on cluster-based computing systems like Hadoop. I don’t think it will take that long for in-chip analytics software. It turns out that while it’s hard to port in-memory databases onto clusters, it’s relatively easy to port in-chip databases onto clusters. The same SIMD chips that are in my laptop are in thousand-node Hadoop clusters. Last year in June at Hadoop Summit, Actian announced Actian Vortex, the analytics platform that is, at its heart, the single server superstar Actian Vector ported into the Hadoop cluster operating system. And, since it’s vector-based, it immediately left the fastest not vector-based Hadoop SQL querying technology in the dust.
So, there I was at Strata + Hadoop World a few months ago, and I overheard some folks talking about a project incorporating this new, cool idea, vector processing. (Peter Boncz was one of the speakers at Hadoop Summit last year.) I’d be stunned if there weren’t some open source programmers out there working their tails off right now putting vector processing technology into a new Hadoop project. The wave is coming, and it’s going to be a big one.