(Originally posted on the Vertica blog)
Data Day Texas is an event in Austin that was started about nine years ago by an old acquaintance of mine, Lynn Bender, who founded Global DataGeeks. The one big theme that struck me as running through the whole conference was the highly cooperative landscape that has developed between proprietary and open source software, and that a good architect doesn’t choose sides. I’ll highlight two of my favorite presentations to give you an idea of what I mean.
Data Day has grown over the years to be a pretty intense conference on the cutting edge of data management and data science. There’s a graph track, an AI and NLP focused track, and my favorite, the data engineering and architecture track.
Keynote – Lies Enterprise Architects Tell
Gwen Shapira, Product Manager at Confluent, gave the keynote, on the Lies Enterprise Architects Tell. Shapira started out by wondering aloud why a bunch of people would show up to listen to her lecture about enterprise architecture at 8 AM on a Saturday, but Lynn announced the open bar about then, and she said, “Now, it makes sense.”
Shapira lamented some of the realities of being a product manager, something to the effect of cat-herding being a less demanding profession. “You can’t make software engineers do anything. The only thing you can do is convince them.”
There was a lot of emphasis in her presentation about reinterpreting the big waves that have hit the data engineering space in the last decade, and how architects lie, sometimes to themselves, about how well they’re riding those waves. Lies we tell relate to real-time systems, big data, hybrid and multi-cloud, best of breed, and the new cool thing: microservices. The lie can be that these things don’t exist, that we are doing them extremely well, or even that we understand what the heck that word means. In the end, a good architect who understands and communicates clearly, without lying to themselves or their enterprise, can make or break a company.
“Good architects enable their businesses to not be overtaken by the next Amazon or the next Uber or the next Netflix.” – Gwen Shapira
She set the tone and the main theme for the conference, about using what works best including both open source and proprietary software. Don’t decide that everything has to be open source, and don’t assume that only proprietary vendors have the best features either. Using what works best for the purpose makes a lot of sense, but be careful.
One of the lies people tell themselves can be, “We always use best of breed.” What they really often mean is “We have one of everything.” You have to keep maintenance and integration costs in mind. They can negate advantages if your architecture gets out of hand. This can be true whether you’re talking about a dozen different proprietary vendors in your stack, a dozen different open source projects in your stack, or any combination.
“Don’t just use open source. It is a good idea to take advantage of what proprietary vendors offer. But avoid lock-in by having an escape plan.” – Gwen Shapira
Shapira recommended Kafka as a great way to have mobility in your data, so you can switch from one data store to another if you need to do that. Kafka has become a kind of lingua franca in a lot of businesses for data in motion, so that makes sense.
Davin Potts – Choosing Sides When Choosing Tools Hurts
Another highlight of the conference for me was Davin Potts, a core Python committer and founder of KNIME who now has his own data science consultancy, Appliomics. He talked about avoiding bias when choosing your enterprise architecture. As a data science consultant, he has to be Switzerland with his clients, and use whatever fits with their architecture, not choose based on his own biases. One of the tools that helps him with that is KNIME, an open source tool that got its start in Europe, used for data mining, especially in the pharmaceuticals industry, and has taken off from there.
Sometimes, we choose to use tools because they’re Java stack tools, and we’re used to that, or they’re in the land of R, but that might not actually be the right tool to use for the job. And, as a consultant working with many companies, he can’t afford to have that mental bias.
“If you want to learn ‘machine learning,’ all the books and classes require you to choose a language. Once you head down that path, forever will it dominate your fate.”– Davin Potts
I’ve spent a fair amount of time using KNIME, myself, and it has a very nice interface. One of my favorite features is that when you choose an operator, the documentation for that operator automatically pops up in the interface. That feature alone saved me a lot of time when I was working with the tool.
The thing Davin focused on for his lecture, though, is KNIME’s marvelously open capabilities. It can work with virtually any other technology, whether proprietary or open source.
He showed an example KNIME workflow for smart building data science, that integrates Python functions and SQL for data manipulation inside relational databases. The same could be done in KNIME with R, Spark, etc. You don’t have to choose sides.
And another important point he made was that doing data manipulation and some analysis inside the database makes far more sense than trying to move the data somewhere else before you use it. We had a long talk after his presentation on a wide variety of subjects, including the vastly expanded set of things you can do inside a database without ever moving the data. (I’ll post some of that conversation later.) In-database machine learning, or other forms of advanced analytics can save a lot of IO and CPU time over moving the data out to something else, then doing your machine learning on it.
He was definitely preaching to the Vertica choir with that one.
So, that’s a tiny taste of Data Day for this year. There were awesome presentations by a bunch of my friends including Holden Karau, Jesse Anderson, Joey Echevarria, Steve Sarsfield and far more people than I can do justice to in one blog post. It’s a great conference if you happen to be in Austin in January. Global Data Geeks is hosting the Texas Scalability Summit in September at the same venue. Check them out if you get the chance.
Oh, and KNIME is hiring if you’re looking.