Archive for Spark tag

Davin Potts, CEO Appliomics, Founder KNIME, Core Python Commiter

One on One with Davin Potts

At the recent Data Day Texas event, I sat down with Davin Potts, who I have known for many years, and had a long conversation about a wide variety of subjects. Over on the Vertica blog, I broke the conversation into chunks, but I wanted to put it all together in one place so you can see what we chatted about end to end.

Wide variety of programming languages and tools for data science, and how Davin Potts became a core Python committer.

Paige Roberts: Let’s start by introducing you to the blog readers who may not know who you are yet.

Davin Potts: I’m Davin Potts. I have my own consultancy based in Austin, Texas, Appliomics, where I mostly work on mathematical modeling, scientific software, and other data science related things. Sometimes that involves cool tools like KNIME. Often it involves things like Python. But honestly, it covers a wide gamut depending upon whatever tools other people are choosing to use.

I’m happy to switch and adapt. My talk (Choosing Sides When Choosing Tools Hurts) was all about that. Fortran, C, C++, Erlang. These are all fair game. And they’re being actively used by groups that I’ve done work for in just the last two years.

Roberts: Wow. That’s all over the place.

Potts: JavaScript is in there, too.

Roberts: Java, Python, R?

Potts: Java, Python and Scala, yes. R, I tend to hesitate with. I haven’t done anything with R in the last two years.

What’s the hesitation with R?

I think R is a fantastic tool. It’s made a lot of people highly effective in a short period of time and ggplot rocks. The thing that makes me hesitant to start new projects with R is because I’ve been asked too many times to help work on projects where clients built up a corpus of code in R, that they now have decided they need to move away from. A common theme is: as they were building up their code, they were not thinking about the architecture around it, and how to get that code to scale.

That’s not to say that R isn’t capable. It means that people have dug themselves this hole repeatedly. And more often than not, when they’re trying to switch to something else, from what I’ve seen which is a limited view of the world, they tend to want to either switch to Python or to the C or C++ stack. For that reason, if a group is already using R, fantastic, I’m not going to talk anybody out of anything. …

But you’re not going to start out using R code?

I’m not going to start up fresh with it because, if people don’t have the mindset from the beginning of planning ahead for taking the code to production, people have been getting surprised. Groups like Revolution Analytics have tried their darnedest to deliver tools to help people achieve performance with R, and that’s helped lots of groups, but it’s not able to help everyone.

Do you see people moving into SPARK or is it Python and C++ that are the two preferred tools to use?

Again, I don’t have an explanation for that, and it may purely be just what I happened to be exposed to. I haven’t met a group that decided “We need to move out of R into SPARK, or we need to move into Scala.” I don’t know why that is.

Facebook is, is one of the most publicly visible users of PHP. They have gone to the extent of writing their own compilers for it, because it was part of their framework from the ground up. They’ve invested a huge amount of effort to try and squeeze every last ounce of performance that they can out of it. They have publicly talked about different aspects of their efforts to transition from PHP to Python. And when you see big companies like that transition to Python, it probably does influence others to think, “Ooh, I’ve got to get me some of that.”

Or maybe the same reasons that drove Facebook to make that switch, others may be seeing those factors, also.

That’s the thought as well. Facebook clearly sat down and they thought about it. They probably had no shortage of arguments over what they should switch to before they finally made that choice. And stories like that are highly influential, especially for smaller groups that don’t have the time to put a dozen people on studying that sort of thing.

I was also interested to learn today that you are one of the core committers on Python. How did that come about?

Multiple different funny ways, but it came, first of all, from doing a little too much Python coding.

[laughs] You have to watch out for that.

Secondly, I took aside one of the existing long standing Python core committers, and said, “I really think something needs to be done about this particular thing.” That person knew me pretty well, so the response that I got was not just one of “Yeah, yeah.” It was more like, “You’re right. That really needs some TLC. Would you be interested in helping in a very serious way?”

And my initial answer was “No. No, no. Not at all.”

[laughing]

That was not the purpose of this conversation. The purpose was to make you aware of an issue.

To get you to fix it, not to make me fix it.

No, I was thinking of pitching in. But he was bringing up the notion of something much more long term. I was thinking, how can I help in a short-term way.

But you got volunteered.

The idea slowly grew on me. To become one of the core developers is not a highly formalized process. It’s a tight-knit group. One person could easily poison the well, so to speak. Finding people with the right style of insanity that are there to try and move Python in a positive direction is important. The core developers have a public perception of being highly approachable friendly, easy-to-talk-to folks on top of that. Trying to find that combination of characteristics is difficult, and not something that they flippantly do even on just one person’s recommendation.

I can see where that’s a challenge.

Learn more about Python on Github.

Learn more about the Vertica-Python interface.

Advantages of KNIME for a data science consultant, and SQL for data manipulation and analysis.

Paige Roberts: So, I just attended your talk on not choosing sides when choosing tools. (Choosing Sides When Choosing Tools Hurts) As a consultant, you can’t choose sides. You have to work with whatever your customer wants. So, I know you’ve been a long-time user of KNIME, one of Vertica’s partners. You used to do talks on KNIME back when I was hosting the Austin KNIME meetups, and you used to even work at KNIME, right?

Davin Potts: I was one of the founders of the company.

Paige Roberts: One of the founders? I didn’t know that. So, what are the advantages of KNIME for a consultant who has to go in and use whatever is required?

Davin Potts: So, one of the neat payoffs, especially when starting an engagement with a new group, where not everybody in the room knows you: I’m trying to convey that I understand some of what they’re talking about. With KNIME, I’m able to make some initial traction in being able to show that understanding, not just verbally, but in a very visual way.

KNIME gives that visual presentation of “See, here I am reading in your data. Here I am transforming something about your data. Here I am calculating something new from your data. And now, I’m presenting information about the data back to you. All in that graphical interface.” It provides a really nice way to communicate first and foremost.

Whereas if I start out by writing code, no one has ever claimed that that’s an exciting or engaging way to present information. Let’s just put a bunch of code up on the screen. No. Even with tools like Jupyter Notebooks which are fantastic, you’re still struggling to explain to the non-technical people. They’re not interested in the code. They want to get past the code quickly to the graphics, to the visuals.

And with KNIME, they feel like it’s almost all approachable. They can wrap their heads around what they’re seeing at a level that they want to operate at. And if they want to delve deeper, they can. So, in terms of helping new engagements, KNIME is an excellent tool for consultants.

Roberts: Communicating complex concepts has been my job for years. The communication aspect is one of the things that I always thought was pretty impressive about KNIME. But the other aspect, you emphasized in your talk: You don’t have to pick a single stack. You want to use SPARK, you want to use Python, you want to use R, you want to use Java, you want to … whatever it is that you want to use, you can. You can put it all in a KNIME flow. And you demonstrated that.

And, of course, now that I’m working with Vertica, I was particularly interested in the emphasis you put on using SQL. You can do in-database SQL queries, and data manipulation. You don’t have to take data out of the database, then operate on the data. Just pass in SQL and go on.

Potts: Right. For that initial part of the conversation with a new client, KNIME is great. But one of the biggest issues within virtually every company is siloed data. Maybe it’s just human nature that we create these silos. For better or for worse, it’s what happens all too often. So, the ability to quickly tap into that silo is essential.

Like you were saying, as a consultant, I try to adjust to whatever it is that the client has chosen as their technology stack. And I’m happy to do that, be flexible, and contribute in a meaningful way in a lot of different tech stacks. But I can’t do them all, and there’s no hope for one person doing that. The ability to quickly tap into the silos with KNIME means I can demonstrate something, but it’s not just visuals. I can take it into a production environment, on any stack. That is something that I have done with a lot of groups, and will continue to do with KNIME.

So, it’s not just about: Give me a nice graphical quick feedback experience that feels rewarding. It’s actually something that they can think about taking to production as well. Not every company is going to want to do that, and that has to be okay. And so when they want certain things to be implemented in Scala because that’s the one true language, or it has to be inside of Fortran because that’s the one true language. There might be a company like that, right? That also has to be okay in the end.
If you go in trying to convince people “Stop using your favorite tool. Use my company’s tool instead.” That is a hard slog. And a number of the other companies here as sponsors of Data Day Texas are in that game of trying to convince people: Stop using your old database. Use ours instead.

I might know something about that. Yeah.

More power to them. And I’m sure each of those tools brings some cool new features that, for the right people, are an excellent choice, but that is such a hard fight. And as a tool vendor company, they can’t be flexible in the same way as a consultant can. But a consultant can’t do some other things that they’re able to do as larger companies.

To me, now working at a specific database vendor, one of the nice things about KNIME is, even if I go in and I convince a customer that whatever database they had before, “That’s a bad idea, you should use my database. And here, let’s switch you all over to Vertica.” The key workflows that the company counts on are still going to work because KNIME works with whatever database you have, and whatever other tech you have. I think that’s powerful, that flexibility.

I think to a very significant extent, companies, Vertica included, to pick on them for a moment, the relationship between the database and the application developers is not always a healthy relationship, right? The application developer doesn’t often understand what a database can actually do for them. And to a certain extent, it’s almost like a religious lack of belief or belief structure.

It can be a holy war.

And so trying to beat the application developer over the head and say, “No, Vertica will totally kick butt. It’ll do exactly what you need. You should totally use it.” Their boss may even go to them and say, “Thou shalt use Vertica.” And they may use it under protest or duress. But they may not use it in a way that really benefits them. So, you get that schism.
Some of what helps is the database tools making themselves easier to use by providing different sorts of APIs, providing things other than SQL. There are a lot of different strategies that different groups have pursued. I’m sure all of those have helped different people.

The thing that we’ll still struggle with, the cost of when that schism remains, and they’re misusing the database on the application developer side, the cost that we pay in terms of performance often comes from the application pulling too much data out of the database, and doing things in the application code that should’ve been done inside of Vertica.

Yeah.

Or they’re holding on to data in the application when they should let the database do its stuff…

Let the database do what databases do well.

They’re creating risk as well. They’re not able to write code that operates as fast as what Vertica is capable of because Vertica has years and years of effort in optimizations that have gone into it.

Vertica focuses on just that one thing, crunching a lot of data at optimal performance. That’s what we’re good at.

Exactly. But when we move that data across the wire, we pay a significant penalty.

Always. Yeah.

And when we transition the data from how it’s represented in the data store into the application code, there’s a translation event, so that costs CPU cycles. The transmission also costs IO cycles, and we’re paying double duty on them.

Learn more about KNIME.

Learn more about Vertica.

A cool new feature coming in the next version of Python

Davin Potts: Something new is planned as a part of the upcoming release of Python. It should be more along the lines of what I talked about earlier.

Paige Roberts: A shared memory.

Davin Potts: A shared memory is not a new idea at all. If anything is new, it’s the idea of shared memory having a modern use. The old-school version that became widespread was System V Shared Memory. To indicate that it was old, they used Roman Numeral V instead of a 5.

Paige Roberts: [laughs]

Potts: Nowadays, we have somewhat more modern incarnations of it, directly derived from it, but they go by different names. POSIX shared memory is on all of the Unix platforms in a consistent way. And on Windows because, Windows sometimes feels it needs to do things differently, they have Named Shared Memory.

But to expose it into a language like Python in a way where it gives us a single consistent API, and it goes cross-platform for all of the modern platforms that everybody is focused on, gives us a single consistent tool to use. It can still stay platform independent.

Roberts: Without moving your data around and translating it constantly and having that slow down.

Potts: You can avoid that cost. And especially in Python where we think about having distinct processes, to be nerdy about it, people tend to think about using threads to get parallel performance out of their code. It’s the go-to solution that we’ve all been taught to do. Writing multi-threaded code is the first thing we think of, but it’s not the only choice. And one of the reasons for its popularity is that all of the threads can see all of the same things in memory at the same time.

So, we avoid the need to translate and communicate and transmit data. That’s a huge win. The gotcha is that you can see everything in memory across all of the threads.

And manipulate it and they can bump into each other and–yeah.

And very bad things happen. So, to protect against that, we have the concept of locks and semaphores, but also people talk about, in modern languages, the concept of thread-local storage.  The idea of: Too much, I can hear too many people talking in memory. Too much noise. What I need is a quiet space to be by myself. That’s thread-local storage where the things that I create there, none of the other threads can see or touch or manipulate. I need my quiet space.

Which is great, but then you can lose that advantage that you had before when the memory was being shared.

So, the idea is, with shared memory, you can create processes that don’t trip over one another and do things in parallel, but traditionally, you had to transmit the data, communicate, translate it. Shared memory allows you to, instead of everything being shared by default, and you’re having to create a private little space for the things you really don’t want to share, it’s the flip of that. Everything is exclusive to a process, and you create a shared space where you do want to share things, and not accidentally over share.

So, you only put the things you want shared in the shared space. It makes sense.

That’s the idea. And the technique has been used to great effect for decades now. From System V to the POSIX shared memory stuff in C and C++, especially, but that shared memory construct is something that is accessible in lots of different areas. The focus in the next release for Python, in the Python module that has been created as a prototype, which has been tested and beaten upon–it’s remained unchanged for six months now. It’s actually been around for closer to a year and a half, so it seems to be stable and ready for everyone’s use.

When do you think Python 3.8 is going to come out?

The releases are on a fixed schedule. They’re published long ahead of time. It’s on an 18-month release cycle. And so the release is going to be in December this year.

There’s a nice long alpha cycle and beta cycle before we actually get to that point. So the first alpha is actually going to start in a week.

Oh. So if people want to try it out and bang on it, to make sure it’s good before it goes live?

The more testers, the better.

And in the meantime, they can actually get the source code and build that directly, but people usually have to be a bit more devoted to actually want to go and do that.

It could be somebody like you. [laughs]

Yes, it turns out to be really, really easy to do. Granted, not everyone is a developer, but the task is download the source code, unpack the tarball, and type “configure” and make. That should work without any other special flags on pretty much any of the platforms.

Wow.

So, yeah, there are a gazillion flags on it, but you probably don’t need to mess with them.

Just do that and it’ll probably work. And amazingly enough, the whole thing will build in, I don’t know, a minute or two, depending upon how old your system is, but, yeah. Not everybody is gonna do that. That’s okay. That’s fine.

You don’t want everybody doing that. You just want the people who are really dedicated to do that.

Honestly, I would be happy to have more people helping, testing things. I can’t imagine reaching the point where I start saying, “Yeah, okay, that’s too many.”

As far as this new feature, if you make it better for Python and you make it shared, that means it’s also better for anybody else who wants to come in and work with the same data.

So that’s the central message around that in particular. When I’ve talked with other core developers about the idea of what other people could do with it, once we’ve exposed it, meaning other people outside of the land of Python–they see the opportunities, but at the same time, they have reservations about that. Man, if we go out promising that this is some magic pill that solves everybody’s ills, that will end badly.

Overpromising can lead to a lot of disappointment and disillusionment.

Yeah, so while I see some other really interesting opportunities for using it, and that’s part of my inspiration, the idea of trying to make it so that it’s simple for everybody to use in Python. I mean, shared memory even just explaining what the concept is.

It’s complicated.

It’s not something that you learn on the first day of programming class in grade school. But making it so that it’s easy for people to use, maybe as an almost everyday tool, because it just becomes part of their workflow… It doesn’t have to reach that point. But if you try to think along the lines of could it ever be useful in that way?

It helps guide your thinking so that you don’t just resign yourself to: Only the real nerds are going to poke at this, so I’m not going bother with documentation on this one.

Ack. That’s a self-fulfilling prophecy right there. [laughs]

Yeah, exactly.

So, what could you accomplish with Python shared memory, if it was a normal thing that everyone used?

I see it first and foremost from a “running big code in parallel capable hardware” point of view of “This makes what I do faster, and it’s not by a little bit, it’s by a lot,” versus what I otherwise would do. It’s a huge performance increase, like orders of magnitude depending upon the situation.

But this is also for people who aren’t performance driven, they’re just trying to explore their data. I’m just trying to use Jupyter Notebooks. They can have one Jupyter Notebook up, and have so much data loaded that they can’t load the data again in another Jupyter Notebook to play with it at the same time. Shared memory would mean they could have the data loaded once in memory and use it across many different notebooks.

In that case, you’re not even thinking about performance. Just, how many copies of my data can I store in memory?

It’s actually a convenience thing, where you’re constrained by the laptop in front of you. I’ve had tests in the last few months where I needed to load on the order of 20 gigs of data into memory. And I only have 16 gigs on my Mac laptop. And I knew I had to switch to another machine. I had a 32-gig machine, but I still couldn’t hold on to two copies at the same time. Shared memory gave me a way where I can actually do lots of things on the hardware that I have.

On a very basic machine.

And you say, “Well, in that case, you should go and buy a real server with a lot of memory.”

That can be damned expensive, and you can’t haul one of those around everywhere.

So, what I was doing at first was I left the data largely on disk and I would cache chunks of it.  The memory is swapping back and forth, and my tasks were taking pretty long. And then I had the accidental thought of “Wait a minute, what if I use the shared memory stuff that I’ve been working on for the last year plus?” And, holy cow… I was like, “THIS is the primary feature. The performance stuff is nice, but THIS is it. This is the killer feature.”

So, you can use less memory and accomplish more things. That is huge.

Learn more about the open source Python-Vertica interface.

Learn more about how you can help with the Python 3.8 test cycle.

Get a copy of Python 3.8 alpha 2 and test.

How you can contribute to open source project like Python in an important way without being an expert.

Davin Potts: KNIME is open source, Python is open source – there are a lot of open source projects.  I think a lot of people use open source, and harbor a hidden desire to contribute back in some way, but they’re hesitant because they think, “I don’t know enough,” or “I’m not good enough,” or “I need to learn a lot more,” or “I’ll do that this summer,“ or something like that. It’s one of those New Year’s resolution type things that never gets met.

Paige Roberts: Also, if it’s not part of your day job, it’s hard to convince yourself that you should work on code you’re not going to get paid for.

Potts: Exactly. And so, if as part of your day job, you use any kind of open source, even if you’re not a developer, I think it’s generally true that if you’re using some sort of an open source tool and something doesn’t behave the way you wanted it to, it didn’t do what you thought it should, it’s incredibly helpful …

Roberts: Report that.

Potts: Because you need that to work. It’s part of your job. Report, “Hey, I had this problem.”

Roberts: Nobody can fix it if they don’t know there’s a bug.

Right. And while you’re there, because it doesn’t take much to add an entry to say, “Hey, I did the following. The hardest part is trying to explain it coherently to another human being who wasn’t sitting there watching you do that.

So they can reproduce it.

If you can describe it enough. Thankfully, there are a lot of people who do that. But there are a lot of people who don’t coherently describe, …

It’s a lot harder than you would think to give reproducible directions.

Writing articles about things. Writing up documentation for things, it’s a lot harder than people give credit for. But another very impactful and helpful thing is while you’re there, if you’re going to ever take the time to add an entry to say, “Hey, I did the following, and it turned out to be red. I thought it should’ve been blue.” When you add that entry. Look around at some of the other things on the issue tracker. What you’ll find is other entries that don’t clearly describe what happened.

If you add an entry to those saying, “I don’t understand what you were trying to explain.” You’re already taking a huge load off of other people because for any open source project, there are people who actually work on the source code itself. And they can’t keep up with all of the issues that get opened up. If you help review, and provide even the most cursory feedback of “I know what you’re talking about. I had the same problem the other week. It only happens when the moon is half full.”

If you stand on one leg and rotate clockwise… Yeah. [laughing]

That is so terribly helpful to be able to say, “Well, I just did that on my system and it worked for me.” You’ve just saved other people who have extremely limited time a huge amount of effort to provide that feedback, and everybody can potentially do that, and be motivated to do it as part of their day job.

And you don’t have to know a whole lot about the tech to do that. All you have to be is a user.

Right. And that’s super meaningful. The number of issues that get opened up against code that I’m supposed to be responsible for. I can’t keep up with that. I don’t have enough time in the day. And even if they double my salary for working on that, …

Double nothing is still nothing.

[laughing] Yeah. So, I can’t possibly keep up with it even if that was the only thing that I did. And so having other people willing to spend even just an occasional bit of time, If I see an issue that’s been opened up by an individual and no one else has commented on, versus another issue that’s been opened up where at least one other person has commented on it, and it’s a different person, guess which one I’m going to pay attention to first?

Especially, if the comment is “Yeah, I had that problem, too.” And it does it when you do x,y,z every time.

It’s supposed to be a community in terms of if it’s open source, there’s an invitation to others to participate. It doesn’t mean that they have to. They’re not under an obligation. They should never feel like it’s an obligation. But if they want to make a contribution back and it can be justified as part of the work that they do as part of their day job, everybody wins including your employer.

Because you get better code to work with.

Yes. In terms of Python, I have some focused effort on making sure that the shared memory code makes it into the release, that it jumps through the hurdles.

It needs to get properly tested and go through those cycles.

Exactly. And so that’s pulling quite a bit of a tension there. If there are other people who are interested and excited by the idea of shared memory, who have a use case for it, double fantastic. I would love to talk to those people.

Do you want me to put contact information for you in the blog post?

Sure, if it’s Python related, my e-mail is just Davin@python.org.

Oh. Nice.

I don’t get paid, but there are a few fringe benefits. That vanity email is one.

That is a nice one.

Learn more about the open source Python-Vertica interface.

Learn more about how you can help with the Python 3.8 test cycle.

Get a copy of Python 3.8 alpha 2 and test.

Read related Vertica blog post: Open Source Software is Free, Like a Puppy

How open source and Vertica interact, and the new open source Python interface for Vertica.

Paige Roberts: We’re actually incredibly grateful to Uber who is one of our customers. They created a new Python Interface for Vertica and open sourced it.

Davin Potts: Oh, I have not seen that. That’s awesome.

Roberts: I just started at Vertica a few months ago and I was surprised to find out they had an open source Python interface, as well as Python UDFs so you could use custom Python algorithms inside Vertica. I thought that was pretty neat.

Potts: Nifty. That is cool.

For the KNIME stuff, the integration that I showed where you don’t just call Python code from KNIME, but you can have Python code call into KNIME. If you’re a Python shop or love using Jupyter Notebooks. Fantastic. You can see your KNIME workflows from inside of Jupyter Notebooks. You can trigger the execution of them and effectively, it’s just another function that you call from Python, except that it runs other things that people created in KNIME.

I can see a lot of potential use cases for that and they say, “We really don’t like using SQL, but we know how to use KNIME to interact with databases. Do all your database things in KNIME, it’ll use the database in a really efficient way. And then, you can call it from Python and there’s your meeting point if that’s what you want to do.

Or you can use the Python-Vertica interface.

Whatever floats your boat because most data stores have some sort of Python interface whether or not that’s the first thing that people think of.

If you live in a Python universe, Python is your world, if a data store doesn’t have a Python interface, you’re not going to talk to it.

Yep.

And KNIME can be a bridge across that gap, but it’s even better if you have something that’s a little closer.

How well is Vertica promoting the existence of the Python open source interface?

The new Uber interface, Vertica-Python, came out recently. We’ve had a Python interface, and Python UDF framework for a while. But Vertica-Python is a pure Python interface. Our old one had C++ code that we built in-house for better performance and compiled to Python. But even internally, we’re moving all our support to the new one. We haven’t done a lot of promotion on it yet, but it’s one of the things I’d like to get the word out about a little more. A huge chunk of our business is OEM. So, a lot of applications are Python or something else on the surface, with Vertica embedded inside.

That makes sense. I knew about the OEM business, but I didn’t know relatively speaking how much time and attention they got. That is very cool.

Learn more about the open source Python-Vertica interface.
Learn more about the intersection between Vertica and open source.

Advantages of doing machine learning inside a database

This is the final installment of my discussion with Davin Potts. I have to say it was a lot of fun catching up and talking shop.

Paige Roberts: One misconception I’ve had for a long time, probably from hanging out with the Hadoop and Spark crowd, was that you need to do machine learning in something like SPARK or Python. You pull data out of the database and you put them in a dataframe or something, and then you do machine learning. Then, you put your results back in the database. It was kind of an epiphany to realize, the data is already there in a table. Why move it?

Davin Potts: I’ve never seen anyone do a careful survey, all I have are anecdotes, but I get that same impression. Relatively few people are doing their machine learning work inside of the database. And I think that’ll change with time, but it’s not going to happen overnight because whatever machine learning they were doing before, they were already doing it in a particular way.

And when they shift to doing that inside of the database, there’s also a mental shift. Like during the talk, I put up the first slide about running Python code inside of Postgres. There were actually two people who I saw in the audience do a back take. Like, “What?”

Roberts: You can’t do that.

Potts: Yeah. First, there was that. Then, I saw on their faces a “Wow,” and then there was a smirk of, “No, that’s crazy.”

Roberts: Yeah.

Potts: People need to overcome that. If they do, the reward is the performance. You’re not going to do machine learning in a database for the cool factor. You need to do it because it’s more performant.

Because you don’t have to move your data around. You don’t have that IO and CPU hit from data movement, or from data transformation. You don’t have to have any of that impact. And you don’t have to downsample or anything. You just leave the data where it is, and do your machine learning there.

The other challenge is that since they weren’t already doing machine learning in SQL, you’re asking them to make a transition to a new language as well. Look at any of Vertica’s competitors, any of the data stores, there’s usually one premier language. PL/SQL for Postgres, Java for Oracle, and go down the list. Some of them don’t even have a choice, other than just kind of “Well, you get ANSI SQL, and why would you ever need anything else?” “Yeah. Okay, MySQL.”

But the notion of things like what Microsoft has started doing, like embedding Python and R inside of the database. That’s a serious commitment on the part of any of those companies.

It is.

And in order to get more people to have more reasons to not just adopt, but to spend their whole lives inside of the database. That might be a reasonable strategy, but I’m sure watching Microsoft will show us how much that actually makes a difference between…

You don’t have to watch Microsoft. Vertica did that years ago.  You can build machine learning algorithms in Vertica, in R, in Python, in Java, in SQL, in whatever it is you want to do it in. And our problem is, no one knows that.

Actually, I didn’t know that either until recently.

See? Nobody knows. And that’s a problem. It needs to change. People think, Vertica is not a baby NoSQL database, so it must not be able to do what I want it to do, because I’m doing something cutting edge, like tracking streaming aircraft data. I just walked out of a talk downstairs. A guy was talking about how they built a special database to do time series analysis because “there just wasn’t a good database for that.” And I’m like, “Hello?” We’re awesome at time series analysis. It wasn’t even a columnar database. It’s not like columnar analytics databases are a brand new idea. This is something that’s been around a while. If you’re going to do analytics, you need to use a columnar format, if you want good performance and scale.

Yep.

That’s like basement level. That’s below the foundation. [laughs] It’s interesting to me, sometimes, to see this attitude about data management software right now: If it’s not open source, and it’s not brand new, it must not be any good. There is that big chunk of prejudice that established proprietary software has to overcome.

There is. It’s funny how that has shifted, right? If we go back, hmm, maybe 15 years ago, people were still using stupid phrases like “open sores.”

Yeah, yeah.

Now, oh man, now the proprietary code is the bad guy. That’s not cool either.

I was glad to hear Gwen Shapiro say something about that. – No, you can’t do it all in open source. Use some of the vendor’s stuff. Just be ready to escape if you need to, so you’re not locked in. Good advice.

If you pick on the Python developers, with the few exceptions that are weirdos like me that are consultants, which are very few, the vast majority of them work for companies with proprietary software products.

Yeah.

And some of them may have open source or significant open source components, so a lot of big companies have both proprietary and open source things. But proprietary is part of what the open source community works on. It’s part of their day jobs. It shouldn’t feel like an either/or thing.

They should know. I mean, you would think that if you work on proprietary software, you know that it’s good software. You built it.

Yep. I think it’s the fan boys and fan girls running around wanting to rally behind a banner, and create the appearance of sides that need to be rooted for, that exacerbate the situation. There will always be people like that, but the hype machine will shift, and will come back to a more reasonable middle ground. It’s when the proprietary tools start interacting with the open source tools and show a willingness, as opposed to, when you create the perception of we’re against that, people reject you.

Yeah. Microsoft figured that out.

Microsoft figured that out. They had to get rid of that guy that knew how to throw chairs in order to figure that out, but they figured it out. And so the notion of Vertica, I don’t know of a great open source story around Vertica. I’ve never heard of Vertica being down on open source either. But the notion of embracing the fact that there are open source things inside of the database that you can already do, like yesterday, I think that’s a cool story.

I really appreciate Davin Potts giving me so much of his time and his wisdom. He’s been doing data science since before anyone called it that. Thanks for reading. And if you want to try out some of the things we chatted about, here’s a link to the free Vertica Community Edition, and the equally free KNIME community edition. Happy analyzing!

Learn more about in-database machine learning in Vertica.

Learn more about doing time series analysis in Vertica Analytics Platform.

Learn more about the intersection between Vertica and open source.

Try out Vertica for free.

Read more...
Orc O'Malley of the Yellow Elephant clan says LLAP

Owen O’Malley on the Origins of Hadoop, Spark and a Vulcan ORC

Owen O’Malley is one of the folks I chatted with at the last Hadoop Summit in San Jose. I already discovered the first time I met him that he was the big Tolkien geek behind the naming of ORC files, as well as making sure that Not All Hadoop Users Drop ACID. In this conversation, I learned that Hadoop and Spark are both partially his fault, about the amazing performance strides Hive with ORC, Tez and LLAP have made, and that he’s a Trek geek, too.

Read more...
Metron Eye On Cyber Security

Cyber Security with Apache Metron and Storm

A few weeks ago at Hadoop Summit, I caught up with some friends from the project I worked on last year with Hortonworks, including Ryan Merriman who is now an Apache Metron architect. Since Apache Metron was a project I knew virtually nothing about beforehand, I quizzed Ryan about it. The conversation evolved into a discussion of the merits of Storm versus Flink and Heron, something I’ve been meaning to delve into for months here.

Read more...
Holden Karau's audience at High Performance Spark preso at Data Day Texas

Interviews with Brilliant People on Hadoop and the Future of Big Data Tech

I have been doing some very cool interviews with brilliant people, usually at events like Strata + Hadoop World and Hadoop Summit. The intention is to use their brilliant thoughts so that I don’t have to take the extra time to come up with my own. Not to mention I get the bonus of learning new things, and getting the unique perspectives of folks who really know their stuff. Nothing like learning tech from the folks who literally wrote the book on it.

Read more...
Hadoop Changes as Fast as Texas Weather

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

Read more...
You Keep Using that Word, Real-Time

Four Really Real Meanings of Real-Time

Our director of engineering told me that she had a customer ask if we could do real-time data processing with Syncsort DMX-h. Knowing that real-time means different things to different people, the engineer asked what exactly the customer meant by real-time. He said, “We want to be able to move our data out of the database and into Hadoop in real-time every two hours.”

When she told me that story, I wanted to quote Inigo Montoya from “The Princess Bride.” You keep using that word, “real-time.” I do not think it means what you think it means.

But what does real-time actually mean? And what do you really mean when you say real-time? What do other people usually mean when they say real-time? How can you tell which meaning people are using? And what the heck is near real-time?

Read more...
Tungsten is Shiny

Spark with Tungsten Burns Brighter

Project Tungsten is a new thing in the Spark world. As we all know, Spark is taking over the big data landscape. But as always happens in the big data space, what Spark could do a year ago is radically different from what Spark can do today. It busted the big data sort benchmark last year, and is just getting better as it goes. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. But, being me, I wanted to know what Tungsten was, how it worked, and why it improved Spark performance so much.

Read more...
Spark

The Spark that Set the Hadoop World on Fire

Spark is the darling of the open source community right now. It’s setting the Hadoop world on fire with its power and speed in large scale data processing on Hadoop clusters. Spark is one of the most active big data open source projects, has bunches of enthusiastic committers, has its own group of ecosystem applications, and is now part of most standard Hadoop distributions. Neat trick for a data processing framework that didn’t even start life as a Hadoop project.

Read more...
The Little Actian DataFlow Engine That Could

Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Read more...
Load More
10 of 12