Owen O’Malley is one of the folks I chatted with at the last Hadoop Summit in San Jose. I already discovered the first time I met him that he was the big Tolkien geek behind the naming of ORC files, as well as making sure that Not All Hadoop Users Drop ACID. In this conversation, I learned that Hadoop and Spark are both partially his fault, about the amazing performance strides Hive with ORC, Tez and LLAP have made, and that he’s a Trek geek, too.
You haven’t heard it until you’ve heard it from the horse’s mouth! We brought you technical vision from Hortonworks CTO, Scott Gnau. Now, here is an interesting one-on-one with Hortonworks co-founder and technical fellow, Owen O’Malley, one of the first to begin coding Hadoop. He’s still very much in it, and he had a lot to share when he sat down with Syncsort’s Paige Roberts at the last Hadoop Summit.
Paige Roberts: I have a question that I’ve been asking everybody, and it’s always fun to get the answer, especially from someone with your background. What is Hadoop for?
Owen O’Malley: [Laughs] Hadoop is for processing a lot of data on a lot of machines, all at once.
Roberts: I think that’s the most succinct, straightforward answer I’ve gotten from anyone.
O’Malley: Well, I’ve been answering that question for ten and a half years now, so, I’ve had a lot of experience!
[Laughing] I know about your contributions to ORC and some other things like Hive, but you’re pretty much Hadoop ground floor. So talk to me about where you started and how you got here.
Okay. I was working in Yahoo! Search, the team that did the back-end search for building the index. We had a large graph of the entire Web that they knew about, where every node was a URL, and all the links between them were modeled as edges in the graph. We used that to build the index. It took 800 computers a month to build each iteration of the index, and we wanted to speed that up. That project was called WebMap in a Week. Google had written the papers about MapReduce and the distributed file system. We were like, “Okay, we want one of those.” So we started a project to code-up our own implementation, and we wanted to open source it. It was in C++ and we worked on it for about six months.
Then we saw the code that eventually became Hadoop, and so we dropped the code that we were doing and picked up Hadoop. Hadoop, at that point, ran on ten machines, if you were lucky. We picked up Hadoop because it would be easier to get Yahoo! legal permission to work on an open source project, because it’s already open source rather than to open source a brand new thing. So, that old code went away. We adopted Hadoop, and I’ve been working on Hadoop ever since.
When was this?
This was 2006. Ten years ago. Actually, I made a sandbox of what the code looked like back then, and some people have been downloading and playing with it. There are pieces of it you recognize right away, but then there’s a bunch of stuff that is very different. It’s much simpler. Back then, Hadoop consisted of like 5,000 lines of Java code for HDFS and 7,000 for MapReduce. The whole thing was a relatively small project. It was basically a grad student project. Then Yahoo! worked on it, and the community picked it up since then to make it into the large project that it is today.
Talk to me about ORC. [Optimized Row Columnar]
ORC came about because we were doing the Stinger project to speed up Hive by two orders of magnitude. A lot of that work involved vectorization and pipelining the operators, because traditionally database execution engines were row by row. They would feed each row into the pipeline and then send it to the pipe point, but that involved a lot of virtual dispatch calls, a lot of if-then-else’s. All that stuff stalls the operator pipeline on the CPU every time you do it. We wanted to rewrite the engine to use vectorization. But, down at the bottom level, the storage formats didn’t have enough information, at that point, to help do the vectorization. Also, to allow you to do things like predicate push-down.
In predicate push-down, the execution engine says, “Okay, I only need to see records that look like this. So, for example, I only want to see records where people have more than a million dollars in their bank account.” Or, if the file is sorted on the customer ID, saying, “I just need to see IDs in this range or this particular ID.” Then the file can take that predicate and say, “Okay, I only need this one chunk of the file”
I would think having the data in a columnar format would help as well to only select the fields you want.
Columnar is good, but there were already columnar formats.
Yeah, so that wasn’t new.
Right, that wasn’t a new thing.
The improved use of vectorization, as well as the understanding of the type, because you need to understand what type it is in order to build an index. You don’t want your index for time stamps to get confused, because sometimes it looks like a string. You actually need to know what type it is, and the field columnar formats were just treating it as a bunch of bytes for each type. So, they couldn’t build indexes and they couldn’t do the predicate push-down that we needed. That’s really where a lot of it came from.
What was the timeline on that? When did you come out with that?
I think it was three and a half years ago now.
So, it’s relatively young still.
In the grand scheme of things, yeah, it’s relatively young.
Except in the Hadoop ecosystem where that’s practically ancient.
Syncsort’s been doing a lot with HCat to ORC. How did HCat support get going?
We did HCat to ORC relatively quickly after ORC came out. Actually, if you’re still using that, you should probably move over to the more native interface, because HCat throws its own translation layer in the middle. So, you’ll get more performance if you go to ORC directly.
We’ve been considering that. One of the advantages of the HCat layer is the metadata. It’s really important to our customers to track that metadata.
The other thing that’s interesting about ORC is now there’s the C++ reader for ORC, which is significantly faster than even the Java reader. That’s another option, because you guys are in C++.
Syncsort is written in C++, yes. How do you use the C++ reader?
It’s just another library. It’s pure C++ and doesn’t reference the Java at all. It was originally done between Hortonworks and Vertica, because Vertica wanted to be able to access ORC files in HDFS from their Vertica engine without incorporating any Java into their process engine. It’s really hard to control the virtual memory space that Java allocates.
That makes sense.
We’re finding that a lot of partners with C++ engines are pulling it in, so they can access it directly without dealing with Java.
What about YARN? If some of your apps are in C and some are in Java, how do you manage the resources?
That actually is okay. YARN will allocate you resources, and then as long as your application knows how to talk to YARN and request resources, it doesn’t actually care whether it’s C++ or Java.
It basically says, you can run stuff over there, and it’s up to you what you run.
What do you think about Spark?
Spark is really amazing. It’s a very different paradigm than a lot of the traditional Hadoop stuff. I went to Berkeley and gave a talk to their machine learning group, so …
It’s your fault! [laughing]
It’s my fault, yeah. As a result of that talk, they ended up making Spark.
The only way to do [distributed machine learning] before then would’ve been to do it in MapReduce and then chain a bunch of MapReduce jobs together. Really, what Spark was made for was that kind of iterative algorithm where you need to load up the data, do some processing, and then send the results across to all the workers, and then repeat. In MapReduce, the only way to do that is to make a series of jobs that’s 50 or 100 long.
And really slow.
It’s going to be really slow because you have to do all the setup over and over again. It doesn’t work well. That’s really where Spark came from. So, especially for those iterative use cases, Spark does an amazing job. The thing Spark did really well is they have really nice developer APIs. If you’re programming against it, it’s got very, very nice APIs.
Which is how you end up with a nice ecosystem around it.
So, what about Tez?
Tez is very much the extension of MapReduce. It was our attempt to take the problems of MapReduce and the limitations of MapReduce and …
Yeah, fix them. When people say MapReduce is dead, well yeah, it was killed by Tez. Tez is an amazing thing.
Have you seen any of the LLAP demos? LLAP combined with Tez is really amazing. I was just seeing a demo today that combines LLAP and Tez. For a table with six billion rows, it’s coming back in under a second out of Hive.
You actually get very, very fast response. LLAP stands for “Live Long and Process.”
[Laughing] You jumped from Tolkien to Star Trek!
Yeah. I really need an orc doing the Vulcan hand sign just to tie the whole ORC and LLAP thing together.
Love that idea. [So, I created one.]
LLAP basically has the servers up and pre-warmed. Part of what we noticed is that spinning up a new JVM, not only does that spinning up process take a lot of time, it starts with the hotspot not being enabled. So, it’s not until after your process has been running a second that the hotspot compiler has gone in and optimized the parts of your Java executable that are getting run a lot.
You have both effects. It takes awhile to get going, and then you don’t run as quickly until the hotspot kicks in. LLAP fixes that because it leaves the servers up and shares them across users. It also caches in memory the files and columns, because it actually understands the columnar format that the data is in. Then the processing runs across the copy that’s in memory, if it has a copy. Otherwise, it will fetch it out of HDFS. So, you get these amazing speed-ups, especially when you combine Tez and LLAP together.
That’s amazing. What about streaming processing? I just saw the Capitol One presentation, and they have all of these processors they’re dealing with. It’s getting crazy out there.
[Laughing] It’s a very dynamic field. That’s part of what’s exciting, though. It’s really exciting watching the whole ecosystem evolve so quickly. That’s really how you see whether an ecosystem is healthy or not. Watching how much activity is happening. Are new products coming in that can do different things?
It’s vibrant and alive.
That’s what makes Hadoop such an exciting field to be in right now is watching all that activity.
With streaming, there are a lot of different contenders. I haven’t looked at them in a whole lot of detail, but I’m hearing really good things about Flink. We’ll see. Of course, with just general data in motion, Apache Nifi is a fascinating project.
It is. My friend Yolanda is working on it.
It’s headed in great directions.
Well, thanks for taking time to talk to me. I appreciate it.