Holden Karau's audience at High Performance Spark preso at Data Day Texas

Interviews with Brilliant People on Hadoop and the Future of Big Data Tech

I have been doing some very cool interviews with brilliant people, usually at events like Strata + Hadoop World and Hadoop Summit. The intention is to use their brilliant thoughts so that I don’t have to take the extra time to come up with my own. Not to mention I get the bonus of learning new things, and getting the unique perspectives of folks who really know their stuff. Nothing like learning tech from the folks who literally wrote the book on it.

Hadoop Changes as Fast as Texas Weather

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

You Keep Using that Word, Real-Time

Four Really Real Meanings of Real-Time

Our director of engineering told me that she had a customer ask if we could do real-time data processing with Syncsort DMX-h. Knowing that real-time means different things to different people, the engineer asked what exactly the customer meant by real-time. He said, “We want to be able to move our data out of the database and into Hadoop in real-time every two hours.”

When she told me that story, I wanted to quote Inigo Montoya from “The Princess Bride.” You keep using that word, “real-time.” I do not think it means what you think it means.

But what does real-time actually mean? And what do you really mean when you say real-time? What do other people usually mean when they say real-time? How can you tell which meaning people are using? And what the heck is near real-time?

David and Goliath

Pitching Stones with David

It’s a brand new year, and I’ve got a brand new job. As of today, you’re looking at the new Product Marketing Manager for Syncsort.

It’s true. After spending half a year doing a little freelance white paper work for the Bloor Group, and documenting for Hortonworks the most complex ETL process I’ve seen in nearly two decades in the business, I’ve found a new home to settle into. I got courted by some Goliaths in the data management software and hardware space, but in the end, I chose a tech savvy David, Syncsort.


The Spark that Set the Hadoop World on Fire

Spark is the darling of the open source community right now. It’s setting the Hadoop world on fire with its power and speed in large scale data processing on Hadoop clusters. Spark is one of the most active big data open source projects, has bunches of enthusiastic committers, has its own group of ecosystem applications, and is now part of most standard Hadoop distributions. Neat trick for a data processing framework that didn’t even start life as a Hadoop project.

MapReduce Clogged Pipes

Using MapReduce is Like Plumbing with Pre-Clogged Pipes

MapReduce is no longer the only way to process data on Hadoop. In fact, it’s arguably the worst Hadoop data processing framework.

By now, everyone knows how awesome Hadoop is for large scale, data storage, processing and analysis. Hadoop is the darling of large scale data processing, while MapReduce keeps getting nothing but bad press and complaints that it’s too slow, too hard to use, and generally doesn’t live up to its hype. But aren’t Hadoop and MapReduce the same thing?

Water jet cutting patterns in steel

Hadoop Can’t Do That

I just got back from a little executive summit conference in Dallas for Chief Data Officers. Frustratingly, I heard a lot of folks telling me what Hadoop CAN’T do. Now, I know that Hadoop can’t bring about world peace or get my husband to put the toilet seat down, but the things people keep saying it can’t do are  things that I’ve personally DONE on Hadoop clusters, so I know they’re doable.

If you asked most people if water could cut through steel, they would probably tell you it can’t. They would be wrong, too.

In-Memory wave crests in-chip wave coming

In-Memory Analytic Databases are So Last Century

In an article written last year by an industry analyst that I respect, IDC’s Carl Olofson, he gave the impression that in-memory analytics are the wave of the future, the new paradigm for high performance analytic databases. He said, “embrace the new paradigm and plan for it.”

For once, I didn’t agree with him.

In-memory analytics are last decade’s revolution, or even last century’s. The wave of the future is something far faster, and far more revolutionary.


Not All Hadoop Users Drop ACID

In the age of businesses with data that lives on dozens or even hundreds of servers, expecting transactional integrity and data consistency and currency are old-fashioned notions. On Hadoop, you just have to settle for the new NoSQL standard of BASE and eventual consistency. That’s what they say. But, as usual, “they” are wrong. Not all Hadoop users have to drop ACID…

