Happy 10 Years Hadoop

Ten Years of Hadoop, Apache Nifi and Being Alone in a Crowd

Hadoop Summit in San Jose this year celebrated Hadoop’s 10th birthday. All of the folks on stage are people who contributed to Hadoop during those 10 years. One of them is Yolanda Davis.

Yolanda and I worked together on a Hortonworks project last year. She was in charge of the user interface design and development team. I caught up with her early in the morning of the last day of Hadoop Summit, and quizzed her on this new project she’s working on that you may have heard of, Apache Nifi. As promised, here is my interview with her on the subject of Nifi and the new HDF (Hortonworks Data Flow) streaming data processing platform, which includes Nifi, Apache Kafka and Apache Storm.

Read more...
Metron Eye On Cyber Security

Cyber Security with Apache Metron and Storm

A few weeks ago at Hadoop Summit, I caught up with some friends from the project I worked on last year with Hortonworks, including Ryan Merriman who is now an Apache Metron architect. Since Apache Metron was a project I knew virtually nothing about beforehand, I quizzed Ryan about it. The conversation evolved into a discussion of the merits of Storm versus Flink and Heron, something I’ve been meaning to delve into for months here.

Read more...
Holden Karau's audience at High Performance Spark preso at Data Day Texas

Interviews with Brilliant People on Hadoop and the Future of Big Data Tech

I have been doing some very cool interviews with brilliant people, usually at events like Strata + Hadoop World and Hadoop Summit. The intention is to use their brilliant thoughts so that I don’t have to take the extra time to come up with my own. Not to mention I get the bonus of learning new things, and getting the unique perspectives of folks who really know their stuff. Nothing like learning tech from the folks who literally wrote the book on it.

Read more...
Hadoop Changes as Fast as Texas Weather

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

Read more...
You Keep Using that Word, Real-Time

Four Really Real Meanings of Real-Time

Our director of engineering told me that she had a customer ask if we could do real-time data processing with Syncsort DMX-h. Knowing that real-time means different things to different people, the engineer asked what exactly the customer meant by real-time. He said, “We want to be able to move our data out of the database and into Hadoop in real-time every two hours.”

When she told me that story, I wanted to quote Inigo Montoya from “The Princess Bride.” You keep using that word, “real-time.” I do not think it means what you think it means.

But what does real-time actually mean? And what do you really mean when you say real-time? What do other people usually mean when they say real-time? How can you tell which meaning people are using? And what the heck is near real-time?

Read more...
Tungsten is Shiny

Spark with Tungsten Burns Brighter

Project Tungsten is a new thing in the Spark world. As we all know, Spark is taking over the big data landscape. But as always happens in the big data space, what Spark could do a year ago is radically different from what Spark can do today. It busted the big data sort benchmark last year, and is just getting better as it goes. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. But, being me, I wanted to know what Tungsten was, how it worked, and why it improved Spark performance so much.

Read more...
Herbert the Syncsort Big Data mascot on a cup

Coffee Cups, Women in Tech, and Rampant Competence

2016 marks my nineteenth year in the field of data wrangling. Yee haw. (I’m from Texas. I can say that.) January also marks my first year, my first week in fact, at my new job at Syncsort. I flew up to the company headquarters in New Jersey and spent the first week of the year getting to know the new team and the new technology. Certain things jumped out at me as good signs. It started with a coffee cup.

Read more...
David and Goliath

Pitching Stones with David

It’s a brand new year, and I’ve got a brand new job. As of today, you’re looking at the new Product Marketing Manager for Syncsort.

It’s true. After spending half a year doing a little freelance white paper work for the Bloor Group, and documenting for Hortonworks the most complex ETL process I’ve seen in nearly two decades in the business, I’ve found a new home to settle into. I got courted by some Goliaths in the data management software and hardware space, but in the end, I chose a tech savvy David, Syncsort.

Read more...
Big Data Analytics Miss

Four Reasons Why Big Data Analytics Projects Fail, Or Do They?

A few months back, I was presenting with a friend at a Chief Data Officer summit in Dallas, and my co-presenter put up a slide that said, “60 % of all big data analytics projects fail.” Someone in the audience asked, “Why do they fail?” My friend said, “I think Paige could answer that better than I could.”

Put on the spot, three reasons that have been confirmed from multiple sources jumped immediately into my head. I used those three to answer the question. But later, when I had time to think, I realized there was one other reason that shows up repeatedly, but often gets downplayed or written off as not the REAL problem, when in my opinion, it very much is.

Read more...
Load More
10 of 22