Archive for Spark tag

Davin Potts, CEO Appliomics, Founder KNIME, Core Python Commiter

One on One with Davin Potts

On April 10, 2019 in Analytics, Big Data, Data Management, Machine Learning, Open Source

At the recent Data Day Texas event, I sat down with Davin Potts, who I have known for many years, and had a long conversation about a wide variety of subjects. Over on the Vertica blog, I broke the conversation into chunks, but I wanted to put it all together in one place so you can see what we chatted about end to end. So, here’s all of it, from machine learning to open source, from Python to Knime, and why the heck DO we move data out of a database to analyze it?

Tags: big data, Data Day Texas, Data Geeks, databases, Kafka, KNIME, RDBMS, real-time, Spark, streaming data

Orc O'Malley of the Yellow Elephant clan says LLAP

Owen O’Malley on the Origins of Hadoop, Spark and a Vulcan ORC

On September 12, 2016 in Big Data, Hadoop, Spark

Owen O’Malley is one of the folks I chatted with at the last Hadoop Summit in San Jose. I already discovered the first time I met him that he was the big Tolkien geek behind the naming of ORC files, as well as making sure that Not All Hadoop Users Drop ACID. In this conversation, I learned that Hadoop and Spark are both partially his fault, about the amazing performance strides Hive with ORC, Tez and LLAP have made, and that he’s a Trek geek, too.

Tags: ACID, big data, Flink, Hadoop, Hadoop Summit, Hive, Hortonworks, MapReduce, ORC, real-time, Spark, SQL in Hadoop, Storm, streaming data, Syncsort, Tez

Cyber Security with Apache Metron and Storm

On August 2, 2016 in Analytics, Big Data, Hadoop

A few weeks ago at Hadoop Summit, I caught up with some friends from the project I worked on last year with Hortonworks, including Ryan Merriman who is now an Apache Metron architect. Since Apache Metron was a project I knew virtually nothing about beforehand, I quizzed Ryan about it. The conversation evolved into a discussion of the merits of Storm versus Flink and Heron, something I’ve been meaning to delve into for months here.

Tags: big data, cyber security, Flink, Hadoop, Hadoop Summit, Hive, Hortonworks, Kafka, life, Metron, Nifi, real-time, Spark, Storm, streaming data

Holden Karau's audience at High Performance Spark preso at Data Day Texas

Interviews with Brilliant People on Hadoop and the Future of Big Data Tech

On July 6, 2016 in Big Data, Data Management, Hadoop, Life, Spark

I have been doing some very cool interviews with brilliant people, usually at events like Strata + Hadoop World and Hadoop Summit. The intention is to use their brilliant thoughts so that I don’t have to take the extra time to come up with my own. Not to mention I get the bonus of learning new things, and getting the unique perspectives of folks who really know their stuff. Nothing like learning tech from the folks who literally wrote the book on it.

Tags: big data, Data Day Texas, Data Geeks, Hadoop, Hadoop Summit, Hive, Hortonworks, life, MapR, NASA, ORC, Spark, SQL in Hadoop, Strata Hadoop World, streaming data, women in tech

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

On May 16, 2016 in Big Data, Data Management, Hadoop, Spark

So, is this a situation you recognize? Your team creates ETL and data preparation jobs for the Hadoop cluster, puts a ton of work into them, tunes them, tests them, and gets them into production. But Hadoop tech changes faster than Texas weather. Now, your boss is griping that the jobs are taking too long, but they don’t want to spring for any more nodes. Oh, and “Shouldn’t we be using this new Spark thing? It’s what all the cool kids are doing and it’s sooo much faster. We need to keep up with the competition, do this in real-time.”

You probably want to pound your head on your desk because, not only do you have to hire someone with the skills to build jobs on another new framework, and re-build all of your team’s previous work, but you just know that in a year or two, about the time everything is working again, some hot new Hadoop ecosystem framework will be the next cool thing, and you’ll have to do it all over again.

Doing the same work over and over again is so very not cool. There’s got to be a better way. Well, there is, and my company invented it. And now I’m allowed to talk about it.

Tags: big data, Hadoop, MapReduce, real-time, Spark, Syncsort

Four Really Real Meanings of Real-Time

On February 29, 2016 in Big Data, Data Management, Hadoop, Spark

Our director of engineering told me that she had a customer ask if we could do real-time data processing with Syncsort DMX-h. Knowing that real-time means different things to different people, the engineer asked what exactly the customer meant by real-time. He said, “We want to be able to move our data out of the database and into Hadoop in real-time every two hours.”

When she told me that story, I wanted to quote Inigo Montoya from “The Princess Bride.” You keep using that word, “real-time.” I do not think it means what you think it means.

But what does real-time actually mean? And what do you really mean when you say real-time? What do other people usually mean when they say real-time? How can you tell which meaning people are using? And what the heck is near real-time?

Tags: big data, definitions, Hadoop, real-time, Spark, streaming data, Syncsort

Spark with Tungsten Burns Brighter

On February 22, 2016 in Big Data, Spark

Project Tungsten is a new thing in the Spark world. As we all know, Spark is taking over the big data landscape. But as always happens in the big data space, what Spark could do a year ago is radically different from what Spark can do today. It busted the big data sort benchmark last year, and is just getting better as it goes. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. But, being me, I wanted to know what Tungsten was, how it worked, and why it improved Spark performance so much.

Tags: big data, Hadoop, in-chip, in-memory, Sisense, Spark

Charity Majors Smart Technology Decisions Slide

Data Day Texas Happy Hour Takeaways

On January 26, 2016 in Big Data, Hadoop, Life

I learned a lot at Data Day Texas. I live tweeted a lot of interesting bits on @RobertsPaige as I went along, but some of the most enjoyable and enlightening stuff happened at the happy hour afterward.

Tags: Akka, big data, Flink, Hadoop, Kafka, KNIME, life, Spark, Storm, streaming data, women in tech

The Spark that Set the Hadoop World on Fire

On August 18, 2015 in Big Data, Data Management, Hadoop

Spark is the darling of the open source community right now. It’s setting the Hadoop world on fire with its power and speed in large scale data processing on Hadoop clusters. Spark is one of the most active big data open source projects, has bunches of enthusiastic committers, has its own group of ecosystem applications, and is now part of most standard Hadoop distributions. Neat trick for a data processing framework that didn’t even start life as a Hadoop project.

Tags: Actian DataFlow, big data, Flink, Hadoop, Heron, MapReduce, Spark, Storm, Tez

The Little Actian DataFlow Engine That Could

Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t

On July 27, 2015 in Big Data, Hadoop

In Hadoop’s ecosystem of massively parallel cluster computing frameworks, Actian DataFlow is an anomaly. It’s a powerful little engine that thinks it can take on any data processing problem, no matter the scale. The trouble is that unlike MapReduce, Tez, Spark, Storm and all of the other Hadoop engines, DataFlow is proprietary, not open source.

Tags: Actian DataFlow, big data, Hadoop, KNIME, MapReduce, Spark, Tez

Big Data Page by Paige

Thoughts on Analytics, Software and Data Management

Archive for Spark tag

One on One with Davin Potts

Owen O’Malley on the Origins of Hadoop, Spark and a Vulcan ORC

Cyber Security with Apache Metron and Storm

Interviews with Brilliant People on Hadoop and the Future of Big Data Tech

How Do You Move Data Preparation Work from MapReduce to Spark without Re-Coding?

Four Really Real Meanings of Real-Time

Spark with Tungsten Burns Brighter

Data Day Texas Happy Hour Takeaways

The Spark that Set the Hadoop World on Fire

Actian DataFlow, the Little Hadoop Engine That Could, But Probably Won’t