Podcast: The Growth & Evolution of Flink with special guest Marton Balassi

May 5, 2020 in Eventador Streams Podcast



Podcast: The Growth & Evolution of Flink with special guest Marton Balassi

In this episode of the Eventador Streams podcast, Kenny and I had the chance to sit down and chat with Marton Balassi, manager of the streaming analytics team at Cloudera. Between his experience as one of the earliest contributors to the Apache Flink streaming code through to his current work integrating Flink into the product stack at Cloudera, Marton has fascinating insights into not only Flink but the streaming ecosystems as a whole.

Learn more about the history and evolution of (and great community surrounding) Apache Flink in this episode of the Eventador Streams podcast:

Eventador Streams · The Growth & Evolution of Flink with special guest Marton Balassi

We recorded this episode with Marton fresh off of his Flink Forward keynote. If you didn’t have a chance to see it (or Kenny’s!) we highly recommend you check it out on YouTube.

Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play and more. Happy listening!

Episode 05 Transcript: The Growth & Evolution of Flink with special guest Marton Balassi of Cloudera

Leslie Denson: We’re back once again to talk about one of our favorite topics, Apache Flink, this time with someone who’s been working with it since before it even had its name. Join Kenny and I and chatting with Marton Balassi, manager of the streaming analytics team integrating in Flink at Cloudera, about Flink’s history, the benefits and the great community that surrounds it. In this episode of Eventador Streams, a podcast about all things streaming data.

LD: Hey, everybody, it’s Leslie and Kenny here again with another episode of Eventador Streams. Today we’re super excited to have Marton Balassi with Cloudera on the line with us to talk about how they’re integrating Flink in with their platform and what they’re hearing from their customers. We are really excited for today’s episode and hope you are too. Thanks for joining us, Marton. We really appreciate it.

Marton Balassi: Thanks for having me, Leslie and Kenny. It’s an honor to be here.

Kenny Gorman: Excited. This will be a good time.

LD: It’s going to be a great time and we’re coming off of Flink Forward. There’s been a lot of really great conversation around Flink in the market yesterday and going to continue over the next couple of days. It’s also just a great time to have it on the mind and keep talking about it.

Marton, why don’t you tell us a little bit about your background with Flink. I know you’ve been working with it for a while now, several years, and what you’re doing at Cloudera. Give us just some insight into that.

MB: Sure. I can call myself very fortunate that I was introduced to Flink before it was called Flink. My background is actually in research in distributed systems and I had the good fortune to start working on a project called Stratosphere. We had an EU funded research project where basically one of our tasks was to use this engine called Stratosphere and give it to a streaming API as well. That engine was contributed then to the Apache software foundation and our contribution became Flink streaming.

LD: Oh, wow.

MB: I was focused on Flink streaming specifically for a couple years. Then I wanted to have a little bit of a switch and understand, coming from the research side, how people actually use these systems and derive value from them in the industry. Then I became a consultant within the Cloudera professional services team and spent roughly three years there. I worked directly with roughly 50 customers.

Then I decided to try my own luck and became an individual proprietor in this field. After roughly six months, I realized that this is not as much fun without a great team to collaborate with. Also, Cloudera reached out to me. Again at that point, my current manager, Jovi, who runs the whole data emotion organization to be on the team for Flink, and naturally I was really excited because I thought the right context and the right context to make this happen and we never stopped ever since.

KG: That’s awesome background. I didn’t actually know how deep your background was with Flink early on, so that’s very cool. Great context.

LD: It’s really interesting to chat with people who have worked with Flink this long. We joke about it that we’ve worked with Flink for two or three years now and that seems like forever in the streaming world. So it’s always great to see folks who have used it even further back.

KG: I mean, even calling themselves glutton for punishment, I mean distributed systems are much harder than traditional systems and streaming distributed systems add a layer of complexity on top of that. At least from speaking from my background, when we first started using Flink, we’re like, “What is this mysterious magical thing called Flink?” So we were pretty addicted to it and it never stopped from there. It’s great to see others come in and adopt it, even better to hear the history of like how I got here and why it’s so prevalent now.

Marton, are you guys seeing a ton of … I mean, obviously data in motion is the moniker that Cloudera has put on your business unit and your effort, I think, surrounding Flink. If you could tell us a little bit more about what that looks like from your perspective being so technical and give us some color there.

MB: I think Flink really unlocks a lot of value for our customer base and I really glad that we are making this investment. Initially when I joined Cloudera back in 2014 I was hoping that I could make that change happen, but actually we needed the merger with Hortonworks to really have a sizeable enough investment that it made a lot of financial effort for the company to make this change and introduce Flink to the platform. That certainly put us in the position that the company felt comfortable with making these investments. So in both cases, I think it’s fair to say that Hortonworks had a much bigger data-in- motion investment, but even for them bringing in a new piece to the puzzle, Cloudera itself now has 40 plus components within its offering.

So making the decision to bring on something new and then make all the necessary integrations, make sure that you can position this in a proper way, it actually becomes a great challenge because the customers look at us to also provide the recommendation and the platform that makes a whole. When you reach to the customer base that Cloudera has, it’s not necessarily pick the absolute best tool that is out there, but pick a tool that is sufficient for the job and can have the necessary integrations. That means that your vetting process has some extra hoops to jump through compared to being a new starter in this industry.

KG: Yeah, that makes tons of sense.

LD: It does, and it goes into one of the things I wanted to talk to you about today because I know we had a lot of these as we’ve gone through and I think we can say we have a lot of them every day, but what are some interesting learnings you’ve had? What were some interesting … I mean, you’ve obviously worked with Flink for a very long time, but as you’ve been integrating it in with Cloudera, as customers have been getting started with it, what are some of the interesting things that you’ve learned even along that path?

MB: So I think going back historically, this really derives from what Kenny mentioned previously that distributed stream processing system design is inherently a bit challenging. The way people are still to this day often approach streaming is they expect the streaming is inherently approximate, that you need to correct whatever is streaming for users and for a number of years that’s how people viewed streaming. Now that especially with, let’s say, the rise of Flink and other systems, for example like Apache Beam where we can guarantee correctness properly and do this in a scalable way at low latency, that really changes and coming from my professional services background, I have really seen people open up to this new idea. I’ve worked with, for example, a telco where they have been doing automatic classification of network outages for years in a batch processing mode.

Then I posed the question to them, “What if we could do this instead of a day, in, let’s say, 20 seconds and end-to-end coming from the network tower?” Then the business unit manager was shocked because he said, “Well, it takes roughly half a minute for our angry customers to call into the call center where they have a network outage.” So by that time, if you can provide the reason why this outages is occurring to the call center folks who pick up the phone and that means you probably reduce the length of that call by 50%. The largest expense a telco provider has happens to be the call center. So that is a game changer and this is a simple example. I’ve seen some more in banking, I’ve seen a couple ones across verticals in cyber security which is really a horizontal use case but it speaks for itself.

But that was really a realization when I was having this conversation in a telco that really made it clear to me that now that we actually get to do these things in a correct manner, in the general sense of the term, with quite low latency, that is really a game changer. We should make it so that these systems integrate really well with the rest of the stack and make that first mile problem that I talked about yesterday in the keynote really smooth and make sure that the learning curve for systems like Flink is really simple.

KG: See, that is the fascinating thing to me too. I tend to agree. We oftentimes talk to folks about streaming and one of the ways we describe it is, “Look, every business can benefit from streaming data.” But way back when, maybe a decade ago, it was very common to say like, “Every company is a data company,” and I think that’s largely become true. Folks have figured out that data is a great asset for their company, but I don’t think everybody’s figured out that every company is a streaming data company. You just don’t know it yet.

It’s these kinds of use cases where the business has been moving along and trucking along at a certain velocity and all of a sudden streaming can come in and actually change the game in terms of how they use data, how they delight their customers. Sometimes it makes them more competitive. Sometimes it makes them have better customer retention numbers or better satisfaction for their customers. The streaming data is tricky and distributed systems are tricky, but the beauty is when you get it right, it does exactly what you were talking about it. It’s like a nuclear bomb that you drop on your infrastructure to change the game from A to B. I think that’s the exciting thing, at least for me to hear those stories. That’s really cool. I’m glad you shared that.

MB: I also get excited about those and that’s I really pounced on this opportunity that joining the data in motion team has given me and I’m really glad that the people who joined my team also shared this vision and really understand what I mean when it comes to this. So yeah, this is a great opportunity for all of us.

KG: That’s cool.

LD: We say it all the time. If our team internally wasn’t as excited about all of the things that we were doing, it would be a whole lot harder to get done when it gets done. Everybody loves to see when this succeeds and everybody loves to see the fun use cases that come in too with different customers and how they’re using streaming, how they’re using Flink, what they want to do, how they want to transition to SQL, all of those different things. It’s interesting to see and it’s a lot of fun to watch.

LD: So part of that also is you’ve been … as we’ve said now a few times, you’ve been working with Flink for a long time. Is it now what you thought it would be? How is the evolution of the platform been from the very beginning?

MB: Yeah, that’s really interesting to look at it in such perspective. I’ve touched on it a little bit from the streaming perspective, how much that changed the system and I think it’s fair to say, and probably our friends in Berlin would agree, that the biggest change for Flink was the checkpointing mechanism, and that really turned the whole story around. But that really shifted Flink from a primarily bashed focused system, which it was originally, to a streaming focused system and it’s fun to see it just from that perspective, how much the scale has tipped to the other direction and how much streaming focused Flink today is. Obviously saying this and knowing that I was the guy, along with my good buddy Doula, writing the first line of Flink streaming at the time. Obviously that makes me immensely proud, but that was just the API.

But the network and the engine for the most part was done by the guys that was in Berlin. But that means that we were the first advocates of Flink streaming. Looking from where we came and how we had to convince people, even the folks in Berlin, that this is what we should invest in as a community, seeing that this really came through. Obviously that’s a matter of luck as well. I’m definitely not a fortune teller, but seeing how far this has come, yeah, it just makes me proud. That’s the best way to describe it.

KG: That’s cool. Did you have exposure to things like Storm and other systems previously or what drew your inspiration in those early days?

MB: Yes, we had at the research institute where we embarked on this project. We had multiple projects running at the time, primarily with Storm and obviously Storm was of great influence to our work and Storm definitely … it was the first, it has given this specific field a lot, but that was actually ultimately I think its demise, that people jumped on it so quickly and it became a production application and it made it quite challenging from my perspective to the Storm community to make some of the changes that maybe the engine needed ultimately. I think that’s why systems like Flink could take a leg up on Storm.

KG: That’s interesting. We initially implemented our platform on Storm using Streamparse, if you’re familiar. If you’re not, it’s essentially the Python API to Storm, for our listeners. We enjoyed it early on, but as soon as we wanted to get powerful and roll up our sleeves, we found limitations and that’s when we turned to Flink. Even though it was very early on in Flink, it still was quite powerful. The focus on state and correctness was front and center for us and an important component. So that was it for us. We opened our IDE and IntelliJ and turned down PyCharm and we went for it.

We wouldn’t have started really thinking about streaming if it wasn’t for Storm, so a lot of credit there. But as soon as we got neck deep in it, we started to see A, streaming systems are awesome and B, Oh my gosh, this is pretty complicated. What’s the best way to move forward and what other platforms have especially that component of state that’s so important.

MB: Yeah, the way I like to think about Storm and this is overly simplistic so I’m sorry if I do offend anyone with this, but it made a huge leap. It proved that something was possible. To a large extent, MapReduce did the same. MapReduce proved that you could build an inverted index, so basically a dictionary for something as large as the internet and that was a huge accomplishment. Storm did something quite similar to the streaming space so you should never discredit them.

KG: That’s right.

MB: But being able to follow up on that, that’s another story and I think to iterate in this point, when I was working with Storm, what I was really frustrated about as a developer who was touching the engine and at the time I was questioning whether this is just me not being able to live up to the expectations or this is a really more challenging task, is that the internals and the distributed resource management aspect of Storm was actually back in the day implemented in clojure, which was a great solution for writing that part of the code base really quickly. Most of that was actually written by a single person and then a small group of developers, which was really efficient to come up with something really fast.

But then once you build a large enterprise footprint around it, it is also quite important that everyone can understand how the system is working and that you can accept contributions from a large community. I think that’s something that the Flink community is very mindful of and we try to maintain a welcoming community for that reason. So that’s how we are trying to embrace the whole philosophy of the Apache software foundation. Although Storm was doing the same with choosing, let’s say, a slightly more obscure programming language than Java and Scala R, they made it difficult for your average data engineer to contribute code back to the Storm. But that’s just my two cents and I might be wrong, but obviously my knowledge is limited on on Storm.

KG: Yeah, I see how you at the end there, you got humble all of a sudden. Now that’s awesome, and I think that’s … a lot of the modern distributed systems projects are Java and Scala and obviously things like Presto come to mind.

MB: Yeah, and maybe other languages … like for example, we see Go coming up quite a lot these days with Kubernetes. I think Go is a great language as well in terms of making it really simple for people to understand it. It’s more about finding a language with simple enough concepts and the guidance that lets you really express the right level of detail, and obviously when it comes to these distributed systems, this wouldn’t necessarily be what you focus on. You then go to does this system really do low latency? How scalable are checkpoints in this system?

You really start to think about the use cases that drive the business value that enable these … or the patterns that enable these new use cases. But it’s really interesting to me how this actually loops back to community on the long term. That’s also why I’m really focusing on pushing this aspect during our Flink Forward keynotes for example, that we want to make sure that the community understands that we as a vendor are there for the community as well. Because I believe that that’s the only way towards long term success when it comes to the system.

LD: It dovetails nicely into one of the things that we’ve talked about amongst ourselves. We asked Jesse, I think we’re going to ask most of our guests, is what are you excited about coming down the pipeline with streaming? What about the future, whether it’s the technologies or the use cases or whatever it might be, makes you still excited to be in this space and do in this work?

MB: So right now the general … look, and this also comes from my very recent experience of integrating Flink into the Cloudera stack. Frankly, the first three to five months was basically just the integration work for us, and we knew exactly when we joined with our small team that this is what we have to deliver. We have to deliver integration basically with Cloudera manager. That was just … start typing really quick.

But once we got there, that’s when we had the chance to open up to customers and we ran a beta program with a handful of customers and that was really enlightening to me that all that over and over, we’ve seen questions that we could categorize into two different themes. Either it was around developer experience and that had to do with basically the prior experience of the customer. Either they were very seasoned and in Yarn and Cloudera already, and then they have seen that Flink does certain things slightly differently and we need to educate them on the concepts. Or they were new to Flink and Cloudera at the same times in many cases or the whole Yarn set up, and we just needed to provide a simple enough user experience.

Obviously, from us after three to five months of really going down to the trenches and to the minuscule details and making the system work, understanding that well, actually if we could just show the running jobs at the same place wherever the completed jobs are, that would help our customers. Obviously this is user experience 101 so it’s fairly simple to suggest that. But obviously coming from an engineering background, we were pushing for yet another cool feature whereas limiting that feature set, but making it very slick and making sure that it’s well organized and providing a ton of content around it, I think that’s something that we need to contribute to the Flink community. The Flink community overall doing a better and better job of this, but historically I think it’s fair to say that Flink have a really nice and clean code base, at least compared to some of the other systems in the space, and they enjoyed digging through the code.

But some of the documentation and some of the examples do lack and you can always improve on that and you can find corners of the code base where the code looks in shape but it’s really difficult to find out from the documentation and have a nice, let’s say, step-by-step tutorial type of guide on how to get up to speed with it. So these stories, we group them into this one theme of let’s call it developer experiences, was UI and providing content. This was really a important enlightenment for us. The other topic is … and that’s also why I’m excited about eventhough in general is StreamSQL.

Obviously the community is really focused on this, opening up to a way larger audience. I think I don’t need to introduce these concepts to the event of the stream sport custodians. But certainly that’s the other big topic that I’m excited about. So if you wish now, the base engine is done, quote unquote. Obviously there is always work to do, but I mean by that that similarly to what I’ve just said with the developer experience, we have this engine that works quite well, but we need to make it sure that you can extract value from it in the right way. From that perspective, making a nicer UI and opening up SQL are similar. Obviously technically they are night and day in terms of implementation detail. But that’s how I view them in this context, and obviously after working with a bunch of customers, I see how much more people will be able to interact with the system in a SQL like manner than when we require them to write Java or Scala code.

KG: We had this one eye opening moment where we had a customer … you reminded me of it. I think I’ve mentioned it before, but we had a customer who was super excited about Flink. They came in through the Flink ecosystem. They had been writing their own processors on their own to do on side projects, got interested in it, and this particular customer was looking at implementing stuff in Flink. But when he looked around at the team, he was the Java nerd, so to speak, and everybody else had some different set of skills, but it wasn’t really core Java. So we got to talking about how to implement some jobs and he was writing stuff in SQL and I was surprised. I said, “You’re the person carrying the flag here in the company for Java. Why would you write these jobs in SQL?”

His answer totally floored me. He said, “Well, I can write it in Java, but then if it breaks, then I’m the one who has to get woken up. When it needs a change, I’m the one who has to go work on it. So it’s not really the best thing for our organization. It’s better that I write them in SQL. I can still accomplish the same business goal, and then the entire suite of my dev ops team can work on it and mutate it and change it over time. If something breaks, the on-call pager can go to everybody.”

I thought that was A, very mature of him to understand that landscape within his org and B, exciting for SQL. Those are the two takeaways that I took from that little conversation, and I thought, “Wow, this is going to be a thing if you know folks who are already dyed in the wool Java folks who like the data stream API and the table API, but are saying, “Hey, you know what? Let’s go step further and let’s make it SQL just for that readability, maintainability, and socialization within our teams,” even within a developer community. I thought that was very powerful statement and was excited to see that direction. I don’t know if you guys see parallels there.

MB: Yeah, absolutely. I think even in your or my day to day work life, let’s say I want to have a look at a Kudu table and obviously I have the Java client to Kudu as well, but I can just open up a sequel engine like Impala and say select start or do a simple aggregation. I tend to choose the latter even though I do consider myself a half decent Java developer, and that’s what I would recommend for people who choose this profession. That’s certainly true and we see this in multiple cases. If you look at the broad landscape, you cannot expect someone to be a domain specific expert and be great at statistics and then be great Java developer at the same time. So that’s just a bit too much to ask for in my opinion.

We see it in many cases. So one of the engineers on the team, Doula, he used to work for a company called King. They make games like Candy Crush and the one of the great platforms that they produced with his team there, basically the idea was that yes, there is the data engineering team that connects the bits and pieces and makes a platform, but the domain specific logic is then plugged in by analysts into that system and we see this pattern repeat over and over again. We see it at Netflix with Flink, we do to a certain extent with Alibaba. Obviously there is that, let’s say, data pipelining experience and you definitely need them, but you want to enable, as you were also saying in your presentation or your talk yesterday, the 80% of the business logic can be plugged in via that domain specific knowledge crowd and let’s make them do that. Let’s not make them then ask budget for a three months project with two data engineers to implement what they could do in a SQL query.

KG: Yeah, it’s interesting as humans, how many times … even though you may know the schema, right? Maybe this is a well known data structure. You know the schema. You still want to poke at, right? You still want to say, “Show me a couple of messages, show me a couple of snippets there. I want to see the data and play with it and interact with it and then I’m going to start writing code against it and understanding it in maybe a deeper way.” I think that’s a human experience. As programmers, no matter … Python’s my programming language of choice. But even if I have a dictionary … speaking in Python terms, a dictionary full of data, I want to poke at it, I want to look at it, what’s in there, show it to me and then I’ll make decisions. Then I know what kind of exceptions to write and handlers to write and things like that.

So I think SQL has this great effect on people where they actually do poke at the data. Like you said, you might just jump in and select start from table, limit 10 or whatever. Very traditional thing to do. Show me what’s in there and okay, now I understand when I’m working with and I can build the application or solve this business problem from there.

MB: This, ultimately, I think reduces time to value and if I remember the value, you also make this point very clearly in one of your blog posts that if you look at how much time it took you years and years ago to build a similar application and how much time it takes for an enterprise to cook in a Kafka topic and provide a SQL interface to someone with domain knowledge, that is really a game changer. Obviously we are talking about latency in terms of daily writing the data itself, but delivering the project, that’s also very valuable.

LD: Well, Marton, this has been a really fantastic conversation. I know we’ve enjoyed it, learned a lot. We love seeing, as I’m sure you do, being somebody who’s been in the Flink ecosystem and in the Flink community, I guess, for so long, seeing more and more people come into Flink and see its values and see its benefits. To the point you made earlier, we obviously are massive fans of SQL over here and so always happy to see that getting its props and see that getting its due within the streaming community and how it’s growing there as well. So this has been a fantastic conversation. We really thank you for taking the time, especially after doing your keynote yesterday. I’m sure you may be a little talked out. So thanks so much for coming on and talking to us.

KG: Yeah, it was a great keynote. Enjoyed the demo. It was great to see Cloudera doubling down on the streaming space and on Flink in particular. So exciting times ahead. It was great stuff.

MB: Yep. Thank you very much.

LD: A big thank you once again to Marton for taking the time to chat after a fantastic keynote at Flink Forward, no less. If you didn’t get to see his talk about Cloudera or any of the other talks, including Kenny’s, check out the event’s YouTube page to watch. You won’t be disappointed. As always, if you find that you have questions about Flink or the Eventador platform, you can reach out to us at hello@eventador.io or go ahead and get started with a 14 day free trial of the platform at eventador.cloud/register. Happy streaming.

Leave a Reply

Your email address will not be published. Required fields are marked *