Senior Director of Marketing
Since Flink became a top-level Apache project in late 2014, the community has grown to make it in 2019 one of the ASF’s most active projects in terms of commits and user group activity. And for very good reasons.
From state management to features like the Table and SQL APIs to more recent support for things like Java 9 and the Blink planner, more and more organizations are understanding the power of Apache Flink as a part of their stream processing ecosystem. There are a few common hurdles for organizations to overcome when ramping up their Flink systems – for instance, a common one we see is improper window sizing. However, the future with Flink—for both data engineering and data science teams—is bright.
Want to learn more about the good, the bad and the awesome of Apache Flink? Check out this episode of Eventador Streams.
And if you haven’t registered for Flink Forward, which is now being presented in a virtual format on April 22 – 24, 2020, go ahead and sign up now. Kenny will be presenting on Wednesday the 22nd at 3 pm ET/Noon PT on “Writing an interactive streaming SQL and materialized view engine using Flink.”
Want to make sure you don’t miss a future episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play and more. Happy listening!
Episode 02 Transcript: A Dive Into All Things Apache Flink
Leslie Denson: If you’re using Apache Flink, thinking about Flink, or have heard of it and just want to know why it’s one of the Apache Software Foundation’s most active projects right now, join me in chatting with Kenny Gorman about all things Flink in this episode of Eventador Streams, a podcast about All Things Streaming Data.
LD: Hey Kenny, how’s it going today?
Kenny Gorman: Good, how are you?
LD: I’m doing good, doing good. Thanks for joining me again, and this time…
LD: We’re going talk about all things Apache Flink. How does that sound?
KG: Yay. That sounds great. [chuckle]
LD: Good. So I know that Flink was something that came pretty early on in the Eventador product suite. You started with Kafka, added in Flink, what was… Why was that the path that you all went down? Because this was a few years ago before it became even as popular as it is now.
KG: Yeah, I’d like to think we were kind of early on in seeing the design choices and the wisdom of the founders of Ververica, or the company now known as Ververica, in that they were really focused on correctness of results, and they were focused on state management and how much that mattered, and those were some of the things that were on our mind. We had previously implemented Storm as part of our platform very briefly and found limitations there in terms of scalability and state management and things like that. And we were… [chuckle] It was actually one of our principal engineers, John Moore and Eric and I were actually at lunch and we were fretting over this decision and having beers, and decided ultimately that Flink had all the attributes and then we had to go back to the office and immediately start implementing Flink. So, I’d like to say it was some well thought out, month-long, very cerebral exercise, but ultimately, we decided that, hey, it had a lot of the core components that we thought were important. We could tell that there was a ground swell. At that point, it was already a first-class Apache project, and so we dove in and started implementing it on our platform.
LD: I feel like over beers is how all the best decisions are made, just personally.
KG: Yeah. [laughter] Right. Ultimately, the thing that’s happening is, and just to give a context is, you might… Someone might say, “Well, why Flink? Why anything? Why wouldn’t you just write everything in your own microservice or your own Kubernetes or Docker ecosystem and just write processors just raw in Java or Scala or anything? Why not just pull stuff out of say, Kafka with native drivers and build your own management system?” And that works to some degree, but if you’re trying to grow an organization from just a processor to a full-fledged stream processing ecosystem where you’re doubling down on Kafka, you’re doubling down on a streaming data bus, you need a framework to rely on. You just… Manually managing that infrastructure on your own, it’s terrible.
KG: And the scaling it, having interprocess communication, and then doing things like… Flink does a really good job of building a DAG of the operators, and ultimately, figuring out… Chained operators, and then ultimately decomposing that job and scaling it independently. And that’s, it’s fantastic stuff. And that power is something that we… Once we kind of dove in, we became pretty infatuated with.
LD: Nice. Well, and that leads into obviously there were probably some surprises you had when implementing this. What were some of those? And any kind of insider tips and tricks for people out there who are thinking about Flink but then also hearing, “Well, this may be a little bit more difficult than I want to dive into.”
KG: Yeah, sure. Flink has multiple different APIs. You can use a real high-level API like the SQL API, you can use the Datastream API which is a little bit lower level that’s written in… You would write your job in Java. Or you can even drop down below that. And I think that when we started writing jobs in Flink, we started writing them in Java and using the Datastream API. And I think ultimately, we started using the table API and SQL more and more over time and began to get more excited about that. When it was early on, it wasn’t as mature, say, as it is now. And I think ultimately, folks started to realize that a nice table API with simple operators, and then obviously, layering SQL on top of that, was super powerful. That’s kind of the direction we were pleasantly surprised to kind of go in that direction. I think the community at large was going in that direction. And so we started writing more and more code in that way. And that’s… Ultimately, we started to see the power of SQL via that exercise, and of course, being database and SQL nerds, we started to figure out the whole, “Wow, this is really cool,” and we really want to just use SQL all the time. And I think that’s kind of where our heads were at. I mean, this is two years ago now. And that’s kind of what led to a lot of the excitement, frankly, on our platform around SQL.
LD: Yeah, that makes a lot of sense because if you can… I can only imagine that if you’re looking at writing 100-150 lines of Java code as opposed to a SQL query, then for most folks, that’s going to be a no brainer.
KG: Yeah. For most organizations, it’s like we say, the 80/20 rule, like, 20% of the folks are going to drop down and use something, maybe they’ve got existing jobs or maybe they’re just more comfortable in Java. So that’s kind of how at least I think about it.
LD: So as you mentioned, it’s been two plus years since we included Flink in the platform. What are some of the common issues that if somebody is trying Flink, they’d probably go, “Oh yeah, I’ve run into that before”?
KG: Well, the number one gotcha is… Okay, so if you’re not familiar, Apache Flink keeps track of state as you build a processor, as you keep state, like maybe you have a window operator where you’re looking at events over like a 24-hour period or something. It keeps the state to back those calculations. So if you’re writing a computation, maybe you’re summarizing data, or maybe you’re averaging or something, and as you do that over that time period, it’s going to keep track of state. And it’s very good at that, in terms of scaling it and restart ability and fault tolerance with that state and all that stuff. It’s all based on RocksDB and it works quite well.
KG: But the gotcha is what folks will do is instead of making a 24-hour window, they’ll make it like forever or a month, or whatever. And you’re putting a billion events a minute or whatever through your Kafka pipeline and Flink’s slow. And you say, “Hey, how come everything that is involved with Flink is completely broken?” And the answer is, “Well, yeah, you’re keeping a open window or a window that’s so huge that that state just becomes crazy.”
KG: In some cases, if you’re going to do that, maybe just use a traditional database. ‘Cause if your window operators or your view of the world in terms of time is going to be that gigantic, then maybe a database is actually a good thing for you. That said, if you’re putting a billion events per minute or whatever it might be into Kafka and wanting to process that in Flink, its best role, I think, is that you are taming the firehose, so to speak. You’re taking a huge amount of data, you’re persisting it in something like a distributed log like, say, Kafka. You’re pulling it into Flink, and using the parallel processing capabilities of Flink, and the simplistic job creation structure, and you’re processing that code down to something more byte-sized. That use case is where Flink is really, really good for, things like ML or data scientists who obviously care a lot about that. You just can’t process that many events over that time frame with these tools these days. I think when you think of it in that context, that you’re just kind of taming the firehose, Flink works really, really good with that. And that is for sure, the most common mistake.
KG: But I guess beyond that, writing Scala and Java code against Streams is… While the API is relatively simple, and I think that Flink does a good job of abstracting that, it is still complicated. You still have to know about things like watermarks, you still have to understand chaining operators, you still have to understand the concept of state. It is not just kind of a… Just write a bit of code and deploy it and forget it. I think that’s a learning curve that I think if you’re a Java engineer and you’re thinking about writing code, Flink code, then understanding the API through and through is really important. And it’s changing relatively fast. Flink is… The adoption is growing and the commits are growing. And so it’s moving relatively fast. So I think that’s an area to kind of… That most folks kind of say, “Oh, I’ll just write it in Flink.” And then they realize like, “Oh, okay, that’s going to take a little while, actually.” And then, “Who else in my organization is going to be able to maintain that code?”
LD: Yeah. And you mentioned that adoption is growing fast. I think one of my favorite stats that I’ve seen over the last couple of months is the ASF, Apache Software Foundation, put out their stats for usage and user activity in 2019, and Flink ranked at the top, and/or in the top three for commits, number of emails on the dev email alias, and number of emails on the user email alias, which is awesome.
KG: Oh. Yeah, yeah. I didn’t know it was that high, actually. That’s really cool.
LD: Yeah, it is. It’s growing like gangbusters and it’s awesome, but what’s… What I kind of giggle about with the stats just to myself a little bit is, yeah, that’s showing really awesome adoption, but it’s also showing that people are going, “Wait, hold on, help me. How do I do this? I’ve run into something.” And those things, as we know with software development go part and parcel, but it’s an interesting stat, and it’s a fun thing to see that it’s growing so quickly.
KG: No, you bring up a good point, because I think it’s interesting that when we go to conferences or we talk to folks, of course, they’re naturally inclined to do the most complicated thing possible with Flink, like, “Oh, let me train ML models in real time,” or whatever that might be. And I think a lot… I think the lion’s share of folks are just trying to grapple with like, hey, how would I stand up a simple cluster against this Kafka topic, which is the most common use case, and how do I process a stream of data with some reasonable computation? And for the most part, it’s not complicated. Maybe they’re just doing simple aggregations. We see use cases all the time, right? Like, that are… You know, let me aggregate and filter this data from this Kafka topic so I can put it into an expansive data warehouse like a real time ETL-type use case or something like that.
KG: So those aren’t super complicated kind of processors, they’re relatively simple. But even then, that requires I think kind of the sea change from typical database mentality. It is not a database. You’re not going to just write code. It’s not just writing a cursor against a table and you’re returning data and shoving it somewhere. You do need to understand the complexities of state. And like I said, that window sizing is the… It’s always like… That’s like the anti-pattern number one, the first thing that everybody messes up. But I think from that standpoint, it’s interesting. I think we’re early… Even with those stats that you just mentioned, which I didn’t actually know were that great, it’s early days, and I think the next five years is going to be nuts.
LD: Yeah, I think it is, too. And one of the reasons why it’s going to be nuts, so we’ve covered a little bit of this as we’ve talked about some of the pitfalls which I think everybody just needs to know going into working with Flink, but there’s a reason why it’s so popular, there are a ton of benefits. And we talked a little bit about state, that’s obviously one of them, and you can dive into that more, but there’s a reason why Flink is growing the way that it’s growing.
KG: Yeah, for sure. I think the nexus of containers and Kafka and streaming in general have driven folks to kind of… Like, this design pattern when then it’s like, “Hey, I just want to write a simple processor for that.” Like, my computations, in theory, it’s… Everybody says, “Oh, it’s only 10 lines.” But it’s not. [chuckle] But in theory, yeah, in theory, the processor I want to… The data I want to emit from the stream is relatively simple, and I want to put together a piece of compact and discrete code to run this processor. And why can’t I? And Flink’s a great framework for that. There’s other things out there, obviously, that are good, too. I mentioned Storm earlier, but even things like Apache Beam, which can use clouds, or can even use Flink as its runner. There’s a lot of options out there, but writing stuff on Flink is I think a great first step for folks, and ultimately, for a lot of folks, scales and grows with them as they evolve.
LD: Yeah. No, that makes a lot of sense. I think there’s also been some really great improvements in some of the recent releases of Flink. We’ve seen them internally, our customers are able to benefit from them. Maybe if somebody tried Flink a year ago, they might “Yeah… ” kind of shake their head at it and go, “I’m not so sure I want to go down that route.” But there’s been some great stuff released. Let’s… Talk to me a little bit about what some of those features and functionalities have been that they’ve released, and what it means for organizations that are either growing and scaling their Flink, or just trying to get started.
KG: Right. Yeah, good question. Its early days… I should say, in the early days, it was all Java 8. And so, Java 9 support is just coming up to speed in Flink with the new release, which is cool, with 1.9. So that’s exciting, but it’s still kind of early in that adoption curve. There’s still some things that just don’t work, but we’ll… I think, the community will get there. I think, the evolution of the state backend has gone through a number of changes and a number of iterations from being very basic to being very robust, so that’s nice.
KG: If you didn’t know, Alibaba was a huge contributor to Flink and continues to be. Obviously, they’ve now… All the Ververica team is part of Alibaba. And so, they’re obviously a huge force, and most of the committers for the project. And early on, they created the Blink planner for the SQL API. And that’s just starting to get merged in now in the 1.9 and going forward. And I think that’s a super exciting area of development. From our standpoint, SQL is super important, SQL is a big part of our product road map, SQL is a big part of our expertise and kind of our thesis and beliefs around streaming data. And so, to see those Blink planner changes come in, that’s really cool. If you’re not familiar, the planner is, think of it as the optimizer in a regular database. Having that, the Blink planner means that we have just more SQL options, more comparison operators, more syntax capabilities, things like way more functions than we had before. So it’s bringing kind of a level of robustness to the SQL language and grammar that I think is welcome and is super exciting.
LD: I would agree with that. There’s a lot of really good things that have come out of it, but on the flip side of that, there’s definitely some things that it’s still missing and there’s work that the community is doing with Flink right now to add more features and functionality. If you could snap your fingers and have something added to Flink, both for us with the platform but then also for our customers to use or anybody to use, what would some of those things be? What do you think would make it more robust?
KG: Yeah, that’s an interesting one. I was just looking a while back. I think there’s this interesting trend and paradigm with streaming data where machine learning and data science is sort of overlapping with streaming data. And I suspect it’s because data science and ML workloads are starting to need real-time data because companies are putting their most exciting data, I’m just going to call it their most exciting data, into streams. That’s stuff that’s happening in the here and now, right?
KG: And so, if you can make advantage of that in your machine learning models, and even if it’s just an internal analytics user, so it’s more of a BI or something more simplistic, or maybe it’s actual data science and people are doing something like training models, I think, that’s the next frontier and where things are going to be super exciting. In Flink terms, that’s like one thing I was thinking, I was… FLIP-39. And that is bringing the ML libs and adding that to Flink. Using ML lib to train models to have things like K-means and linear regression algorithms, and all those libraries built into Flink and be able to use those within the framework, and then build true training pipelines and train the models, and then being able to take those results and use them in production application, that’s the kind of the holy grail from a machine learning model. At least, in my rudimentary knowledge of data science.
KG: And I think from a data engineering perspective, which is in my wheelhouse, I think that’s where we can help the data science community have the tools they need. I think they obviously struggle with getting data feeds and cleansing data and spend less time doing data science and more time messing with data. And if we can help them by building libraries in, providing self-serve applications and self-serve platforms like Eventador and allowing them to have the libraries and tools they need to build those models and train models, I think that’s super exciting. There’s a bunch of those on the roadmap for Flink, a bunch of that kind of stuff. And I think the FLIP-39, check it out, is I think a pretty exciting one.
LD: Very cool. Let’s peer a little bit into the future. If you were to look into your crystal ball, what do you think in a year, two years, whatever it might be, what do you think these pipelines are going to look like? What do you think that our customers or anybody’s streaming pipeline is going to look like with Flink involved, whatever it might be? Is it they’re just scaling up, is it more and more complex jobs? What is it that you think folks have to look forward to?
KG: Yeah, I don’t know. I think a couple of things are known true. We know that companies in order to be more competitive, are putting more streaming type data into things like Kafka. And they’re building enterprise data buses to use a… To bring an old school term back to the forefront again. And a reasonable and real source of data for your project if you’re an internal programmer or developer, data scientist, engineer, whatever, a credible source of data now is your Kafka bus or your streaming bus. Whether it be clickstream or if your company, maybe your company is now involved in IoT, or maybe your company has new sensors or things out as part of its own product. Maybe, I think, network security companies are an obvious one, or banking and financial institutions are another one that are obvious. Those kinds of companies are obviously dealing with streaming data day in and day out. And adding more events to their bus so that they can have better context around their customers and then distilling that down and building simple computations in a distributed manner… Microservices have made a huge change in how we all can develop and write code against Streams of data. No longer are there these monolith applications. You can just write a processor and deploy it, it has some logic for some application downstream somewhere. I think that’s super exciting, and I think that’s going to continue to grow.
KG: And I think if you’re not using streaming data and you’re thinking about it, your competitors are either thinking about it or already doing it. And I think that’s the real interesting thing for me is, if you’re in a data science group or you’re in a data engineering analytics group in one of these companies, your peers are already thinking about the same things you are. You go to the meet-ups, or at least right now, you’re not going to meet-ups, but [chuckle] when you interact with your peers, you can tell, people are really excited about these kinds of workloads and building computations on Streams.
KG: And I think that’s just going to continue to grow, and to some degree, kind of take over for a while. I think we’ve all kind of realized the… As a community, as a data community, we’ve realized the limitations of batch and the Hadoop infrastructure. Spark is still strong, but I think message by message computations is a very exciting area. And I think streaming has a very bright future because of that. I think Confluent’s obviously driving Kafka adoption through the roof. And I think there’s a lot of cool startups, especially around the SQL area that I like to talk about. There’s Materialize, there’s us, there’s even ksqlDB, and there’s others who are really starting to kind of take on the charge of around SQL and around Streams. I think that’s super exciting and I hope to see that really grow ’cause that’s, at least for me, is a big passionate part of this puzzle, and I think a very powerful one for most organizations.
LD: Yeah, I would agree. Well, I think as we’ve talked about, Flink is an exciting topic, it’s one that is… We’re hearing more and more lately from the folks that we’re talking to, obviously, as we talked about, that committers are growing as well. If you haven’t seen, Flink Forward is doing their virtual conference later this month. And so, Kenny is actually speaking on April 22nd. You can hear him talk more about Flink and the Eventador platform then or check out the previous Flink Forward videos for his talk that he did in Berlin last year. So thanks, Kenny.
KG: Yeah, yeah. Thanks for having me.
LD: Well, we hope we answered some of your questions about Flink today, and just maybe we invited a few questions along the way. If we did, you can always reach out to us at firstname.lastname@example.org with any of those questions that you may have. Or if you’re starting to dip your toe in the waters with Flink, we did just introduce a new professional services offering that lets our expert engineering team who work with Flink day in and day out, help your organization build, scale, and optimize your Flink or Kafka-based streaming systems. And as always, you can get started today with a two-week free trial of the Eventador platform with the included runtime for Apache Flink by going to eventador.cloud/register. Thanks again for listening to this episode of Eventador Streams.