Podcast: A Dive Into Apache Heron & Understanding Streaming Systems with Josh Fischer and Ning Wang

July 7, 2020 in Eventador Streams Podcast



Podcast: A Dive Into Apache Heron & Understanding Streaming Systems with Josh Fischer and Ning Wang
Eventador Streams Podcast: A Dive Into Apache Heron & Understanding Streaming Systems with Josh Fischer and Ning Wang A Dive into Apache Heron & Understanding Streaming Systems

In this episode of the Eventador Streams podcast, Kenny and I chatted with Josh Fischer, Senior DevOps Developer at ScotCro Management Consulting, and Ning Wang, Software Engineer at Amplitutde, about Apache Heron and their new book Grokking Streaming Systems, which you can get a 40% discount on with the code podeventador20.

Learn more about Apache Heron, its use and its benefits, as well as why it was important to Josh and Ning to write Grokking Streaming Systems as an introduction in this episode of Eventador Streams.

Eventador Streams · A Dive Into Apache Heron with special guests Josh Fischer and Ning Wang

Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play, Spotify, and more. Happy listening!

Episode 11: A Dive Into Apache Heron and Understanding Streaming Systems

Leslie Denson: So we’ve talked about Flink, Kafka, Beam, and more, so isn’t it about time for us to learn more about Apache Heron? So Kenny and I are joined today by Josh Fischer and Ning Wang to chat all about Heron, as well as Josh and Ning’s new book, “Grokking Streaming Systems,” in this episode of Eventador Streams, a podcast about all things streaming data.

LD: Hey, everybody. Welcome back to another episode of Eventador Streams. Today, Kenny and I are joined by Josh Fischer and Ning Wang, who are the co-authors of the new book, “Grokking Streaming Systems,” as well as both having really great backgrounds in history and actually working with streaming systems, so we’re really excited to have them on the line today. Hey, Josh. Hi, Ning. How’re you doing?

Josh Fischer: It’s a good day so far.

Kenny Gorman: Well, we’ll change that.

LD: Why don’t you both tell us just a little bit about yourself, a little bit of your background, how you got to where you are now, and how you got to where you were writing this book?

JF: Sure. So I have been working in software for about the past five or six years. Just like everybody else, just had an interest in software, got self-taught. Found that I was really bad at UI type stuff, I couldn’t center… to save my life. Still can’t to this day. So I got more into the data side of things, went to APIs, and ended up getting involved in some open source project that was built out of Twitter. It was a streaming engine called Heron, it’s still around. And over the course of time, I ended up losing some of my personal life and investing some time into this project, and then I’ve just been kind of stuck there ever since.

LD: Ning, how about yourself?

Ning Wang: Hi, my name is Ning. Yeah, I’ve been working on software for quite a while. They are very interesting. So I like inventing things without buying all the materials. All I need is a computer, that’s good. It’s kind of convenient. Right now, I’m an engineer at Amplitude. It’s a product intelligence startup, so I’m working on the data pipeline there. And before, I was working at Twitter in the real-time computer team. So we are basically the owner, the maintainer of the Heron project. I think data processing is an interesting topic, not only streaming but overall, data processing. There are just so many data these days, so it’s a fun area to work on.

LD: Super agree with that. There’s a reason why we’re in the business that we’re in as well. It’s interesting how things have, even over just the last year or two, have changed so significantly when it comes to overall data processing, especially stream processing. I’m sure you guys have seen that especially working with Heron and Twitter, which has obviously massive streaming data pipelines. What are some of the ways that you guys have seen… We always like hearing from the folks who are on the podcast about the way that they’ve gotten into streaming and how streaming has evolved since they’ve done that. What are some of the ways that you guys have seen it evolve, whether it’s use cases or the technology or whatever that might be?

JF: As far as use cases, I think that they are expanding every day. And as the technology is growing, it’s becoming much easier to use. Even two years ago or even a year ago, it was really hard to reliably create a streaming job. There’s just… With any new technology, it’s great, it’s awesome, but there’s always some trade-off that comes off, right?

LD: Right.

JF: And one of the biggest trade-offs in streaming jobs, really, to me, is the whole reliability of your job is, how do you guarantee that that event that hits that stream is gonna be processed correctly, and then that the results are gonna be put somewhere? ‘Cause you just don’t know.

KG: Like there’s a big difference between not only once and exactly once, a very big difference.

JF: There’s a little bit of a difference, right, and you know that… And everybody wants to put these mission-critical systems on these streaming jobs, and I think they’re great candidates for it, but there’s a lot of stuff that’s gotta be taken into consideration before you make that move. And over the years, the evolution of frameworks, like Kafka Streams or Heron or Twitter or Apache Storm, they’ve all kind of wrapped that really hard stuff to do under the covers and take care of that for developers so that they can just easily use their APIs.

KG: Yeah, we actually started using Storm with, I think, the libraries from Parse on top of it for Python, way back when. And it’s mostly ’cause we were Python nerds, and we’re data nerds, and Python was a good language for that. And that worked pretty good. Ultimately, we switched away in favor of Flink, but boy, did we miss being able to use simple Python to build jobs, and the language you’re familiar with is oftentimes the one that’s gonna get you to the finish line just ’cause your familiarity, even if it’s a non-perfect fit. And that was definitely true in that case, but, yeah… And back then, this is like three years ago, it seems like it’s ancient now as you were saying, to underscore your point, but yeah, it was really hard to do things correctly. It felt unnatural for the longest time. I think we’re finally starting to get there though. Is that how you guys feel?

NW: I think basically that’s the direction, to make life easier. Like streaming systems, they solve a particular set of problems for us, it’s basically just a tool. And how to make it easier. This thing is kind of complicated underlying, so how to make easier for people to use to maintain is definitely a critical part. And I think that’s the progress they are making, all the frameworks are trying to achieve.

KG: Yeah, yeah, I agree.

LD: With that, we’ve talked a lot on the podcast, and we talk internally a lot about Kafka and Flink, and we know the usability improvements that have been made with that, but one of the things we haven’t had a chance to talk about a lot lately that you both have the background with is Apache Heron, and how that fits in within the streaming ecosystem. So would love it if you… For those of us, and our listeners out there, who maybe don’t have as much of a background in it, would give an overview of, again, what that is and how it fits in with the ecosystem.

JF: I will compare with Kafka and Heron. My experience with Flink has been a little minimal, but I think Ning can expand on that. So Kafka started off as apillar for processing events through a system. It was a queue, right? So everything would go to this thing which is Kafka and then something else would pull it off. So it’s basically like a hub for all your data to go through. That was the original… If I remember correctly, that was the original reason for Kafka. And then the popularity of streaming jobs started to come up and all these… Apache Storm was coming up and getting popular. And then Kafka came up with Kafka Streams, which is these lightweight functions that pull the events off the queues or off the topics and then place them on another queue. So now, Kafka is like a self-service system, in a way. It’s a queue and it has data processing jobs or functions that do stuff.

LD: Right.

JF: So the nice thing about Kafka is it’s a stand-alone thing. Where I see Heron… Well, the difference of Heron is, I see Heron as the net that connects all the different data silos across an organization. With any organization, especially legacy ones or even ones that haven’t been around for that long, you get these different silo data sources for whatever reason, and it’s really hard to connect the dots across all of them. And for me, I see Heron working really well as the mesh that connects all those different data points across data silos to make decisions on where we cross. Heron is not a queue or a pub/sub mechanism by any means. It’s more of an underlying daemon that just processes the events as you need.

LD: Right.

NW: For Heron, I did do some more research, comparing Heron against the others. I think like we’ve mentioned before, all the systems, data streaming systems, and other systems and also different streaming systems, they have their own pros and cons, so they have their own tradeoffs. For Heron, it’s more kind of a detector of Apache Storm based on some pain points Twitter had before. So for example, the maintenance or even sketching can be hard since to see if the system is like a black box. So for Heron, it’s basically couples many modules, and if you have any issue even sketching can be more straightforward. It’s like this component, you just worry about this single component instead of this whole thing. And also Twitter has a lot of streaming jobs and they are… They were under… They were implemented with Apache Storm API. So backward compatibility is really important for Heron, so that’s basically the problem Heron wants to solve. On the other side, compared to the others, I think… Well, my fear is… Because this is pretty much a configurable and modulized system on the other side of development and operation now, the learning curve can be a little bit steep compared to the others since, I think, Kafka Streams, Flink, they are fairly easy to use.

NW: You’ll start the cluster and have a simple job deploy. For Heron, you need to understand a little bit more so that when you operate in production, you’ll need to know them anyways. So that’s kind of why a lot of the decisions are made for Heron. It’s like the good and the bad stuff and there are many tradeoffs, that’s the reason behind it. And the old systems, there are… I think it’s pretty fun to compare the differences, so that gives you… Especially if you compare it with some historical reasons, many of them make sense this way, and we’ll see. I think in the future there will be more… Try to have a better one, like better systems in future.

KG: Yeah, and I’d add one actually. Just to underscore what you’re saying, Ning. I think if you’re looking at a typical topology enterprise-wide and you’re thinking… Maybe you’re using Kafka as part of your… You’re using a distributor log, maybe it’s Kafka… Probably it’s Kafka, and you’re trying to process those events like you say, Josh, between different data points in the enterprise, the separation of concerns is important, I think, from an enterprise architecture standpoint. If you wanna scale your Kafka cluster because you’re publishing certain amount of events per second and you wanna retain those events for a certain timeline, that’s not always the same vector in which you wanna scale, say, some processing componentry like Heron or Flink or anything else, even if you’re writing your own micro-services.

KG: And so I think if you’re thinking enterprise-wide and you’re thinking big, being able to separate where your data lives and how it lives in that particular realm, like in Kafka in a distributed log structure, and then how you process that and deal with it in the enterprise, maybe you’re pushing data to S3 or maybe you’re feeding some other micro-service, each one of those has its own ability to handle late data, each one of those has its own ability to deal with a certain rate of data, maybe it’s down or whatever. And building a responsible infrastructure requires, I think, the separation of concerns. And that’s why, at least for me personally, when you think about these topologies, I like the idea that something like Heron can live in the middle, or maybe you call it on the top, of Kafka in this enterprise, in a greater enterprise architecture. Because if you’re a DevOps, you have much more control over how you manipulate data from various different sources and sinks, to use a Flink term. And I think that’s important, so I think that’s another… If I could humbly add one to your list, that’s what I’d say.

JF: I can humbly agree. Thank you.

LD: So what are… I’m sure you both have seen some… ‘Cause I know we’ve seen some really interesting use cases come through and we’ve also seen some really interesting best practices and worst practices happen as we have been working with folks. So talk to us a little bit about, with what you guys have done, Josh, you said it really well earlier, the use cases are expanding every day. What are some of the ones that you guys have seen that keep you really interested and excited about what you’re doing? And on that, what are some of the best practices that you guys have to recommend to folks, is there, whether it’s on the developer side or the DevOps side working with any of these technologies?

JF: Good question.

LD: I try to stump people every so often.

JF: So the use case that really got me involved in streaming was real-time fraud detection. So, a credit card’s swiped somewhere, anywhere in the world. That card’s gotta go through a series of rules to make sure that that transaction is not fraudulent before the customer is charged. And so that, to me, that whole use case, which was a big part of my life for a couple of years trying to work on it, was what really got me interested in it because you learned… The streaming part of it’s great and it’s very interesting about how all these systems work underneath the covers, but then you also get into the domain knowledge of how does fraud detection work. And then you gotta piece fraud detection and streaming together, they actually come up with a working product. That’s what got me interested in it. And as far as advice on getting into streaming, it may not be the answer you’re looking for but I would say stay as simple as you possibly can until it doesn’t work. Streaming systems are great. They can really move some mountains for organizations, but there is a cost that you have to be prepared to take on before you can make that leap. You really have to understand the pros and cons of them.

KG: That’s good advice.

LD: That is really good advice. You kinda wanna walk before you run because, to your point, when you look at the fraud use case… Fraud use case is actually one that we talk about a lot. Internally, we have a lot of customers who are using us for fraud, and the difference in a well-working fraud pipeline and a not well-working fraud pipeline can mean the difference of millions or more dollars and the reputation of your company or any other number of things. And it can be really bad. And if you don’t…

KG: It’s a serious use case. Yeah.

LD: It is a serious use case. And it is not… I laugh in a lot of these and I sort of make fun of marketing, but it’s not like, “Oh, clickstream analysis. We’re seeing how people are interacting with the website.” That’s really important, don’t get me wrong. It’s how I do my job. But fraud can mean millions. And so having something that you know works and then slowly kind of scaling it up to do more and more, yeah, that makes a lot of sense.

KG: It’s interesting, Josh, you brought that up because when we were start… You’ve all probably seen it, right? You get a text on your phone that says, “You just spent whatever, $28 at the gas pump. Was this you? Yes or no?” And that happened to me. I first experienced that when we were getting started with Eventador, and we were starting to really dig into the streaming stuff, and it was very exciting. And I had one of those things happen when I was at the gas station. And I thought to myself, “Okay, that’s cool that happened in real-time and they’re checking in real-time.” But that wasn’t the real cool thing. That was part of it. I was like, “Okay, streaming data, super important, super neat.” But to me, the thing was is that it wasn’t just like some unsupervised machine-learning algorithm decided that that was bad or whatever. Somebody had figured out, “Well, why don’t we just ask the user, ‘Was that you?’ and let’s connect with our customers in a different way that we had done before.”

KG: And for me, the thing that absolutely blew my mind was it was a short text message, whatever many characters, I don’t know, 20, 15. It happened near instantly. It engaged me as a customer to trust the vendor more. So it was like, “Was this you? I just saw something happen.” And I’m like, “Yeah, it was me.” And then the next time I swiped my card at the pump, it works. So I had this new connection with my bank that I hadn’t before. And I thought, “Holy crap. That is the next generation of use case. That is so amazing.” ‘Cause they didn’t just do the anti-fraud measures, they actually connected with me better than I had been connected with them before. And that was the total, “Holy shit,” moment for streaming data for me. I’m glad you brought that up ’cause I have forgotten until Josh was like, “Fraud data is awesome.” Yup, it is.

LD: Ning, how about you? What is it that you’ve seen that’s been really interesting to you and made you really love this? I’m sure with your background especially at Twitter, I can only begin to imagine the kind of streaming…

KG: Yeah. I wanna hear about Twitter…

LD: Stuff that you guys heard.

KG: Yeah. I wanna hear about Twitter workloads.

NW: Yeah. Workload is kind of high, I don’t want… It’s a little bit risky to talk about numbers, but yeah. Having a lot of data for sure and also a lot of systems. Yeah. We are like a real-time control team, almost the engine part, so we don’t really work on the specific feature or job really. Basically, we are supporting our colleagues. So I may not have a good example there. On the other side, my advice is more like don’t trust the engine. I think this is true for many big data stuff because the system itself is very complicated. The interface may be simple. Like if you look at your code, just a few lines, unused ones, but there’s no magic, underlying there are many things working together. So keep the things simple is definitely important. Otherwise, you find that anything’s wrong, it’s not easy to fix, not like if you have a simple Java program, you can fix by yourself. Debugging and investigation can be really hard, so keep things simple and don’t trust the engine or the framework too much. There are limitations and you don’t know where the limitation is. And hopefully, if you are lucky enough, you don’t need to push to this limitation too much to make your thing reliable. For Twitter’s case, it is a little bit tricky because of the data size. Here and there, they may see the limitations but then that’s the real work.

KG: It’s interesting that our last podcast, we talked to data scientists, and they were talking about training models and then deploying into production. And it’s interesting because, and I think it’s fairly common, data scientists will train off of some historical set, like maybe they’ll pull out of a database or whatever, or CSVs or something like this. And then they deploy that into production, but then they redid a stream of data in some way or maybe it’s a sort of micro-batching in its design. But those two data sources are oftentimes orthogonal. How did you get it? What’s the lineage? Even just comparing schemas and things. So it was interesting when you say, “Don’t trust the data.”

KG: I don’t trust the data anyway. Their whole careers are built on not trusting data. But ultimately, I think you’re right. And especially with something like data science where they don’t have… It’s not like they have a connection to a relational database and a well-known schema and they’re running SQL queries, and it’s repeatable and understandable, and the business reasons or the business terms for the columns in the database are all well-known, things like that. Oftentimes, streaming data from a schema perspective is nuanced and opaque. And I said this last time, even in a big organization where maybe they’re using Avro and have a well-defined schema, it’s still a mess for most folks. And so I think sage advice, just kind of repeating back what I heard, and I think is spot on and is super interesting, is just don’t trust the data. And I guess my follow-up is, do you guys do different things because of that? What systems do you… One thing would be to use a lambda architecture or something like that to rectify data. What are your thoughts around how you deal with that if you don’t trust your data?

NW: Yeah, yeah. At Twitter, I saw lambda architecture, heavily. I think it’s like impossible to keep things simple because usually one of these is fairly expensive and they’re more work to do, maintenance work or some other works. So that’s why Twitter is… Basically the real-time pipelines, you’re seeing simple solutions maybe. And then there’s a batch pipeline, more complicated or more accurate calculation happens in the batch pipeline. So they keep the overall system simple. Well, simple. Now, that’s a different story. Some people don’t like lambda architecture because they have two jobs to maintain, I think. But from a calculation point of view, the benefit of these two systems are maximized in two different directions. So they try to keep the real-time pipeline as real-time as possible and keep the batch as accurate as possible. So this is just one way. Personally, I like it but it’s really some personal opinion. Some other people may not like as much or may not like it at all, but I think it’s reasonable.

KG: And how do you see the future of that? If you guys look forward, I don’t know, a few years… Five years sounds like a long time in the streaming world, but if you look forward a few years, what does that look like in your mind? Do we as a group, as streaming engineers and data engineers, do more lambda style, like you said, separation of concerns around accuracy versus latency? Does that scale? What are your thoughts there?

NW: I think scale-wise, this is easier to scale, this lambda architecture. Because you have two systems, they have their own focus. You don’t need to worry about everything in the same place. I think the challenge here is more about the capacity, you do have two systems to maintain. Development, maintenance they are both extra work, extra burden. So I think hopefully… My personal feeling is the direction is to make this seem as simple as possible, as reliable as possible. That’s the overall direction, I guess, that’s the same for many things. Lambda architecture is the same, trying to make it simpler, that could be helpful for many people. And streaming is just like a tool. To me, it’s not really better than the batch process or worse. It’s just like some use cases. I think there will be more on the more integration. Like lambda is integration of streaming and the batch. I think better integration and the better… Like a simpler solution is still… Hopefully… That’s my hope.

KG: Yeah.

NW: Just the direction to me.

LD: That makes a lot of sense. And that’s something that we’ve talked about actually on the podcast and internally a lot is it’s, A, you can look at it in one way where you can sort of use all data as a stream, even if you’re looking at batch data from a database, or you really need to be able… One doesn’t really exist without the other in whatever form that you’re using. So being able to make it easier to access in general and give people more power with it.

KG: Wait, I just wanna know, are Josh and Ning on the same page about lambda architecture? Is there some inherent disagreement here that we could capitalize on?

JF: Ning and I disagree all the time. It averages out…

LD: You guys obviously are passionate about streaming architectures because you’ve now written a book and are writing a book about it. So talk to us a little bit about, “Grokking Streaming Systems,” and what was out there that led you guys to look at each other? I’m gonna spill some beans to the audience of what Josh said to me before, which is Josh and Ning have not actually met in person, so clearly, they have something to say about streaming systems that they decided to go in and write a book together. So what drove that for you guys?

JF: About… I guess it was a little over a year ago now, I was contacted by Manning Publications. We just had a very casual conversation, or at least I talked with a guy named Mike Stevens, real nice. We talked about the pros and cons of Heron, how it would work, how big the community is, and stuff like… And just talked about pieces around Heron at the time. And I said, “Well, the community’s grown a little bit, but not as much as it could.” And I think the biggest blocker for a lot of people is that they don’t understand what streaming systems are, how they work, what the issues can be, and what the benefits can be.

KG: Right.

JF: And so they were talking about writing a book for Heron and I said, “I gotta be honest with you. I don’t think that’s what you wanna do. I don’t think it’s gonna do well.” And so we had a few more conversations that… Or we had another conversation… Or I guess further into that conversation, we got into it talking about what would really make a good book, and we talked about doing a framework agnostic book that people can really learn the core principles of streaming systems before they jump into our framework. And then from that conversation, I was like, “There’s no way I’m gonna be able to write this.” And so I called Ning and I said, “Let’s do this.”

LD: We see it also. There is that shade of gray where people can kind of understand what streaming systems are. People understand the concept of streaming data, but folks tend to either really get it and they’re super involved in it and they understand it, or it’s completely foreign to them. So it’s interesting.

NW: It’s hard. It’s hard. I think for me, similar… I think the lucky part… Okay, let’s see, it’s lucky just to be positive. The lucky part is I have to work on it because that’s my job, and then I was kind of first to dig into the implementation and all the concepts at the same time. But I can imagine for some people this is not easy, and especially if you don’t have the pressure like me like this is your paycheck, it’s maybe harder. So, important for you to jump into if you are interested, but still not as easy to start. So our hope is, this book is… We don’t really care too much about the length, the real different kind of engines, we just want to introduce the concepts so people can pick it up and then they can learn other things easier. That’s our hope. And if they don’t like it… Like, “I’m interested. I just wanna know the concept.” And then it turns out, “Okay, this is boring. I don’t want to spend too much time.” It’s also a good point, good for you, helpful for you hopefully, save you life for sometime. Yeah.

JF: Probably the coolest thing about the book is that, since Ning and I haven’t met, we have these totally different opinions and perspective on things.

NW: Like fighting.

JF: Well, it really is. There’s a lot of long conversations that we gotta work through to get to the…

NW: Yup.

JF: But because of that… There’ll be a piece that I’ll write and I’ll think I got it all figured out, it’s perfect, and then Ning takes it and just blows it apart and says, “No, we gotta do it this way, this way, and this way.” And I hate to say it, but he’s typically right every single time. And part of it, it makes better material, it really does.

KG: Yeah. That is interesting. In COVID times, especially if you guys haven’t… You’ve been co-authoring a book remotely, is… That’s actually a fascinating side story.

NW: Yeah. I don’t know if it’s helpful or not. In fact, this working-from-home thing kind of makes things more busier. So in fact, we have less time on the book.

LD: Yeah.

JF: Yeah, absolutely.

KG: So I have a question for you guys on Heron in general, and maybe it’s just generic advice, but if I’m thinking about what the audience might wanna hear, if I’m in their shoes, I’m wondering, “When would I use this versus… ” You mentioned Kafka Streams earlier, or Flink, or anything else, writing my own consumer code, or whatever that might be. Give us a couple of minute kind of intro. What’s it great at? When would you think about using it versus other things? And maybe you just talk about its design benefits or maybe some points that you think are particularly good about Heron just as an overview, would be great.

JF: Ning, go ahead.

NW: Okay, I was going to ask you to go ahead.

NW: That’s alright. Okay, okay. I can go. I can go.

JF: Alright, go ahead. Hit it.

NW: Yeah, I think it’s from the… Go back to the design goal for Heron. Heron is like re-architecture of Apache Storm. So that’s the basic story. The pain point is more operation, so if something happens, because at Twitter there are so many streaming jobs and they had problems in different places. So if something happens, investigation can be hard. So for Heron, many things are decoupled from each other, and this can be helpful for the overall operational issues, operational stuff. So that’s the one. So if you have different type of jobs and you care a lot of operation, Heron may be worth a try, but the warning is because of the decoupling, there are so many things to configure.

NW: At the start, you have to understand a little bit about, know more about what is a reconversion used for, why this is used this way. I think another one is Heron functionality-wise, it’s just mostly still the old API. There’s a new one, but not as popular, and there are not as many more fancy features. So if you need some fancy features or APIs, then this may be better to… May be harder to use Heron. So that’s another one. Finally, finally, it’s not really about Heron, that’s a more personal stuff. If I saw something so good, too good to be true like everything’s good, for me, it’s a negative thing. You have to tell me what is the cost, what is the trade-off. Otherwise, I don’t know if I can believe you or not. I think that’s just a general personal thing. I have trust issues. So yeah.

JF: Yeah, trust issues. Yeah, I can understand.

NW: Your turn now, Josh.

JF: Yeah. So I think I said earlier that Heron really works as the fabric on which data can travel between these different data silos and organization. To me, it’s the net, the mesh that connects all these different pieces of an organization together to be able to make decisions, to be able to join data, to be able to create events, to be able to do whatever you wanna do. The thing you gotta think about… Well, the plus and the negative side to Heron, I think, in particular, is that for one, Heron is a process-based architecture as opposed to threaded. So because of that, Heron is more heavyweight, costs more money to run. However, it really creates, like Ning said, that separation of concerns. I think Kenny said that as well. For example, if you were running a job back on Storm 1.X, and this is nothing against Storm, this is just an observation I made, is that if you monitor the logs of an executor, you’d see multiple components printing logs onto the same log, and it was really hard to manually, with your eye, parse the logs. You’d have to, you know, different parts of the job would be printing at the same time and it can be very confusing.

JF: With Heron, since everything has its own process, everything prints to its own logs, so it’s easier to read and debug and it’s easier to parse with your naked eye. But the flip side to that, the downside to that, is that there’s many more moving pieces that you really have to take into account and you really have to, in my opinion, understand that Heron Architecture, more than you would something like Storm because there’s more moving, decoupled pieces that can cause issue.

LD: We’ve talked a lot about what got you guys to writing the book, what got you to where you are now, why you’re working in streaming systems. What is it that you’re excited about with the future of streaming data and streaming systems? And it can be whether it’s a part of the technology, something that you can see coming on the horizon for Heron or whatever it might be, or it can just be a fun use case that you see coming down the pipe. What is it that makes you excited to continue down this path?

NW: Can you go first, Josh?

JF: Yeah. I’ll go, I’ll go. There’s a couple of things that excite me about where streaming systems are going, and where data’s going in general. Right now, the people who are building machine-learning models, like Kenny said, are using some type of spreadsheet, they’re using some data, some static set of data somewhere. And as of right now, that’s fine, it works. But I’m excited to see, ’cause I’ve seen people doing research on this, on how they can train models in real-time using data that’s brought into pipelines. I don’t know what the repercussions of that could be. How do you validate stuff’s working correctly? Because right now, all of the stuff done with machine learning models is really… It’s the practitioner using their best gut feeling on a lot of things. And so, how do you automate that away? How do you use AI to do that? I don’t know the answer to that, but that is what excites me, is to see people building models using real-time data to really increase… It’s just like that text message you got when you swiped your card, to really increase the effectiveness of a business in its domain. It seems really interesting to me.

KG: Totally. Totally.

NW: Agreed. Yeah, okay. My turn. To me, my interest is not only the streaming, I think it’s more like data processing, where it’s like a new technology. I think from the past few years we can see so many more data have been generated, collected, and in the future, it will be the same. I think there will be more and more data. Especially we’ll have 5G, we’ll have Starlink, things like that. The network is improving. We will have more data to process, and streaming to me is not just one type of technology, it’s more like… Make data processing more real-time, so we get the information faster. So, affect your real life, making it better. I think that’s the future. Like, Machine Learning is definitely one thing, Python as well, help you to make your life better. That’s the hope. It may make your life worse in some cases. Let’s hope. Yeah, that’s pretty much the… I’m just hoping for a better life in the future. Future life should be better.

KG: If you think about it, what’s interesting to me about that stat on to the… Leslie didn’t ask me what I think is cool about the future, but I’m gonna tell her anyway. What I think is cool is that there is so many… We were just talking to someone who makes beer, a company that makes beer, and you’d know who it is if I mentioned their name. But what’s interesting is a beer can is a sensor, and if you know its movement or its connection to the other five of them in a group, or you know how much is left in it or how often the accelerometer is going off, you can connect with your customers in new and perhaps strange and eerie and amazing ways. And there’ll be, sure, where the use case just gets abused or doesn’t work or is stupid, but the 1% or 2% where you’re just gonna absolutely delight the customer with streaming data, and they’re just gonna… Like I was trying to say about the fraud alert that I got on my phone, there’s gonna be ways that people get delighted in the future, and I’m doubling down on Ning’s point so hopefully, I’m making a good case here, where customers are totally blown away by the timeliness and accuracy and relevancy of data if we all do our jobs correctly. That’s where my head was at when you were speaking, and you were inspiring me to think about this. I think that’s cool too. And hopefully, you’re right. Hopefully, we make our lives better and when we’re out of beer, more beer shows up or whatever the use case might be.

LD: That sounds wonderful.

JF: But it sounds like it works against you at some point though.

KG: Yeah, right. Or maybe they just cut you off.

LD: Really double down on these points. Sometimes it makes it better, sometimes it makes it worse.

NW: So healthy.

LD: Well, thank you both so much for joining us today. We really appreciate it. Like I said, it’s been super interesting. We’ve had a really great opportunity to chat with guests about Flink specifically, and Kafka, but talking about Heron, our listeners are interested in all aspects of streaming, so that’s been…

KG: Yeah. Super cool.

LD: Really interesting, and we’re excited for the book. And for listeners, check out the show notes wherever you’ve looked at this, and I’ll have more information on the book there as well. Thank you guys so much, Josh, Ning, really appreciate it.

KG: Yeah, definitely. Thanks for chatting.

JF: Thank you, Leslie. Thank you, Kenny. Have a great day.

LD: Thanks to Josh and Ning for taking some time to talk today about Apache Heron, their perspective on today’s data processing pipelines, and their new book, “Grokking Streaming Systems,” which you can find the link for in the show notes along with the discount code for it. And as always, if you’re interested in learning more about us or if you have any questions for Kenny and me, you can reach out to us on Twitter @EventadorLabs or through email at hello@eventador.io. Happy streaming.

 

Leave a Reply

Your email address will not be published. Required fields are marked *