Senior Director of Marketing
In this episode of the Eventador Streams podcast, Kenny and I had the pleasure of chatting with Jesse Anderson, managing director of the Big Data Institute and author of the upcoming book “Data Teams: A unified model for successful data-focused teams,” about what the anatomy of a successful streaming data team looks like.
Jesse’s experience helping a vast range of companies build the right teams and systems for the streaming data job gives him a unique and invaluable perspective on the current state of streaming data teams, as well as what the future looks like for companies that are starting or scaling their streaming infrastructures. We dive into his expertise and advice on the role of data engineers and data scientists on data teams, as well as his thoughts on today’s streaming technologies (including Kafka and Flink) and tomorrow’s (including Pulsar and Druid).
Want to learn more? Check out this episode of the Eventador Streams podcast:
Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play and more. Happy listening!
Episode 03 Transcript: Anatomy of a Successful Streaming Data Team with guest Jesse Anderson
Leslie Denson: Chances are, if you’ve worked on, managed or built a data team in recent years, you’ve either worked with our guest today, heard him speak, or read one of his articles. Kenny and I were stoked to chat with Jesse Anderson, Managing Director at the Big Data Institute, about the role of data engineers and how organizations can and really should build their data teams to get the most from streaming data. Learn more from the guy who, literally, wrote the book and soon to be books, about it in this episode of Eventador Streams, a podcast about all things streaming data.
LD: Well, Jesse, thank you so much for joining us on the Eventador Streams podcast, super excited to have you here as our very first guest. So, thanks for coming. How you doing today?
Jesse Anderson: Good, I’m excited to be your first guest, thank you for having me.
LD: Yeah, any time, so our listeners out there know, we also have Kenny on the line with me who’s going to be participating in the conversation as well and hanging Jesse with those questions that maybe I just don’t have. So there we go.
KG: Maybe, hello everyone.
LD: So I think most of our listeners probably have at least a good idea of who you are Jesse, considering I’ve seen your talks at Flink Forward. I know you talked at Data Day Texas, there have been several places where you go out and talk so folks tend to know who you are, but for those who may not, give us just a quick intro into you and Big Data Institute, what you have going on.
JA: Sure, so a part of the reason, I as you mentioned, I speak at a lot of conferences all over the world and I write pretty extensively on this and what I found is that there wasn’t… There were a lot of voices, what I wanted to bring was a voice that was vendor-neutral. I had worked at a vendor, I had worked at Cloudera, so I kind of knew what it’s like to be at a vendor and then when I started my… Left Cloudera and started my company, I made it a vendor neutral company so that I could really be of service to my customers. They knew that if I made a recommendation, they knew I wasn’t coming from a place of… And here’s where my product fit…
JA: Whether my product fits there or not, I will shoehorn that in there, they knew that I’m the best…
KG: That never happens.
JA: That never happens, but on the off chance, [chuckle] on the lottery chance, one in a trillion chance that happens. So people really appreciated that. And then, I’ve obviously specialized highly in streaming. I saw that need when I was at Cloudera, where we’d come in with these problems where people said, “Hey, I want X to happen, and I need it to happen in real-time.” We just didn’t have that so I saw that need from the very beginning, so I started doing a lot with Kafka five years ago and I tried to tell people, “Five years doesn’t sound like a lot of time with the technology, but that’s ages for some thing.”
LD: I agree with you.
JA: And as you start to deal with these distributed systems in distributed state, then you get into Flink and used to… And having that deep background allows you to say, “Okay, why would you choose a Flink versus a Kafka?” And I help my clients do that. So in my company, we do everything from architecture reviews to technical training and mentoring. So, I took a different stance where I thought, “Okay, people,” you can’t just come in and work with a company for a day or two, or a week. You actually need long-term, long-term engagements with people to see them, make them be successful and so it’s not just… And it’s not just on a technical side, I also deal with them on the management side, where I help them figure out the organization of the team and that kind of leads into my next book of, called Data Teams. It’s a unified model, for how should all three teams work together. If you’re data scientists, if you only have data scientists, hey, they’re gonna have different problems and you need data engineering, you need operations, you need operational excellence or if your data scientists are doing everything, probably not going to be doing that very well. So that’s basically what I’ve been working on for the past many years.
LD: That’s awesome, and I will say one of your posts where you go in-depth on the role of data scientist versus the role of data engineer where they overlap, where they maybe think they overlap, but don’t actually is one that we’ve passed around internally quite a bit as we’ve been talking about who we talk to, as well, because it’s really interesting and it’s incredibly spot on with what we’ve seen as well. So…
KG: Yeah, I’m actually super interested in hearing your thoughts around the role of the data scientists in terms of the streaming context, ’cause I think traditionally, data scientists have been batch working on small important, but smaller sets of data in a batch format and analyzing those things and then producing some result. What are your thoughts around how streaming is affecting the role of the data scientist? What’s that look like in your mind?
JA: It’s a cognitive increase in complexity for the data scientists. So here you have a data scientist perhaps sometimes they’re on the high end of technical experience and sometimes they’re on the very beginner end of that technical experience, and if they’re on that beginner end… And you start saying, “And now go do things in real-time”, hey that’s a big step up in complexity and so the data scientists may not be able to do that, they may not even be able to think through the engineering implications of that. So here with the role of the data engineering that is, grabs at an ideal end is to completely obfuscate that from the data scientists.
JA: That the data scientists code that they’re trying to deal with, they don’t really have to think is a batch. Or is it real-time? I know I’ve had clients try to create systems around that, where perhaps you could train in real-time or real-time decisioning or real-time modeling. Hey, those are things that you want to make sure that you have good data engineering around otherwise you’ll fail, failing in real-time and failing in batch are two very different things. Where the data scientist is used to, “Oh, that batch process didn’t run. I’ll just re-run it. Fix that code, get it back up there.” You fail in real-time, you’ve now dropped data on the table, you’ve now have problems where whatever system downstream didn’t get its result. So there’s different levels of engineering effort that we really want data scientists to understand. Is that what you’re seeing as well?
KG: Yeah, I mean… Yeah, exactly. And I think… It’s interesting, we… We’ll chat with folks, and there’ll be, you know, a table of interested parties, or interested parties around the table. And they’ll be the data engineer, there’ll be the data scientist team. And the data scientists are like, “Yeah, we know how to pull data out of a database. We know, how to… We’ve done this for years, we’ve put in Pandas and we do this thing, or whatever it might be. We just wanna run our, but we wanna do it in real-time. And like you said that the difference between… Even just from a pure infrastructure standpoint, from trying to pull something from Kafka and give it to a data scientist in a way that is concrete and usable for them versus a complete paradigm shift has been a big discussion, has been a conversation around, “Well, how do I actually freeze the data in time?” That’s one of the reasons we started creating materialized views.
KG: How do we freeze that data for the data scientist long enough for them… For their models to actually train or pick up the latest state or aggregate data over time. Whatever those things might be has been very challenging to have that paradigm shift, especially if… Like you said, some of those data scientists are from a technical standpoint. They might be a mathematician or statistician and are grappling with trying to actually keep up with kind of streaming data’s cutting edge at this point. How do I graft my mathematics knowledge to a streaming firehose of data? And that’s a… In a lot of cases is a big gap.
JA: It’s a big gap, and then trying to educate somebody on the difference between real-time and state, and just state. Like, hey, that’s a big deal. There’s a big difference there. I wrote a post about another technology, talking about the real-time state. And what was interesting about that post was the people coming in and asking questions, they weren’t asking questions from a state like you and I of, “Yeah, I understand that. And here’s what this… ” They were coming at it from this beginner state of, “What do you mean by check pointing? What do you mean by shuffle sorts?”
KG: Right, okay.
JA: And I’m like, “Oh my goodness. To even answer your question we would have to educate you for an hour or two, or maybe even five hours on distributed systems.” It’s a different level.
KG: You know, you mentioned the state thing, one of the patterns we see, and I’d love your thoughts on, is that there’s this architecture where they say, “Hey, look, we’re streaming. We’re using Kafka, we’re streaming data. Yay, for us, but we’re pulling the data, you know, maybe from Clickstream, let’s just make up an example. And we’re doing some sort of processing on it, and then we’re sticking into a traditional database. We’re just putting it in the Postgres or something like that. Because how do I read it, how do I use it in my applications, how do I make sense of it?” But the real sad thing there is you’re really just using Kafka as like a distributed ingest pipeline and not using Kafka as actually a real system of truth or system of record. And ultimately you’re not really doing things in real-time, because you’re just putting it back in the database, and reading it with like, you know, a B-tree index and a fetch pattern, that’s… You know, a cursor, and all the things that we’ve… Traditional database systems actually, actually have. Do you see that too, and is that part of the continuum you see from starting off with streaming systems, and growing, and then becoming more advanced, and kind of… Like, what is the continuum that you see in your travels?
JA: I see the wide spectrum. I see, “I’m more comfortable with my relational database, so I’m just going to shove my real-time data into that relational database.”
JA: And that may be okay for some things, but I have a customer who… We tried to put that real-time data in the database and the database couldn’t keep up. And this is exactly the reason why we have these distributed systems out there. If you’re going to ask these questions, if you’re going to ask difficult questions they need to be done in a different way. They need to be done with a Flink for an example. And, yes, I have seen the Kafka as a glorified pipeline that we just get back to our traditional tools. If you had a true big data problem or even a medium data problem, those traditional tools probably aren’t going to keep up with you. And now, you’re spending… If you’re kind of thinking of… Like, “I’ll put my CTO hat on or my manager hat on.” And okay, we spent $1 million, a million Euros on reinvesting and changing our technology stack, but did you really create a technology stack that scalable? If your circuitous route is to this database, and that we would query right off that database, a relational database will be back to being your…
JA: Single point of failure, your bottleneck. Or should you have really done something else?
JA: And what I find is that I… So there’s the reason, there’s the manifestation we see, what I find more interesting is looking at the root cause. What was the root reason why the team wanted to get back there? And what I find is that sometimes it’s an issue of their definition of data engineer meant DBA.
KG: Right. Yeah, interesting.
JA: And so that DBA needed to get back to a technology that they’re comfortable with. And so their definition of data engineer, being DBA was the wrong one. Where we need our definition of data engineer being a software engineer specializing in these big data tools, in these real-time tools. And they’ll be comfortable in not just relational database, but also in the Kafkas and the Apache Flinks, so that we can choose the right tool for the job. And I think that that’s the… That’s one of the biggest difficulties facing management to a certain degree, but architects and engineers of… There’s these bevy of technologies. What is the right tool? And if you just go back to what you know and love, relational database, hey, that may not be the right solution.
KG: Yeah, yeah, I think we see that, we see that all over the place. And we also see it like an evolution, right? So if like phase one, I’m gonna still use the relational database to power my applications, whatever that might be. And… But now I’m feeding it with a Kafka stream or pipeline of some sort. And then phase two, over the streaming adoption, or whatever you might wanna call it is, hey, now, I’m gonna start to actually write some sort of processor and hang it off Kafka. Maybe I’m using Flink or something. And now that’s my new system of truth, and it’s much faster and it saves cost. And in some cases more complicated. But also, if you really like you say, I like the way you frame that, if you’re asking those questions that require that kind of a system, then you’re already… There better be some business benefit doing it or else, you’re kinda wasting your time and money. And assuming there is, then look, that’s the right, like you said, the right tool for the job. So I totally agree. I think that’s really interesting.
JA: Thank you.
LD: You know, it’s funny, we’re sitting here talking about the change of definition, we’re talking about the change of okay, does it really make sense to throw it in the database or not? I laugh ’cause I flash back to a previous life before I was with Eventador and I remember having this conversation with multiple people that we were talking to, from a customer perspective, and it was like, “Yeah, we need data in real-time.” “Okay, so what does real-time mean to you?” “If we can get it from the point where we ingest it to where we can actually access and do some something with it in like, I don’t know, an hour to two hours, that’s totally good for us.” And that was only four years ago and now that just seems so outrageous and the idea…
KG: I can wait.
LD: Right. Like, real-time is an hour to two hours. Okay, that’s the kind of where you can kinda nod your head and you’re like, “Okay, sure.”
KG: Well, there’s the other side of it too, right, where you ask the customer you say, and this has happened to us before, you say, “What’s your requirement in latency?” And they’re like, “Oh yeah, like two milliseconds.” And we’re like, “Oh really? Well, how many messages you do?” And they’re like, you know, whatever, billion, trillions. And we said, “Okay, that’s gonna be really expensive and here’s what it’s gonna cost you in people and infrastructure and time.” And they’re like, “Oh, never mind. Second’s fine.” So, I think we see the other side of that too sometimes.”
JA: And that’s exactly, if you’re a data engineer watching this or you’re a manager, take this out of it of as you get a requirement from the business you say, “This is what it’s going to cost you.” You don’t just haul off and start doing it.
JA: I think it’s our job as data engineers just to say, “Here’s what we can do, here’s what we can do, and here’s what we can do. And here are the costs associated with that.” So that the business… So that if the business comes back and says, “This isn’t fast enough.” You said, “Well, we could do that if you’d like to give us another, let’s say, 2 million a year, for example.”
KG: No, that’s super common. And we also… Everybody starts to like… Well, latency should be… They don’t wanna say zero ’cause they know that’s not realistic, but they’ll pick the next smallest integer in milliseconds and say, “Okay, that’s the requirement.” And as you point out, I think it’s interesting as data practitioners, digging into what that business use case and really pushing whether it’s the product manager, or whether it’s some sort of architecture group, or architect that’s designing that pipeline or building that business case, really trying to understand, “Okay, why are you wanting to do this in real-time?” We’re obviously fans of real-time data, that’s what we do, but at the same time pushing back and saying, how real-time is real-time? ‘Cause you’re right, there’s a cost associated with it. There’s a fault tolerance, the cost that goes beyond that if you have to recover with a certain SLA. There’s all sorts of implications from a data architecture perspective, and I think as you point out, and I think that’s a really rich point, dig deep and figure out what those requirements really are because it’s gonna drive a lot of things. That’s a great point.
JA: Yeah. I think it does a disservice. There was a company saying everything should be real-time because if you think about it, everything starts real-time, and I thought, “Boy, that is a real disservice to their customers.” And if you start with that mantra, it’s a disservice to your internal customers. It’s a disservice of you are going to increase your cost, you’re going to increase your complexity. And that is… Now that we’re in this COVID, those chickens come home to roost.
KG: Yeah, that’s right.
JA: So when you’re pushed back by the business and the business says, “Hey, you’re costing us 2 million or 10 million a year in whatever cost.” Cloud cost, let’s say. Then you’re able to say, “Yes, we did this because of X, Y, Z.” And if you can’t give a good solid business reason for doing things in real-time, or if worst case, you have an outage, hey, that’s a problem that you should have done the upfront work, the leg work to prove those and not everything should be real-time.
KG: Yeah, it’s sort of like, you know, it’d be interesting to draw a parallel between streaming ecosystems and modern streaming systems and how the Hadoop… I know you have a background all across data. From the Hadoop side, it was almost a joke for a while. When they put every piece of data they could possibly find into a data lake, then the question was like, “Oh, that’s expensive. How am I getting value from that data? Do I really need this?” And it seemed like in a Lux streaming data has the possibility of being a better architecture for controlling costs ’cause you can filter based on all sorts of things in the stream and you can actually… First of all, streaming data at least my thesis is that it’s generally more important and more exciting to the business, ’cause the data’s happening or the events are happening in the here and now than stashing away a year-long log of something into a data lake and really never getting any value out of it. What are you thoughts kinda like, those transitions from the Hadoop and batch distributed system world to Kafka and streaming and value of data? Is that a thing that comes up in your travels and with your meetings with customers?
JA: I’m the one who brings it up. When we talk about data, we establish very specific business value. I had two meetings, I had one meeting where I was kind of giving a flow and I was saying, where the technical folks were talking about, “And this is how we should implement it, and this is how we should implement.” And I said, “No, this is what we’re gonna back up to. We’re going to establish a list of business priorities with ROIs for them, then we’re going to choose them based on what we’ll try to munge that with how difficult it is to implement, but we are not going to dictate what the business can do. We are going to be flowing out of what business value we can create.” So it’s really important there. I had another, kind of to that batch, sort of mentality. I was having a conversation yesterday about, we were designing a real-time system with a financial institution, and we were talking about how soon do you need an insight? And…
JA: And so, those insights could be about customers, those insights could be about other things, but they were doing batch processing of those insights, in this case, a reconciliation. And so, if you’re reconciling 24 hours later, if there was a problem, a consistent problem…
JA: You will know now 24 hours later.
JA: And there’s an inherent cost to that, and when you’re a banker, a financial institution, that cost could be astronomical, because now, we’re dealing with money. We’re dealing with hard cost. And I always try to make sure people understand, when you’re a bank, financial institution, there is a different level of engineering that needs to go in with no disrespect, than into a social media application, for example. Missing a like, missing a retweet, not the end of the world. Losing a million dollars, people are gonna… May get fired out of that, and that’s a different level. So I always try to make sure that people are designing and thinking in those engineering terms, so that reconciliation kind of coming full circle, we shared a design and we made a design where it was, we would do that reconciliation in real-time. If something was consistently messing up, we would know less than a second later, and that business value, we didn’t have the business professionals in the room to ask that question, but I bet we could have established now, “Okay, here is the ROI of doing that.”
JA: They were going to do it anyway, but now we could have an actual number from a business person where we would ask them and say, “What has been the biggest hit you’ve ever had or this or that?” And so we could actually say the ROI of this project is a million dollars a year, and that’s the encouragement I try to give to data engineers. Data engineers, often coming from that software engineering background, they haven’t thought in these terms, they haven’t thought in value. A software engineer’s value is generally releasing software. I get software out there. Now, that they move in to becoming a data engineer, their value is different. Their value is somewhat releasing software, but their value is based on the generated value from that data.
JA: And that’s an encouragement that I’d make to software, data engineers as they make this transition, is that you think about the value that your data is creating, and that as you start to either calculate your value or what you do for the company, or what your project will do for the company, it is not saying, “Hey, everybody, let’s go deploy Flink.” That’s how you do that as a software engineer. The value that you get is Flink is going to make something easier for you rather than Flink is going to do something for the business that will let you to reconcile in real-time. By reconciling in real-time, we now save a million a year. I really encourage data engineers and managers to think along those terms. It’s a big change, but it needs to happen.
KG: That’s great. You know there’s one other… You made me think of another area that has a similar characteristic and that’s real-time experiments. I know you are, but if the audience isn’t familiar, real-time experiments are like the most simplistic one is have a high traffic website and I’m gonna change the “Buy It Now” button from blue to green, or something like this, and you’re gonna do it on some subset of your web servers and you’re gonna measure that in real-time. And if you… And the reason you’re doing that in real-time is if the customer stopped buying things because the color change was poor, then you wanna know about it instantly. And obviously, a lot of these experiments are much more complicated than just button color, but that’s the general game of real-time experiments.
KG: And I found it super fascinating that folks were running relatively sophisticated experiments and then going, “Oh, my God.” And the amount of money either up or down that it changed the experience, the revenue from their product, whether it be shopping portals or all sorts of e-commerce or even just things like adoption for iOS apps and things like that. It was unbelievable to see the amount of traction in either direction that you can see in real-time from these real-time experiments. I thought that was a fascinating, pretty cool powerful use case, because if you get it wrong, if the design is bad or the button is in the wrong place or people aren’t able to find it or buy, if you figured that out a day later, you’ve lost millions of dollars.
JA: Yeah, and one thing I’ve been trying to tell people is that these millions of dollars, now with COVID, could be companies going out of business, could be layoffs. We, as data professionals, we need to be thinking about this, of how our actions benefit the company where we work, either good or bad.
KG: Yeah, that’s a great point. That’s a great point.
JA: Related to that point, I’ve had a few companies where they may have a model whether it’s real-time or batch, and it improves 10%, let’s say, and they’re holding off, waiting for that 30%, that 50% improvement. Well, guess what? Maybe it’s worth it they do a 10% improvement now, if that means the difference between the layoffs or not. I really encourage the data scientists watching this to think of in those terms.
KG: That’s great. Yeah, that’s great. You’ve mentioned Flink a few times. Tell us the story about how you came across Flink and where your head is at with Apache Flink. You and I have talked a little bit about it together, but maybe share with the audience just like, what are your thoughts on Flink? How did you come across it? What initially made you excited about it? Where do you think it stacks up in the ecosystem today as a piece of software for the data practitioner?
JA: As I mentioned, I had been dealing with or working with Kafka for quite some time. Now, Kafka is very good for stateless things. And, in fact, if you’re doing some kind of stateless operation that isn’t doing the equivalent of a shuffle sort or something like that, hey, that’s a great way to do it. In fact, for some clients, we’ve started out that way. Let’s do it entirely stateless. We’ll do best effort caching, for example, and we’ll go from there, but that isn’t… The next level of analytic, the next level of engineering usually brings you into stateful engineering. In order for me to run that data science algorithm, I need to reach back to 10 points before that, 10 data points, 50 data points before that. How do you do that? And then you have the issue, the other issue of, how do I do this quickly?
JA: And that brings us to issues of micro-batching versus not. So you have, obviously, Spark Streaming with its micro-batching and for some use cases, that micro-batching is just not good enough. It’s either not fast enough or has too many limitations along with it. As I looked around and saw, “Okay, Flink doesn’t have that micro-batching.” I immediately realized, “Okay, this is actually really good. You can do your windowing if you need it, you can do your full statefulness.” And so, I saw the… It was very clear that it was an overall better design, better architecture, and doing some really interesting parts that weren’t in other systems, for example, on Spark Streaming.
JA: One of those is, if people don’t know is timers. Timers, one of the design issues you may have is, yes, you may have a stateful issue, but what about the absence of data. How do you track the absence of data? That’s a difficult problem, that is a really difficult problem. And so, as you’re doing this in real-time, “Hey, maybe, in batch, you didn’t have that problem.” In fact, I’ve hit several companies where they’ve said, “Oh yeah, how do you deal with X, Y, or Z in batch?” Well, it just came later, we added it later, it was 10 minutes later. And so once we did the batch processing, it was okay.
JA: When you’re doing that in real-time, that’s a different story. So we’ve got… There were all these really good, really interesting things about Flink. And so I met the Flink people, Kostas, and it was an interesting thing I’ve seen of dealing with them. They’re really smart, they’re really, really nice people, actually. And I asked other people I knew, whose opinion I respected and I asked them, “What do you think about Flink?” And they said, “Oh yeah, that’s some of the best X, Y, Z real-time processing.” So I thought, “Okay, this coincides with my opinion. I’ve seen Flink and I’ve used it and I’m seeing this at just massive scales.” And it was really interesting and eye-opening starting to hear that about the Alibaba use cases, now that they’re being more forthright of… This is just an insane scale that they’re operating at. It was definitely a eye-opening to see Flink fit the bill on each one of these needs that I had.
KG: Yeah, that’s cool. I think we echo the same sentiment around the now Ververica team around Flink. Great people, great community. And man, it’s growing. We’ve been doing Flink for a couple of years now and like you said earlier, that’s like an eternity. It feels like two or three years of streaming stuff is… It feels like a lot these days. And two or three years ago, Flink was very, very early on. And I think now some of the stuff that’s coming out like for us, the stuff, the blink planner is fantastically cool, and just really honing in on running on Kubernetes and that story is also very powerful. So yeah, I think it’s exciting time for Flink. I’ve been through a couple of technology curve, adoption curves in my career, and I think I can kind of feel this one really picking up and getting energy behind it and it’s exciting. I’m excited that Eventador is part of it. It’s fun to see the community grow and kinda have that trajectory start to really take off, so that’s cool.
JA: Yeah, that’s definitely… I’m happy to see that they’re putting their effort into blink because that is, it’s well-known within the Flink community, that their batch just isn’t there. And so, when you talk to a heavy batch company, they say,” Well, everything’s batch.” But when you talk to a streaming company, they say,” Everything’s streaming.” What you really need is that single technology that does both. And this is important, not just from a technical point of view, it’s from an operational point of view. I don’t want two separate system, to have to have two separate systems to do streaming a batch. Ideally, I’d have a single one and then on your developer side, now you have two APIs and you have to support those two code bases. Some real cognitive overload there.
KG: Yeah, I was just gonna say, two different code bases is a huge thing. Yeah, I totally agree. Interesting.
LD: Jesse, I think this has all been super interesting. We’ve talked a little bit about it. I know it… Well, this has been really interesting for me to hear you guys chat. With what you’re saying both in just conversations with general people in the industry, with the folks that you’re working with, what are you really excited about that’s coming up soon? What is it that’s coming down the path that you can’t either wait to see or use or start having folks implement?
JA: I think there’s two technologies that… And I’ve become involved with them because I saw the need for them. I think it’s two, is Apache Pulsar, which I have been really interested in. I’m not sure if you’ve looked at it, and then the other one is a Apache Druid. There are a few non-trivial… And when I say non-trivial, I mean, enough for an entire company and product to exist around this of, “How do we do real-time ingest to a database and be able to return responses in seconds or less?” Yeah, it’s a Druid. That’s a non-trivial problem. And if you try to do that with relational database, “Hey, that will probably break down.” So I’m really excited about the things that I’m seeing in Druid. I’ve worked with Imply, the company. That’s the people that have created the Druid itself. It’s been interesting seeing how they’ve improved the UI of the product, the user experience of the product. It’s now, crossing my fingers, it’s now a pretty non-trivial thing to do, or, excuse me, a pretty trivial thing to do is to get your data from Kafka into Druid. They have an entire UI that does it.
KG: Yeah, it’s a very popular stack for…
JA: It’s a very popular stack. If you have your BI people breathing down your neck and saying, “I need that data that’s to be landed and ready and queryable within less than a second.” I’d be looking at Druid for that. That’s one possibility for that really ad hoc BI-style analytics. It’s kind of an interesting thing coming at it and thinking about SQL with Flink and SQL with Druid, how they’re addressing the problem differently. I think that we’ve got the “Yes, we can create… Do some queries on data in real-time and express that with SQL,” but you’ll still need the ability for your BI people to express those queries and get that data back quickly.
KG: Yeah, in our world, we call it continuous SQL, but that’s really the processing element of it, right? That’s really writings between processors, with structured query language. But on the flip side of that is, “Okay, so what am I doing with that data?” Yeah, I’m continuously, I don’t know, doing a GROUP BY or something. But where does that data go, and how can someone make use of it? You know, what kind of construct do they… What kind of API do data scientists or BI people, like you mentioned, or even just applications you always need to have to access that data?
JA: Yeah. And then the other one I’ve been excited about is Pulsar. And so, having dealt with Kafka for quite a while, I’ve seen Kafka grow, but I’ve also seen some of the architectural limitations of Kafka. Looking over Pulsar, I’ve seen better design and an ability to deal with those limitations with Pulsar. Some of them are around… For example, as we get into the melding of real-time processing, and of batch processing, we’ll need better systems in order to lay that down. For example, how would you batch process into a Kafka? Well, you have to start… You have to have either laid it down with Kafka Connect into HDFS or S3, but you can’t really go directly against your broker in an efficient way. With Pulsar, you have those bookies, and you could use that reader interface to actually read that. So a company called StreamNative, or it was Sijie Guo, they’ve actually put a pretty significant amount of effort into making sure that the Pulsar and Flink bindings are really well done, so that you can. Yeah, so it’s… And in fact, Sijie gave a talk at the last Flink Forward there in Berlin.
KG: Oh okay, yeah. I missed that one. In fact, I’ll go back and check that out. Pulsar comes to my mind for the geo-replication stuff, that’s always been where it pops to the front of my… When someone says, “Oh, we have to replicate all over the planet.” “Okay, you should take a deeper look at Pulsar.”
JA: Yeah, so to add to that, if people aren’t familiar with how difficult it is to do geo-replication, it’s a… There’s non-trivial problems and then there’s really, really, really difficult problems.
JA: And then you get into two active active problems. And I don’t think… I personally don’t think that Kafka’s replication story is very strong. I mean, there’s obviously replication, you can do replication. But I think Pulsar’s replication, geo-replication, sorry, is significantly stronger because you’re doing the replication at the broker rather than with Kafka as a separate MirrorMaker process.
KG: That’s right, yeah.
JA: And it sounds like you’ve been around long enough to know that there’s a big difference between shipping around files, and shipping around updates. Then we have built it into the process, the process knows what needs to go, and we can do that significantly better.
KG: Yeah, you know, I think there’s just two headwinds there for the Kafka ecosystem. And though… You know it’s getting better, right? This is an important thing for the folks who are writing Kafka, I’m sure. Ultimately, if you want to replicate data across the globe in a geo-fashion, you’re doing that probably because some sort of requirement around that data latency, and around… Or maybe recoverability.
KG: The hard part about that is by building on other pieces and parts, you’re now making it relatively brittle and fragile stack. The net net is that you have to have a pretty damn good team, and a good on-call rotation, and all the things that you needed when you’ve got kind of a stack that’s built with these… You know, a little bit more brittle components and fragile components. At least that’s what we’ve seen from a customer perspective.
JA: Yeah, when you think about this, just standing up a process to move data around, that sounds like a good logical flow. But it’s distributed systems tend to have these…
JA: These Murphy’s Law sort of problems. And so not just for replication, but I say in general. Yeah, architects… I try to impress this upon architects and data engineers, yes, you can make something that works in theory, and it looks like it will work, but once you get into that messy real world of production, that’s when it starts breaking down. And that’s really where you need these people with the backgrounds, these experts to be able to say, “No, it’s going to break this way. We should actually do it this way, and this way. There are other better ways to do this.”
KG: In my old crusty DBA days, we’d say, “Just ’cause you can doesn’t mean you should.”
LD: Well, Jesse, thank you so much for taking time to chat with us today. I know you mentioned it at the very beginning, and I’ll certainly put a link in the description of this for how folks can get more information, and then order it. But tell us a little bit about this book you have coming up.
JA: So the book is called Data Teams, and it’s a unified model for how do companies make sure that their data scientists, data engineers, and operations teams are all working together. Because if you… We’re obviously talking about some pretty high-end technical issues. But usually, when a big data project or a streaming project fails, it’s usually not Flink failing, or it’s usually not Kafka failing, it’s usually the organizational design of the team that failed. And what the issue was, is that there wasn’t anybody who was saying, “And this is how you should… ” Or recommending, “And this is how you should create that team. This is your organization of that team.” And so, I’ve written one book already called Data Engineering Teams, which I tried to do this. But in this book, it’s even longer, it’s even bigger, it’s even better, and uncut, so that we have even more information about what the team should look like, and how the team should interact. And what I go even further there is, I go in further and deeper into how the business should interact as well.
JA: That if you have a data strategy and your… You’ve done your data strategy in the business, now it washes its hands of it and says, “Alright, data teams, your go.” That’s probably not going to be a successful project. And so I’m really trying to key in on this is what makes you successful. And then to top it off, I did case studies. It’s a body of work. A lot of my body of work hasn’t ever been either existed before or talked about before. This is another thing that I haven’t ever seen exist, and that is case studies about how teams and companies have organized their data teams, and the good, bad and indifferent. There’s interviews from companies like Stitch Fix, and I interviewed Dmitriy Ryaboy, who started up Twitter’s data engineering team. It’s not just, “Here, let me get a one-month snapshot.” In some cases, this is many years, from people who are very early on in this, and I got to have those conversations. Those interviews are highly valuable unto themselves, of looking at whether they’re startups or big banks or whatever to help people understand, “Hey, this is the process that we go through.” So the actual URL is datateams.io.
JA: If they wanna go there, I have a landing page so that you can be notified of when the book comes out. And in the meantime, I’ll give them a copy of my data engineering teams book.
KG: Alright, I’m signing up. Sold.
LD: Super interesting.
KG: Okay. Great.
JA: Well, thank you, yeah. If you’re a manager listening on this, just think, remember that you make sure that you have the right people and the right, the people in the right seats on the bus, is the way people term it. If you’re barely making it at batch, and you go to real-time, probably not going to make that real-time switch. You’re probably going to be tripping and falling over all the time, so make sure you got the right people.
KG: Good tips.
LD: Nice. Well, Jesse, thank you so much for joining us on the podcast, we really appreciate it. I had fun. I can tell Kenny’s had fun.
KG: Yeah, me too.
LD: Hopefully, we will have you again and hear you again on the podcast. So thanks so much.
JA: Well, thank you again.
KG: Thanks again, Jesse.
LD: Huge thanks again to Jesse for joining us today, we really enjoyed that conversation. If you did as well and you want to hear more from him, you should check out the Big Data Institute, and you should absolutely go sign up right now to find out more about his upcoming book at datateams.io. Now I do wanna take a quick moment to say that if you’re looking for a way that your team can unlock more value and get more from your Kafka data, or even simplify how you work with Flink, you should take a look at the Eventador platform, and sign up for the 14-day-free trial. You can find that at eventador.cloud/register, or as always, you can drop us a line at, email@example.com, to ask any questions that you may have.