Podcast: Ditching the Database and Materializing Views of Data Streams

April 28, 2020 in Eventador Streams Podcast



Podcast: Ditching the Database and Materializing Views of Data Streams

In this episode of the Eventador Streams podcast, Kenny and I sat down to talk in more depth about why the ability to materialize views of Kafka data is already and will increasingly be a critical part of the streaming data pipeline and just how this can also help reduce the reliance on costly and not meant for streaming databases.

And, for those not as familiar with materialized views, we talk through an overview of what they are, how they can be used with streaming data, and how organizations are currently handling gaining those views (spoiler alert: it has something to databases). Hear more about when you can—and when you probably shouldn’t—ditch your database in this episode of the Eventador Streams podcast:

Eventador Streams · Ditching The Database and Materializing Views of Data Streams

We also took some time to chat about last week’s Flink Forward virtual event, which was full of fantastic talks and sessions. Want to learn more about the Eventador Platform as well as materialized views? Check out Kenny’s session, and we absolutely recommend you watch the rest as well!

Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play and more. Happy listening!

Episode 04 Transcript: Ditching the Database and Materializing Views of Data Streams

Leslie Denson: For a guy who calls himself one of the biggest, if not the biggest database fan out there, Kenny sure is quick in this episode to talk about why you don’t need them, and it all has to do with a not so little something called materialized views of data streams. Learn more about when you can and should ditch your database in this episode of the Eventador Streams, a podcast about all things streaming data.

LD: Hi Kenny, how’s it going?

KG: Good, how are you?

LD: Good, good.

KG: I’m back.

LD: We have you… You’re back.

KG: Yeah, I actually got invited back. How did this happen?

LD: How did it happen? I was going to say, so you had your Flink Forward session today, and it’s officially 4 in the afternoon where we are and we’re about to talk about databases, which I know is one of your favorite topics. So the question I have is, do you have a final cup of coffee…

KG: Coffee?

LD: In your cup, or do you have a beer in your cup right now?

KG: Yeah. Sadly, I have nothing. I was just actually eating Easter candy like crazy because we have kids and that’s what happens. But…

KG: With the Flink Conference, I was… Leading up to it, it was one of these unique things where we thought like, “Oh, this is going to be crazy, it’s online,” because just as background, this is COVID times right now. And so the Flink Forward Conference was scheduled to be like a virtual conference. And I was like, “Okay, we pre-recorded our thing.” And first of all, pre-recording is horrible because I had to do the whole thing in one shot and that didn’t work out very well.

LD: It looked good, though.

KG: Thank you. But it was maddening, it drove me to the brink of insanity. But ultimately, I think the whole thing turned out really good. I really enjoyed the sessions. I think… You didn’t really ask me for a Flink overview, but I’m giving it to you anyway, I suppose.

LD: That’s okay. Go for it.

KG: But it was great. I thought it was super professional. I thought everybody dealt with the adversity really well, and the passion around the topics came out anyway, which I thought was killer, such a good community and great folks, and that the Ververica folks were super helpful and getting… And a big shoutout and thanks to them for making it free and putting it out there and… I thought it was awesome.

LD: Yeah, I did, too. I thought it was a really great show today. So excited and I know we’ve got a couple more days of it, I can’t wait to see what everybody is doing. I always enjoy watching these sessions. So it’s good stuff. Well, we are actually going to talk today about one of the things that you kind of went into in your session topic, which is materialized views, but what we’re really talking about is how people can just go ahead and ditch their database because that’s not controversial at all, is it?

KG:   Yeah, at a high level it is a snarky way of kind of attacking the problem, but we’ve been working on, and iterating on using Structured Query Language on streams for a while now, and that’s been a big part of our product base and what we’ve been working on from an engineering standpoint and where we believe as a company, our innovation really lies, as to… It’s our experiences around SQL, we’re old, crusty, database nerds, and we’re bringing some of that knowledge to the table when we create and architect a streaming SQL platform like we have. Obviously, we use Apache Flink under the covers and it gives us a lot of niceties but it’s not the whole puzzle. And we had to write a number of cloud-native type microservices to handle various pieces of the puzzle, including another SQL engine itself, parsing the SQL, understanding the schema, bringing that all together. One of the things that came up this morning was I noticed how many other solutions… Not to brag for a sec, but I’m going to.   How many other solutions really wanted you to have the notion of a table and wanted to have the kind of the schema piece of it, I’m going to say farmed out or delegated out to some other component. Assuming the schema is all perfect, this will work great, but that’s not how we’ve seen reality really happen for us.

KG: I don’t know if we just   are cursed in having these use cases that are relatively dynamic, but I just feel like… Especially with JSON, if you’re using JSON and streams, if it’s super organized and that scheme is perfect and you don’t have a lineage problem, I would be super surprised, and I might even say you probably don’t even know that you maybe have that problem. I would say it politely.   But engineering a proper schema and understanding not just the data types and that kind of thing, but also understand what business component does that belong to, what does that actually mean, that part, I think is largely… That’s why you have things like Schema Registry and Avro serialization mechanism, there’s other ones as well, Protobuf and others. So it’s super interesting to kinda see this thing evolve.

KG: And I think from our standpoint, we’ve kind of assumed worse case. We said we think people are putting JSON in 90% of the time or 80% of the time just because it’s easy. And then based on that, your schema’s probably not… It’s probably fairly dynamic. In fact, we’ve seen people just pollute topics with crazy stuff. Like if you’re to sample hundred messages, there’d be whatever, 12 different types of messages in that one topic. And that’s crazy, but that’s life, we see that, and that’s an anti-pattern, for sure. But if you’re a data scientist and you’re trying to   query the database, and that’s what you got, it’s like that’s what you got. You have to somehow make sense of it and try and still deliver your project on time and use the tools in your toolbox to make it all work.

KG: And the materialized view engine which we call Snapper, is a new piece of infrastructure for us. It’s a new service that ultimately, if you’re familiar, works with a retract stream in Flink, manages the insert, update, delete of data in a database essentially, by key. And so you define a primary key, the data comes flowing through, and then we materialize that view. And really, if someone’s not familiar with materialized views or views or what even that means, traditionally, a view has just been a SQL statement, it’s a persistent SQL statement. So you can create a view called ‘my view’, and it can… Maybe you join a couple of tables, maybe you summate something, a group or whatever, but it stays in your database as that view name.

KG: So maybe you call it like ‘last month’s finance’ and ‘this month’s finance’ or whatever projection, whatever that might be. And so that’s what views are. Materialized views are just a little bit different, it’s the same thing except the data doesn’t live in its source tables. The data’s actually saved in a new table, if you will. Materialized view is a table that was created as a select statement and named just like a view. And that’s really all it is, it’s not… There’s no huge magic there from a database perspective, but in the streaming context, it’s very interesting because it’s always being mutated. It’s always being changed by that retract stream, and so it’s a source of truth that you can go to just like a traditional database, to look up the data based on whatever is coming through you. A message bus, or whatever.

LD: No, that makes a lot of sense. I used the phrase ‘ditching the database’ as kind of my fun little marketing way to put it.   I would say yes, people probably… We’re not telling folks to set fire to their data centers just yet, but I think to a point that you made a little bit earlier, some people may think that their streaming data right now is really simple, and they’re either going to find out soon that it’s not, and they’re not getting everything that they could out of it. Or even if it is right now, it’s not going to be in the next month, six months or a year. And the idea of being able to materialize it without having to dump everything in a database, because that, to be fair, that kind of ruins the idea of actually having streaming data when you have to put it actually into a database.

KG: Yeah.

LD: In a view of it, yeah.

KG: So let’s break it down. And so this is the most… And I think we talked about this on a previous podcast a little bit, and I’ve talked about it in some of my talks over and over but… Let’s just break it down. The most common use case, I think, for Kafka, the easiest thing someone does is they say, “Okay let’s just use clickstream data. I’m going to write a piece of code that captures a click, maybe it’s just JavaScript or whatever, and it’s going to hit a service of some type, and that service is going to then turn it into a Kafka message, and so it’s going to produce a message to Kafka, and it’s asynchronous. From that frontend framework standpoint, it all happens asynchronously, it’s super fast, and so the logical put of the data, if you will, is asynchronous fast and probably won’t break.

KG: And if it did, it would just timeout at some service or whatever. It’s very reliable, super robust, scales forever. Okay, got it, checkbox. Awesome. So then you got all your clicks in Kafka. Cool, great. And so you’re like, “Well, I really want to display a dashboard of… Maybe I’m doing… I just want to segment by browser type,” just to take a simple example and to write a report. Okay, cool. So now, I want to pull that data out of Kafka and write a report. Okay, well, you’re… Just take Python. I’m going to write Python. I’m going to use the consumer Python client, maybe I’ll use Confluent’s. It’s a pretty good one. I’ll read messages out of Kafka, then I have to put it somewhere. So what will I do with it? How will I get the summation of all the clicks over some time period?

KG: So I have the write code to do that. I’d write Python, and you can do that, but if you think like in a production mindset, it’s like, okay, that would have to be a micro service that lives somewhere, and then that would be connecting to Kafka. Okay, cool, cool, and then, Okay. But then I want my dashboard users to read that, and I need to give them an API to read it. Oh okay, well, you know what I’m going to do? They use a database today, I’m just going to write it to the database.

KG:   And so now, you’ve got this anti-pattern of where you had this asynchronous architecture, you had tons of scalability, you had tons of fault tolerance, and then you took that data, you jammed it into a gigantic database. It probably has other data in it, too. You continued this idea of this monolithic database, or even a cluster of databases. And that’s been the pattern I think everybody has been going through. And by the way, a lot of times, that database is something that’s… Maybe it’s RDS or something super expensive, or maybe even super opaque like Aurora, where it’s just hard to tune and you just can’t have the handholds you need to kinda make it performant, it just sucks. The whole thing just sucks.

KG: And you’re doing that because there’s no crisp API that has state materialized over some period by some key to hand out to your team, and that API is important. And it’s a very tricky topic to talk about. because people are like, “I can write something that handles that, but inevitably, they’re going to use Redis, or they’re going to write it to S3, or they’re going to keep it in memory, but not have the ability to recover, or whatever it might be. They’re going to have to figure out, “How do I take that out and how do I manage that state?” And our take on that has been “Don’t use your legacy or traditional databases, don’t… ” because first of all, when you do that, you’re inserting… They’re messages, so you’re inserting, updating, and deleting probably a gigantic database.

KG: If you’re not familiar,   what tends to happen there is you tend to fragment the heck out of that thing. You’re doing these updates on these message keys. Maybe that’s a logical move from a block standpoint on that database, maybe even it’s a logical shard move, if you change the key. It depends on the database you’re using, but it can make a mess, and it can be very slow, too, and very expensive. So our take on the whole thing has been like, just forget all that. If we could just dream and have a distributed database that would scale in lockstep with Flink, because that’s the underlying framework we use from a distributed system standpoint, it would have just a simple clean REST API. And ultimately, that’s the easiest way to query data, because it works in everything. It works whether it’s in Golang or something or RUST or whatever… More… Kind of a new school in popular languages of recent days.

KG: Or even something like a notebook or even a front-end application, Angular and React apps, and JavaScript apps or single page apps, things like that. So RUST is, it’s ubiquitous and easy to use. And so that was kind of a design tenet of the system itself, and materialized views. We wanted customers to be able to say, “Look, I know how to query these use, it’s super easy. I can embed that in my app, I can do a lot of API keys, I can revoke them, I can have the security context I need to do that.” And then, bam, they’re off to the races. And they can skip, for the most part, skip huge pieces of expensive and potentially slow database infrastructure.

KG:   It’s odd, because that’s what we’ve done for many, many years, is build database infrastructure, but we know the ins and outs of database infrastructure, how expensive, how much tuning goes into it. And we’ve seen over the last 10 years, so many different types of data storage come out. And they’re all trying to have a little bit different take on the same kind of… Similar kind of problems, whether it be scaling across the planet, or whether they’re thinking about time series, or whatever, schema-less, things like that. And so, from our standpoint, we said ultimately, we don’t want to care about any of that stuff. We don’t want to write 1,000 different connectors to 1,000 different databases. We just simply want to offer a view and the easiest way to query a view of real-time data. And it’s kind of that simple.

LD: Well, and I think it’s interesting. You and I have had this conversation a lot recently, too, just for internal reasons. I’m not saying that there’s nobody out there that can do it. What I’m saying is we don’t use those services and we haven’t found the services that can actually show us to your example, about clickstream data, some of the marketing data that we want. And so, it’s as easy as putting it into StreamLoader and querying it and using materialized view. And bam, we’ve got some of the marketing data that we really want to use and really want to see that we couldn’t get access to easily, at least, through any of our other services, or any other tools that we can find out there. And if you’re doing it with… Not to put my profession down at all, because I love what I do. But if you’re doing it with   marketing data, then chances are, there are a lot of other pieces of the puzzle throughout your company that could be using data better if they had this kind of access and this kind of view of it. Or at least, that’s my view.

KG: I see where you’re going, right? And that is like… And I’d go a step further. I’d say, “Look, that’s just democratizing data.” If you are… Let’s just take clickstream, because that’s a really good example, good point. And you’re a data engineer, backend data… Backend programmer, data ops person or whatever. And you’re in a data team, and your role is to help folks make sense of that clickstream data within your org. If you’re not a two-person company or a five-person company, you need to be able to somehow capture that data, and democratize it, and distribute it in an easy way across your organization. That’s the popular movement, is to empower folks to make applications and decisions based on data.

KG: And the only way you’re going to be able to do that is to ultimately, “Let’s build some sort of backbone to capture that stuff.” And so clickstream… Yeah, Kafka is obvious. Like, that’s a very popular use case, but Kafka from the consumption side to the application, is a gigantic divide of space. And how do I address that? And most people are putting that data, I think most people are putting that into a database right now to materialize it so that their teams can read it. And that’s something we want to change. We think, “Look, you can have fast and real time data, you can have access to it in a performant way. You can still address it with familiar and easy APIs, but you don’t need to provision gigantic pieces of database infrastructure to do it.”

LD: We’ve talked a lot about the reasons why you would do it. And I think there’s probably a zillion reasons more why he would use a materialized view across the board. Let’s play devil’s advocate a little bit. When is the time where you have streaming data, and you probably don’t want to materialize it? What is the point at which, “Okay, yeah, this… I know we’re not actually ditching the database, yeah, we gotta put this data back into a database.” When does that happen?

KG: There’s a couple of use cases where… I think there’s two dimensions. One is, “Would I have a materialized view of it at all?” is one thing. And the other thing is, “Would you use our framework for doing it?” I think… For the first one, if you’re doing something like real-time alerting, if you’re routing… If you’re going kind of from a message paradigm to another message paradigm, then obviously, it makes tons of sense to just… Maybe you’re running some sort of simple lambda on something, you’re coming up with a simple small result set, and that data just gets piped to some service. Maybe you’re using CQRS patterns or something like that. You don’t necessarily need to think about materialization in the same way, in my mind. You may not use a database at all in those cases, and that stuff’s well-known and pretty easy to use. You write Kafka consumer code and use some driver in some language and you’re off to the races, no big deal.

KG: Now, in the case of why would you use our framework, well, I think there’s two schools of thought there. One of them is, if you’re already using some sort of database framework and you have a ton of data from other sources in there, maybe some of it’s legacy, maybe it’s batch, whatever, then you might want to just consider using that data store. That’s fine, and we support that, too. You can always write to a JDBC sink, you can… That’s been around for a long time, there’s nothing magical there.

KG: But it doesn’t mean you can’t do both. And I guess that’s where I was kinda going is, if you have an application that’s… And I always use this example, some sort of map on iOS or whatever, or a JavaScript app where you’re showing plots over time, or you’re maybe doing a heat map or something. It’s super nice to just be able to say, “Look, I’m just going to get this data right from this REST endpoint.” Data science and notebooks is another… If you’re using notebook interfaces, that’s another place where people are already used to kind of using that paradigm, and so it makes tons of sense to use it. And maybe you’re joining multiple different sources. I think it’s up to the user. And this is why stream processing gets complicated. Is like “Hey, do I take this source data and put it into Kafka and then join it and continue with SQL and then output something that’s clean?” Or maybe that data is coming from somewhere else, like a old school Informatica batch load or something. And you need to join it downstream further because that’s just the nature of your business. Okay, that’s cool, too. We can support that. It just depends on the nature of the business, and kind of where you are on that adoption continuum. Not everybody has a brand new Kafka source of truth and that’s it. Many times, infrastructures are messier than that, and they have existing legacy data stores and some other things that need to be taken into account. It can be both, really.

LD: Well, and that’s what… I think in my mind, the way… And you’ve heard me use this example before, I may have actually even used it on the first podcast that we did. In my mind, a really great way to frame up materialized views are, you think about a ride-sharing app. It’s not useful to know every driver that’s ever been within a two-mile radius of where you’re trying to get picked up right now. You need to know who’s there right now. But at the same time, on the flip side of that, yeah, that ride-sharing company probably still needs… They’re getting in geo-location data in a streaming fashion. They still need to know everywhere that those people have been for things like, “Oh, we need to refund this because the driver got lost and went out of the way. Or we need to do whatever that may be.” So there is… In my mind, and I think what you’re saying also, is for the vast majority of companies, it’s going to… You have to have a healthy mix of the two.

KG: Yeah, and you brought up the ride-sharing app. That’s a really good use case for materialization because if you think about it, it’s an IoT sensor, right? It’s a iOS app in a car and it’s streaming data about its position and whatever else, customer that’s in the car, and it’s got a counter going for the cost, and blah, blah, blah. And that’s being streamed into… I’m just going to make up my own infrastructure here, it’s going to be streamed into Kafka but then you have to somehow say like, “I want to know who the customer is in the car, what their latest location is, what their total is on the spend, where is the counter at, and then maybe the driver ID or something like that.” Well, you have to materialize that result. If you want to see that as a widget in your iOS app or in an internal dashboard… Maybe internally, they have them… All the rides that are happening at once and they’ve got them all listed. You need to kind of do that aggregation. And materialized views are a great way to do that because it’s app-specific, you can protect it behind an API key, you can scale it independently, you have separation of concerns, and get that really tight single piece of data that you want out of a huge stream of the firehose of data coming in from that IoT device.

LD: Yeah, that’s… For some reason, that example is always the way that it… For me, it’s really easy to frame up in a way that people who may not understand what materialized views are, and who honestly may even be slightly new to streaming it themselves can kind of understand because they see it, but it’s also a great way to understand… And again, there’s the need for both. You need to be able to materialize those views but there are use cases where, “Okay, we do need to plot some of this data into a database and keep it there for now because we’ve got more static and batch-type things that we need to do with it later on.”

KG: Yeah, it’s interesting. It’s kind of… Materialized views have been hot as of late. I’ve seen it in more Flink docs. I’ve seen it in more Confluent stuff. I’ve seen it… Everybody has a little bit different name for it and a little bit different design pattern around it, but it’s all fairly similar in its approach. And I think that’s really because you have to unlock the ability for people to actually read this data and make sense of it. That’s ultimately what we’re trying to do here. So anytime you can make it more useful to that end user, whether that be a developer or data scientist or whatever, then that’s great. And I think materialized views really present that kind of, let’s just call it a little bit more of a legacy API, a more approachable database-looking API with this notion of streams. And I think that’s the art or the beauty of it.

LD: Why do you think there may be roadblocks in the way to actually getting some companies to adopt this? because what we’ve talked about is a lot of companies may not understand how complex their streaming probably is, and I think some of it may also be that some companies don’t know what’s possible because they don’t know what’s possible yet. They don’t want to go through and do something because they don’t understand the possibilities of what they could be getting because they just haven’t done it before. But are there any other roadblocks that you can think of that somebody who’s listening might go, “Okay yeah, that sounds familiar to me. Maybe I should check out this whole materialized views on streams thing and just see what it can do for me because that sounds super familiar to the problem I’m having.”

KG: I think it’s a general topic. We talked about this in the last podcast with Jesse Anderson. I think it’s a general high level topic, that streaming data is a paradigm shift. It is a little bit different thinking than the way we’ve been thinking for a long time. And it’s a boundless stream of tuples and SQL’s continuous and we’ve got a materialized results. We’ve got these new moving parts that are different than things were before. And a lot of really heavy duty academic research has gone on this stuff. And I think we’re more on the practical side, frankly, trying to make a product that helps leverage some of those technologies and bring them to people.

KG: I think the gap is going to be from an adoption standpoint, folks will, they’ll say, “Well, I already have RDS,” or “I’m already using Oracle   or whatever.” And so it’s going to say like, “I’m just going to put it in there. I can store it forever. It’s a well-known paradigm. I know how to back it up. I know how to scale it. It’s known.” Sadly what we’ll see is when that dashboard user goes to try and look up the data for that data point, it’s going to be a B3 fetch on a big table, and maybe the streaming data tends to be… Especially… A retract stream tends to be scattered in its access patterns, so we’re going to have a lot of buffers that move or a lot of buffers that need to be fetched. Disc drives aren’t going to move like they used to because they’ll… Mostly SSD these days, but there’s a lot of moving parts to make those buffers return in a timely way. And that’s what databases are good at.

KG: But we’ve been trying to move to this kind of continuous process where we’re filtering data and able to use it right away, and that’s what continuous SQL does. It’s always emitting those results as they come in. And maybe you’re grouping over a window but they typically tend to be kind of small. I think what we have to do is help folks, teach folks and make it easy for them to try to use materialized views. And we’re on a learning curve, too. This is obviously new for a lot of people. We’re not the only ones in this space. We’re on a learning curve, too. We’re trying to help our customers and help folks use streaming data to best effect, to make awesome apps. And so we’re building and learning, too. And I think feedback from customers is key. I think having them explain to us that, “Look, if we just had a… ” I’m going to make it up. “A ODBC connector that would allow my legacy reporting app… ” Again, just making stuff up. “To connect, versus REST or whatever, then I would be able to use Excel as my query frontend.” And I’m sure there’s a lot of people who are still using Excel to query SQL server and stuff. That’s been a pattern for a long time.

KG: And so, there’s going to be those flows that we want to bring the new capabilities of streaming data and streaming data architectures to, but we don’t want to have to say, “Hey, you have to retool your entire stack, learn everything new, just to get these new apps.” And I think that’s where SQL, materialized views, and being able to treat a stream of data like a database, that’s where the sweet spot is. So, we gotta help people go there. I think that the product and the design is kind of oriented to doing that, but I do think that, yeah, if you’re used to something that’s hard to move, and until the business says, “Oh my gosh, we can’t wait 10 minutes for that light to refresh or that piece of data to refresh. We actually need that to be in real-time,” they won’t go searching for something that can, and for good reason.

LD: I think the final thing I’m going to ask you is going to be a super tough one. You ready for it?

KG: No softballs. Come on.  

LD: You ready for it? What kind of beer are you going to go have now that you’ve done this in Flink Forward today?

KG:   Okay, this is the best question. We have to do this in every podcast from now on.

LD: Even if we’re recording at 7:00 in the morning, it’s just “What beer are you going to have after this?”

KG: Okay, good point. Yeah, it is in the afternoon so I think we’re legal here. I’m going to grab a pint of Electric Jellyfish from Pinthouse Pizza here in Austin.

LD: Nice.

KG: That’s my next move here.

LD: Solid choice.

KG:   What about you?

LD: I appreciate that.

KG: What about you?

LD: I haven’t even begun to think about it. I did get wine delivered, so it may be cracking open a bottle of one of those. We’ll see.

KG: There you go. There you go. Well, it was a big day with Flink Forward, and we’re doing this podcast and a ton of things gone, so it’s been a really cool day.

LD: Yeah, it has. It’s been a good day. It’s been fun talking about and hearing about what everybody’s doing with Flink because again, the community, it’s growing like gangbusters and that’s been awesome, so it’s been a good day.

KG: Yeah, so again, shout out to the Flink folks. Great conference.

LD: Fantastic.

KG: Really enjoyed participating and excited about the future of Flink. A lot of the new features and a lot of the things that are going on there is super cool. Glad to be a little part of it.

LD: Yeah, same. Alright, thanks, Kenny. We’ll do this again, I think, probably.

KG: Alright, Leslie.

LD: If you want to hear more from Kenny on the topic of materialized views, you can find his Flink Forward talk, as well as all the other great sessions on YouTube. But if you want to get started with them today, it’s as easy as signing up for the Eventador platform with its 14-day-free trial at eventador.cloud/register. Or as always, shoot us a note at hello@eventador.io with any questions, comments, or anything else we can help you with. Happy streaming.

Leave a Reply

Your email address will not be published. Required fields are marked *