Podcast: A Look at Modern Data Processing Architectures

May 27, 2020 in Eventador Streams Podcast

Podcast: A Look at Modern Data Processing Architectures
Eventador Streams Podcast: A Look at Modern Data Processing Architectures

In this episode of the Eventador Streams podcast, Kenny and I took a look at today’s data processing architectures, and how, in reality, all data is a data stream. This reality brings both a new level of clarity to real-time data analysis, but it also brings a new level of challenges to accessing, wrangling, transforming, and using that data. We chat through how this reality has impacted data analysts and how they’re overcoming these challenges.

Learn more about modern data processing stacks and what pitfalls data analysts have navigating them in this episode of the Eventador Streams podcast:

Eventador Streams · A Look at Modern Data Processing Architectures

Want to make sure you never miss an episode? You can follow Eventador Streams on a variety of platforms including Soundcloud, Apple Podcasts, Google Play, Spotify, and more. Happy listening!

Episode 08 Transcript: A Look at Modern Data Processing Architectures

Leslie Denson: So you’ve got all this data streaming or batch. Now what do you do with it? Kenny and I dove in on today’s modern data architectures specifically for data processing to talk what’s working, what’s not, and just who’s out there strengthening those pipelines in this week’s episode of Eventador Streams, a podcast about all things streaming data.

LD: Hey, everybody, welcome back to another episode of Eventador Streams. Kenny and I are here today and we are going to talk a little bit more broadly about the entire data processing stack. And it’s a topic that’s come up for us a lot lately and we’re really stoked to be chatting about. So, strap in and let’s go. How’s it going, Kenny?

Kenny Gorman: Good, good! I got the coffeemaker, things are behind me.

LD: Nice!

KG: It’s working now. It works 100%. I just realized that if I don’t swear at it and I don’t touch it, I just put the water in it, it works out okay. So, I’m figuring out the hacks for the coffeemaker, because when it fails, man, it’s disastrous.

LD: It fails badly. Well, I know that feeling. I know that feeling well. That’s how I ended up with two espresso makers in the house just in case one gets-

KG: This hot failover?

LD: Yeah. Just like-

KG: Like if one fails, does it immediately start the other one to get… is there Ethernet between the two?

LD: No, I probably should do that.

KG: All right, yeah, we’re going to fix that.

LD: Yeah, I just thought it would be interesting. Well, to dive into today’s topic, past coffee, although I think probably the data processing stack, the whole ML model to know whether or not the downstairs coffeemaker has crapped out, the upstairs coffeemaker could be an interesting test case for this. Let’s just dive in and start chatting about some of it.

LD: So again, we tend to see things bubble up as we’re having conversations and it’s fun to see them when they start bubbling up in cycles. And one of the things that’s kind of started bubbling up with us is what are the challenges and some of the overall modern data processing stacks, whether it be batch, streaming, kind of in general, whatever technologies you’re using, there are just some very common pain points that people are going through. So, let’s talk some about those. I know we’ve got a few that have come up. So why don’t we dive into it because I feel like people are going to go, “Yup, I hear from my team. That happens a lot.”

KG: Well, I think that the real sort of fork in the road is I think if you’re talking about traditional batch analytics, you’re talking about BI, you’re talking about data warehousing, reporting, data engineering, from that perspective, the world is ripe with tools and capabilities. And there’s always someone out there to take your money and provide you with a solution. And that continues to evolve, things like Snowflake, it comes to mind as being kind of a new contender but very popular.

LD: Right.

KG: But those things are steeped in a little bit more of a use case where, yeah, it’s okay to get the answer in 30 minutes. It’s okay to have an overnight processing cycle. I want to take trillions or billions of events and data points from my business and I want to turn that into some picture for the business. And that’s super well-known. But what’s not as well-known is this idea of real time analytics. And what does that even mean? And by real time, we mean human real time. We don’t mean like machine real time, right? So we mean minutes or seconds or I don’t know, it may be even a little bit longer than that.

KG: But really the use cases that are starting to require data… Use cases are requiring data in new ways and in new forums. In a lot of cases, the rise of things like machines generating more data, things like IoT, things like personalization on websites, buying recommendations and things like that, the rise of data science and machine learning.

KG: There’s a lot of contributing factors that are changing the way we need to work with data. And a lot of that starts to push data engineering teams in the direction of how do I manage the latency from source to endpoint. And I think that’s kind of where our heads are at and that’s the use case that we’re talking about here. Not the kind of more traditional BI or the overnight processing or one hour processing. We’re talking minutes here.

LD: Right. So, why is it hard? I think I’ve mentioned in a podcast before and it’s what you just said, I’ve been working with data analytics companies for a while now. And five years ago, an hour was considered to some degree real time and that time is condensing and these people do need something in minutes because it is becoming more and more critical when you think about IoT for these.

KG: Right.

LD: You think about something like Uber, Lyft, having a differential of a couple of minutes of getting the insight can be detrimental to you. So why is it so hard? We’ve got the tools out there supposedly so why is it so difficult?

KG: Well, right. You’re right, good question. So, and just to put a fine point on those use cases, I mean, if you look at Lyft for instance, they are doing very interesting stuff with Geo encoding, and real time pricing.

LD: Right.

KG: Or you think about just kind of the broad fraud detection use case, right? Like if it’s an hour later, it’s probably useless, it’s probably not preventing fraud from happening. It might tell you how you were defrauded but it’s not preventing it, right?

LD: Yeah.

KG: And so, the tip of the spear in data engineering is being able to react in real time, being able to make great decisions whether they’re machine decisions or decisions by people or maybe even just simple precomputed things. That’s the tip of the tip of the spear here, but you’re right. It is hard because today we’re still using kind of a little bit more of traditional systems in most cases. Yes, we are starting to use Kafka. In general, you can go to a conference today and talk about how people are using Kafka, and it’s prevalent, it’s everywhere. And that’s obviously a source of what we call it streaming data for this conversation.

KG: Operational data stores, your transactional database, purchases, and user accounts and things like that are an important source as well. And so the stream coming from those databases are also important change vectors, or a stream of events if you will, that make up a big mix of intelligence for companies. And what’s hard is now I’ve got these disparate data sources. Sometimes it’s semi-structured data, sometimes it’s structured, and maybe even it’s mixed. They all have different schema definitions or maybe some of them have a varying schema definition. Like maybe it’s JSON coming in with that’s just semi-structured.

KG: And then what the industry has been doing for so many years is, okay, we take that and we do something and stick it into a database and tell the analyst or the system that’s reading it, the software that’s reading it, okay, go read it from the database. And maybe that database is like Postgres or Oracle, or Snowflake, or ELK stack or whatever, but ultimately that’s kind of the state of the art here is that we’re taking data from event data essentially, everything kind of as a stream in this design. And you’re then, kind of a broad term, we’re draining it or loading it into a more traditional database.

KG: And that’s terrible because now you’ve put billions of events into a traditional database. And then you say, hey, go read it with a b-tree index or maybe it’s a column store and you’re doing a little bit better and maybe it’s a little bit better architecture. But ultimately, you still have this kind of hodgepodge kind of a design. And maybe in between there, you’re doing some ELT type stuff or ETL or whatever, whichever way your flow works, these days it’s not just straight ETL. And most companies have Python scripts and CSV loaders and microservices that move data everywhere and it’s a big giant mess.

LD: Right.

KG: And that’s kind of, I mean, I am sure that if we do enough engagements where we talk to folks, and it’s like, oh, yeah, well, we load the CSVs using Python. And we have distributed loader that we’ve scheduled, we use airflow or whatever. Or, yeah, we drain Kafka, we use a Kafka Connect, and we put it into Postgres, and we read it from there. But none of these use cases are particularly performant. None of these use cases really get the data analyst, data scientist or folks who are ultimately consuming that data even if it’s an application, the performance that they need. And that’s kind of why the architecture is essentially broken.

LD: So, if the architecture is broken, what should it look like? And I say that because you see it in applications, you see it in IoT, and in a lot of ways it is baked into what the application is. But I think about, there are so many companies that are now going out there and actually selling the result of this pipeline as a major value prop to their org like I think about Capital One.

LD: There are commercials on TV right now that talk about their application that is their fraud detection where it pops up and tells you this has happened. So it is becoming, as we’ve always talked about, more and more critical. So having it that’s good.

KG: Right. Good point because that’s an exciting use case, right? Because a couple of unique attributes that… So, data infrastructure and modern data engineering has been able to bring that kind of capability to the table. And think about some of the unique things that that’s doing, sometimes we forget how far we’ve come. Number one is, we’re performing analytics and detection of some sort on a data stream in real time. So that’s cool and it’s very personalized, right? So it’s personal to that user, it’s a user experience that’s on an account by account basis.

LD: Right.

KG: So that’s number two. Number three is, the results are fed right back to the customer, whether it’s a SMS alert or whether it’s in their app or on their iOS device or whatever, it might get an alert, notification, like that is a very compelling and personalized, very competitive application. And that’s built by these data streams except it’s extremely expensive and extremely hard to get that for almost all organizations. Like you have to spend years building it, you have to have a huge team, and you’re going to be cobbling together a dozen different technologies to make that work right or more. In my mind, you’re right, that is the state of the art and that’s a great use case, and that’s a great example.

LD: To that point, because not everybody uses a Capital One, what should people’s systems look like, what should these, again, we talked about what the architecture shouldn’t look like or why they’re broken. So, how can a just normal everyday company that’s out there have something that it’s not going to be Capital One’s app, but how can they get more on that level?

KG: Well, every company is a little bit different in where their data feeds come from and how they’re organized, what systems they’re using, what’s the data format, some people might use Avro or Parquet, some people use JSON, whatever, the feeds can almost always going to be quite a bit different. But what we see is, look, there’s a couple big problems with, and I touched on them earlier. One of the problems is you’re feeding a traditional database. At the end of the day, you’re basically putting data into a database and not much is changed, and that’s fine.

KG: But there’s new indexing patterns, there’s new scalability patterns and all of that, that’s fine, but if we look ahead 5, 10 years, that’s not going to get us there. As a data engineering practitioners, we’re not going to get to the finish line of real time data with that kind of a design. And that’s fine. I think a lot of companies do that as a stopgap, right? They say, “Well, crap, we got to get these reports in Tableau to work Thursday. And, okay, half hour is good enough for this and we’ll tune it up and invest later.” But that’s because the state of the art is expensive, it’s hard to figure out, it takes months of engineering, it’s risky like people’s careers, it’s risky for them to propose things that are new and tricky.

KG: So, it does take a while but ultimately the way we see the pipeline developing. And that’s really what it is. It’s a data pipeline from data ingest to data presentation, has a lot more to do with streaming data and is more of a streaming data design than it is batch.

LD: Right.

KG: And the goal should be, for the people thinking about this at least kind of at a high level, is how do I make the most sense of my data without putting it into a traditional database? How do I filter, aggregate, wrangle the data in some way so that I’ve taken the firehose of data that’s coming in, maybe it’s clickstreams, maybe it’s IoT, whatever it is, and I’ve pared it down to the most useful data that I need to see. So it’s not big data. It’s fast data. It’s important data and you’re getting into that important piece sooner in your stack.

KG: So, moving that processing logic up the stack, not in the very end result in the database where you have to use b-tree indexes, but move it up the stack into the stream processing tier and therefore everything downstream is just these views that make sense to the business but they’re compact, they’ve already been pared down, they’ve already been joined, filtered, aggregated, whatever it might need to be to present that to the business.

LD: So, there’s a lot that goes into that and there’s a lot to unpack as you’re thinking about all of that. You mentioned data wrangling as one big piece. But I think another piece that, and it’s something that we’ve talked about from a streaming perspective for a while, is the idea of continuous SQL and query speed. So what-

KG: Yeah, it’s like the listener knows where we’re going with this, I think, so.

LD: I know.

KG: But follow along. I think you know where we’re going.

LD: Yes. So how much does query speed in this matter? Because we may not be talking about a millisecond, but if you’re still talking about, I need an answer back in a couple of seconds, I feel like that matters a lot.

KG: Right, right. So let’s just talk, unpack what continuous SQL is and why are we stoked about it? What does it do, like with this framework that we’ve laid out here, like what is continuous SQL and why is it exciting?

LD: Right.

KG: So, you can do a couple things. You can make your database faster. I talked about that, right? You can scale it and you can make it planet scale. You can have a killer, auto sharding, cloud, fully elastic cloud aware capability, and all these magic things can happen. And that’s great, but you’re still querying and filtering the data in that database engine. And so that engine has to have all those blocks full of rows or documents or whatever the particular database stores it in, some of them are very different than others. But a true traditional database, there’s rows and those rows live in buffers and blocks of data on disk somewhere.

KG: And that’s just kind of the physics of the equation, and you can make that faster, you can make that better. And we’ve been doing that as a data engineering community for many years. We’ve started off with rudimentary data stores way back in the day, we’ve gone through a relational era. There’s things like index organized tables, there’s things like sharding and global indexes. I think green plum was an early innovator there and on and on and on. And I mean, early, I mean, decades ago now.

KG: And recently, we’ve seen distributed computing become way more prevalent and way, way better and more useful and actually reasonable to use. And Hadoop and Hive and Spark ecosystem is going crazy. We’ve seen distributed databases and no SQL databases, Mongo comes to mind as a good one. And now there’s a new breed of databases that all have to do with these different data patterns, whether it be timescale or something like elastic. And the ELK Stack which has kind of it’s very broad but very good Lucene-based use cases.

KG: That’s kind of where it’s been but every one of those things has something in common and you can’t get away from the physics of having to, now, if you drain every IoT event from Kafka into that database, you’re still going to have to have a really, really expensive and awesome database on the back end. And so continuous SQL is a leap from that design. Continuous SQL says, I’m going to run SQL on my database, in this case, a stream of data. And I’m going to continuously process and pare down that data, whatever that query is doing, whether it’s joining and maybe it’s filtering or maybe it’s aggregating over a window. And I’m going to admit that as a view to the business in some way.

KG: And therefore, the database really just doesn’t see that data. You’re not using the database to pare that down. You’re just saying, “Look, I’m just taking away essentially or joining or maybe I’m upper casing some fields or some simple transform like that, or maybe I’m not including personally identifiable information in some way, or maybe I’m using a UDF to join in another logical business stream of some sort.” But whatever that might be, I’m running that continuously against the stream of data. And the output of that query is continuously being fed, right, so it’s not a cursor that is opened and traversed and the results fed to an app like a traditional database would be.

KG: It’s just always executing against the stream. And then the results of that are being, like I said, fed to some downstream system. And maybe that is actually in this case, a database. But the advantage here is now we’re only feeding the data that makes sense to the machine learning algorithm or to the report that you’re trying to create or some sort of filtered view of personalization for an application. So, or maybe it’s like geo coordinates for a map or some IoT device, right.

KG: So, we’re taking the firehose, retaining the firehose as early up the stack as we can with a very robust distributed SQL processing system. And the output of those queries are views that are materialized and the apps can then plug directly into those.

LD: We obviously are big proponents of the idea of Continuous SQL to do a lot of this and you’re seeing it. I mean, it’s one of the things that we talked about. And Kenny has a great blog post on his personal blog up about kind of his history of a SQL and the history of SQL in general, and why this is a great thing and why it will continue to be used.

LD: One of the things also, kind of context switching away from that a little bit that I think we’re also… I don’t want to necessarily say we’re hearing about more and more, but I think when you’re talking about bringing latency down as much as you can, it becomes more and more important is the idea of having a hybrid infrastructure there any number of ways that that can be architected. How does that impact an analyst and how does that impact what their workflow should look like.

KG: Right, right. Well, I mean, so that hybrid means a lot of things to me, right when we’re talking hybrid, that word in data systems.

LD: Right.

KG: And I think one of the definitions that are interesting to talk about is a hybrid cloud. What does that mean? How are we stitching together various cloud resources to work together? And that’s common in most organizations. It may just be regional and that’s very simple, you’re in the same cloud provider, or maybe you’re using the analytic stack in the Google Cloud and most of your data sits in Amazon, that’s actually a pretty, pretty common one. But I also think of hybrid as like hybrid data sources, hybrid data formats.

KG: Like I said earlier, it’s very common to say, “You know, look, in order to make this report work or this piece of intelligence work or this model work, I have to stitch together these three or four different disparate data sources. And one of them is traditional batch like it’s just a static, almost semi-static piece of data, maybe it’s even cultivated from historical data. Then it’s like, “Oh, and then I also have to pull in this Kafka stream of like click events or transactions from pipeline to detect fraud or something. And, “Oh, and also I have to pull in the account changes from the ODS to make sure that these are all live accounts or something like that.

KG: And so, hybrid in my mind is like, it’s all those things. The modern data engineer has to think about all those things. And every one of those things is probably owned by a different team. They probably have different security creds or different way of doing things. They use their systems differently. And how do you stitch that together? So the way, like one important thing that we thought about early on was like JSON is a very common data format. And in our stack, we support both JSON, Avro, doesn’t matter.

KG: But in the JSON case, it’s a little bit more tricky. Because in Avro, you’ve already defined your schema and BAM. It’s no big deal. New schema registry and your data is in Avro format, you deserialize it and you can do whatever you want. But in the case of JSON, which is I think more common or just any kind of unstructured data, you have a schema that may mutate. And detecting that schema, which is something that we do automatically and presenting that as a structured with types and strongly typed, right? It’s got to be strongly typed for SQL to work.

KG: And so that’s an important construct, is how do you traverse this notion of semi-structured data and loose data typing too strong typed data in your flow. And that’s like a big problem with modern infrastructure. And I think, the way we handle it is, we say, look, you just tell us what the source is. Maybe it’s Kafka with JSON, and maybe it’s a Change Data Capture stream which would turn into JSON, and that’s… So we’ve got JSON and JSON, you want to join them. They’ve got different schemas. We automatically detect the schemas and data types.

KG: In our case, we sample the data to actually duct tape those types in the data, build you a schema automatically, and then now you can see that as what most people would call a table. It does looks like columns with types. So, that’s a big and it sounds almost trivial. It sounds almost like, “Well, duh, of course, I’m going to need that. But that is a big problem with a modern enterprise is how do I how do I make sense of that. And now what you have is you have tables that represent these data streams and they’re named and they’re saved, and they’re durable.

KG: And you can say, “Oh, yeah, that’s the data feed for customer transactions. Oh, that’s the data feed for events from the website. That’s the data feed from the actual transactions from the point of sale or something.” And as you name them, now, you have these Lego blocks, these core components that you can build queries off of. And you can wrangle the data, you can start to write queries and filter and build your WHERE clause. And there’s no indexing to think about, right? Because you’re performing this on a stream continuously. So indexing is like not even a thing. You don’t have to index. We don’t think about indexes. It is the index, right? We’re processing continuously the index.

KG: It’s a paradigm shift from a kind of a mental model perspective and the views that come out of this are ever mutating. And they look like a database view like in a typical database but it is the view of the stream. And that’s the real power here, is you’ve now taken, you wrangled not only the data in terms of how you want to see it and be presented and grouped and categorized and filtered. But you’ve also then pulled in various sources to form this picture of whatever it makes sense for the business.

LD: So I think we’ve talked about it a little bit but dovetailing from that and to put a finer point on it. What is the delta between, and I’ll say today’s streaming but really today’s just data technologies and what data analysts really need because there are a lot of things out there and they’re clearly, even if it’s not great doing some of this right now, but it and as with anything I think could be better. So what is that delta?

KG: I agree. There is a delta between Kind of today’s state of the art. And this is evolving, right? Like everybody that we talked to and you go to the conferences, you go to meetups, you talk to folks on the forums, and ultimately, this is an evolving picture especially the streaming space around Kafka and stream processors like Flink, and that whole ecosystem is rapidly, rapidly growing. But I think, the thing that we’ve really been keen on is that Structured Query Language is important.

KG: It is a great neutralizer in an organization of talent gap, it is a great reducer of time to market, and it’s known by everybody and I think there’s a certain comfort level there with almost everybody, whether you’re performing analytics on a stream or database or you’re a data scientist or you’re an application engineer, everybody’s used to using SQL in some shape or form. And if you’re not, man, there’s so much documentation.

LD: Right.

KG: It’s almost like if you don’t know SQL already and you’re one of those roles, it would behoove you to probably know that from a career development standpoint. So I think, SQL is a first class citizen in our mind. That’s exciting for me to use SQL in a new and interesting way on a stream, and basically circumvent a lot of the latency and slowness problems that we’ve seen in this flow and in these use cases. And I think it’s only going to get better. One might ask, is that SQL like standard SQL? Can you do all the things?

KG: The answer is complicated, depends. Yes, for the most part especially in our stack, we use calcite, the calcite SQL standard, if you will. And that’s great. But that’s not 100% of everything. And that’s evolving, and we’ll continue to kind of build more operators and allow time windows to be more granular. I mean, like today, you can do a hopping window or tumble window of various types. And we’ll continue to build that out and continue to work with the communities to build that out.

KG: So, I think there’s an evolution there, and I think SQL is kind of, frankly, at the top of that list in my mind. The other thing though is state management. And this is kind of under the covers thing. But when you do SQL on a stream, you must manage state. And I don’t mean the state of the output data, that’s kind of the material we’ve talked about. It’s the state of the computation as it’s in flight. So if you do, let’s say you’re averaging temperatures by vehicle ID over a five-minute period. You have to have state to know the key that you’re aggregating over, which is vehicle ID. You have to then keep a running total to do the average, right, because that’s how the math works for average.

KG: And somewhere, that state has to be saved so that you can actually compute it. And more so, if something fails, you don’t want to have to restart over that window to recompute the state, if that makes sense. And that’s what they talk about when they say, “That, at its very basic form.” That is what they talk about when they say stateful stream processing. And that is something that we’re very good at today. And frankly, we’re really good at it because Flink is a big part of that.

KG: And we use Flink heavily, its checkpointing capabilities and its underlying architecture and how it communicates between various distributed system in a distributed system since. We use Flink for that, we leverage Flink flat to a large degree. And that’s a very, very powerful feature and an important one to think about when you’re thinking about distributed systems, state, fault tolerant and truly scalable processing on streams of data. These are like core building blocks and important attributes and that’s something that we’ve worked really hard to make sure it’s part of our stack.

KG: I mean, I don’t think you see that in other places. So that’s kind of unique to us. I think stateful stream management and stateful stream processing are going to be areas folks hear about more and more going forward in the future. Flink has been carrying that flag since kind of day one but more and more I think folks are understanding that if we have to process everything in the database just in order to use state, the database is going to fall on its head, and in most organizations it is.

KG: And so, moving it up to stack in the stream processor where you can’t totally get away from some of the things that the databases give you, some niceties they give you, and state management and recoverability. And those types of things are one of those things that has to remain as part of a responsible distributed system with data, that kind of thing.

LD: And that’s why we love Flink. So, who else is out there that’s doing this like-

KG: Well, yeah. I mean-

LD: So let’s talk a little bit about the competitive landscape here.

KG: This is an evolving field-

LD: It is, yes.

KG: And there’s a lot of complexities. And I think while we’re excited about solving the problems we’re solving and I think obviously we’re thinking in detailed way about them. This field just to be frank is still evolving. It is still a-

LD: It is.

KG: … a new science and a new way of thinking, paradigm shift for a lot of folks, streams of data, and thinking of the world in streams is a new paradigm. Jay at Confluent has talked about this quite a bit and obviously their company is kind of based on this premise. That’s kind of a mind shift. But if you think about the landscape, like who is kind of thinking about things in the same way as we are? The first thought in most people’s minds is, yeah, I just mentioned Confluent. Is Confluent a competitor or a person? No, the answer is no. And you might say, “Well, ksqlDB does some of these things.

KG: And what I would say to that is just because you hear the word SQL and just because you hear the word streams doesn’t necessarily mean that we have the same use cases and the same processing paradigms in mind. ksqlDB is killer for, if Kafka is your data and you need to look at Kafka, then it’s a great way to do that. If you need to use like build a simple app on a stream and you want to connect with rest and make an app out of that, great. But if you’re thinking about an entire enterprise ecosystem and you’re thinking about plugging in Kafka to the rest of your company and how does it fit, and what are the componentry and where’s the processing layer live for that?

KG: You start to think something more like Materialize is something that they do a really good … They’re written in Rust. They have a similar but different architecture around stream processing based on a timely data flow. They have a Postgres sync, if you will, so they’re highly tied to Postgres, which is great. I’m a huge fan. And I think those guys are very smart over there, really respect them. Big shout out to the guys at Materialize, totally get it, great company, great folks, smart folks.

KG: And then I’d say Rockset. Rockset is another company that kind of get set, it’s going in a similar direction. Again, now Rockset is interesting because they’ve really invested in this idea of indexing. And essentially, Rockset is a RocksDB distributed system, cloud native distributed system around RocksDB. And they have some really, really smart ways that they index data and ultimately do exactly what I’m talking about. And that like, frankly, Rockset can do a lot of the stuff. Now, where I think Rockset gets it wrong is they’re still a database. And so, when you think about indexing, well, we don’t think about that because we’re processing data upstream.

KG: And so, we don’t ever have to index those things. We don’t have to take that burden. We actually have the nicety of saying like, “Look, on a stream, we can operate continuously.” And so where they say fast SQL and have killer indexing and killer distributed system to do that, we say, “We’re not fast query, we’re continuous SQL. And as continuously processing on a stream and we have a killer distributed system to do that. So, I think that’s the kind of the nuance between the two.

KG: And so if you really wanted to kind of look at like what the state of the art, and by the way, Materialise kind of seems to kind of be going in that direction too. So, the three of us, I’d say, all have their kind of a little bit different take on things, Rockset, Materialize, and Event Store. And we’re all kind of chasing the same dream, frankly. And they’re smart people in all those companies, tons of respect. And we get it wrong in some areas and they get it right in others. Frankly, I think our approach of moving SQL upstream and running it continuously on the stream is kind of our big innovation.

KG: We’ve invested heavily on making those queries interactive. So that’s something we bring to the table. So if you’re an editor and you’re typing SQL, you can actually get instant feedback against that stream. So if you missed type a column, you get a value wrong, you get your SQL wrong, we’re instantly giving you that feedback. It’s not like SQL embedded in a string in a Java program, in a JVM running on distributed cluster, it’s not that. That would give you some stack trace if it failed, right? The ability to interactively query and build these views of data in real time is a big part of the innovation that we bring to the streaming components.

KG: But I think that’s kind of what it looks like right now. If you think about the cloud vendors, you think about AWS and AWS lambda, and some of the some of the Google products, I think this stack is going to continue to be more and more exciting. There’s going to be innovation and you know what, the clouds especially Amazon, and they do what they do. So when we think about use cases and pain points, I think that’s the real place that a company like Materialize or Rockset or ksqlDB can excel in is really bringing a pointed solution, an opinionated solution around a real acute pain point that is very well tied in with the problem set that modern companies have.

KG: I think the clouds build gigantic generic solutions. And while that’s awesome, it’s not always the answer, so, yeah. And you’re also locked into their APIs and stuff for the most part. So, those are kind of… if somewhere to say, like, “Hey, Kenny, what does the rest of the world look like? I hear your opinion on it but what is the rest of players look like?” I think those would be the places that I’d look to kind of compare and contrast what we’re talking about with how others are doing.

LD: With that in mind, and I know you talked a little bit about this a second ago, but if somebody were out there going, “Okay, I see what you’re saying in this podcast, it sounds like my company.” What, from our perspective at least, from the Eventador side of things would have a standout from the crowd. What makes us uniquely suited for this type of use case?

KG: Right. Well, it’s interesting because our backgrounds are database engineering backgrounds. If you look at Erik and I’s past and the other companies that we have, I mean, we’re database nerds. But what’s interesting is the world has been changing and evolving and event streaming has become important. And frankly, the modern data infrastructure, event streaming as an important piece of it. We don’t think that you should change your entire company to streaming, that’s very naive to say that. But there are use cases where event streaming and things like Kafka, Pravega, Pulsar, make tons of sense. And we believe heavily in that. And we believe the world is going in that direction.

KG: And so, our philosophy has been kind of a streaming first philosophy. Everything is a stream philosophy. And when you move to that mindset where it’s like, oh, okay, so even my ODS or my system of record or my just my main database, if you have just a few databases, is generating an event stream from its recovery log, whether that be like your redo log from Oracle or your bin log or whatever that might be, whatever database technology you’re using, those databases are generating a change log, a recovery log typically of changes, and that can be captured as a stream and thought of as a stream.

KG: And so when you think of it that way, kind of everything starts to look like a stream, you put on the stream glasses and you’re like, “Well, if everything is a stream, how can I leverage that? How can I take advantage of that from an architectural perspective to make a more high performance solution for folks?” And the answer in our book is continuous SQL and using Flink and using Kafka and building our technology stack and Java around that. And that has been our take on things.

KG: So, the reason we’re optimistic is we think like, “Well, okay, but even Logstash now can be thought of as a stream of data coming into your system.” Even something we traditionally wouldn’t have thought of as being a stream or maybe even it’s like a set of files on S3 is an input stream now. You can be thinking of it as that way as well. And so when you start to think of it that way, then, okay, I can understand, then we’d have to run SQL continuously. And when you get to that point, continuous SQL becomes amazing because you’re pulling all those different disparate data sources, you’re processing them in real time, and only the important pieces of data are coming out of that query.

KG: You have pared down, as high in the stack as you can, as early in the stack as you can, if you want to think of it like in a pipeline metaphor, the data to the most useful or most profitable bits in that stream, and you’ve done it in a highly performant, scalable, fault-tolerant way using open source and best of breed of components.

KG: And you’ve presented these APIs out to the business is durable named APIs, these materialized views, and the business can consume them and trade them and use them in their apps. They’re durable. They have API keys and security wrapped around them. They have a team features wrapped around them. And now, the enterprise has these data endpoints that are meaningful in their context that have been curated and built by a SQL on the stream in real time at scale.

KG: That’s a mouthful, I’d probably couldn’t say all that twice. But that’s kind of where our heads are at. That’s how we’re thinking about the modern data stack. And that’s how we’re thinking about the modern real time analytics stack. And like I said early on, contrasting that against batch and things like that.

KG: We’re not Snowflake, we’re not trying to be Snowflake, we’re not trying to be an ODS, we’re not trying to be your system of record per se, we really plug in as early in the stream as we can to give high performance results to the business and take away a ton of that componentry, kind of duct tape infrastructure, ETL stuff, we replace that stuff pretty well, and do it in a massively high performance way. So that’s kind of where we are thinking about things and kind of where we’ve pointed the company and the solution.

LD: Listeners will hear us talk about this topic again. It’s large, there’s a lot to unpack. It’s very important to organizations and to us, but anything else as kind of the introductory primer?

KG: Yeah. I think that in reality like, it would be easy to listen and say, “Sounds cool,” but I’ve got seven databases and I’ve got all sorts of problems today.

LD: Right.

KG: Or trying to get data to folks and my analytics team has beat me up. And whether you’re the analyst or that data engineer building stuff or whatever, it would be easy to say like, “Well, everything is not a stream and it’s hard for me to go from A to B.” But I’d encourage you, like the listener, just to not worry about that too much. Honestly, plugging in change data capture into an existing database, that’s a well-known paradigm, that’s pretty easy to do these days. I mean, there’s a lot of ways, different ways of doing that. We have our own way of doing it, but we know you can do that open source and spool data right to files pretty easily these days, right?

LD: Right.

KG: When you do replicas and whether it’s databases or something like Mongo or whatever, replicas typically work on this paradigm. It’s very well known, very robust. And Kafka is on its own adoption trajectory. And so that’s kind of like, “Look, Kafka has been adopted because it makes sense for a ton of stuff.” But downstream, you’re kind of held hostage to that because you’re like, “Well, how do I now use that? How do I now make sense of that in my business? How do I integrate that into a report or a piece of output that I’m used to using a database for?” And the answer is you just can’t.

KG: And so, that’s where continuous SQL or automatic schema inference and presenting these data APIs out to the organization come in. And now you can define and create a durable and named API, data API. You can use that in your tooling. And you can build your analytics on it. You can build a report, you can build apps on it, you can do machine learning on it, whatever it might be. And you don’t have to care. You don’t have to care about what the next thing is coming on the data. You don’t have to care about the rate of data. It’s never too much.

KG: Continuous SQL has that property of saying, “Hey, I’ve got the firehose but it just doesn’t matter.” Yes, the latency is affected by the amount of concurrent incoming messages but it’s highly distributed, highly scalable. So you can scale that cluster to accommodate all sorts of crazy workloads and I mean trillions and trillions of events. So, it really eases the data engineer’s burden because now you have this framework for ingesting basically any kind of feed and then providing that back out, even self-service style to your organization in a way that they’re familiar to consuming it.

LD: It’s one of those fun things as we have conversations with customers as we do these podcasts as we do different things and you start to hear different topics bubble up to the top and this is one that we’ve been obviously talking about a lot and we’re hearing more and more of, so it’s great to chat more-

KG: We need a way for customers to ping us back and to talk about these and follow up on these topics. Like I think that’s something that we’ve had a lot of people kind of email us or whatever, maybe we need some to figure out, “Oh, I’d be interested in hearing from folks listening like what would be a way that they’d want to follow up and engage and talk about some of this stuff if they have questions or they want to call us out or whatever.” It would be good to get a dialogue going because I think we’ve had enough kind of inbound inquiries in various random forums like I can’t keep up with the Twitter DM. There’s too many different ways whatever.

LD: That’s a very good point. So, we always have our email address at the end of these which is hello@eventador.io and we hear a lot from people there or on Twitter or on LinkedIn as Kenny had said. But, yeah, if there’s a specific way that you guys have interacted with podcast before, want to feel like it would be a good way to chat. You may have to email us or Twitter DMs so that we learn and let us know and we’ll take a look at them for sure. Because I think we’re episode nine into this, we have a few more in the hopper that are scheduled to come out as well.

LD: There’s been some really great feedback and we really want this to be community focused and we really want to be able to hear from you guys more and more. And if you have suggestions for guests or anything like that, we certainly want to hear it. So don’t be afraid. We don’t bite, we promise, we’re very nice-

KG: Oh, I think it’d be cool to hear from folks. What are they doing today? Like, what is your architecture today. And with the things that I said in mind like, what is your architecture today and does it make you think differently about what you’re doing or not and where could you see something like this plug in, would be interesting to me to hear back and get a feedback and keep going.

LD: Yeah. And you heard it here. Kenny wants you to send him all the emails.

KG: Oh, God.

LD: All the emails. But, yeah, no. We promise we’re nice. We don’t bite. We’d love to-

KG: We’ll get back to everybody. If people send a message, we’ll get back to 100% everybody.

LD: Absolutely, we will. All right. Well, you heard it here first, folks. Again, I think we’ll do this again sometime. This is good. This is great conversation.

KG: Yeah. Thanks, Leslie. Awesome.

LD: We’ll be back for more.

KG: All right. Thank you.

LD: Well, folks, again, you heard it here first. And we want to hear more from you. You can reach us via Twitter @EventadorLabs or on email at hello@eventador.io with any feedback you have including better ways to give feedback. And especially we want to hear what your organization’s data processing stack looks like and how you may have solved some of these problems. As always, check out eventador.io to learn more about us and to find links for all the ways to get in touch. Happy streaming!

Leave a Reply

Your email address will not be published. Required fields are marked *