Unlocking Kafka’s Value for Machine Learning and Data Science Pipelines

April 2, 2020 in Continuous SQL



Unlocking Kafka’s Value for Machine Learning and Data Science Pipelines

As organizations are relying more and more heavily on streaming data to power business services and critical applications, data scientists are one key group that is looking for better access to real-time data and use it to improve or create better machine learning models. Across the board, there’s added burden on data science teams to reduce latency in their machine learning pipelines, rapidly develop new AI-based applications, create alerting using CEP, and, overall, pinpoint previously inaccessible business insights using those complex models. It’s a topic that’s come up for us with both current customers and in other conversations we have—data science teams are hungry for better access to Kafka-based streaming data. 

But to get that access and perform the functions they’re tasked with, there are some hurdles to overcome. These hurdles slow down access to Kafka-based data, and as more feeds need to be included or new streaming jobs need to be created, the performance impact can snowball. Some use cases, for instance, financial fraud detection, require sub-second response times to be effective. Building machine learning pipelines that tell you fraud has occurred only after the transaction has gone through doesn’t do a lot to help the business. 

All Day Long I Dream About JVM Languages? 

If you surveyed 100 data scientists on what their favorite programming language is, it’s doubtful that many (or any at all) would respond with something like Java or Scala. In fact, according to the 2018 Kaggle Machine Learning and Data Science Survey, only 21% and 4% respectively of data scientists surveyed use either language on a regular basis. These languages aren’t a natural fit into data science workflows, and they, once again, inject unneeded barriers and complexity into their data pipelines. 

But the reality is, most streaming technologies rely on those two languages instead of using something more widely known—like SQL. So to get this functionality, the data science team is either bottlenecked by waiting on technology and feature requests to come from backend teams that deal with the underlying infrastructure. Or they take matters into their own hands and spend more time trying to build a stack that works for them as well as mutate or massage the data into a usable format and less time actually performing the tasks that are truly a part of their job. 

Consistently Inconsistent Data

Just like with static or batch data, not every dataset, or data stream in this case, is consistent. And while data engineers typically take on the role of transforming these data sets, data scientists still need to refine and distill the data to extract value—and actionable insights—from it. This is the essence of what data science teams are built for. 

However, as new data feeds are created or the required business logic becomes more and more complex, refining and distilling the data into the most usable format also becomes more complex and the actual work with machine learning models gets delayed. It’s more beneficial to the data scientist—and to the company as a whole—to have data science teams spend less time normalizing or wrangling data and more time on creating, training and optimizing the machine learning models and pipelines that deliver business-critical information. 

Streaming Data At Database Speeds

The beauty of streaming data is that it is, well, streaming. New or updated events stream in and data scientists can have up to the second (or sub-second) feeds of relevant data. That said, while it’s true that data is consistently streamed in, and organizations have that data, in order to get a clear view of the most up-to-date information, some sort of persistence mechanism is required, often a database. And the second you have to dump streaming data into a traditional RDBMS to work with it, it loses the competitive edge of being “real-time.” 

For instance, how effective would a ride-sharing app be if it only showed what cars were in your area 5 or 10 minutes ago? Or, on the flip side, how effective would it be if it showed every car that’s ever been in your area? The answer to both is not very useful at all. 

A few tech- and data-forward companies have spent massive amounts of time, money, and personnel resources finding ways to overcome this—they had to in order to become and remain successful. But many companies don’t have the necessary resources to speed up their streaming pipelines by eliminating the reliance on databases to materialize views of their streaming data. 

Give Data Scientists the Keys to Kafka-based Streaming Data

The hurdles discussed aren’t unfamiliar to data scientists—they faced similar challenges when making the leap from pure RDBMS batch to a more near-real-time micro-batch (think Spark) technology stack. But being able to both squeeze every bit of latency out of a pipeline as well as make the data easily (and instantly) accessible to data science teams has been a challenging task that only a few companies have undertaken. StitchFix is one of those companies, and you should really check out their write up detailing why and how they did it along with the challenges they faced along the way. 

But for companies and data science teams that don’t have or want to spend the resources that a StitchFix has, the Eventador Platform quickly unlocks the value of streaming data with functionality that makes it dead simple to access and use Kafka data streams. 

First and foremost, we’ve built a powerful and robust Continuous SQL engine that lets users query and create computations on Kafka data using familiar ANSI SQL. What may have taken 100s of lines of Java code can be expressed in a simple SQL query that looks just like what you would use to query a database. Think about how much faster it would be to create, update, and optimize machine learning models based on Kafka data when you’re relying on SQL—and not something like Java. 

Additionally, the Eventador Platform delivers functionality including Input Transforms, which eliminate the complexity of normalizing JSON data, and Javascript Functions, which let users easily create UDFs for custom and/or complex business logic that are called using SQL. Both of these tools give data scientists the power to better refine and distill data streams, so those streams are more easily integrated into and usable from their machine learning pipelines. 

And with last month’s launch of v2.0 of the Eventador Platform, data science teams no longer have to rely on databases to get a materialized view of Kafka data. Now, directly from machine learning pipelines, notebooks, applications and more, data scientists can utilize materialized views, which are defined using ANSI SQL, are automatically indexed and maintained, and arbitrarily queried via RESTful endpoints. You have the flexibility to query by secondary key, perform range scans, and can utilize a suite of common operators against these views. 

This combination of functionality plus others like the interactive SQL parser, the growing library of sources and sinks, support for complex schemas, and more makes the Eventador Platform a powerful solution to put the power of streaming data directly into the hands of data scientists.

As always, if you want to take the Eventador Platform out for a test drive, you can sign-up for a free trial. Or if you have questions, don’t hesitate to reach out!

Leave a Reply

Your email address will not be published. Required fields are marked *