Introducing SQLStreamBuilder

SQLStreamBuilder allows you to declare stateful stream processors using SQL. It is massively scalable, fault tolerant and production grade. Using SQL to build streaming jobs allows for a new level of simplicity and power and makes building and managing complete stream processing topologies easy and quick.

Background

Streaming data is eating the world, and for good reason. Data presented to the user in real time makes for more compelling applications and makes for more competitive business. Eventador tackled the problem of making the streaming infrastructure simple to build out and manage when we introduced Fully Managed Apache Kafka, and Fully Managed Apache Flink to address this need.

But as streaming use cases become more ubiquitous, and more endpoints are exposed to various departmental and/or product consumers, building and managing the stream processing pipelines themselves has become a challenge. Today, the state of the art is to build processors in Java, Scala or other JVM based languages. Teams will need skilled backend engineers to write and manage jobs using powerful but complex API’s, and this is great for many use cases. But as the world adopts stream processing, a more approachable and simplistic way to specify processors will speed widespread adoption.

Structured Query Language (SQL) has long been the bastion of declarative languages: simply write SQL to specify the data you want to see. You can filter, aggregate, and perform functions in a single statement using a mature and extremely feature-full language. One major difference between using traditional SQL databases and SQL against boundless streams of data is that jobs continuously run the SQL statement against the stream-they never complete. Because of this, a key design decision when building SQLStreamBuilder was to make every SQL statement a persistent job.

Today, we are launching SQLStreamBuilder (SSB): an interface to declare stream processing jobs using SQL. SQLStreamBuilder provides a powerful interface for writing SQL as well as managing SQL jobs. It allows you to specify input streams of data to process (sources) and output streams for the processed data (sinks). SSB is massively scalable and fault tolerant. It provides the ability to stop SQL jobs and restart them from savepoint, and can quickly recover from failures. Users can write SQL jobs in a powerful editor in an interactive manner, yet have the job seamlessly run as a persistent production-grade job.

Lastly, SSB is designed to connect a rich ecosystem of sources and sinks-not just Apache Kafka. So you can build processors that join not just Kafka, but cloud services like Amazon RDS, Kinesis and MSK. This allows you to create and manage very rich production grade processing topologies.

Features

SSB has an interactive console for authoring and managing SQL jobs, and it communicates with the SQLStreamBuilder engine to parse, launch, and manipulate jobs.

The SQLStreamBuilder engine is written in Java using the Apache Flink API. SSB runs as Apache Flink jobs on Eventador Fully Managed Apache Flink, and users don’t need to worry about any of the underlying Flink components. A number of services make up the SSB backend and frontend.

Interactive Console

SSB includes a rich editing interface included in the Eventador Console. It’s designed to make it easy to author, validate, and launch streaming SQL jobs and includes a number of unique features:

  • Editor interface with context-sensitive menu
  • SQL history
  • VIM mode
  • Theming and color coding for SQL
  • Validate SQL before launching jobs
  • SQL Job management with stop/start/relaunch
  • Savepoint management – stop/restart from same place in stream
  • Apache Calcite standard SQL syntax
  • SQL comment support

SQL Parser

SSB has a built-in Apache Calcite SQL parser. You execute SQL statements, and the parser gives immediate feedback on syntax and validity. The job doesn’t launch if you’re still iterating on your syntax, rather it guides you with helpful feedback. This enables the ability for the user to interactively author SQL by iterating on the statement until it produces the result you want.

For example, running the following SQL

produces the message:

5/28/2019, 5:43:50 PM – 25 – Job execution failed. No valid tables specified – check data sources and try again.

The user is able to change the name of the virtual table and try again:

This feature is a game changer bringing the interactivity that users expect from SQL tools with the power of the Flink platform. If you were to author a SQL job inside Java you would not see the error until after the job has been built, deployed, and attempted to run. This feature enables quick trial and error of SQL and speeds time to market for streaming jobs.

Sources and Sinks

SSB has the notion, as Flink does, of sources and sinks. They are exactly what they sound like: the source of data to process and the sink for the results. SSB allows you to create a virtual table for a source or sink, then specify them in SQL statements (source) and in the console itself (sink). This allows you to declare the data you want to see but also where to get it from and send it to—all in a compact interface.

Today we support Kafka based sources and sinks, both on the Eventador platform, Amazon MSK, or your own endpoint. You can use SQLStreamBuilder to process streams on just about any Kafka cluster.

But, because sources and sinks are Flink compatible, SSB is ready to support the entire ecosystem of sources and sinks. This allows for unrivaled capabilities to create entire streaming systems that process, aggregate, filter, and route data between these endpoints. For instance, creating streaming pipelines that:

  • Aggregate a clickstream Kafka topic by minute, hourly, and daily buckets into MySQL
  • Filter IoT sensor data where a timestamp is not null and send to Elasticsearch/Kibana
  • If Kinesis sensor data has moisture values above unacceptable levels, write it to InfluxDB.
  • Scrub PII data from a production Kafka topic to a QA topic for testing.

The combinations and possibilities are endless. Flink has a number of built-in sources/sinks, projects like Apache Bahir have a growing list, and you can write your own. Compatibility that allows for any/all sources and sinks to be used in SSB is coming soon, so stay tuned.

Schema Management

Schema management is critical when working with streaming data. SSB defines a schema for each source at creation time. Then exposes that schema using metadata commands and SQL queries in the editor interface. SSB is compatible with JSON data, and the schema is defined using JSON-schema compatible syntax.

For example, to define a payments schema:

and SQL can be written as:

Job Management

Jobs are a core concept in SQLStreamBuilder-each statement becomes a job. A collection of one or more jobs makes up some business need/process/feature-a streaming topology. Managing jobs is done via the SQL Jobs tab. Jobs can be observed, canceled, and edited.

SQLStreamBuilder Jobs Management

SSB has the inherent property of easily scaling. In fact, it can seamlessly scale from one to hundreds of concurrent processes all servicing your one SQL job. Each job is logically decomposed into a graph of operators, and the execution of each operator is physically decomposed into multiple parallel operator instances. Clusters can be expanded to serve as many concurrent SQL jobs as required.

Production grade

A unique capability of the SSB engine is that users can iterate and perfect SQL, maybe against development data sources, then when ready, simply run against production data sources. There isn’t some other cluster to submit jobs to or some special mode to run the server in—it’s just another job. Everything is run in a highly scalable, fault-tolerant manner by default. Nothing to configure or worry about.

One of the key design tenants of SSB was to ensure it was production grade. We have been frustrated by competitive offerings that just didn’t meet our standard for what production meant. SSB is a robust platform that utilizes checkpoints to ensure job restart-ability, is fully integrated with Apache Kafka such that restarting picks up from the proper consumer offset, and the state store itself scales and is fault tolerant. We run redundant Flink management components and have a generally best-in-breed stack for Flink. It’s something that’s completely transparent to the end-user, and it just works.

Of course, SQLStreamBuilder comes with the same Eventador 24x7x365 support that our other products have.

Beta availability

SQLStreamBuilder is available on AWS and GCP platforms. See below for specific source/sink compatibility during Beta period:

Type Source Sink Availability Region
AWS MSK * * AWS all
Eventador Fully Managed Kafka * * AWS, GCP all

Next Steps

SSB is currently available in public Beta, and if you’re interested in trying it out, you can let us know below.

Leave a Reply

Your email address will not be published. Required fields are marked *