Co-founder & CEO
There has been a lot of discussion recently around using Continuous SQL with Kafka—and rightfully so. The combination of the two can be a game-changer for your business. It simplifies streaming job deployment. It can massively accelerate streaming application time-to-market. And it can provide the extra push you need to help make your business more competitive by unlocking increased value from your Kafka deployment.
But why specifically is using SQL with Kafka so amazing? Let’s take a look at three of the best reasons:
1. SQL makes reasoning about Kafka-based data simple
It’s very powerful to be able to use SQL to simply ask for Kafka data, filter it, group it, etc. Let’s use an incoming Twitter stream as an example (note: the examples in this post define the schema and data as JSON):
{ | |
"created_at": "Tue May 14 13:21:32 +0000 2019", | |
"source": "Android", | |
"reply_count": 204, | |
"retweet_count": 1518, | |
"favorite_count": 3592, | |
"hashtag": "#Moon2024", | |
"thetext": "We are going to the Moon â to stay. We will build sustainable infrastructure to support missions to Mars and beyond. This is what we're building. This is what we're training for. We are going.", | |
"link": "http://go.nasa.gov/2IjV2KQ" | |
}, | |
{ | |
"created_at": "Thur June 6 9:14:12 +0000 2019", | |
"source": "Android", | |
"reply_count": 124, | |
"retweet_count": 2021, | |
"favorite_count": 433, | |
"hashtag": "#DDay", | |
"thetext": "Today, we remember the bravery & sacrifice on display during #DDay.", | |
"link": "http://go.nasa.gov/3I0t5sy" | |
}, | |
{ | |
"created_at": "Thur June 5 9:02:00 +0000 2019", | |
"source": "iOS", | |
"reply_count": 82, | |
"retweet_count": 531, | |
"favorite_count": 3200, | |
"hashtag": "#Moon2024", | |
"thetext": "We're working to send the first woman & next man to the Moon by 2024 on our Artemis mission and more than 78 Michigan companies are helping make it happen", | |
"link": "http://go.nasa.gov/4IjV2KQ" | |
} |
Now, you can use SQL to easily show tweets ordered by popularity:
-- order by | |
SELECT thetext, link | |
FROM tweets | |
ORDER BY favorite_count; |
Or show the most popular Twitter client:
-- group by source | |
SELECT source, count(*) | |
FROM tweets | |
GROUP BY source; |
Or show counts by hashtag and ordered by most recent:
-- count of hashtags | |
SELECT hashtag, count(*) | |
FROM tweets | |
GROUP BY hashtag | |
ORDER BY created_at DESC; |
Or many other ways that you can parse, filter, aggregate, etc. Kafka data that may take dozens of lines (or more) of Java code to achieve.
2. SQL window functions and date manipulation on Kafka data are magical
SQL against Kafka-based streams works best when paired with date/time windows. SQL (and especially streaming SQL) uses windowing functions to group results based on various time boundaries.
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846314", | |
"sensorName": "DrillSideLoad", | |
"sensorType": "Load", | |
"sensorValue": "243" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846315", | |
"sensorName": "MainPressure", | |
"sensorType": "Pressure", | |
"sensorValue": "2401" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846316", | |
"sensorName": "Valve01", | |
"sensorType": "Pressure", | |
"sensorValue": "2501" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846316", | |
"sensorName": "PrimaryTank", | |
"sensorType": "Temp", | |
"sensorValue": "78" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846318", | |
"sensorName": "DrillSideLoad", | |
"sensorType": "Load", | |
"sensorValue": "250" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846318", | |
"sensorName": "MainPressure", | |
"sensorType": "Pressure", | |
"sensorValue": "2444" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846320", | |
"sensorName": "Valve01", | |
"sensorType": "Pressure", | |
"sensorValue": "2330" | |
}, | |
{ | |
"rigId": "28bd884e85cf32167ed72147db10adc7", | |
"ts": "1559846320", | |
"sensorName": "PrimaryTank", | |
"sensorType": "Temp", | |
"sensorValue": "90" | |
} |
Convert the EPOCH date to human-readable:
-- make data readable as string | |
SELECT CAST(ts AS varchar) AS thedatetime | |
FROM rigData |
Now, average the pressures of the last five seconds:
-- average pressure over 5 second window | |
SELECT TUMBLE_END(ts, interval '5' second) AS ts, | |
rigId, | |
sensorName, | |
AVG(sensorValue) AS pressure | |
FROM rigData | |
WHERE sensorType = 'Pressure' | |
GROUP BY rigId, sensorName, TUMBLE(ts, interval '5' second) |
And create an alert to only see data when it is too high:
-- find load data > 400psi | |
SELECT rigId, sensorName, MAX(sensorValue) as MaxLoad | |
FROM rigData | |
WHERE sensorType = 'Load' | |
AND sensorValue > 400 | |
AND ts > CURRENT_TIMESTMAP - interval '5' minutes |
3. SQL makes collapsing out of order/partial events in Kafka more powerful
Using Apache Kafka is a great way to handle IoT data, especially since there may be a need to push data into a topic even if it is not totally complete.
Take the example of ADS-B data. Data comes in with various attributes at the time that those attributes are known. The systems don’t wait for complete messages—they send what they have when they have it. This paradigm yields incoming messages that might look like the following where there is a JSON source, emitting events over time, with a logical key (tail number or ICAO):
{"ts": "1559840592", "icao": "A264A4", "lat": "", "lon": "", "altitude": "8590", "speed": ""}, | |
{"ts": "1559840593", "icao": "A264A4", "lat": "", "lon": "", "altitude": "", "speed": "540"}, | |
{"ts": "1559840595", "icao": "A264A4", "lat": "159.7774335", "lon": "35.0375525", "altitude": "", "speed": ""} |
SQL aggregations enable the ability to nicely collapse the messages into one logical record to show the current picture of this flight.
-- find the latest complete picture of | |
-- tailnumber A264A4 | |
SELECT icao, | |
HOP_END(ts, interval '1' minute, interval '15' minute) AS timestamp, | |
MAX(lat) as latitude, | |
MAX(lon) as longitude, | |
AVG(altitude) as altitude, | |
AVG(speed) as speed | |
FROM adsb_source | |
WHERE icao = 'A264A4' | |
GROUP BY icao, HOP(ts, interval '1' minute, interval '15' minute) |
The result is a flattened record grouped by key over a TS range.
icao | timestamp | latitude | longitude | altitude | speed |
A264A4 | 1559840595 | 159.7774335 | 35.0375525 | 8590 | 540 |
Streaming SQL Amplifies the Benefits of Robust Kafka-based Data Systems
All of these reasons, and more, are why we developed SQLStreamBuilder to help you very easily use SQL with your Kafka data. That ability lets you be more competitive—and often more productive—by simplifying your streaming data pipeline and accelerating streaming application time-to-market. Keep an eye out here as we continue to post more reasons why using continuous SQL with Apache Kafka
Want to get started today? Sign up for a free trial of SQLStreamBuilder and see how easy using SQL with Kafka can be. Want to learn more before you try? Check out our expert guide that dives into more detail on why combining SQL and Kafka is amazing.