Top 3 Reasons Why Kafka + SQL is an Amazing Combination

December 3, 2019 in Apache Kafka and SQL



Top 3 Reasons Why Kafka + SQL is an Amazing Combination

There has been a lot of discussion recently around using SQL with Kafka—and rightfully so. The combination of the two can be a game-changer for your business. It simplifies streaming job deployment. It can massively accelerate streaming application time-to-market. And it can provide the extra push you need to help make your business more competitive by unlocking increased value from your Kafka deployment. 

But why specifically is using SQL with Kafka so amazing? Let’s take a look at three of the best reasons:

1. SQL makes reasoning about Kafka-based data simple

It’s very powerful to be able to use SQL to simply ask for Kafka data, filter it, group it, etc. Let’s use an incoming Twitter stream as an example (note: the examples in this post define the schema and data as JSON): 

{
"created_at": "Tue May 14 13:21:32 +0000 2019",
"source": "Android",
"reply_count": 204,
"retweet_count": 1518,
"favorite_count": 3592,
"hashtag": "#Moon2024",
"thetext": "We are going to the Moon — to stay. We will build sustainable infrastructure to support missions to Mars and beyond. This is what we're building. This is what we're training for. We are going.",
"link": "http://go.nasa.gov/2IjV2KQ"
},
{
"created_at": "Thur June 6 9:14:12 +0000 2019",
"source": "Android",
"reply_count": 124,
"retweet_count": 2021,
"favorite_count": 433,
"hashtag": "#DDay",
"thetext": "Today, we remember the bravery & sacrifice on display during #DDay.",
"link": "http://go.nasa.gov/3I0t5sy"
},
{
"created_at": "Thur June 5 9:02:00 +0000 2019",
"source": "iOS",
"reply_count": 82,
"retweet_count": 531,
"favorite_count": 3200,
"hashtag": "#Moon2024",
"thetext": "We're working to send the first woman & next man to the Moon by 2024 on our Artemis mission and more than 78 Michigan companies are helping make it happen",
"link": "http://go.nasa.gov/4IjV2KQ"
}
view raw twitterschema.json hosted with ❤ by GitHub

Now, you can use SQL to easily show tweets ordered by popularity:

-- order by
SELECT thetext, link
FROM tweets
ORDER BY favorite_count;

Or show the most popular Twitter client:

-- group by source
SELECT source, count(*)
FROM tweets
GROUP BY source;
view raw twitterclient.sql hosted with ❤ by GitHub

Or show counts by hashtag and ordered by most recent:

-- count of hashtags
SELECT hashtag, count(*)
FROM tweets
GROUP BY hashtag
ORDER BY created_at DESC;
view raw recenthashtag.sql hosted with ❤ by GitHub

Or many other ways that you can parse, filter, aggregate, etc. Kafka data that may take dozens of lines (or more) of Java code to achieve.

2. SQL window functions and date manipulation on Kafka data are magical

SQL against Kafka-based streams works best when paired with date/time windows. SQL (and especially streaming SQL) uses windowing functions to group results based on various time boundaries. 

{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846314",
"sensorName": "DrillSideLoad",
"sensorType": "Load",
"sensorValue": "243"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846315",
"sensorName": "MainPressure",
"sensorType": "Pressure",
"sensorValue": "2401"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846316",
"sensorName": "Valve01",
"sensorType": "Pressure",
"sensorValue": "2501"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846316",
"sensorName": "PrimaryTank",
"sensorType": "Temp",
"sensorValue": "78"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846318",
"sensorName": "DrillSideLoad",
"sensorType": "Load",
"sensorValue": "250"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846318",
"sensorName": "MainPressure",
"sensorType": "Pressure",
"sensorValue": "2444"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846320",
"sensorName": "Valve01",
"sensorType": "Pressure",
"sensorValue": "2330"
},
{
"rigId": "28bd884e85cf32167ed72147db10adc7",
"ts": "1559846320",
"sensorName": "PrimaryTank",
"sensorType": "Temp",
"sensorValue": "90"
}
view raw pressureschema.json hosted with ❤ by GitHub

Convert the EPOCH date to human-readable:

-- make data readable as string
SELECT CAST(ts AS varchar) AS thedatetime
FROM rigData
view raw epoctoreadable.sql hosted with ❤ by GitHub

Now, average the pressures of the last five seconds:

-- average pressure over 5 second window
SELECT TUMBLE_END(ts, interval '5' second) AS ts,
rigId,
sensorName,
AVG(sensorValue) AS pressure
FROM rigData
WHERE sensorType = 'Pressure'
GROUP BY rigId, sensorName, TUMBLE(ts, interval '5' second)
view raw avgpressure.sql hosted with ❤ by GitHub

And create an alert to only see data when it is too high:

-- find load data > 400psi
SELECT rigId, sensorName, MAX(sensorValue) as MaxLoad
FROM rigData
WHERE sensorType = 'Load'
AND sensorValue > 400
AND ts > CURRENT_TIMESTMAP - interval '5' minutes
view raw pressurealert.sql hosted with ❤ by GitHub

3. SQL makes collapsing out of order/partial events in Kafka more powerful

Using Apache Kafka is a great way to handle IoT data, especially since there may be a need to push data into a topic even if it is not totally complete.

Take the example of ADS-B data. Data comes in with various attributes at the time that those attributes are known. The systems don’t wait for complete messages—they send what they have when they have it. This paradigm yields incoming messages that might look like the following where there is a JSON source, emitting events over time, with a logical key (tail number or ICAO):

{"ts": "1559840592", "icao": "A264A4", "lat": "", "lon": "", "altitude": "8590", "speed": ""},
{"ts": "1559840593", "icao": "A264A4", "lat": "", "lon": "", "altitude": "", "speed": "540"},
{"ts": "1559840595", "icao": "A264A4", "lat": "159.7774335", "lon": "35.0375525", "altitude": "", "speed": ""}
view raw adsbschema.json hosted with ❤ by GitHub

SQL aggregations enable the ability to nicely collapse the messages into one logical record to show the current picture of this flight.

-- find the latest complete picture of
-- tailnumber A264A4
SELECT icao,
HOP_END(ts, interval '1' minute, interval '15' minute) AS timestamp,
MAX(lat) as latitude,
MAX(lon) as longitude,
AVG(altitude) as altitude,
AVG(speed) as speed
FROM adsb_source
WHERE icao = 'A264A4'
GROUP BY icao, HOP(ts, interval '1' minute, interval '15' minute)
view raw adsbaggregation.sql hosted with ❤ by GitHub

The result is a flattened record grouped by key over a TS range.

icaotimestamplatitudelongitudealtitudespeed
A264A41559840595159.777433535.03755258590540

Streaming SQL Amplifies the Benefits of Robust Kafka-based Data Systems

All of these reasons, and more, are why we developed SQLStreamBuilder to help you very easily use SQL with your Kafka data. That ability lets you be more competitive—and often more productive—by simplifying your streaming data pipeline and accelerating streaming application time-to-market. Keep an eye out here as we continue to post more reasons why using streaming SQL with Apache Kafka

Want to get started today? Sign up for a free trial of SQLStreamBuilder and see how easy using SQL with Kafka can be. Want to learn more before you try? Check out our expert guide that dives into more detail on why combining SQL and Kafka is amazing.

Leave a Reply

Your email address will not be published. Required fields are marked *