All your streaming data are belong to Kafka
Apache Kafka continues its ascent as attention shifts from lumbering Hadoop and data lakes to real-time streams
Apache Kafka is on a roll. Last year it registered a 260 percent jump in developer popularity, as Redmonk’s Fintan Ryan highlights, a number that has only ballooned since then as IoT and other enterprise demands for real-time, streaming data become common. Hatched at LinkedIn, Kafka’s founding engineering team spun out to form Confluent, which has been a primary developer of the Apache project ever since.
But not the only one. Indeed, given the rising importance of Kafka, more companies than ever are committing code, including Eventador, started by Kenny Gorman and Erik Beebe, both co-founders of ObjectRocket (acquired by Rackspace). Whereas ObjectRocket provides the MongoDB database as a service, Eventador offers a fully managed Kafka service, further lowering the barriers to streaming data.
Talking with the Eventador co-founders, it became clear that streaming data is different, requiring “fresh eyes” because “data being mutated in real time enables new use cases and new possibilities.” Once an enterprise comes to depend on streaming data, it’s hard to go back. Getting to that point is the key.
Kafka vs. Hadoop
As popular as Apache Hadoop has been, the Hadoop workflow is simply too slow for the evolving needs of modern enterprises. Indeed, as Gorman tells it, “Businesses are realizing that the value of data increases as it becomes more real-time.” For those companies that prefer to wait on adding a real-time data flow to their products and services, they risk the very real likelihood that their competitors are not content to sit on their batchy laurels.
This trend is driving the adoption of technologies that can reliably and scalably deliver and process data as near real-time as possible. New frameworks dedicated to this architecture needed to exist. Hence, Apache Kafka was born.
What about Apache Spark? Well, as Gorman points out, Spark is capable of real-time processing, but isn’t optimally suited to it. The Spark streaming frameworks are still micro-batch by design.
This leaves Kafka, which “can offer a true exactly once, one-at-a-time processing solution for both the transport and the processing framework,” Gorman explains. Beyond that, additional components like Apache Flink, Beam, and others extend the functionality of these real-time pipelines to allow for easy mutation, aggregation, filtering, and more. All the things that make a mature, end-to-end, real-time data processing system.
Kafka’s pub-sub model
This wouldn’t matter if Kafka were a beast to learn and implement, but it’s not (on either count). As Gorman highlights, “The beauty of Apache Kafka is it exposes a powerful API yet has very simple semantics. It is all very approachable.” Not only that, but its API has been implemented in many different programming languages, so the odds are good that your favorite language has a driver available.
Kafka has the notion of a topic, which is simply a namespace for a stream of data. It’s very simple to publish data to a topic, and Kafka handles the routing, scalability, durability, availability, etc. Multiple consumers coordinate subscription to these topics, to fetch data and process or route it. Asked about how this translates into the application development experience, Gorman stressed that it’s not trivial but it’s straightforward: “Building applications that work with Kafka is fairly easy [as] the client libraries handle much of the nuances of the communication, and developers utilize the API to publish or subscribe to streams of data.”
The problem, if any, isn’t the technology. Rather, it’s a question of paradigms.
The real trick for developers, Gorman tells me, is “to think about using streaming data with a fresh pair of eyes.” Why? Because “data being mutated in real time enables new use cases and new possibilities.”
Let’s look at a tangible example. Perhaps a client publishes data about ridership of a ride-sharing service. One set of consumers analyzes this stream to perform machine learning algorithms for dynamic pricing, then another set of consumers reads the data to provide location and availability of the cars to customers’ mobile devices. Yet another consumer feeds an aggregation framework for ridership data to internal dashboards. Kafka is at the core of a data architecture that can feed all kinds of business needs, all real-time.
Kafka in the cloud
This is great for developers and the companies for which they work, but Kafka demand is no guarantee of Eventador’s success, given that it has to compete with Confluent, which has the distinction of being the founder of Kafka. What’s more, Confluent, too, has announced a cloud offering that likely will compete with Eventador’s Kafka service.
Gorman is not bothered. As he describes,
The real difference is that we aren’t limited just to Kafka. We use Kafka where it makes the most sense. We are an end-to-end, enterprise-grade, stream processing framework built on Apache Kafka and Apache Flink. We have connectors for AWS S3, a REST interface, integration with PrestoDB and Jupyter notebooks, as well as connections for popular databases and even other streaming systems like AWS Kinesis. We offer plans from a simple single node to full on-prem enterprise configurations. Besides, given the booming demand for real-time data, Gorman believes there is room for many different players. Not only does Eventador complement Kafka with Flink and more, it has taken to heart Rackspace’s mantra for “fanatical customer support,” which starts with a well-built, fully integrated product. Having spent decades doing operations for some of the world’s largest companies, Gorman continues, “We know what it means to run a first class, professional quality, rock solid, as-a-service offering.”
He’s absolutely right that the market is still young. Developers are still working to understand how Kafka can be integrated into their projects. The use cases are expanding every day, driven by this need to compete with data.
Years from now, however, “It will be common to rely on streaming data in your infrastructure,” Gorman points out, “and not just some ancillary workload.” This is the future they’re building for. “Once you start expecting data to be more real-time, it’s hard to stop.” Eventador, Confluent, and undoubtedly others are building for this real-time, streaming data future. For some, that future is now. For others, these startups hope to get them there sooner.