The reason Apache Kafka is so exciting

Apache Kafka was built by a team of Engineers at LinkedIn circa 2000. They were trying to solve the problem of how to pipeline data between various components of their micro-services based architecture. They decided they needed a single pub/sub messaging platform designed to move data efficiently. Kafka was born.

Flash forward 2015. @erikbeebe and myself were frustrated watching customers wait for slow query response times over ridiculously huge data sets and started to ponder a way they can architect applications such that the data is filtered in real-time.

The light bulb went off: Kafka was the way.

Apache Kafka

The design of Kafka just made sense to us. It’s designed as an append-only, commit log structure that persists to disk. For database nerds, this is very similar to recovery log structures commonly used in various database designs in Oracle and PostgreSQL among others. The architecture and it’s nuances immediately made sense to us.

To scale eBay and Paypal during insane growth periods we would game performance by turning updates into inserts and treat traditional relational databases more like an immutable append-only queue. These properties are essentially what makes Kafka fast and scalable.

Just to be clear, messaging backends aren’t a new phenomenon in computer science. This paradigm has existed for a long time, but what makes now so special is adapting this paradigm to a developing problem set with big data. Specifically, there is a lot of it in both size and rate.

A design that allows data to be mutated, aggregated, and routed in real time could solve a massive problem set for many customers. Instead of provisioning massive clusters of distributed query workers or fine tuning traditional b-tree index paths – just deal with the data as it arrives, routing and processing as needed.

Design Matters

The design of Kafka is quite nice. The core construct being the topic. A topic is simply a channel to push data through. You produce data to a topic and that topic can be consumed by N subscribers.

Each topic is partitioned, being an immutable sequence of data records. This is the commit log I spoke of earlier. When records are consumed from a topic an offset is incremented that indicates it has been consumed. Data is retained for a configurable period of time.

Partitions are replicated over various members of a cluster, one being a leader. If a leader fails, a follower picks up. Load is balanced on servers – some partitions having leaders and some followers.

This design lends itself nicely to building real time event streams. It inherently has:

  • Performance (append only commit log is fast)
  • Recoverability (topic partitions are replicated and durable)
  • Scalability (can add brokers and move some topic partitions to it)

Kafka allows for consumers to subscribe to topics and read records, incrementing the offset as they wish. The design facilitates having a suite of consumers per topic, but not more than the total number of partitions. Scaling the brokers can be independent of scaling the producer and/or consumers and the guarantees remain. Zookeeper is used to keep track of all the metadata for the cluster. This all works quite well.

The exciting bits

All of this is all fairly academic if you can’t do something useful with the data. This is the part that is so exciting.

The core design of Kafka enables analysis in real time for use cases like:

  • Sensor data capture/routing
  • Immediate log analysis
  • Just-in-time operational metrics
  • Micro-services communication
  • In application metrics
  • Manufacturing yields and feedback
  • Transportation efficiency

These use cases combined with machine learning and artificial intelligence and we feel like we are on the cusp of something amazing.

Stream processing engines like Apache Flink, Storm, Samza, and Kafka Streaming allow for filtering, aggregation, mutation, and experimentation (don’t mistake batch mechanisms like Spark for true streaming mechanisms). At we currently use PipelineDB, a real-time SQL engine based on PostgreSQL for processing stream data. We chose PipelineDB because it allows customers to use any SQL compliant tooling to easy consume the real-time data by using simple SQL statements.

That said, there is still work to do. For instance, Kafka streams is still in early development, monitoring and management can still be tricky, drivers have some divergence, and implementing a processing engine can be daunting. Of course, none if this is any good if it’s not fast, so performance and scalability matter a lot.

We don’t know what the future will bring exactly, but we have a roadmap that we couldn’t be more passionate about. We do know building a service on top of Kafka that facilitates real-time analysis and applications couldn’t be more exciting.

Leave a Reply

Your email address will not be published. Required fields are marked *