Streaming-First Architecture

Will It Stream?

Sep 02, 2024

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

I started using “Streaming-First Architecture” term a while ago, but recently, many people started asking what I mean specifically. I’d love to explain!

There are several building blocks, and most of these shouldn’t be surprising:

Kafka-compatible streaming platform (e.g. Apache Kafka, Confluent, Redpanda, WarpStream).
Stream-processing frameworks (e.g. Apache Flink, Kafka Streams, Materialize).
Connectors (e.g. Kafka Connect, Flink CDC).
Various databases and datastores that work well with real-time data (e.g. OLAP engines, LakeHouses).

However, the key principle is treating streaming data as a source of truth. Historically, streaming has been used for data transit: ingesting and delivering data to a data lake/warehouse or a database. Topics in a streaming platform like Kafka could have a few hours or a few days of retention. When that time is over, the data gets deleted (and if you couldn’t persist it for any reason, then it’s… gone).

In the Streaming-First Architecture, we treat streaming data as a source of truth, which practically means:

Modelling topics as changelogs (you can read more about this here). Ideally, enabling compaction.
- This allows us to treat Kafka a bit more like a database: we can use messages not just as inserts, but also updates and deletes. It makes it possible to correct / backfill data by just emitting new messages.
Setting infinite retention for the topics. This usually requires using some form of Tiered Storage.

All of these together allow us to keep the data in the streaming platform forever.

What are the benefits?

I don’t think I need to tell you about the benefits of streaming data (otherwise, this newsletter has failed 😀).

However, it’s not very common to keep data in the streaming platform forever, so let me expand here. It was prohibitively expensive before the introduction of Tiered Storage. And it’s not just the cost. The operational complexity of archiving data from Kafka to long-term storage like HDFS, then object storage, and now LakeHouses is still fairly high. Sure, it might be fine for a handful of topics, but doing it at scale with schema evolution and versioning support is still non-trivial. And when using historical data, you either need to support a hybrid source pattern (reading historical data from the archive first, then switching to the “edge” in Kafka) or re-hydration of topics from historical data (which surely messes up the ordering).

In the Streaming-First Architecture, you just set the consumer to read from the earliest offset and… that’s it. It may seem like a small detail, but it makes a huge difference when it comes to operational complexity. Kafka API also allows you to initialize consumers with a given timestamp, which is very handy in this case.

This approach also allows you to keep a lot of heavy transformation logic in the stream-processing layer. You can push the most demanding joins and other stateful computations to your database of choice, but many stateless computations, as well as aggregations, are straightforward to implement in the streaming world.

Something is missing

When you only have Kafka topics and consumers / stream-processors as building blocks, you quickly realize that you can’t perform many computations on the historical data very efficiently. Kafka doesn’t support any kind of lookup queries. So, very often, you’ll have to scan through the whole topic to find a handful of records. The throughput is limited by the number of partitions. This may seem like a terrible and wasteful approach. However, it can work sometimes: consumers in Kafka are fairly cheap; you just need to get the networking right. Waiting extra time for backfills may also be justified.

But just like with Tiered Storage, we’re expected to see a massive improvement in this area in the next ~6-9 months. Redpanda announced plans to introduce Iceberg support for Tiered Storage a while ago, Confluent is working on Tableflow, and WarpStream seems to be working on a query engine as well.

A query engine on top of the streaming platform is a game changer. We don’t need to archive our streaming data to a LakeHouse anymore; the streaming platform has become a LakeHouse!

This solves the lookup queries (as long as the query engines support predicate pushdown) and throughput when backfilling.

This may also solve the painful eventual consistency problem that you face when heavily relying on streaming. Exactly-once is hard. I imagine most of the query engines built on top of Kafka storage to support some form of snapshot isolation.

Of course, some of it is still a speculation. But the direction is very clear!

But does it scale?

We’ll see. Personally, I worked with hundreds of terabytes stored in the streaming platform without any issues.

But a lot depends on the implementation of Tiered Storage. I’ve seen some implementations that don’t perform well with high concurrency; they only expect the historical data to be accessed infrequently.

I have a lot of faith in the new wave of projects that heavily rely on object storage. Here I wrote about WarpStream, which I see as a clear leader in this area right now.

Isn’t it just Kappa Architecture?

In my opinion, the original Kappa Architecture is more of a vision. It doesn’t answer many hard questions that immediately come up when you try to implement it. It doesn’t prescribe almost anything except using Kafka.

Using a different name makes it clear: I’m proposing something quite specific and opinionated.

Data Streaming Journey

Discussion about this post