Approaches for Defining Message Metadata
Metadata can be as important as core datasets.
The concept of a message or an event1 is central to any data streaming or event processing system. However, we rarely discuss how to design messages.
Today I want to focus on defining message metadata (if you want to learn more, I highly recommend starting with Adam Bellemare’s course on event design).
What is metadata? I define it as data that are useful but not necessarily relevant to the current payload. For example, a message about a successful checkout in an e-commerce store can contain fields with information about line items (product ids, names, quantities), sale information (subtotal, tax, total values), and customer information (customer id, name).
But it can also contain fields like a unique message id, created timestamp, client IP address, etc. So another aspect of metadata is its universal applicability to a wide range of message types - pretty much all message types can benefit from this information.
So, where and how should this metadata be defined?
This is the most obvious one - just keep the metadata fields alongside the payload fields, for example:
product_name: How Query Engines Work
In this case, we want to logically separate metadata fields from the rest of the payload field, so we use an underscore to indicate them (we can also use a prefix like meta).
This approach is very straightforward and easy to start (but it can lead to problems down the road).
Data is easy to transform - these fields look just regular payload fields.
When persisting this data into a data store, the metadata will be added by default. This is handy for writing data to a data lake or data warehouse. But it can also contain PII or other values that need special handling (which means additional transformation steps).
Also, metadata fields are mixed with the payload fields at the schema level. This is not always preferable. What if you want to evolve metadata fields? You can’t really do it separately from the payload fields, and this could mean updating hundreds or thousands of schemas. It also means that enforcing metadata fields at a schema level can be problematic.
A message envelope can be used to separate metadata fields from payload fields. Here’s an example:
product_name: How Query Engines Work
The biggest advantage of this approach is the flexibility to evolve metadata and payload schemas separately (you need to find a way to version them differently; the next section may help with that).
This approach addresses some concerns from the previous section but raises a new one: “flattening”. Some databases may not support nested fields well, so you may need to “flatten” the message before writing it to the destination (so that the end result may look like the message from the previous section). Still, with this approach, you have more flexibility and control over the message schemas.
Most of the streaming and messaging platforms support the concept of headers nowadays. Headers can also be used for storing metadata, but it may be even harder to combine them with the rest of the payload fields to get a nicely “flattened” message before writing it to a database - many processing frameworks still don’t have good support for extracting headers from Kafka records (try writing Flink SQL that references Kafka message headers; it’s possible, I challenge you 😅).
But message headers have a big advantage. In modern data systems, serialization and deserialization usually take a lot of computational power (I wrote about some struggles with Debezium / Flink CDC here, Deserialization has its own section). And accessing message headers doesn’t require message deserialization! This makes it ideal for data routing and filtering cause deserializing the whole message just to skip it is very wasteful.
Also, since storage is cheap (kinda), you can use headers in addition to the message envelope or payload metadata - just cherry-pick fields that may be useful.
And I believe that the message headers are the best location for schema ids! This is true both for the payload and the envelope schemas - using version headers can be an elegant way to supply both at the same time.
Another place to find metadata can be Kafka topic names. This is what a typical Kafka topic naming convention looked like in Activision/Demonware in ~2018 (note: yes, it violates some of the existing best practices):
As you can see, it’s possible to extract a lot of useful metadata from the name, like the environment name, the producer name, message category.
Of course, this metadata is applicable to all records within the topic, so you don’t have per-record granularity. But it’s still useful in many situations.
Finally, when using schema-based data formats like Avro or Protobuf it’s possible to encode some metadata (not just field names and types!) in the schema itself. Both Avro and Protobuf have a concept of custom properties.
These also don’t have per-record granularity, so they only can be used for providing metadata for a collection of records with the same schema. My favourite example here is specifying columns that belong to a primary key for messages generated via a Change Data Capture process. This information can be very useful downstream, and it’s hard to obtain it in some other way unless connecting to the database directly.
I consider an event to be a type of message alongside a command and a document. Since metadata is applicable to all of them, I mostly use message in this post.