Length-Delimited Protobuf Streams
Protobuf is a great encoding format. It’s become wildly popular as an encoding protocol, not only for gRPC messages, but for all kinds of information exchange.
When using Protobuf in a streaming situation — i.e. when writing to a file, or over a socket — the problem of delimiting one message from the next becomes apparent. Encoded Protobuf messages do not mark the end of each message, so it is not be possible to reliably distinguish one message from another when reading the stream on the other end.
While Protobuf has no official guidelines regarding this problem, there is a commonly used solution to the problem called length-delimiting (sometimes called prefixing / framing) messages.
This article will go through some basic properties of Protobuf, why large messages are difficult to work with, the length-delimited (length-prefixed) implementation, and finally some considerations when using this streaming format.
Let’s get started
Protobuf Message Basics
Consider a very simple Protobuf definition of a User:
I put this in a file called simple.proto
and generate the output with
protoc --python_out=. simple.proto
The process for writing a Protobuf message to a file is dead-simple. Create the message, serialize it to binary, and write it to the file. To read the message from file, open and parse (decode) the entire contents of the file.
Multiple Messages in One File
Writing multiple Protobuf messages to a file is no harder than writing one message. Simply repeat the write for each message:
In order to parse one message from the file however, we need to know where the “Bob message” ends so that the encoded message can be read from the file and parsed individually.
There are typically two ways to predictably parse blocks from a stream — either each block is of fixed size, or blocks are delimited in some way.
For this reason, Avro has formed a specification called the Object Container File (OCF) format. In this format, a special sync-marker is placed in the file to delimit messages.
For Protobuf, we’ll have to find our own solution to the problem.
Wrapper Messages & Size Constraints
A quick-n-dirty solution to the lack of built-in message delimiters is to wrap one message in another:
There is however one big drawback to this: in order for Protobuf messages to be reliably decoded, the entire message must be decoded at once, i.e. the entire message must fit in memory. In some contexts, this is not a problem, in others, it may be catastrophic. While the wrapper message grows, the memory requirements can quickly become unwieldy.
The official recommendation is to keep Protobuf messages around 1MB in size. 64MB is the limit from Google, and the hard-limit is 2GB in total. These limits and recommendations are there for a reason. Having a message that is 64MB in size can require several GB of RAM to decode the message.
Length-Delimited Protobuf
Delimited streaming formats are used everywhere. In the case of textual formats, such as CSV, the delimiter is simply a newline character. For JSON, streaming can be achieved by using a specific line-delimited JSON format:
Since Protobuf is a binary format, and due to the encoding specification itself, it is not possible to insert some kind of character or byte sequence that delimits each message.
It is however possible to instruct the reader how far it should read before decoding each message. This is called length-delimiting or length-prefixing, and is similar to how the header of an IP packet contains a header that contains the length of the data payload.
Writing Length-Delimited Messages
The process for writing length-delimited messages is is simple: serialize each message, calculate its length in bytes, and store the length in binary format. In this example I’ve picked a big-endian, unsigned 32-bit integer (>L
). Finally, write both the length and the message to the stream:
The file content layout will look like this:
Reading Length-Delimited Messages
In order to read a message, start by reading the first four bytes. Then parse those bytes to get the message length. Finally, read the length of bytes from the stream before decoding the message:
Voilá, you’ve just written and read files one-by-one from a file. For most cloud providers, it is possible to perform this operation over the wire, i.e. each message can be streamed to and from the remote file in real-time. This is an extremely powerful property. Since we write both the length and message at the same time as well, we do not need to worry about incomplete writes.
Considerations
The advantages of using a streaming format are plenty:
- It is possible to stream individual messages over e.g. a Socket
- New messages can be appended to existing files without parsing previous content.
- It’s possible to count the number of messages in a file without decoding any of the messages
- By reading only the length-prefixes, an index can be created that can be used to split the file into smaller batches without decoding any messages.
However, it’s not all flowers and sunshine in Protobuf streaming land.
Lack of a Standard
This format is design. For example, the choice of a prefix varies between implementations, but is normally either a varint, uint32, or uint64.
In Java and C#, there are two functions that can be used ( parseDelimitedFrom
and writeDelimitedTo
), but this this compatibility issue has been open since 2018.
In 2015, a PR was opened that aimed to merge the Java implementation into the core C++ library, but it was ultimately closed in 2017.
For Go, there is a delimited reader /writer implementation found in the Gogo Protobuf package called Uint32DelimitedReader
and Uint32DelimitedWriter
. I’ve used this implementation quite a bit and it’s worked well with the Python implementation this article.
In the Rust protobuf package, there is a delimited reader / writer, but it has been marked as deprecated. Possibly for the same reasons the C++ PR was closed.
I wrote a small library for Python available on Github https://github.com/sebnyberg/ldproto-py that can be used as a reference for an implementation. The package is also available on Pip as ldproto
in case someone wants to try it out.
All in all, the format has been a contentious issue throughout the years, but luckily it is not all that hard to just implement it yourself.
If you made it this far, thank you for reading this article, and good luck on your Protobuf streaming adventures!