Length-Delimited Protobuf Streams

Protobuf Message Basics

Consider a very simple Protobuf definition of a User:

A simple user definition in Protobuf.
protoc --python_out=. simple.proto
Example of writing and reading a single Protobuf message.

Multiple Messages in One File

Writing multiple Protobuf messages to a file is no harder than writing one message. Simply repeat the write for each message:

Writing two Protobuf messages to one file

Wrapper Messages & Size Constraints

A quick-n-dirty solution to the lack of built-in message delimiters is to wrap one message in another:

Embedding a list of Protobuf message using a repeated field.

Length-Delimited Protobuf

Delimited streaming formats are used everywhere. In the case of textual formats, such as CSV, the delimiter is simply a newline character. For JSON, streaming can be achieved by using a specific line-delimited JSON format:

Line-delimited JSON (actually invalid JSON)

Writing Length-Delimited Messages

The process for writing length-delimited messages is is simple: serialize each message, calculate its length in bytes, and store the length in binary format. In this example I’ve picked a big-endian, unsigned 32-bit integer (>L ). Finally, write both the length and the message to the stream:

Write to Protobuf Users to the same file, prefixing each message with a length prefix.

Reading Length-Delimited Messages

In order to read a message, start by reading the first four bytes. Then parse those bytes to get the message length. Finally, read the length of bytes from the stream before decoding the message:

Read the two users from the file.

Considerations

The advantages of using a streaming format are plenty:

  • New messages can be appended to existing files without parsing previous content.
  • It’s possible to count the number of messages in a file without decoding any of the messages
  • By reading only the length-prefixes, an index can be created that can be used to split the file into smaller batches without decoding any messages.

Lack of a Standard

This format is design. For example, the choice of a prefix varies between implementations, but is normally either a varint, uint32, or uint64.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store