After the last post about text and binary protocols, Sagee sent me a link to Google’s protocol buffers, which is a protocol used to send structured data over the network, which also provides backwards compatibility between versions. From the announcement:
“XML? No, that wouldn’t work. As nice as XML is, it isn’t going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy.
Do we write hand-coded parsing and serialization routines for each data structure? Well, we used to. Needless to say, that didn’t last long. When you have tens of thousands of different structures in your code base that need their own serialization formats, you simply cannot write them all by hand.
Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.”
How does it work
It is something like a simplified version of ASN.1, with some nice additions:
- Pre-allocated field identifiers for third party extensions
- Integration with C++, Java and Python
The latter addition is the main differentiation from ASN.1 decoders: the decoder usually creates a navigable tree (Something like XML DOM), while protocol buffers generate classes out of the protocol definitions files, and the received message is presented as a class with simple access functions for each field. The resulting code is a lot more legible than it would be if repeated calls were made to generic tree navigation functions, and it is probably more efficient, although the program footprint may be larger.
One thing seems strange to me: in addition to the standard 32bit and 64bit integers, they added an undefined length integer, and they use it for length of strings, groups and embedded messages. The decoding uses the first bit of every byte to indicate if there is another byte to the number, then the bit is shaved off all bytes but the last, and the bits are added together to form the number. At first, I thought it was used to encode very large integers, but this seems like a lot of work. A more sensible use of such a structure is to encode values that take up to 7 bits (127) almost all of the time, but could rise to any value, where message size must be as small as possible. Even under these constraints, I think the bit crunching is excessive. A more sensible encoding could be this: the first bit decides if the first bit is zero, the value is under 127, and is equal to the first byte; if the first bit is one, the first byte (without the MSb) is the size of the following integer in bytes. For integers under 127, the encoding is the same, larger integers are encoded using one more byte (up to 7-byte integers, where the added byte is equal to the bit shaved off each byte).
Will I use it?
Will I use such an encoding (it’s not really a full protocol)? If I was writing some proprietary protocol, I would. It’s efficient, provides useful services and keeps the code readable. However, I’m not writing a proprietary protocol, so I have to use whatever encoding the protocol dictates. It’s a shame that Google “reinvented the wheel” instead of using an established encoding like ASN.1, which they could have wrapped with the same APIs.