4 Gotcha’s of text-based protocols
As I stated many times, I think binary encoding is superior to text based encoding. However, usually, when you are a software engineer implementing a protocol, the protocol choice is not yours. In this post, I will consider the pitfalls and gotcha’s of text-based protocols and how to design your way around them. I mostly consider protocols defined by ABNF rules (such as SIP). If the protocol uses an XML scheme, there are readily available documents on parsing XML into DOM trees, as well as in standard libraries. In fact, the advice in this post was influenced by XML parser principles, regular expression architectures and also by my own experience with text parsing and binary decoding.
1. Message Size
Text-based protocols usually are not kind enough to provide the entire length of the message in advance. The best way to design around this is to write a parser for just one line at a time. This way, it should be able to preserve its state between lines and to point out where the line it processed ended, or the line ending was not found, in which case it should remain in its previous state. The connection layer should then append the next buffer read after the unprocessed section and resume parsing lines.
2. String Tokens
As soon as you write strcmp(), the parser efficiency is gone. Opt for generalized tokenizing: build an enumeration for every string token you expect and create a huge function to turn every known string to the matching enumerated token. The function should work like a state machine, reporting an accepting state (with the token code), a non-accepting state, or “end of line reached”. The following is a state machine for the words “ABC”, “ACB” and “CAB”:

This may appear as many states, but a small script can take the token words and write a function to tokenize them efficiently.
Notice the tokenize-state-machine diagram does not differentiate between upper and lower case and treats any “white-space” as word end. This grants it some degree of resilience to most non-standard implementations that differ in spaces versus tabs and upper versus lower case. Punctuation marks are still a problem, as can be seen below.
3. Message Syntax
Many times, the text protocol will be defined by ABNF rules. These are substitution rules that are usually nested. For instance, one rule defines a time format and another rule uses that definition in another command:
$time = DIGIT DIGIT “:” DIGIT DIGIT
$timeRange = $time “-” $time
If an ABNF rule was made completely out of primitives (such as DIGIT in our example), we could write regular expression syntax to parse it and return field-value pairs. The complete parser would have to expand on the regular expression engine by adding a nesting ability. If the syntax defines a primitive, it would match it; if it calls another rule, it will invoke it recursively until it is parsed into primitives. Primitives, in our case, are tokens (which we already know how to tokenize), numerical values and string values. This nested regular expression engine must be able to preserve its state between calls, as some ABNF rules can span more than one line. The engine can also be taught to treat different punctuation symbols as equivalent, to deal with non-standard messages.
4. Data Structures
If you are familiar with XML or HTML, you may have used the Document Object Model (DOM). The DOM is simply a tree data structure containing XML and HTML by hierarchy. Each element represents an XML tag, an attribute or a value (text). Any text parser can use a similar data structure to hold parsed messages. If fact, a text parser will have to use one tree data structure to hold the message syntax and smaller message trees to hold just the token values in the messages. The syntax tree is also important for encoding text messages: the layers above the parser may have built the message tree in any order and the encoder will have to encode the values in the order defined by the syntax tree.
Tags: Code, decoding, Design, Development, dom, Efficiency, encoding, Implementation, parser, parsing, Protocols, regexp, text, XML
Related posts:


Add your own
1. Todd Bradley | July 3rd, 2008 at 5:21 pm
I’m really surprised that in this day and age you still prefer binary protocols. For debugging, text is so much nicer. And message size is almost a non-issue these days.
2. Ran Arad | July 7th, 2008 at 7:42 am
I’m really surprised that anyone still buys the “easier to debug” folly. Did no-one ever hear of WireShark? Why is human readability a factor in computer communications, especially when it costs so much? Message size is an issue for cellular networks, that’s why SIP uses sig-comp for IMS. The compression costs additional computer resources over text parsing, and for cellular devices it’s a problem.
3. how does an encoder workhttp://matthewhome.yoyohost.com/howdoesanencoderwork.html | July 13th, 2008 at 11:33 pm
[…] the protocol choice is not yours. In this post, I will consider the pitfalls and gotcha??s ofhttp://blog.radvision.com/codeofcontact/2008/07/03/4-gotchas-of-text-based-protocols/How Does an Optical Encoder Work? .pdf On GlobalSpecThe Optical Encoders typically consist of a […]
Leave a Comment
Trackback this post | Subscribe to the comments via RSS Feed