diziet: (Default)
[personal profile] diziet

tl;dr: Use MessagePack, rather than CBOR.

Introduction

I recently wanted to choose a binary encoding. This was for a project using Rust serde, so I looked at the list of formats there. I ended up reading about CBOR and MessagePack.

Both of these are binary formats for a JSON-like data model. Both of them are "schemaless", meaning you can decode them without knowing the structure. (This also provides some forwards compatibility.) They are, in fact, quite similar (although they are totally incompatible). This is no accident: CBOR is, effectively, a fork of MessagePack.

Both formats continue to exist and both are being used in new programs. I needed to make a choice but lacked enough information. I thought I would try to examine the reasons and nature of the split, and to make some kind of judgement about the situation. So I did a lot of reading [11]. Here are my conclusions.

History and politics

Between about 2010 and 2013 there was only MessagePack. Unfortunately, MessagePack had some problems. The biggest of these was that it lacked a separate string type. Strings were to be encoded simply as byte blocks. This caused serious problems for many MessagePack library implementors: for example, when decoding a MessagePack file the Python library wouldn't know whether to produce a Python bytes object, or a string. Straightforward data structures wouldn't round trip through MessagePack. [1] [2]

It seems that in late 2012 this came to the attention to someone with an IETF background. According to them, after unsatisfactory conversations with MessagePack upstream, they decided they would have to fork. They submitted an Internet Draft for a partially-incompatible protocol [3] [4]. Little seemed to happen in the IETF until soon before the Orlando in-person IETF meeting in February 2013.[5]

These conversations sparked some discussion in the MessagePack issue tracker. There were long threads including about process [1,2,4 ibid]. But there was also a useful technical discussion, about proposed backward compatible improves to the MessagePack spec.[5] The prominent IETF contributor provided some helpful input in these discussions in the MessagePack community - but also pushed quite hard for a "tagging" system, which suggestion was not accepted (see my technical analysis, below).

An improved MessagePack spec resulted, with string support, developed largely by the MessagePack community. It seems to have been available in useable form since mid-2013 and was officially published as canonical in August 2013.

Meanwhile a parallel process was pursued in the IETF, based on the IETF contributor's fork, with 11 Internet-Drafts from February[7] to September[8]. This seems to have continued even though the original technical reason for the fork - lack of string vs binary distinction - no longer applied. The IETF proponent expressed unhappiness about MessagePack's stewardship and process as much as they did about the technical details [4, ibid]. The IETF process culminated in the CBOR RFC[9].

The discussion on process questions between the IETF proponent and MessagePack upstream, in the MessagePack issue tracker [4, ibid] should make uncomfortable reading for IETF members. The IETF acceptance of CBOR despite clear and fundamental objections from MessagePack upstream[13] and indeed other respected IETF members[14], does not reflect well on the IETF. The much vaunted openness of the IETF process seems to have been rather one-sided. The IETF proponent here was an IETF Chair. Certainly the CBOR author was very well-spoken and constantly talks about politeness and cooperation and process; but what they actually did was very hostile. They accused the MessagePack community of an "us and them" attitude while simultaneously pursuing a forked specification!

The CBOR RFC does mention MessagePack in Appendix E.2. But not to acknowledge that CBOR was inspired by MessagePack. Rather, it does so to make a set of tendentious criticisms of MessagePack. Perhaps these criticisms were true when they were first written in an I-D but they were certainly false by the time the RFC was actually published, which occurred after the MessagePack improvement process was completely concluded, with a formal spec issued.

Since then both formats have existed in parallel. Occasionally people discuss which one is better, and sometimes it is alleged that "yes CBOR is the successor to MessagePack", which is not really fair.[9][10]

Technical differences

The two formats have a similar arrangement: initial byte which can encode small integers, or type and length, or type and specify a longer length encoding. But there are important differences. Overall, MessagePack is very significantly simpler.

Floating point

CBOR supports five floating point formats! Not only three sizes of IEEE754, but also decimal floating point, and bigfloats. This seems astonishing for a supposedly-simple format. (Some of these are supported via the semi-optional tag mechanism - see below.)

Indefinite strings and arrays

Like MessagePack, CBOR mostly precedes items with their length. But CBOR also supports "indefinite" strings, arrays, and so on, where the length is not specified at the beginning. The object (array, string, whatever) is terminated by a special "break" item. This seems to me to be a mistake. If you wanted the kind of application where MessagePack or CBOR would be useful, streaming sub-objects of unknown length is not that important. This possibility considerably complicates decoders.

CBOR tagging system

CBOR has a second layer of sort-of-type which can be attached to each data item. The set of possible tags is open-ended and extensible, but the CBOR spec itself gives tag values for: two kinds of date format; positive and negative bignums; decimal floats (see above); binary but expected to be encoded if converted to JSON (in base64url, base64, or base16); nestedly encoded CBOR; URIs; base64 data (two formats); regexps; MIME messages; and a special tag to make file(1) work.

In practice it is not clear how many of these are used, but a decoder must be prepared to at least discard them. The amount of additional spec complexity here is quite astonishing. IMO binary formats like this will (just like JSON) be used by a next layer which always has an idea of what the data means, including (where the data is a binary blob) what encoding it is in etc. So these tags are not useful.

These tags might look like a middle way between (i) extending the binary protocol with a whole new type such as an extension type (incompatible with old readers) and encoding your new kind data in a existing type (leaving all readers who don't know the schema to print it as just integers or bytes or string). But I think they are more trouble than they are worth.

The tags are uncomfortably similar to the ASN.1 tag system, which is widely regarded as one of ASN.1's unfortunate complexities.

MessagePack extension mechanism

MessagePack explicitly reserves some encoding space for users and for future extensions: there is an "extension type". The payload is an extension type byte plus some more data bytes; the data bytes are in a format to be defined by the extension type byte. Half of the possible extension byte values are reserved for future specification, and half are designated for application use. This is pleasingly straightforward.

(There is also one unused primary initial byte value, but that would be rejected by existing decoders and doesn't seem like a likely direction for future expansion.)

Minor other differences in integer encoding

The encodings of integers differ.

In MessagePack, signed and unsigned integers have different typecodes. In CBOR, signed and unsigned positive integers have the same typecodes; negative integers have a different set of typecodes. This means that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.

MessagePack reserves fewer shortcodes for small negative integers, than for small positive integers.

Conclusions and lessons

MessagePack seems to have been prompted into fixing the missing string type problem, but only by the threat of a fork. However, this fork went ahead even after MessagePack clearly accepted the need for a string type. MessagePack had a fixed protocol spec before the IETF did.

The continued pursuit of the IETF fork was ostensibly been motivated by a disapproval of the development process and in particular a sense that the IETF process was superior. However, it seems to me that the IETF process was abused by CBOR's proponent, who just wanted things their own way. I have seen claims by IETF proponents that the open decisionmaking system inherently produces superior results. However, in this case the IETF process produced a bad specification. To the extent that other IETF contributors had influence over the ultimate CBOR RFC, I don't think they significantly improved it. CBOR has been described as MessagePack bikeshedded by the IETF. That would have been bad enough, but I think it's worse than that. To a large extent CBOR is one person's NIH-induced bad design rubber stamped by the IETF. CBOR's problems are not simply matters of taste: it's significantly overcomplicated.

One lesson for the rest of us is that although being the upstream and nominally in charge of a project seems to give us a lot of power, it's wise to listen carefully to one's users and downstreams. Once people are annoyed enough to fork, the fork will have a life of its own.

Another lesson is that many of us should be much warier of the supposed moral authority of the IETF. Many IETF standards are awful (Oauth 2 [12]; IKE; DNSSEC; the list goes on). Sometimes (especially when network adoption effects are weak, as with MessagePack vs CBOR) better results can be obtained from a smaller group, or even an individual, who simply need the thing for their own uses.

Finally, governance systems of public institutions like the IETF need to be robust in defending the interests of outsiders (and hence of society at large) against eloquent insiders who know how to work the process machinery. Any institution which nominally serves the public good faces a constant risk of devolving into self-servingness. This risk gets worse the more powerful and respected the institution becomes.

References

  1. #13: First-class string type in serialization specification (MessagePack issue tracker, June 2010 - August 2013)
  2. #121: Msgpack can't differentiate between raw binary data and text strings (MessagePack issue tracker, November 2012 - February 2013)
  3. draft-bormann-apparea-bpack-00: The binarypack JSON-like representation format (IETF Internet-Draft, October 2012)
  4. #129: MessagePack should be developed in an open process (MessagePack issue tracker, February 2013 - March 2013)
  5. Re: JSON mailing list and BoF (IETF apps-discuss mailing list message from Carsten Bormann, 18 February 2013)
  6. #128: Discussions on the upcoming MessagePack spec that adds the string type to the protocol (MessagePack issue tracker, February 2013 - August 2013)
  7. draft-bormann-apparea-bpack-01: The binarypack JSON-like representation format (IETF Internet-Draft, February 2013)
  8. draft-bormann-cbor: Concise Binary Object Representation (CBOR) (IETF Internet-Drafts, May 2013 - September 2013)
  9. RFC 7049: Concise Binary Object Representation (CBOR) (October 2013)
  10. "MessagePack should be replaced with [CBOR] everywhere ..." (floatboth on Hacker News, 8th April 2017)
  11. Discussion with very useful set of history links (camgunz on Hacker News, 9th April 2017)
  12. OAuth 2.0 and the Road to Hell (Eran Hammer, blog posting from 2012, via Wayback Machine)
  13. Re: [apps-discuss] [Json] msgpack/binarypack (Re: JSON mailing list and BoF) (IETF list message from Sadyuki Furuhashi, 4th March 2013)
  14. "no apologies for complaining about this farce" (IETF list message from Phillip Hallam-Baker, 15th August 2013)
    Edited 2020-07-14 18:55 to fix a minor formatting issue, and 2020-07-14 22:54 to fix two typos

CBOR features vs bugs

Date: 2020-07-14 08:13 pm (UTC)
From: (Anonymous)
I've been researching such formats as well, recently. I have none of the history here, and only know what I've found about the formats themselves. For my application (streaming over the network), I ended up choosing CBOR, because it has features I need. I'm a little confused about some of the things you've held up as problems with CBOR.

If you're storing data you have fully available, then certainly using a fixed-length string/byte type makes sense. But people do use CBOR to stream over a network, and it's helpful to not have to fully buffer data before sending it out. Suppose you're reading bytes being generated by a program, into a buffer, and wanting to stream those out over the network. Using CBOR, you can send out the result of each read() or each 4096-byte buffer as a packet in an indefinite string, and send a break when you get the EOF. As far as I can tell, with MessagePack, either you have to buffer all the data and send it only when you have it all, or you have to create your own equivalent to the CBOR indefinite string atop MessagePack by saying "semantically, in this protocol, when you get this series of strings at this point in the protocol, you should interpret them as one logical buffer".

Valid implementations are allowed to reject indefinite-length encodings if they don't want to deal with them, or for that matter they could specify a length limit; for instance, a tiny device that doesn't have malloc could reject them, or could set a maximum limit based on a buffer size.

You mentioned that MessagePack has encoding space reserved for extensions. CBOR does as well: it has space for coordinated/standardized extensions (where there must be a standard), and space for less-coordinated extensions (where the extension number can just be reserved for a given usage without requiring substantial justification). I'm trying to figure out what you're suggesting MessagePack has done there that CBOR hasn't.

Given that both formats are schemaless (with optional schema support available), anything you're not expecting you can generally just reject, as long as you handle it. So, if your format doesn't expect (say) 16-bit floats, or doesn't expect floats at all, you could just treat them like you do any other unknown format, and reject them. A general-purpose decoding library would need to handle them, but that's just one more branch in a switch statement and one more variant in an enum/ADT.

It sounds to me like both MessagePack and CBOR will work well for many purposes, and there are other purposes for which one or the other may work better. If you specifically want fixed-size buffers and a simpler implementation, MessagePack sounds helpful; if you need indefinite-length buffers or some additional data types, CBOR could make that easier.
ext_1318180: A hat (Default)
From: [identity profile] mdw [distorted.org.uk]
IMO binary formats like this will (just like JSON) be used by a next layer which always has an idea of what the data means, including (where the data is a binary blob) what encoding it is in etc. So these tags are not useful.
I'm not sure how true this is. In a simple system, at least, the sender and (intended) recipient of a message can be expected to know what the various pieces mean. But there are more players than this, and some of them may lack detailed information about the underlying protocol. Indeed, much of the attraction of data encodings like JSON, MessagePack, and CBOR, comes from the fact that messages can be processed, to a significant extent (though, to be sure, imperfectly) by programs which don't understand their high-level semantics. For example, it's possible to write a tool which dumps a MessagePack or CBOR message in a useful human-readable format, for debugging purposes, say.

For example, I guess that the Base64 tags you mention are there specifically to support a protocol-agnostic conversion into JSON. This doesn't seem like a good idea to me (the reverse conversion seems very difficult, in general), but it presumably made some sense to someone.

I remember designing an encoding scheme of this kind once. It ended up more similar to CBOR than MessagePack in its basic shape. Anyway, I eventually -- and against my better judgement -- added an `annotation' feature very similar to CBOR's tags. The primary motivation for this, as I recall, was to support a debugging dump which could -- optionally! (I was very clear on this point) -- redact `secret' data from the dump. But, for this do be done in a protocol-agnostic way, there must be some way to identify the values which should be redacted. Hence the annotations. Of course, once the feature was added, more uses for it were identified. Maybe one of those was a good idea.

(You're going to ask me why I didn't use one of the existing things. I designed this three years before MessagePack was a thing. Besides, there are features I wanted that neither of these things provides.)

The tags are uncomfortably similar to the ASN.1 tag system, which is widely regarded as one of ASN.1's unfortunate complexities.
I don't see this similarity at all. CBOR tags are for annotating a value with some additional metadata.

ASN.1 tags have an entirely different purpose. They're mostly used in order to disambiguate encodings of `sequences' (ASN.1's primary heterogeneous aggregate type) which contain optional elements. Under the usual ASN.1 encoding rules, optional things which aren't actually present are simply omitted. For example, if you have two optional things of the same type, one after the other, and one of them appears in an encoded message, you need some way of figuring out which one you have: and explicit tagging is that mechanism.

In a nutshell: CBOR tags tell you extra information about what a value means in isolation; ASN.1 tags tell you how a value fits into its surrounding context.

None of this is intended to be a defence of ASN.1.

In MessagePack, signed and unsigned integers have different typecodes. In CBOR, signed and unsigned positive integers have the same typecodes; negative integers have a different set of typecodes. This means that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.
I think you've completely misunderstood MessagePack here. To be fair, the specification is remarkably poor, and I had to source-dive some implementations and dig into the bug tracker. (This in itself is a reason to use CBOR.)

Let's deal with CBOR first, because it's rather simpler. To represent a nonnegative integer x, you write a prefix that says `nonnegative integer of some length', and then the value of x. To represent a negative integer x, you write a different prefix that says `negative integer of some length', and then the value of -(x + 1). Every integer in the half-open interval [-2^64, 2^64) can be represented in exactly one of these two basic ways (but implementations still have a number of different-length encoding varants to choose between).

The MessagePack spec lists in its `Overview' section

  • `positive fixint', or maybe `fixnum', representing [0, 128);
  • `negative fixint', or maybe `fixnum', representing [-128, 0);
  • `uint 8', `uint 16', `uint 32', `uint 64', where `uint N' represents [0, 256^N); and
  • `int 8', `int 16', `int 32', `int 64', where `int N' represents [-(256^N)/2, (256^N)/2).


The `Type system' section lists `Integer', without subdividing it into `signed' and `unsigned'. There's a remark
a value of an Integer object is limited from -(2^63) upto (2^64)-1
though the lower bound is achieved only using the `int 64' encoding, and the upper bound achieved only using `uint 64'. This suggests that the various `int N' and `uint N' formats are thought of as encoding different, but overlapping, subranges of a unified `integer' type. On the other hand, under `int format family', we have the gnomic remark

  • 0XXXXXXX is 8-bit unsigned integer
  • 111YYYYY is 8-bit signed integer
This is followed by some box diagrams which say things like
        uint 32 stores a 32-bit big-endian unsigned integer
        +--------+--------+--------+--------+--------+
        |  0xce  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
        +--------+--------+--------+--------+--------+
and
int 16 stores a 16-bit big-endian signed integer +--------+--------+--------+ | 0xd1 |ZZZZZZZZ|ZZZZZZZZ| +--------+--------+--------+
and that's all there is. There's no explanation of what the ZZZZ means in a `signed' integer. From source diving, it seems that this is two's complement. See also their issue #269 (https://github.com/msgpack/msgpack/issues/269).

There's no further exposition on whether a nonnegative `signed' integer is the same as, or distinct from, an `unsigned' integer with the same numerical value.

The `msgpack' project on Github hosts a number of implementations in different languages. Their Java implementation (https://github.com/msgpack/msgpack-java/blob/develop/msgpack-core/src/main/java/org/msgpack/core/MessagePacker.java#L547) certainly uses `uint N' types for (necessarily signed) nonnegative Java integers. The Python decoder is complicated, but bottoms out at `callbacks (https://github.com/msgpack/msgpack-python/blob/master/msgpack/unpack.h#L85)' which just make Python integers, erasing the `signed'/`unsigned' distinction.

This ambiguity was raised in issue #164 (https://github.com/msgpack/msgpack/issues/164) though the original submitter closed that bug before anything apparently was done about it.

A bug (https://github.com/msgpack/msgpack-c/issues/247) was raised against the C implementation which failed to maintain the distinction. A patch was prepared, but then abandoned because the existing behaviour was thought to be better.

I think this was the right decision. A system which worked the way you suggest would be very inconvenient to use in dynamic languages such as Python or Perl to use their natural representations without erasing apparently important type distinctions.

Your specific criticism of CBOR here seems bizarre.

This mean that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.
The latter is clearly no more onerous than the range checking necessary to select the right encoding format. MessagePack recommends using the shortest acceptable encoding variant, so has no benefit here:

If an object can be represented in multiple possible output formats, serializers SHOULD use the format which represents the data in the smallest number of bytes.

The former claim seems even more absurd. Indeed, I must assume that this is a typo, since a CBOR decoder expecting a signed integer doesn't care about the sign of the incoming data -- ex hypothesi it will handle either. A decode expecting a nonnegative integer must of course check the sign of the incoming value -- in the same way that a decoder expecting value that fits in 32 bits must check that it hasn't been given a 43-bit value, and the latter is something a MessagePack application must also deal with.

Your conclusions seem mostly sound, but don't address the significant advantage that CBOR at least has a fairly clear, well-written specification. MessagePack's `specification' is barely anything of the sort.
ext_1318180: A hat (Default)
From: [identity profile] mdw [distorted.org.uk]
You seem to be suggesting that the spec is ambiguous as to whether a uint16 and an int16 should be treated as the same type, but that interpretation is explicitly contradicted by the type system set out right at the top of the document. You seem to have noticed, but then weirdly don't seem to regard this as sufficiently conclusive. The parts of the spec that you seem to regard as casting doubt are encoding descriptions.

Hardly. I cited the actual description of the 'Type system', which is actually at the top of the document, and which mentions no such distinction -- indeed, it doesn't mention signedness at all. The first suggestion that there might be a distinction to be drawn is in 'Formats: Overview'. There's a difference between a value's type and its representation, so the latter certainly shouldn't be considered to override the former. (On the one hand, Lisp systems will select different representations for integers depending on their size and context[1]. On the other hand, an ISO8859-1 string and a raw octet vector have the same representation, but are probably best considered as different types.)

Anyway, I'm suggesting that the spec is ambiguous precisely because there are multiple interpretations of it, out there in the world. You formed one; I formed another, and I have justified my interpretation with extensive citations, not only of the specification itself, but of bug-tracker discussions about it, and 'blessed' implementations. There are also implementations which support your interpretation, which just reinforces my point that the specification is bad.

Range checking is not necessary to select the right encoding format. Choosing the shortest encoding is not mandatory (neither is it in CBOR)

Indeed it's not mandatory, but it is strongly recommended. It's literally the only occurrence of RFC2119 SHOUTING in the document.

One might dismiss this as an irrelevant optimisation and say that a decoder must be able to handle all kinds of integer. But (i) the straightforward encoding can provide a fast path and (ii) in a system where the writers are cooperating to use a supported subset, a limited decoder can be a lot simpler than the corresponding limited CBOR decoder here.

This is a fair point.

I did not see any ambiguity in the MessagePack specification when I read it, and I still don't having seen your comments.

How do you reconcile this position with the observed fact that the 'msgpack-c' implementation (among a number of others) erases the distinction between positive and unsigned integers?

Certainly it is not written in the very defensive style common to modern standards bodies [...]. CBOR is written that way and the result is extremely turgid prose.

Adding the words 'two's complement' occasionally will not turn this document into the terrible Vroomfondel standardese of modern RFCs.

[1] It's true that Common Lisp, for example, partitions its integer type into fixnum for small-magnitude integers and bignum for large-magnitude integers. But everything is more subtle in actual implementations, which will store a value unboxed in a machine word or register, even if it's slightly too large to fit in a fixnum with the usual type tagging, if type and range inference shows that this will work.

(no subject)

Date: 2021-10-17 11:48 am (UTC)
From: (Anonymous)

I didn't know about all that politics and arguments. It's worth to know, thank you for the review. But I'll compare only technical details.

I've implemented binary JSON-like format for storing, transferring and processing log events. And I took CBOR as a base. More precisely I've thrown away all that predefined extension types and used my own only. I agree, standard extension tags are quite messy. Many of them are useless or rarely needed.

Why do I like it more than MsgPack?

Reader and writer implementation is simpler.

There is the same tag-length structure (tag is always 2 bit, length is always the same encoded), so I can have the same function to encode/decode all types headers.

Indefinite-length is useful.

Logger can encode arguments without knowing how many of them follow. Processor can easily modify messages in flight.

Extension tags are simpler.

CBOR has just one simple wrapper with a tag. Easy to encode and decode. MsgPack has 8 tags with different semantics. And these tags are not event contiguous.

CBOR can have the same CBOR format under extension tag with all base types. MsgPack has only byte array. One loses ability to parse, interpret and process that extension data in a general way.

Extension tags are extremely useful. Although all data is encoded in the same simple types you can easily differentiate {just int, time, duration, enum}, {just bytes, id, hash and, other binary format}, {error message and string}, {stack trace from array of maps} and so on. So if user of my logger would add any of that typed data into the event arguments any UI or processing script can interpret it specially.

Other small nice point.

Float can be encoded as a decimal, so that they can take 1 or 2 bytes instead of minimal 5.

Profile

diziet: (Default)
Ian Jackson

May 2025

S M T W T F S
     123
45678910
11121314151617
18192021222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags