diziet | I think you're wrong on several of the technical aspects (Reply)

From:

IMO binary formats like this will (just like JSON) be used by a next layer which always has an idea of what the data means, including (where the data is a binary blob) what encoding it is in etc. So these tags are not useful.

I'm not sure how true this is. In a simple system, at least, the sender and (intended) recipient of a message can be expected to know what the various pieces mean. But there are more players than this, and some of them may lack detailed information about the underlying protocol. Indeed, much of the attraction of data encodings like JSON, MessagePack, and CBOR, comes from the fact that messages can be processed, to a significant extent (though, to be sure, imperfectly) by programs which don't understand their high-level semantics. For example, it's possible to write a tool which dumps a MessagePack or CBOR message in a useful human-readable format, for debugging purposes, say.

For example, I guess that the Base64 tags you mention are there specifically to support a protocol-agnostic conversion into JSON. This doesn't seem like a good idea to me (the reverse conversion seems very difficult, in general), but it presumably made some sense to someone.

I remember designing an encoding scheme of this kind once. It ended up more similar to CBOR than MessagePack in its basic shape. Anyway, I eventually -- and against my better judgement -- added an `annotation' feature very similar to CBOR's tags. The primary motivation for this, as I recall, was to support a debugging dump which could -- optionally! (I was very clear on this point) -- redact `secret' data from the dump. But, for this do be done in a protocol-agnostic way, there must be some way to identify the values which should be redacted. Hence the annotations. Of course, once the feature was added, more uses for it were identified. Maybe one of those was a good idea.

(You're going to ask me why I didn't use one of the existing things. I designed this three years before MessagePack was a thing. Besides, there are features I wanted that neither of these things provides.)

The tags are uncomfortably similar to the ASN.1 tag system, which is widely regarded as one of ASN.1's unfortunate complexities.

I don't see this similarity at all. CBOR tags are for annotating a value with some additional metadata.

ASN.1 tags have an entirely different purpose. They're mostly used in order to disambiguate encodings of `sequences' (ASN.1's primary heterogeneous aggregate type) which contain optional elements. Under the usual ASN.1 encoding rules, optional things which aren't actually present are simply omitted. For example, if you have two optional things of the same type, one after the other, and one of them appears in an encoded message, you need some way of figuring out which one you have: and explicit tagging is that mechanism.

In a nutshell: CBOR tags tell you extra information about what a value means in isolation; ASN.1 tags tell you how a value fits into its surrounding context.

None of this is intended to be a defence of ASN.1.

In MessagePack, signed and unsigned integers have different typecodes. In CBOR, signed and unsigned positive integers have the same typecodes; negative integers have a different set of typecodes. This means that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.

I think you've completely misunderstood MessagePack here. To be fair, the specification is remarkably poor, and I had to source-dive some implementations and dig into the bug tracker. (This in itself is a reason to use CBOR.)

Let's deal with CBOR first, because it's rather simpler. To represent a nonnegative integer x, you write a prefix that says `nonnegative integer of some length', and then the value of x. To represent a negative integer x, you write a different prefix that says `negative integer of some length', and then the value of -(x + 1). Every integer in the half-open interval [-2^64, 2^64) can be represented in exactly one of these two basic ways (but implementations still have a number of different-length encoding varants to choose between).

The MessagePack spec lists in its `Overview' section

`positive fixint', or maybe `fixnum', representing [0, 128);
`negative fixint', or maybe `fixnum', representing [-128, 0);
`uint 8', `uint 16', `uint 32', `uint 64', where `uint N' represents [0, 256^N); and
`int 8', `int 16', `int 32', `int 64', where `int N' represents [-(256^N)/2, (256^N)/2).

The `Type system' section lists `Integer', without subdividing it into `signed' and `unsigned'. There's a remark

a value of an Integer object is limited from -(2^63) upto (2^64)-1

though the lower bound is achieved only using the `int 64' encoding, and the upper bound achieved only using `uint 64'. This suggests that the various `int N' and `uint N' formats are thought of as encoding different, but overlapping, subranges of a unified `integer' type. On the other hand, under `int format family', we have the gnomic remark

0XXXXXXX is 8-bit unsigned integer
111YYYYY is 8-bit signed integer

This is followed by some box diagrams which say things like

        uint 32 stores a 32-bit big-endian unsigned integer
        +--------+--------+--------+--------+--------+
        |  0xce  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
        +--------+--------+--------+--------+--------+

and


        int 16 stores a 16-bit big-endian signed integer
        +--------+--------+--------+
        |  0xd1  |ZZZZZZZZ|ZZZZZZZZ|
        +--------+--------+--------+

and that's all there is. There's no explanation of what the ZZZZ means in a `signed' integer. From source diving, it seems that this is two's complement. See also their issue #269 (https://github.com/msgpack/msgpack/issues/269).

There's no further exposition on whether a nonnegative `signed' integer is the same as, or distinct from, an `unsigned' integer with the same numerical value.

The `msgpack' project on Github hosts a number of implementations in different languages. Their Java implementation (https://github.com/msgpack/msgpack-java/blob/develop/msgpack-core/src/main/java/org/msgpack/core/MessagePacker.java#L547) certainly uses `uint N' types for (necessarily signed) nonnegative Java integers. The Python decoder is complicated, but bottoms out at `callbacks (https://github.com/msgpack/msgpack-python/blob/master/msgpack/unpack.h#L85)' which just make Python integers, erasing the `signed'/`unsigned' distinction.

This ambiguity was raised in issue #164 (https://github.com/msgpack/msgpack/issues/164) though the original submitter closed that bug before anything apparently was done about it.

A bug (https://github.com/msgpack/msgpack-c/issues/247) was raised against the C implementation which failed to maintain the distinction. A patch was prepared, but then abandoned because the existing behaviour was thought to be better.

I think this was the right decision. A system which worked the way you suggest would be very inconvenient to use in dynamic languages such as Python or Perl to use their natural representations without erasing apparently important type distinctions.

Your specific criticism of CBOR here seems bizarre.

This mean that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.

The latter is clearly no more onerous than the range checking necessary to select the right encoding format. MessagePack recommends using the shortest acceptable encoding variant, so has no benefit here:

If an object can be represented in multiple possible output formats, serializers SHOULD use the format which represents the data in the smallest number of bytes.

The former claim seems even more absurd. Indeed, I must assume that this is a typo, since a CBOR decoder expecting a signed integer doesn't care about the sign of the incoming data -- ex hypothesi it will handle either. A decode expecting a nonnegative integer must of course check the sign of the incoming value -- in the same way that a decoder expecting value that fits in 32 bits must check that it hasn't been given a 43-bit value, and the latter is something a MessagePack application must also deal with.

Your conclusions seem mostly sound, but don't address the significant advantage that CBOR at least has a fairly clear, well-written specification. MessagePack's `specification' is barely anything of the sort.