diziet: (Default)

I have just released Otter 1.0.0.

Recap: what is Otter

Otter is my game server for arbitrary board games. Unlike most online game systems. It does not know (nor does it need to know) the rules of the game you are playing. Instead, it lets you and your friends play with common tabletop/boardgame elements such as hands of cards, boards, and so on. So it’s something like a “tabletop simulator” (but it does not have any 3D, or a physics engine, or anything like that).

There are provided game materials and templates for Penultima, Mao, and card games in general.

Otter also supports uploadable game bundles, which allows users to add support for additional games - and this can be done without programming.

For more information, see the online documentation. There are a longer intro and some screenshots in my 2021 introductory blog post about Otter

Releasing 1.0.0

I’m calling this release 1.0.0 because I think I can now say that its quality, reliability and stability is suitable for general use. In particular, Otter now builds on Stable Rust, which makes it a lot easier to install and maintain.

Switching web framework, and async Rust

I switched Otter from the Rocket web framework to Actix. There are things to prefer about both systems, and I still have a soft spot for Rocket. But ultimately I needed a framework which was fully released and supported for use with Stable Rust.

There are few if any Rust web frameworks that are not async. This is rather a shame. Async Rust is a considerably more awkward programming environment than ordinary non-async Rust. I don’t want to digress into a litany of complaints, but suffice it to say that while I really love Rust, my views on async Rust are considerably more mixed.

Future plans

In the near future I plan to add a couple of features to better support some particular games: currency-like resources, and a better UI for dice-like randomness.

In the longer term, Otter’s, installation and account management arrangements are rather unsophisticated and un-webby. There is not currently any publicly available instance for you to try it out without installing it on a machine of your own. There’s not even any provided binaries: you must built Otter yourself. I hope to be able to improve this situation but it involves dealing with cloud CI and containers and so-on, which can all be rather unpleasant.

Users on chiark will find an instance of Otter there.

diziet: (Default)

Last week I received (finally) my Fairphone 4, supplied with a de-googled operating system, which I had ordered from the E Foundation’s shop in December. (I’m am very hard on hardware and my venerable Fairphone 2 is really on its last legs.)

I expect to have full control over the software on any computing device I own which is as complicated, capable, and therefore, hazardous, as a mobile phone. Unfortunately the Eos image (they prefer to spell it “/e/ os”, srsly!) doesn’t come with a way to get root without taking fairly serious measures including unlocking the bootloader. Unlocking the bootloader wouldn’t be desirable for me but I can’t live without root. So.

I started with these helpful instructions: https://forum.xda-developers.com/t/fairphone-4-root.4376421/

I found the whole process a bit of a trial, and I thought I would write down what I did. But, it’s not straightforward, at least for someone like me who only has a dim understanding of all this Android stuff. Unfortunately, due to the number of missteps and restarts, what I actually did is not really a sensible procedure. So here is a retcon of a process I think will work:

Unlock the bootloader

The E Foundation provide instructions for unlocking the bootloader on a stock FP4, here https://doc.e.foundation/devices/FP4/install and they seem applicable to the “Murena” phone supplied with Eos pre-installed, too.

NB tht unlocking the bootloader wipes the phone. So we do it first.

So:

  1. Power on the phone, with no SIM installed
  2. You get a welcome screen.
  3. Skip all things on startup including wifi
  4. Go to the very end of the settings, tap a gazillion times on the phone’s version until you’re a developer
  5. In the developer settings, allow usb debugging
  6. In the developer settings, allow oem bootloader unlocking
  7. Connect a computer via a USB cable, say yes on phone to USB debugging
  8. adb reboot bootloader
  9. The phone will reboot into a texty kind of screen, the bootloader
  10. fastboot flashing unlock
  11. The phone will reboot, back to the welcome screen
  12. Repeat steps 3-9 (maybe not all are necessary)
  13. fastboot flashing unlock_critical
  14. The phone will reboot, back to the welcome screen

Note that although you are running fastboot, you must run this command with the phone in “bootloader” mode, not “fastboot” (aka “fastbootd”) mode. If you run fastboot flashing unlcok from fastboot you just get a “don’t know what you’re talking about”. I found conflicting instructions on what kind of Vulcan nerve pinches could be used to get into which boot modes, and had poor experiences with those. adb reboot bootloader always worked reliably for me.

Some docs say to run fastboot oem unlock; I used flashing. Maybe this depends on the Android tools version.

Initial privacy prep and OTA update

We want to update the supplied phone OS. The build mine shipped with is too buggy to run Magisk, the application we are going to use to root the phone. (With the pre-installed phone OS, Magisk crashes at the “patch boot image” step.) But I didn’t want to let the phone talk to Google, even for the push notifications registration.

  1. From the welcome screen, skip all things except location, date, time. Notably, do not set up wifi
  2. In settings, “microg” section
    1. turn off cloud messaging
    2. turn off google safetynet
    3. turn off google registration (NB you must do this after the other two, because their sliders become dysfunctional after you turn google registration off)
    4. turn off both location modules
  3. In settings, location section, turn off allowed location for browser and magic earth
  4. Now go into settings and enable wifi, giving it your wifi details
  5. Tell the phone to update its operating system. This is a big download.

Install Magisk, the root manager

(As a starting point I used these instructions https://www.xda-developers.com/how-to-install-magisk/ and a lot of random forum posts.)

You will need the official boot.img. Bizarrely there doesn’t seem to be a way to obtain this from the phone. Instead, you must download it. You can find it by starting at https://doc.e.foundation/devices/FP4/install which links to https://images.ecloud.global/stable/FP4/. At the time of writing, the most recent version, whose version number seemed to correspond to the OS update I installed above, was IMG-e-0.21-r-20220123158735-stable-FP4.zip.

  1. Download the giant zipfile to your computer
  2. Unzip it to extract boot.img
  3. Copy the file to your phone’s “storage”. Eg, via adb: with the phone booted into the main operating system, using USB debugging, adb push boot.img /storage/self/primary/Download.
  4. On the phone, open the browser, and enter https://f-droid.org. Click on the link to install f-droid. You will need to enable installing apps from the browser (follow the provided flow to the settings, change the setting, and then use Back, and you can do the install). If you wish, you can download the f-droid apk separately on a computer, and verify it with pgp.
  5. Using f-droid, install Magisk. You will need to enable installing apps from f-droid. (I installed Magisk from f-droid because 1. I was going to trust f-droid anyway 2. it has a shorter URL than Magisk’s.)
  6. Open the Magisk app. Tell Magisk to install (Magisk, not the app). There will be only one option: patch boot file. Tell it to patch the boot.img file from before.
  7. Transfer the magisk_patched-THING.img back to your computer (eg via adb pull).
  8. adb reboot bootloader
  9. fastboot boot magisk_patched-THING.img (again, NB, from bootloader mode, not from fastboot mode)
  10. In Magisk you’ll see it shows as installed. But it’s not really; you’ve just booted from an image with it. Ask to install Magisk with “Direct install”.

After you have done all this, I believe that each time you do an over-the-air OS update, you must, between installing the update and rebooting the phone, ask Magisk to “Install to inactive slot (after OTA)”. Presumably if you forget you must do the fastboot boot dance again.

After all this, I was able to use tsu in Termux. There’s a strange behaviour with the root prompt you get apropos Termux’s request for root; I found that it definitely worked if Termux wasn’t the foreground app…

You have to leave the bootloader unlocked. Howwever, as I understand it, the phone’s encryption will still prevent an attacker from hoovering the data out of your phone. The bootloader lock is to prevent someone tricking you into entering the decryption passkey into a trojaned device.

Other things to change

  • Probably, after you’re done with this, disable installing apps from the Browser. I will install Signal before doing that, since that’s not in f-droid because of mutual distrust between the f-droid and Signal folks. The permission is called “Install unknown apps”.

  • Turn off “instant apps” aka “open links in apps even if the app is not installed”. OMG WTF BBQ.

  • Turn off “wifi scanning even if wifi off”. WTF.

  • I turned off storage manager auto delete, on the grounds that I didn’t know what the phone might think of as “having been backed up”. I can manage my own space use, thanks very much.

There are probably other things to change. I have not yet transferred my Signal account from my old phone. It is possible that Signal will require me to re-enable the google push notifications, but I hope that having disabled them in microg it will be happy to use its own system, as it does on my old phone.

diziet: (Default)

The EU Digital Covid Certificate scheme is a format for (digitally signed) vaccination status certificates. Not only EU countries participate - the UK is now a participant in this scheme.

I am currently on my way to go skiing in the French Alps. So I needed a certificate that would be accepted in France. AFAICT the official way to do this is to get the “international” certificate from the NHS, and take it to a French pharmacy who will convert it into something suitably French. (AIUI the NHS “international” barcode is the same regardless of whether you get it via the NHS website, the NHS app, or a paper letter. NB that there is one barcode per vaccine dose so you have to get the right one - probably that means your booster since there’s a 9 month rule!)

I read on an forum somewhere that you could use the French TousAntiCovid app to convert the barcode. So I thought I would try that. The TousAntiCovid is Free Softare and on F-Droid, so I was happy to install and use it for this.

I also used the French TAC Verify app to check to see what barcodes were accepted. (I found an official document addressed to French professionals recommending this as an option for verifying the status of visitors to one’s establishment.) Unfortunately this involves a googlified phone, but one could use a burner phone or ask a friend who’s bitten that bullet already.

I discovered that, indeed:

  • My own NHS QR code is not accepted by TAC Verify
  • My own NHS QR code can be loaded into TousAntiCovid, and added to my “wallet” in the app
  • If I get TousAntiCovid to display that certificate, it shows a visually different QR code which TAC Verify accepts

This made me curious.

I used a QR code reader to decode both barcodes. The decodings were identical! A long string of guff starting HC1:. AIUI it is an encoded JWT. But there was a difference in the framing: Binary Eye reported that the NHS barcode used error correction level “M” (medium, aka 15%). The TousAntiCovid barcode used level “L” (low, 7%).

I had my QR code software regenerate a QR code at level “M” for the data from the TousAntiCovid code. The result was a QR code which is identical (pixel-wise) to the one from the NHS.

So the only difference is the error correction level. Curiously, both “L” (low, generated by TousAntiCovid, accepted by TAC Verify) and “M” (medium, generated by NHS, rejected by TAC Verify) are lower than the “Q” (25") recommended by what I think is the specification.

This is all very odd. But the upshot is that I think you can convert the NHS “international” barcode into something that should work in France simply by passing it through any QR code software to re-encode it at error correction level “L” (7%). But if you’re happy to use the TousAntiCovid app it’s probably a good way to store them.

I guess I’ll find out when I get to France if the converted NHS barcodes work in real establishments. Thanks to the folks behind sanipasse.fr for publishing some helpful backround info and operating a Free Software backed public verification service.

Footnote

To compare the QR codes pixelwise, I roughly cropped the NHS PDF image using a GUI tool, and then on each of the two images used pnmcrop (to trim the border), pnmscale (to rescale the one-pixel-per-pixel output from Binary Eye) and pnmarith -difference to compare them (producing a pretty squirgly image showing just the pixel edges due to antialiasing).

diziet: (Default)

tl;dr: Faithfully following upstream semver, in Debian package dependencies, is a bad idea.

Introduction

I have been involved in Debian for a very long time. And I’ve been working with Rust for a few years now. Late last year I had cause to try to work on Rust things within Debian.

When I did, I found it very difficult. The Debian Rust Team were very helpful. However, the workflow and tooling require very large amounts of manual clerical work - work which it is almost impossible to do correctly since the information required does not exist. I had wanted to package a fairly straightforward program I had written in Rust, partly as a learning exercise. But, unfortunately, after I got stuck in, it looked to me like the effort would be wildly greater than I was prepared for, so I gave up.

Since then I’ve been thinking about what I learned about how Rust is packaged in Debian. I think I can see how to fix some of the problems. Although I don’t want to go charging in and try to tell everyone how to do things, I felt I ought at least to write up my ideas. Hence this blog post, which may become the first of a series.

This post is going to be about semver handling. I see problems with other aspects of dependency handling and source code management and traceability as well, and of course if my ideas find favour in principle, there are a lot of details that need to be worked out, including some kind of transition plan.

How Debian packages Rust, and build vs runtime dependencies

Today I will be discussing almost entirely build-dependencies; Rust doesn’t (yet?) support dynamic linking, so built Rust binaries don’t have Rusty dependencies.

However, things are a bit confusing because even the Debian “binary” packages for Rust libraries contain pure source code. So for a Rust library package, “building” the Debian binary package from the Debian source package does not involve running the Rust compiler; it’s just file-copying and format conversion. The library’s Rust dependencies do not need to be installed on the “build” machine for this.

So I’m mostly going to be talking about Depends fields, which are Debian’s way of talking about runtime dependencies, even though they are used only at build-time. The way this works is that some ultimate leaf package (which is supposed to produce actual executable code) Build-Depends on the libraries it needs, and those Depends on their under-libraries, so that everything needed is installed.

What do dependencies mean and what are they for anyway?

In systems where packages declare dependencies on other packages, it generally becomes necessary to support “versioned” dependencies. In all but the most simple systems, this involves an ordering (or similar) on version numbers and a way for a package A to specify that it depends on certain versions of B.

Both Debian and Rust have this. Rust upstream crates have version numbers and can specify their dependencies according to semver. Debian’s dependency system can represent that.

So it was natural for the designers of the scheme for packaging Rust code in Debian to simply translate the Rust version dependencies to Debian ones. However, while the two dependency schemes seem equivalent in the abstract, their concrete real-world semantics are totally different.

These different package management systems have different practices and different meanings for dependencies. (Interestingly, the Python world also has debates about the meaning and proper use of dependency versions.)

The epistemological problem

Consider some package A which is known to depend on B. In general, it is not trivial to know which versions of B will be satisfactory. I.e., whether a new B, with potentially-breaking changes, will actually break A.

Sometimes tooling can be used which calculates this (eg, the Debian shlibdeps system for runtime dependencies) but this is unusual - especially for build-time dependencies. Which versions of B are OK can normally only be discovered by a human consideration of changelogs etc., or by having a computer try particular combinations.

Few ecosystems with dependencies, in the Free Software community at least, make an attempt to precisely calculate the versions of B that are actually required to build some A. So it turns out that there are three cases for a particular combination of A and B: it is believed to work; it is known not to work; and: it is not known whether it will work.

And, I am not aware of any dependency system that has an explicit machine-readable representation for the “unknown” state, so that they can say something like “A is known to depend on B; versions of B before v1 are known to break; version v2 is known to work”. (Sometimes statements like that can be found in human-readable docs.)

That leaves two possibilities for the semantics of a dependency A depends B, version(s) V..W: Precise: A will definitely work if B matches V..W, and Optimistic: We have no reason to think B breaks with any of V..W.

At first sight the latter does not seem useful, since how would the package manager find a working combination? Taking Debian as an example, which uses optimistic version dependencies, the answer is as follows: The primary information about what package versions to use is not only the dependencies, but mostly in which Debian release is being targeted. (Other systems using optimistic version dependencies could use the date of the build, i.e. use only packages that are “current”.)

Precise

Optimistic

People involved in version management

Package developers,
downstream developers/users.

Package developers,
downstream developer/users,
distribution QA and release managers.

Package developers declare versions V and dependency ranges V..W so that

It definitely works.

A wide range of B can satisfy the declared requirement.

The principal version data used by the package manager

Only dependency versions.

Contextual, eg, Releases - set(s) of packages available.

Version dependencies are for

Selecting working combinations (out of all that ever existed).

Sequencing (ordering) of updates; QA.

Expected use pattern by a downstream

Downstream can combine any
declared-good combination.

Use a particular release of the whole system. Mixing-and-matching requires additional QA and remedial work.

Downstreams are protected from breakage by

Pessimistically updating versions and dependencies whenever anything might go wrong.

Whole-release QA.

A substantial deployment will typically contain

Multiple versions of many packages.

A single version of each package, except where there are actual incompatibilities which are too hard to fix.

Package updates are driven by

Top-down:
Depending package updates the declared metadata.
Bottom-up:
Depended-on package is updated in the repository for the work-in-progress release.

So, while Rust and Debian have systems that look superficially similar, they contain fundamentally different kinds of information. Simply representing the Rust versions directly into Debian doesn’t work.

What is currently done by the Debian Rust Team is to manually patch the dependency specifications, to relax them. This is very labour-intensive, and there is little automation supporting either decisionmaking or actually applying the resulting changes.

What to do

Desired end goal

To update a Rust package in Debian, that many things depend on, one need simply update that package.

Debian’s sophisticated build and CI infrastructure will try building all the reverse-dependencies against the new version. Packages that actually fail against the new dependency are flagged as suffering from release-critical problems.

Debian Rust developers then update those other packages too. If the problems turn out to be too difficult, it is possible to roll back.

If a problem with a depending packages is not resolved in a timely fashion, priority is given to updating core packages, and the depending package falls by the wayside (since it is empirically unmaintainable, given available effort).

There is no routine manual patching of dependency metadata (or of anything else).

Radical proposal

Debian should not precisely follow upstream Rust semver dependency information. Instead, Debian should optimistically try the combinations of packages that we want to have. The resulting breakages will be discovered by automated QA; they will have to be fixed by manual intervention of some kind, but usually, simply updating the depending package will be sufficient.

This no longer ensures (unlike the upstream Rust scheme) that the result is expected to build and work if the dependencies are satisfied. But as discussed, we don’t really need that property in Debian. More important is the new property we gain: that we are able to mix and match versions that we find work in practice, without a great deal of manual effort.

Or to put it another way, in Debian we should do as a Rust upstream maintainer does when they do the regular “update dependencies for new semvers” task: we should update everything, see what breaks, and fix those.

(In theory a Rust upstream package maintainer is supposed to do some additional checks or something. But the practices are not standardised and any checks one does almost never reveal anything untoward, so in practice I think many Rust upstreams just update and see what happens. The Rust upstream community has other mechanisms - often, reactive ones - to deal with any problems. Debian should subscribe to those same information sources, eg RustSec.)

Nobbling cargo

Somehow, when cargo is run to build Rust things against these Debian packages, cargo’s dependency system will have to be overridden so that the version of the package that is actually selected by Debian’s package manager is used by cargo without complaint.

We probably don’t want to change the Rust version numbers of Debian Rust library packages, so this should be done by either presenting cargo with an automatically-massaged Cargo.toml where the dependency version restrictions are relaxed, or by using a modified version of cargo which has special option(s) to relax certain dependencies.

Handling breakage

Rust packages in Debian should already be provided with autopkgtests so that ci.debian.net will detect build breakages. Build breakages will stop the updated dependency from migrating to the work-in-progress release, Debian testing.

To resolve this, and allow forward progress, we will usually upload a new version of the dependency containing an appropriate Breaks, and either file an RC bug against the depending package, or update it. This can be done after the upload of the base package.

Thus, resolution of breakage due to incompatibilities will be done collaboratively within the Debian archive, rather than ad-hoc locally. And it can be done without blocking.

My proposal prioritises the ability to make progress in the core, over stability and in particular over retaining leaf packages. This is not Debian’s usual approach but given the Rust ecosystem’s practical attitudes to API design, versioning, etc., I think the instability will be manageable. In practice fixing leaf packages is not usually really that hard, but it’s still work and the question is what happens if the work doesn’t get done. After all we are always a shortage of effort - and we probably still will be, even if we get rid of the makework clerical work of patching dependency versions everywhere (so that usually no work is needed on depending packages).

Exceptions to the one-version rule

There will have to be some packages that we need to keep multiple versions of. We won’t want to update every depending package manually when this happens. Instead, we’ll probably want to set a version number split: rdepends which want version <X will get the old one.

Details - a sketch

I’m going to sketch out some of the details of a scheme I think would work. But I haven’t thought this through fully. This is still mostly at the handwaving stage. If my ideas find favour, we’ll have to do some detailed review and consider a whole bunch of edge cases I’m glossing over.

The dependency specification consists of two halves: the depending .deb‘s Depends (or, for a leaf package, Build-Depends) and the base .debVersion and perhaps Breaks and Provides.

Even though libraries vastly outnumber leaf packages, we still want to avoid updating leaf Debian source packages simply to bump dependencies.

Dependency encoding proposal

Compared to the existing scheme, I suggest we implement the dependency relaxation by changing the depended-on package, rather than the depending one.

So we retain roughly the existing semver translation for Depends fields. But we drop all local patching of dependency versions.

Into every library source package we insert a new Debian-specific metadata file declaring the earliest version that we uploaded. When we translate a library source package to a .deb, the “binary” package build adds Provides for every previous version.

The effect is that when one updates a base package, the usual behaviour is to simply try to use it to satisfy everything that depends on that base package. The Debian CI will report the build or test failures of all the depending packages which the API changes broke.

We will have a choice, then:

Breakage handling - update broken depending packages individually

If there are only a few packages that are broken, for each broken dependency, we add an appropriate Breaks to the base binary package. (The version field in the Breaks should be chosen narrowly, so that it is possible to resolve it without changing the major version of the dependency, eg by making a minor source change.)

When can then do one of the following:

  • Update the dependency from upstream, to a version which works with the new base. (Assuming there is one.) This should be the usual response.

  • Fix the dependency source code so that builds and works with the new base package. If this wasn’t just a backport of an upstream change, we should send our fix upstream. (We should prefer to update the whole package, than to backport an API adjustment.)

  • File an RC bug against the dependency (which will eventually trigger autoremoval), or preemptively ask for the Debian release managers to remove the dependency from the work-in-progress release.

Breakage handling - declare new incompatible API in Debian

If the API changes are widespread and many dependencies are affected, we should represent this by changing the in-Debian-source-package metadata to arrange for fewer Provides lines to be generated - withdrawing the Provides lines for earlier APIs.

Hopefully examination of the upstream changelog will show what the main compat break is, and therefore tell us which Provides we still want to retain.

This is like declaring Breaks for all the rdepends. We should do it if many rdepends are affected.

Then, for each rdependency, we must choose one of the responses in the bullet points above. In practice this will often be a mass bug filing campaign, or large update campaign.

Breakage handling - multiple versions

Sometimes there will be a big API rewrite in some package, and we can’t easily update all of the rdependencies because the upstream ecosystem is fragmented and the work involved in reconciling it all is too substantial.

When this happens we will bite the bullet and include multiple versions of the base package in Debian. The old version will become a new source package with a version number in its name.

This is analogous to how key C/C++ libraries are handled.

Downsides of this scheme

The first obvious downside is that assembling some arbitrary set of Debian Rust library packages, that satisfy the dependencies declared by Debian, is no longer necessarily going to work. The combinations that Debian has tested - Debian releases - will work, though. And at least, any breakage will affect only people building Rust code using Debian-supplied libraries.

Another less obvious problem is that because there is no such thing as Build-Breaks (in a Debian binary package), the per-package update scheme may result in no way to declare that a particular library update breaks the build of a particular leaf package. In other words, old source packages might no longer build when exposed to newer versions of their build-dependencies, taken from a newer Debian release. This is a thing that already happens in Debian, with source packages in other languages, though.

Semver violation

I am proposing that Debian should routinely compile Rust packages against dependencies in violation of the declared semver, and ship the results to Debian’s millions of users.

This sounds quite alarming! But I think it will not in fact lead to shipping bad binaries, for the following reasons:

The Rust community strongly values safety (in a broad sense) in its APIs. An API which is merely capable of insecure (or other seriously bad) use is generally considered to be wrong. For example, such situations are regarded as vulnerabilities by the RustSec project, even if there is no suggestion that any actually-broken caller source code exists, let alone that actually-broken compiled code is likely.

The Rust community also values alerting programmers to problems. Nontrivial semantic changes to APIs are typically accompanied not merely by a semver bump, but also by changes to names or types, precisely to ensure that broken combinations of code do not compile.

Or to look at it another way, in Debian we would simply be doing what many Rust upstream developers routinely do: bump the versions of their dependencies, and throw it at the wall and hope it sticks. We can mitigate the risks the same way a Rust upstream maintainer would: when updating a package we should of course review the upstream changelog for any gotchas. We should look at RustSec and other upstream ecosystem tracking and authorship information.

Difficulties for another day

As I said, I see some other issues with Rust in Debian.

  • I think the library “feature flagencoding scheme is unnecessary. I hope to explain this in a future essay.

  • I found Debian’s approach to handling the source code for its Rust packages quite awkward; and, it has some troubling properties. Again, I hope to write about this later.

  • I get the impression that updating rustc in Debian is a very difficult process. I haven’t worked on this myself and I don’t feel qualified to have opinions about it. I hope others are thinking about how to make things easier.

Thanks all for your attention!

diziet: (Default)

I have accepted a job with the Tor Project.

I joined XenSource to work on Xen in late 2007, as XenSource was being acquired by Citrix. So I have been at Citrix for about 14 years. I have really enjoyed working on Xen. There have been a variety of great people. I'm very proud of some of the things we built and achieved. I'm particularly proud of being part of a community that has provided the space for some of my excellent colleagues to really grow.

But the opportunity to go and work on a project I am so ideologically aligned with, and to work with Rust full-time, is too good to resist. That Tor is mostly written in C is quite terrifying, and I'm very happy that I'm going to help fix that!

diziet: (Default)

Rust is definitely in the news. I'm definitely on the bandwagon. (To me it feels like I've been wanting something like Rust for many years.) There're a huge number of intro tutorials, and of course there's the Rust Book.

A friend observed to me, though, that while there's a lot of "write your first simple Rust program" there's a dearth of material aimed at the programmer who already knows a dozen diverse languages, and is familiar with computer architecture, basic type theory, and so on. Or indeed, for the impatient and confident reader more generally. I thought I would have a go.

Rust for the Polyglot Programmer is the result.

Compared to much other information about Rust, Rust for the Polyglot Programmer is:

  • Dense: I assume a lot of starting knowledge. Or to look at it another way: I expect my reader to be able to look up and digest non-Rust-specific words or concepts.

  • Broad: I cover not just the language and tools, but also the library ecosystem, development approach, community ideology, and so on.

  • Frank: much material about Rust has a tendency to gloss over or minimise the bad parts. I don't do that. That also frees me to talk about strategies for dealing with the bad parts.

  • Non-neutral: I'm not afraid to recommend particular libraries, for example. I'm not afraid to extol Rust's virtues in the areas where it does well.

  • Terse, and sometimes shallow: I often gloss over what I see as unimportant or fiddly details; instead I provide links to appropriate reference materials.

After reading Rust for the Polyglot Programmer, you won't know everything you need to know to use Rust for any project, but should know where to find it.

Thanks are due to Simon Tatham, Mark Wooding, Daniel Silverstone, and others, for encouragement, and helpful reviews including important corrections. Particular thanks to Mark Wooding for wrestling pandoc and LaTeX into producing a pretty good-looking PDF. Remaining errors are, of course, mine.

Comments are welcome of course, via the Dreamwidth comments or Salsa issue or MR. (If you're making a contribution, please indicate your agreement with the Developer Certificate of Origin.)

edited 2021-09-29 16:58 UTC to fix Salsa link targe, and 17:01 and 17:21 to for minor grammar fixes

diziet: (Default)

This post is about some changes recently made to Rust's ErrorKind, which aims to categorise OS errors in a portable way.

Audiences for this post

  • The educated general reader interested in a case study involving error handling, stability, API design, and/or Rust.
  • Rust users who have tripped over these changes. If this is you, you can cut to the chase and skip to How to fix.

Background and context

Error handling principles

Handling different errors differently is often important (although, sadly, often neglected). For example, if a program tries to read its default configuration file, and gets a "file not found" error, it can proceed with its default configuration, knowing that the user hasn't provided a specific config.

If it gets some other error, it should probably complain and quit, printing the message from the error (and the filename). Otherwise, if the network fileserver is down (say), the program might erroneously run with the default configuration and do something entirely wrong.

Rust's portability aims

The Rust programming language tries to make it straightforward to write portable code. Portable error handling is always a bit tricky. One of Rust's facilities in this area is std::io::ErrorKind which is an enum which tries to categorise (and, sometimes, enumerate) OS errors. The idea is that a program can check the error kind, and handle the error accordingly.

That these ErrorKinds are part of the Rust standard library means that to get this right, you don't need to delve down and get the actual underlying operating system error number, and write separate code for each platform you want to support. You can check whether the error is ErrorKind::NotFound (or whatever).

Because ErrorKind is so important in many Rust APIs, some code which isn't really doing an OS call can still have to provide an ErrorKind. For this purpose, Rust provides a special category ErrorKind::Other, which doesn't correspond to any particular OS error.

Rust's stability aims and approach

Another thing Rust tries to do is keep existing code working. More specifically, Rust tries to:

  1. Avoid making changes which would contradict the previously-published documentation of Rust's language and features.
  2. Tell you if you accidentally rely on properties which are not part of the published documentation.

By and large, this has been very successful. It means that if you write code now, and it compiles and runs cleanly, it is quite likely that it will continue work properly in the future, even as the language and ecosystem evolves.

This blog post is about a case where Rust failed to do (2), above, and, sadly, it turned out that several people had accidentally relied on something the Rust project definitely intended to change. Furthermore, it was something which needed to change. And the new (corrected) way of using the API is not so obvious.

Rust enums, as relevant to io::ErrorKind

(Very briefly:)

When you have a value which is an io::ErrorKind, you can compare it with specific values:

    if error.kind() == ErrorKind::NotFound { ...
  
But in Rust it's more usual to write something like this (which you can read like a switch statement):
    match error.kind() {
      ErrorKind::NotFound => use_default_configuration(),
      _ => panic!("could not read config file {}: {}", &file, &error),
    }
  

Here _ means "anything else". Rust insists that match statements are exhaustive, meaning that each one covers all the possibilities. So if you left out the line with the _, it wouldn't compile.

Rust enums can also be marked non_exhaustive, which is a declaration by the API designer that they plan to add more kinds. This has been done for ErrorKind, so the _ is mandatory, even if you write out all the possibilities that exist right now: this ensures that if new ErrorKinds appear, they won't stop your code compiling.

Improving the error categorisation

The set of error categories stabilised in Rust 1.0 was too small. It missed many important kinds of error. This makes writing error-handling code awkward. In any case, we expect to add new error categories occasionally. I set about trying to improve this by proposing new ErrorKinds. This obviously needed considerable community review, which is why it took about 9 months.

The trouble with Other and tests

Rust has to assign an ErrorKind to every OS error, even ones it doesn't really know about. Until recently, it mapped all errors it didn't understand to ErrorKind::Other - reusing the category for "not an OS error at all".

Serious people who write serious code like to have serious tests. In particular, testing error conditions is really important. For example, you might want to test your program's handling of disk full, to make sure it didn't crash, or corrupt files. You would set up some contraption that would simulate a full disk. And then, in your tests, you might check that the error was correct.

But until very recently (still now, in Stable Rust), there was no ErrorKind::StorageFull. You would get ErrorKind::Other. If you were diligent you would dig out the OS error code (and check for ENOSPC on Unix, corresponding Windows errors, etc.). But that's tiresome. The more obvious thing to do is to check that the kind is Other.

Obvious but wrong. ErrorKind is non_exhaustive, implying that more error kinds will appears, and, naturally, these would more finely categorise previously-Other OS errors.

Unfortunately, the documentation note

Errors that are Other now may move to a different or a new ErrorKind variant in the future.
was only added in May 2020. So the wrongness of the "obvious" approach was, itself, not very obvious. And even with that docs note, there was no compiler warning or anything.

The unfortunate result is that there is a body of code out there in the world which might break any time an error that was previously Other becomes properly categorised. Furthermore, there was nothing stopping new people writing new obvious-but-wrong code.

Chosen solution: Uncategorized

The Rust developers wanted an engineered safeguard against the bug of assuming that a particular error shows up as Other. They chose the following solution:

There is now a new ErrorKind::Uncategorized which is now used for all OS errors for which there isn't a more specific categorisation. The fallback translation of unknown errors was changed from Other to Uncategorised.

This is de jure justified by the fact that this enum has always been marked non_exhaustive. But in practice because this bug wasn't previously detected, there is such code in the wild. That code now breaks (usually, in the form of failing test cases). Usually when Rust starts to detect a particular programming error, it is reported as a new warning, which doesn't break anything. But that's not possible here, because this is a behavioural change.

The new ErrorKind::Uncategorized is marked unstable. This makes it impossible to write code on Stable Rust which insists that an error comes out as Uncategorized. So, one cannot now write code that will break when new ErrorKinds are added. That's the intended effect.

The downside is that this does break old code, and, worse, it is not as clear as it should be what the fixed code looks like.

Alternatives considered and rejected by the Rust developers

Not adding more ErrorKinds

This was not tenable. The existing set is already too small, and error categorisation is in any case expected to improve over time.

Just adding ErrorKinds as had been done before

This would mean occasionally breaking test cases (or, possibly, production code) when an error that was previously Other becomes categorised. The broken code would have been "obvious", but de jure wrong, just as it is now, So this option amounts to expecting this broken code to continue to be written and continuing to break it occasionally.

Somehow using Rust's Edition system

The Rust language has a system to allow language evolution, where code declares its Edition (2015, 2018, 2021). Code from multiple editions can be combined, so that the ecosystem can upgrade gradually.

It's not clear how this could be used for ErrorKind, though. Errors have to be passed between code with different editions. If those different editions had different categorisations, the resulting programs would have incoherent and broken error handling.

Also some of the schemes for making this change would mean that new ErrorKinds could only be stabilised about once every 3 years, which is far too slow.

How to fix code broken by this change

Most main-line error handling code already has a fallback case for unknown errors. Simply replacing any occurrence of Other with _ is right.

How to fix thorough tests

The tricky problem is tests. Typically, a thorough test case wants to check that the error is "precisely as expected" (as far as the test can tell). Now that unknown errors come out as an unstable Uncategorized variant that's not so easy. If the test is expecting an error that is currently not categorised, you want to write code that says "if the error is any of the recognised kinds, call it a test failure".

What does "any of the recognised kinds" mean here ? It doesn't meany any of the kinds recognised by the version of the Rust stdlib that is actually in use. That set might get bigger. When the test is compiled and run later, perhaps years later, the error in this test case might indeed be categorised. What you actually mean is "the error must not be any of the kinds which existed when the test was written".

IMO therefore the right solution for such a test case is to cut and paste the current list of stable ErrorKinds into your code. This will seem wrong at first glance, because the list in your code and in Rust can get out of step. But when they do get out of step you want your version, not the stdlib's. So freezing the list at a point in time is precisely right.

You probably only want to maintain one copy of this list, so put it somewhere central in your codebase's test support machinery. Periodically, you can update the list deliberately - and fix any resulting test failures.

Unfortunately this approach is not suggested by the documentation. In theory you could work all this out yourself from first principles, given even the situation prior to May 2020, but it seems unlikely that many people have done so. In particular, cutting and pasting the list of recognised errors would seem very unnatural.

Conclusions

This was not an easy problem to solve well. I think Rust has done a plausible job given the various constraints, and the result is technically good.

It is a shame that this change to make the error handling stability more correct caused the most trouble for the most careful people who write the most thorough tests. I also think the docs could be improved.

edited shortly after posting, and again 2021-09-22 16:11 UTC, to fix HTML slips

diziet: (Default)

tl;dr

dgit clone sourcepackage gets you the source code, as a git tree, in ./sourcepackage. cd into it and dpkg-buildpackage -uc -b.

Do not use: "VCS" links on official Debian web pages like tracker.debian.org; "debcheckout"; searching Debian's gitlab (salsa.debian.org). These are good for Debian experts only.

If you use Debian's "official" source git repo links you can easily build a package without Debian's patches applied.[1] This can even mean missing security patches. Or maybe it can't even be built in a normal way (or at all).

OMG WTF BBQ, why?

It's complicated. There is History.

Debian's "most-official" centralised source repository is still the Debian Archive, which is a system based on tarballs and patches. I invented the Debian source package format in 1992/3 and it has been souped up since, but it's still tarballs and patches. This system is, of course, obsolete, now that we have modern version control systems, especially git.

Maintainers of Debian packages have invented ways of using git anyway, of course. But this is not standardised. There is a bewildering array of approaches.

The most common approach is to maintain git tree containing a pile of *.patch files, which are then often maintained using quilt. Yes, really, many Debian people are still using quilt, despite having git! There is machinery for converting this git tree containing a series of patches, to an "official" source package. If you don't use that machinery, and just build from git, nothing applies the patches.

[1] This post was prompted by a conversation with a friend who had wanted to build a Debian package, and didn't know to use dgit. They had got the source from salsa via a link on tracker.d.o, and built .debs without Debian's patches. This not a theoretical unsoundness, but a very real practical risk.

Future is not very bright

In 2013 at the Debconf in Vaumarcus, Joey Hess, myself, and others, came up with a plan to try to improve this which we thought would be deployable. (Previous attempts had failed.) Crucially, this transition plan does not force change onto any of Debian's many packaging teams, nor onto people doing cross-package maintenance work. I worked on this for quite a while, and at a technical level it is a resounding success.

Unfortunately there is a big limitation. At the current stage of the transition, to work at its best, this replacement scheme hopes that maintainers who update a package will use a new upload tool. The new tool fits into their existing Debian git packaging workflow and has some benefits, but it does make things more complicated rather than less (like any transition plan must, during the transitional phase). When maintainers don't use this new tool, the standardised git branch seen by users is a compatibility stub generated from the tarballs-and-patches. So it has the right contents, but useless history.

The next step is to allow a maintainer to update a package without dealing with tarballs-and-patches at all. This would be massively more convenient for the maintainer, so an easy sell. And of course it links the tarballs-and-patches to the git history in a proper machine-readable way.

We held a "git packaging requirements-gathering session" at the Curitiba Debconf in 2019. I think the DPL's intent was to try to get input into the git workflow design problem. The session was a great success: my existing design was able to meet nearly everyone's needs and wants. The room was obviously keen to see progress. The next stage was to deploy tag2upload. I spoke to various key people at the Debconf and afterwards in 2019 and the code has been basically ready since then.

Unfortunately, deployment of tag2upload is mired in politics. It was blocked by a key team because of unfounded security concerns; positive opinions from independent security experts within Debian were disregarded. Of course it is always hard to get a team to agree to something when it's part of a transition plan which treats their systems as an obsolete setup retained for compatibility.

Current status

If you don't know about Debian's git packaging practices (eg, you have no idea what "patches-unapplied packaging branch without .pc directory" means), and don't want want to learn about them, you must use dgit to obtain the source of Debian packages. There is a lot more information and detailed instructions in dgit-user(7).

Hopefully either the maintainer did the best thing, or, if they didn't, you won't need to inspect the history. If you are a Debian maintainer, you should use dgit push-source to do your uploads. This will make sure that users of dgit will see a reasonable git history.

edited 2021-09-15 14:48 Z to fix a typo

diziet: (Default)

tl;dr: Please recommend me a high-level Rust server-side web framework which is sync and does not plan to move to an async api.

Why

Async Rust gives somewhat higher performance. But it is considerably less convenient and ergonomic than using threads for concurrency. Much work is ongoing to improve matters, but I doubt async Rust will ever be as easy to write as sync Rust.

"Even" sync multithreaded Rust is very fast (and light on memory use) compared to many things people write web apps in. The vast majority of web applications do not need the additional performance (which is typically a percentage, not a factor).

So it is rather disappointing to find that all the review articles I read, and all the web framework authors, seem to have assumed that async is the inevitable future. There should be room for both sync and async. Please let universal use of async not be the inevitable future!

What

I would like a web framework that provides a sync API (something like Rocket 0.4's API would be ideal) and will remain sync. It should probably use (async) hyper underneath.

So far I have not found one single web framework on crates.io that neither is already async nor suggests that its authors intend to move to an async API. Some review articles I found even excluded sync frameworks entirely!

Answers in the comments please :-).

diziet: (Default)

tl;dr:
With these two crazy proc-macros you can hand out multipe (perhaps mutable) references to suitable subsets/views of the same struct.

Why

In Otter I have adopted a style where I try to avoid giving code mutable access that doesn't need it, and try to make mutable access come with some code structures to prevent "oh I forgot a thing" type mistakes. For example, mutable access to a game state is only available in contexts that have to return a value for the updates to send to the players. This makes it harder to forget to send the update.

But there is a downside. The game state is inside another struct, an Instance, and much code needs (immutable) access to it. I can't pass both &Instance and &mut GameState because one is inside the other.

My workaround involves passing separate references to the other fields of Instance, leading to some functions taking far too many arguments. 14 in one case. (They're all different types so argument ordering mistakes just result in compiler errors talking about arguments 9 and 11 having wrong types, rather than actual bugs.)

I felt this problem was purely a restriction arising from limitations of the borrow checker. I thought it might be possible to improve on it. Weeks passed and the question gradually wormed its way into my consciousness. Eventually, I tried some experiments. Encouraged, I persisted.

What and how

partial-borrow is a Rust library which solves this problem. You sprinkle #[Derive(PartialBorrow)] and partial!(...) and then you can pass a reference which grants mutable access to only some of the fields. You can also pass a reference through which some fields are inaccessible. You can even split a single mut reference into multiple compatible references, for example granting mut access to mutually-nonverlapping subsets.

The core type is Struct__Partial (for some Struct). It is a zero-sized type, but we prevent anyone from constructing one. Instead we magic up references to it, always ensuring that they have the same address as some Struct. The fields of Struct__Partial are also ZSTs that exist ony as references, and they Deref to the actual field (subject to compile-type borrow compatibility checking).

Soundness and testing

partial-borrow is primarily a nontrivial procedural macro which autogenerates reams of unsafe. Of course I think it's sound, but I thought that the last two times before I added a test which demonstrated otherwise. So it might be fairer to say that I have tried to make it sound and that I don't know of any problems...

Reasoning about the correctness of macro-generated code is not so easy. One problem is that there is nowhere good to put the kind of soundness arguments you would normally add near uses of unsafe.

I decided to solve this by annotating an instance of the macro output. There's a not very complicated script using diff3 to help fold in changes if the macro output changes - merge conflicts there mean a possible re-review of the argument text. Of course I also have test cases that run with miri, and test cases for expected compiler errors for uses that need to be forbidden for soundness.

But this is quite hairy and I'm worried that it might be rather "my first insane unsafe contraption". Also the pointer/reference trickery is definitely subtle, and depends heavily on knowing what Rust's aliasing and pointer provenance rules really are. Stacked Borrows is not entirely trivial to reason about in fiddly corner cases.

So for now I have only called it 0.1.0 and left a note in the docs. I haven't actually made Otter use it yet but that's the rather more boring software integration part, not the fun "can I do this mad thing" part so I will probably leave that for a rainy day. Possibly a rainy day after someone other than me has looked at partial-borrow (preferably someone who understands Stacked Borrows...).

Fun!

This was great fun. I even enjoyed writing the docs.

The proc-macro programming environment is not entirely straightforward and there are a number of things to watch out for. For my first non-adhoc proc-macro this was, perhaps, ambitious. But you don't learn anything without trying...

edited 2021-09-02 16:28 UTC to fix a typo

diziet: (Default)

Summary

I have just tagged nailing-cargo/1.0.0. nailing-cargo is a wrapper around the Rust build tool cargo. nailing-cargo can:

  • Work around cargo's problem with totally unpublished local crates (by temporally massaging your Cargo.toml)
  • Make it easier to work in Rust without giving the Rust environment (and every author of every one of your dependencies!) control of your main account. (Privsep, with linkfarming where needed.)
  • Tweak a few defaults.

Background and history

It's not really possible to make a nontrivial Rust project without using cargo. But the build process automatically downloads and executes code from crates.io, which is a minimally-curated repository. I didn't want to expose my main account to that.

And, at the time, I was working on a project which for which I was also writing a library as a dependency, and I found that cargo couldn't cope with this unless I were to commit (to my git repository) the path (on my local laptop) of my dependency.

I filed some bugs, including about the unpublished crate problem. But also, I was stubborn enough to try to find a workaround that didn't involve committing junk to my git history. The result was a short but horrific shell script.

I wrote about this at the time (March 2019).

Over the last few years the difficulties I have with cargo have remained un-resolved. I found my interactions with upstream rather discouraging. It didn't seem like I would get anywhere by trying to help improve cargo to better support my needs.

So instead I have gradually improved nailing-cargo. It is now a Perl script. It is rather less horrific, and has proper documentation (sorry, JS needed because GitLab).

Why Perl ?

Rust would have been my language of choice. But I wanted to avoid a chicken-and-egg situation. When you're doing privsep, nailing-cargo has to run in your more privileged environment. I wanted something easy to get going with.

nailing-cargo has to contain a TOML parser; and I found a small one, TOML-Tiny, which was good enough as a starting point, and small enough I could bundle it as a git subtree. Perl is nicely fast to start up (nailing-cargo --- true runs in about 170ms on my laptop), and it is easy to write a Perl script that will work on pretty much any Perl installation.

Still unsolved: embedding cargo in another build system

A number of my projects contain a mixture of Rust code with other languages. Unfortunately, nailing-cargo doesn't help with the problems which arise trying to integrate cargo into another build system.

I generally resort to find runes for finding Rust source files that might influence cargo, and stamp files for seeing if I have run it recently enough; and I simply live with the fact that cargo sometimes builds more stuff than I needed it to.

Future

There are a number of ways nailing-cargo could be improved.

Notably, the need to overwrite your actual Cargo.toml is very annoying, even if nailing-cargo puts it back afterwards. A big problem with this is that it means that nailing-cargo has to take a lock, while your cargo rune runs. This effectively prevents using nailing-cargo with long-running processes. Notably, editor integrations like rls and racer. I could perhaps solve this with more linkfarm-juggling, but that wouldn't help in-tree builds and it's hard to keep things up to date.

I am considering using LD_PRELOAD trickery or maybe bwrap(1) to "implement" the alternative Cargo.toml feature which was rejected by cargo upstream in 2019 (and again in April when someone else asked).

Currently there is no support for using sudo for out-of-tree privsep. This should be easy to add but it needs someone who uses sudo to want it (and to test it!) The documentation has some other dicusssion of limitations, some of which aren't too hard to improve.

Patches welcome!

diziet: (Default)

I have just disconnected from irc.freenode.net for the last time. You should do the same. The awful new de facto operators are using user numbers as a public justification for their behaviour. Specifically, I recommend that you:

  • Move your own channels to to Libera or OFTC
  • If you have previously been known to be generally around on Freenode, connect to Libera (the continuity network set up by the Freenode staff), and register your usual nick(s) there.
  • Disconnect from freenode so that you don't count as one of their users.

Note that mentioning libera in the channel topic of your old channels on freenode is likely to get your channel forcibly taken over by the new de facto operators of freenode. They won't tolerate you officially directing people to the competition.

I did an investigation and writeup of this situation for the Xen Project. It's a little out of date - it doesn't have the latest horrible behaviours from the new regime - but I think it is worth pasting it here:

Message-ID: <24741.12566.639691.461134@mariner.uk.xensource.com>
From: Ian Jackson <iwj@xenproject.org>
To: xen-devel@lists.xenproject.org
CC: community.manager@xenproject.org
Subject: IRC networks
Date: Wed, 19 May 2021 16:39:02 +0100

Summary:

We have for many years used the Freenode IRC network for real-time
chat about Xen.  Unfortunately, Freenode is undergoing a crisis.

There is a dispute between, on the one hand, Andrew Lee, and on the
other hand, all (or almost all) Freenode volunteer staff.  We must
make a decision.

I have read all the publicly available materials and asked around with
my contacts.  My conclusions:

 * We do not want to continue to use irc.freenode.*.
 * We might want to use libera.chat, but:
 * Our best option is probably to move to OFTC https://www.oftc.net/


Discussion:

Firstly, my starting point.

I have been on IRC since at least 1993.  Currently my main public
networks are OFTC and Freenode.

I do not have any personal involvement with public IRC networks.  Of
the principals in the current Freenode dispute, I have only heard of
one, who is a person I have experience of in a Debian context but have
not worked closely with.

George asked me informally to use my knowledge and contacts to shed
light on the situation.  I decided that, having done my research, I
would report more formally and publicly here rather than just
informally to George.


Historical background:

 * Freenode has had drama before.  In about 2001 OFTC split off from
   Freenode after an argument over governance.  IIRC there was drama
   again in 2006.  Significant proportion of the Free Software world,
   including Debian, now use OFTC.  Debian switched in 2006.

Facts that I'm (now) pretty sure of:

 * Freenode's actual servers run on donated services; that is,
   the hardware is owned by those donating the services, and the
   systems are managed by Freenode volunteers, known as "staff".

 * The freenode domain names are currently registered to a limited
   liability company owned by Andrew Lee (rasengan).

 * At least 10 Freenode staff have quit in protest, writing similar
   resignation letters protesting about Andrew Lee's actions [1].  It
   does not appear that any Andrew Lee has the public support of any
   Freenode staff.

 * Andrew Lee claims that he "owns" Freenode.[2]

 * A large number of channel owners for particular Free Software
   projects who previously used Freenode have said they will switch
   away from Freenode.

Discussion and findings on Freenode:

There is, as might be expected, some murk about who said what to whom
when, what promises were made and/or broken, and so on.  The matter
was also complicated by the leaking earlier this week of draft(s) of
(at least one of) the Freenode staffers' resignation letters.

Andrew Lee has put forward a position statement [2].  A large part of
the thrust of that statement is allegations that the current head of
Freenode staff, tomaw, "forced out" the previous head, christel.  This
allegation is strongly disputed by by all those current (resigning)
Freenode staff I have seen comment.  In any case it does not seem to
be particularly germane; in none of my reading did tomaw seem to be
playing any kind of leading role.  tomaw is not mentioned in the
resignation letters.

Some of the links led to me to logs of discussions on #freenode.  I
read some of these in particular[3].  MB I haven't been able to verify
that these logs have not been tampered with.  Having said that and
taking the logs at face value, I found the rasengan writing there to
be disingenuous and obtuse.

Andrew Lee has been heavily involved in Bitcoin.  Bitcoin is a hive of
scum and villainy, a pyramid scheme, and an environmental disaster,
all rolled into one.  This does not make me think well of Lee.

Additionally, it seems that Andrew Lee has been involved in previous
governance drama involving a different IRC network, Snoonet.

I have come to the very firm conclusion that we should have nothing to
do with Andrew Lee, and avoid using services that he has some
effective control over.

Alternatives:

The departing Freenode staff are setting up a replacement,
"libera.chat".  This is operational but still suffering from teething
problems and of course has a significant load as it deals with an
influx of users on a new setup.

On the staff and trust question: As I say, I haven't heard of any of
the Freenode staff, with one exception.  Unfortunately the one
exception does not inspire confidence in me[4] - although NB that is
only one data point.

On the other hand, Debian has had many many years of drama-free
involvement with OFTC.  OFTC has a formal governance arrangement and
it is associated with Software in the Public Interest.  I notice that
the last few OFTC'[s annual officer elections have been run partly by
Steve McIntyre.  Steve is a friend of mine (and he is a former Debian
Project Leader) and I take his involvement as a good sign.

I recommend that we switch to using OFTC as soon as possible.


Ian.


References:

Starting point for the resigning Freenode staff's side [1]:
  https://gist.github.com/joepie91/df80d8d36cd9d1bde46ba018af497409

Andrew Lee's side [2]:
  https://gist.github.com/realrasengan/88549ec34ee32d01629354e4075d2d48

[3]
https://paste.sr.ht/~ircwright/7e751d2162e4eb27cba25f6f8893c1f38930f7c4

[4] I won't give the name since I don't want to be shitposting.

diziet: (Default)

In April I wrote about releasing Otter, which has been one of my main personal projects during the pandemic.

Uploadable game materials

Otter comes with playing cards and a chess set, and some ancillary bits and bobs. Until now, if you wanted to play with something else, the only way was to read some rather frightening build instructions to add your pieces to Otter itself, or to dump a complicated structure of extra files into the server install.

Now that I have released Otter 0.6.0, you can upload a zipfile to the server. The format of the zipfile is even documented!

Otter development - I still love Rust

Working on Otter has been great fun, partly because it has involved learning lots of new stuff, and partly because it's mosty written in Rust. I love Rust; one of my favourite aspects is the way it's possible to design a program so that most of your mistakes become compile errors. This means you spend more time dealing with compiler errors and less time peering at a debugger trying to figure out why it has gone wrong.

Future plans - help wanted!

So far, Otter has been mostly a one-person project. I would love to have some help. There are two main areas where I think improvement is very important:
  • I want it to be much easier to set up an Otter server yourself, for you and your friends. Currently there are complete build instructions, but nowadays these things can be automated. Ideally it would be possible to set up an instance by button pushing and the payment of a modest sum to a cloud hosting provider.
  • Right now, the only way to set up a game, manage the players in it, reset the gaming table, and so on, is using the otter command line tool. It's not so bad a tool, but having something pointy clicky would make the system much more accessible. But GUIs of this kind are a lot of work.

If you think you could help with these, and playing around with a game server sounds fun, do get in touch.

For now, next on my todo list is to provide a nicely cooked git-forge-style ssh transport facility, so that you don't need a shell account on the server but can run the otter command line client tool locally.

diziet: (Default)

One of the things that I found most vexing about lockdown was that I was unable to play some of my favourite board games. There are online systems for many games, but not all. And online systems cannot support games like Mao where the players make up the rules as we go along.

I had an idea for how to solve this problem, and set about implementing it. The result is Otter (the Online Table Top Environment Renderer).

We have played a number of fun games of Penultima with it, and have recently branched out into Mao. The Otter is now ready to be released!

More about Otter

(cribbed shamelessly from the README)

Otter, the Online Table Top Environment Renderer, is an online game system.

But it is not like most online game systems. It does not know (nor does it need to know) the rules of the game you are playing. Instead, it lets you and your friends play with common tabletop/boardgame elements such as hands of cards, boards, and so on.

So it’s something like a “tabletop simulator” (but it does not have any 3D, or a physics engine, or anything like that).

This means that with Otter:

  • Supporting a new game, that Otter doesn’t know about yet, would usually not involve writing or modifying any computer programs.

  • If Otter already has the necessary game elements (cards, say) all you need to do is write a spec file saying what should be on the table at the start of the game. For example, most Whist variants that start with a standard pack of 52 cards are already playable.

  • You can play games where the rules change as the game goes along, or are made up by the players, or are too complicated to write as a computer program.

  • House rules are no problem, since the computer isn’t enforcing the rules - you and your friends are.

  • Everyone can interact with different items on the game table, at any time. (Otter doesn’t know about your game’s turn-taking, so doesn’t know whose turn it might be.)

Installation and usage

Otter is fully functional, but the installation and account management arrangements are rather unsophisticated and un-webby. And there is not currently any publicly available instance you can use to try it out.

Users on chiark will find an instance there.

Other people who who are interested in hosting games (of Penultima or Mao, or other games we might support) will have to find a Unix host or VM to install Otter on, and will probably want help from a Unix sysadmin.

Otter is distributed via git, and is available on Salsa, Debian's gitlab instance.

There is documentation online.

Future plans

I have a number of ideas for improvement, which go off in many different directions.

Quite high up on my priority list is making it possible for players to upload and share game materials (cards, boards, pieces, and so on), rather than just using the ones which are bundled with Otter itself (or dumping files ad-hoc on the server). This will make it much easier to play new games. One big reason I wrote Otter is that I wanted to liberate boardgame players from the need to implemet their game rules as computer code.

The game management and account management is currently done with a command line tool. It would be lovely to improve that, but making a fully-featured management web ui would be a lot of work.

Screenshots!

(Click for the full size images.)

diziet: (Default)
There is a serious problem with Dreamwidth, which is impeding access for many RSS reader tools.

This started at around 0500 UTC on Wednesday morning, according to my own RSS reader cron job. A friend found #43443 in the DW ticket tracker, where a user of a minority web browser found they were blocked.

Local tests demonstrated that Dreamwidth had applied blocking by the HTTP User-Agent header, and were rejecting all user-agents not specifically permitted. Today, this rule has been relaxed and unknown user-agents are permitted. But user-agents for general http client libraries are still blocked.

I'm aware of three unresolved tickets about this: #43444 #43445 #43447

We're told there by a volunteer member of Dreamwidth's support staff that this has been done deliberately for "blocking automated traffic". I'm sure the volunteer is just relaying what they've been told by whoever is struggling to deal with what I suppose is probably a spam problem. But it's still rather unsatisfactory.

I have suggested in my own ticket that a good solution might be to apply the new block only to posting and commenting (eg, maybe, by applying it only to HTTP POST requests). If the problem is indeed spam then that ought to be good enough, and would still let RSS readers work properly.

I'm told that this new blocking has been done by "implementing" (actually, configuring or enabling) "some AWS rules for blocking automated traffic". I don't know what facilities AWS provides. This kind of helplessness is of course precisely the kind of thing that the Free Software movement is against and precisely the kind of thing that proprietary services like AWS produce.

I don't know if this blog entry will appear on planet.debian.org and on other people's reader's and aggregators. I think it will at least be seen by other Dreamwidth users. I thought I would post here in the hope that other Dreamwidth users might be able to help get this fixed. At the very least other Dreamwidth blog owners need to know that many of their readers may not be seeing their posts at all.

If this problem is not fixed I will have to move my blog. One of the main points of having a blog is publishing it via RSS. RSS readers are of course based on general http client libraries and many if not most RSS readers have not bothered to customise their user-agent. Those are currently blocked.

diziet: (Default)
I have signed the open letter calling for RMS to be removed from all leadership positions, including the GNU Project, and for the whole FSF board to step down.

Here is what I wrote in my commit message:

    I published my first Free Software in 1989, under the GNU General
    Public Licence.  I remain committed to the ideal of software freedom,
    for everyone.
    
    When the CSAIL list incident blew up I was horrified to read the
    stories of RMS's victims.  We have collectively failed those people
    and I am glad to see that many of us are working to make a better
    community.
    
    I have watched with horror as RMS has presided over, and condoned,
    astonishingly horrible behaviour in many GNU Project discussion
    venues.
    
    The Free Software Foundation Board are doing real harm by reinstating
    an abuser.  I had hoped for and expected better from them.
    
    RMS's vision and ideals of software freedom have been inspiring to me.
    But his past behaviour and current attitudes mean he must not and can
    not be in a leadership position in the Free Software community.

Comments are disabled. Edited 2021-03-23T21:44Z to fix a typo in the commit message.

diziet: (Default)

tl;dr: Do not configure Mailman to replace the mail domains in From: headers. Instead, try out my small new program which can make your Mailman transparent, so that DKIM signatures survive.

Background and narrative

DKIM

NB: This explanation is going to be somewhat simplified. I am going to gloss over some details and make some slightly approximate statements.

DKIM is a new anti-spoofing mechanism for Internet email, intended to help fight spam. DKIM, paired with the DMARC policy system, has been remarkably successful at stemming the flood of joe-job spams. As usually deployed, DKIM works like this:

When a message is originally sent, the author's MUA sends it to the MTA for their From: domain for outward delivery. The From: domain mailserver calculates a cryptographic signature of the message, and puts the signature in the headers of the message.

Obviously not the whole message can be signed, since at the very least additional headers need to be added in transit, and sometimes headers need to be modified too. The signing MTA gets to decide what parts of the message are covered by the signature: they nominate the header fields that are covered by the signature, and specify how to handle the body.

A recipient MTA looks up the public key for the From: domain in the DNS, and checks the signature. If the signature doesn't match, depending on policy (originator's policy, in the DNS, and recipient's policy of course), typically the message will be treated as spam.

The originating site has a lot of control over what happens in practice. They get to publish a formal (DMARC) policy in the DNS which advises recipients what they should do with mails claiming to be from their site. As mentioned, they can say which headers are covered by the signature - including the ability to sign the absence of a particular headers - so they can control which headers downstreams can get away with adding or modifying. And they can set a normalisation policy, which controls how precisely the message must match the one that they sent.

Mailman

Mailman is, of course, the extremely popular mailing list manager. There are a lot of things to like about it. I choose to run it myself not just because it's popular but also because it provides a relatively competent web UI and a relatively competent email (un)subscription interfaces, decent bounce handling, and a pretty good set of moderation and posting access controls.

The Xen Project mailing lists also run on mailman. Recently we had some difficulties with messages sent by Citrix staff (including myself), to Xen mailing lists, being treated as spam. Recipient mail systems were saying the DKIM signatures were invalid.

This was in fact true. Citrix has chosen a fairly strict DKIM policy; in particular, they have chosen "simple" normalisation - meaning that signed message headers must precisely match in syntax as well as in a semantic sense. Examining the the failing-DKIM messages showed that this was definitely a factor.

Applying my Opinions about email

My Bayesian priors tend to suggest that a mail problem involving corporate email is the fault of the corporate email. However in this case that doesn't seem true to me.

My starting point is that I think mail systems should not not modify messages unnecessarily. None of the DKIM-breaking modifications made by Mailman seemed necessary to me. I have on previous occasions gone to corporate IT and requested quite firmly that things I felt were broken should be changed. But it seemed wrong to go to corporate IT and ask them to change their published DKIM/DMARC policy to accomodate a behaviour in Mailman which I didn't agree with myself. I felt that instead I shoud put (with my Xen Project hat on) my own house in order.

Getting Mailman not to modify messages

So, I needed our Mailman to stop modifying the headers. I needed it to not even reformat them. A brief look at the source code to Mailman showed that this was not going to be so easy. Mailman has a lot of features whose very purpose is to modify messages.

Personally, as I say, I don't much like these features. I think the subject line tags, CC list manipulations, and so on, are a nuisance and not really Proper. But they are definitely part of why Mailman has become so popular and I can definitely see why the Mailman authors have done things this way. But these features mean Mailman has to disassemble incoming messages, and then reassemble them again on output. It is very difficult to do that and still faithfully reassemble the original headers byte-for-byte in the case where nothing actually wanted to modify them. There are existing bug reports[1] [2] [3] [4]; I can see why they are still open.

Rejected approach: From:-mangling

This situation is hardly unique to the Xen lists. Many other have struggled with it. The best that seems to have been come up with so far is to turn on a new Mailman feature which rewrites the From: header of the messages that go through it, to contain the list's domain name instead of the originator's.

I think this is really pretty nasty. It breaks normal use of email, such as reply-to-author. It is having Mailman do additional mangling of the message in order to solve the problems caused by other undesirable manglings!

Solution!

As you can see, I asked myself: I want Mailman not modify messages at all; how can I get it to do that? Given the existing structure of Mailman - with a lot of message-modifying functionality - that would really mean adding a bypass mode. It would have to spot, presumably depending on config settings, that messages were not to be edited; and then, it would avoid disassembling and reassembling the message at at all, and bypass the message modification stages. The message would still have to be parsed of course - it's just that the copy send out ought to be pretty much the incoming message.

When I put it to myself like that I had a thought: couldn't I implement this outside Mailman? What if I took a copy of every incoming message, and then post-process Mailman's output to restore the original?

It turns out that this is quite easy and works rather well!

outflank-mailman

outflank-mailman is a 233-line script, plus documentation, installation instructions, etc.

It is designed to run from your MTA, on all messages going into, and coming from, Mailman. On input, it saves a copy of the message in a sqlite database, and leaves a note in a new Outflank-Mailman-Id header. On output, it does some checks, finds the original message, and then combines the original incoming message with carefully-selected headers from the version that Mailman decided should be sent.

This was deployed for the Xen Project lists on Tuesday morning and it seems to be working well so far.

If you administer Mailman lists, and fancy some new software to address this problem, please do try it out.

Matters arising - Mail filtering, DKIM

Overall I think DKIM is a helpful contribution to the fight against spam (unlike SPF, which is fundamentally misdirected and also broken). Spam is an extremely serious problem; most receiving mail servers experience more attempts to deliver spam than real mail, by orders of magnitude. But DKIM is not without downsides.

Inherent in the design of anything like DKIM is that arbitrary modification of messages by list servers is no longer possible. In principle it might be possible to design a system which tolerated modifications reasonable for mailing lists but it would be quite complicated and have to somehow not tolerate similar modifications in other contexts.

So DKIM means that lists can no longer add those unsubscribe footers to mailing list messages. The "new way" (RFC2369, July 1998), to do this is with the List-Unsubscribe header. Hopefully a good MUA will be able to deal with unsubscription semiautomatically, and I think by now an adequate MUA should at least display these headers by default.

Sender:

There are implications for recipient-side filtering too. The "traditional" correct way to spot mailing list mail was to look for Resent-To:, which can be added without breaking DKIM; the "new" (RFC2919, March 2001) correct way is List-Id:, likewise fine. But during the initial deployment of outflank-mailman I discovered that many subscribers were detecting that a message was list traffic by looking at the Sender: header. I'm told that some mail systems (apparently Microsoft's included) make it inconvenient to filter on List-Id.

Really, I think a mailing list ought not to be modifying Sender:. Given Sender:'s original definition and semantics, there might well be reasonable reasons for a mailing list posting to have different From: and and then the original Sender: ought not to be lost. And a mailing list's operation does not fit well into the original definition of Sender:. I suspect that list software likes to put in Sender mostly for historical reasons; notably, a long time ago it was not uncommon for broken mail systems to send bounces to the Sender: header rather than the envelope sender (SMTP MAIL FROM).

DKIM makes this more of a problem. Unfortunately the DKIM specifications are vague about what headers one should sign, but they pretty much definitely include Sender: if it is present, and some materials encourage signing the absence of Sender:. The latter is Exim's default configuration when DKIM-signing is enabled.

Franky there seems little excuse for systems to not readily support and encourage filtering on List-Id, 20 years later, but I don't want to make life hard for my users. For now we are running a compromise configuration: if there wasn't a Sender: in the original, take Mailman's added one. This will result in (i) misfiltering for some messages whose poster put in a Sender:, and (ii) DKIM failures for messages whose originating system signed the absence of a Sender:. I'm going to mine the db for some stats after it's been deployed for a week or so, to see which of these problems is worst and decide what to do about it.

Mail routing

For DKIM to work, messages being sent From: a particular mail domain must go through a system trusted by that domain, so they can be signed.

Most users tend to do this anyway: their mail provider gives them an IMAP server and an authenticated SMTP submission server, and they configure those details in their MUA. The MUA has a notion of "accounts" and according to the user's selection for an outgoing message, connects to the authenticated submission service (usually using TLS over the global internet).

Trad unix systems where messages are sent using the local sendmail or localhost SMTP submission (perhaps by automated systems, or perhaps by human users) are fine too. The smarthost can do the DKIM signing.

But this solution is awkward for a user of a trad MUA in what I'll call "alias account" setups: where a user has an address at a mail domain belonging to different people to the system on which they run their MUA (perhaps even several such aliases for different hats). Traditionally this worked by the mail domain forwarding incoming the mail, and the user simply self-declaring their identity at the alias domain. Without DKIM there is nothing stopping anyone self-declaring their own From: line.

If DKIM is to be enabled for such a user (preventing people forging mail as that user), the user will have to somehow arrange that their trad unix MUA's outbound mail stream goes via their mail alias provider. For a single-user sending unix system this can be done with tolerably complex configuration in an MTA like Exim. For shared systems this gets more awkward and might require some hairy shell scripting etc.

edited 2020-10-01 21:22 and 21:35 and -02 10:50 +0100 to fix typos and 21:28 to linkify "my small program" in the tl;dr

diziet: (Default)

Any software system has underlying design principles, and any software project has process rules. But I seem to be seeing more often, a pathological pattern where abstract and shakily-grounded broad principles, and even contrived and sophistic objections, are used to block sensible changes.

Today I will go through an example in detail, before ending with a plea:

PostgreSQL query planner, WITH [MATERIALIZED] optimisation fence

Background history

PostgreSQL has a sophisticated query planner which usually gets the right answer. For good reasons, the pgsql project has resisted providing lots of knobs to control query planning. But there are a few ways to influence the query planner, for when the programmer knows more than the planner.

One of these is the use of a WITH common table expression. In pgsql versions prior to 12, the planner would first make a plan for the WITH clause; and then, it would make a plan for the second half, counting the WITH clause's likely output as a given. So WITH acts as an "optimisation fence".

This was documented in the manual - not entirely clearly, but a careful reading of the docs reveals this behaviour:

The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards.

Users (authors of applications which use PostgreSQL) have been using this technique for a long time.

New behaviour in PostgreSQL 12

In PostgreSQL 12 upstream were able to make the query planner more sophisticated. In particular, it is now often capable of looking "into" the WITH common table expression. Much of the time this will make things better and faster.

But if WITH was being used for its side-effect as an optimisation fence, this change will break things: queries that ran very quickly in earlier versions might now run very slowly. Helpfully, pgsql 12 still has a way to specify an optimisation fence: specifying WITH ... AS MATERIALIZED in the query.

So far so good.

Upgrade path for existing users of WITH fence

But what about the upgrade path for existing users of the WITH fence behaviour? Such users will have to update their queries to add AS MATERIALIZED. This is a small change. Having to update a query like this is part of routine software maintenance and not in itself very objectionable. However, this change cannnot be made in advance because pgsql versions prior to 12 will reject the new syntax.

So the users are in a bit of a bind. The old query syntax can be unuseably slow with the new database and the new syntax is rejected by the old database. Upgrading both the database and the application, in lockstep, is a flag day upgrade, which every good sysadmin will want to avoid.

A solution to this problem

Colin Watson proposed a very simple solution: make the earlier PostgreSQL versions accept the new MATERIALIZED syntax. This is correct since the new syntax specifies precisely the actual behaviour of the old databases. It has no deleterious effect on any users of older pgsql versions. It makes it possible to add the new syntax to the application, before doing the database upgrade, decoupling the two upgrades.

Colin Watson even provided an implementation of this proposal.

The solution is rejected by upstream

Unfortunately upstream did not accept this idea. You can read the whole thread yourself if you like. But in summary, the objections were (italic indicates literal quotes):

  • New features don't gain a backpatch. This is a project policy. Of course this is not a new feature, and if it is an exception should be made. This was explained clearly in the thread.
  • I'm not sure the "we don't want to upgrade application code at the same time as the database" is really tenable. This is quite an astonishing statement, particularly given the multiple users who said they wanted to do precisely that.
  • I think we could find cases where we caused worse breaks between major versions. Paraphrasing: "We've done worse things in the past so we should do this bad thing too". WTF?
  • One disadvantage is that this will increase confusion for users, who'll get used to the behavior on 12, and then they'll get confused on older releases. This seems highly contrived. Surely the number of people who are likely to suffer this confusion is tiny. Providing the new syntax in old versions (including of course the appropriate changes to the docs everywhere) might well make such confusion less rather than more likely.
  • [Poster is concerned about] 11.6 and up allowing a syntax that 11.0-11.5 don't. People are likely to write code relying on this and then be surprised when it doesn't work on a slightly older server. And, similarly: we'll then have a lot more behavior differences between minor releases. Again this seems a contrived and unconvincing objection. As that first poster even notes: Still, is that so much different from cases where we fix a bug that prevented some statement from working? No, it isn't.
  • if we started looking we'd find many changes every year that we could justify partially or completely back-porting on similar grounds ... we'll certainly screw it up sometimes. This is a slippery slope argument. But there is no slippery slope: in particular, the proposed change does not change any of the substantive database logic, and the upstream developers will hardly have any difficulty rejecting future more risky backport proposals.
  • if you insist on using the same code with pre-12 and post-12 releases, this should be achievable (at least in most cases) by using the "offset 0" trick. What? First I had heard of it but this in fact turns out to be true! Read more about this, below...

I find these extremely unconvincing, even taken together. Many of them are very unattractive things to hear one's upstream saying.

At best they are knee-jerk and inflexible application of very general principles. The authors of these objections seem to have lost sight of the fact that these principles have a purpose. When these kind of software principles work against their purposes, they should be revised, or exceptions made.

At worst, it looks like a collective effort to find reasons - any reasons, no matter how bad - not to make this change.

The OFFSET 0 trick

One of the responses in the thread mentions OFFSET 0. As part of writing the queries in the Xen Project CI system, and preparing for our system upgrade, I had carefully read the relevant pgsql documentation. This OFFSET 0 trick was new to me.

But, now that I know the answer, it is easy to provide the right search terms and find, for example, this answer on stackmumble. Apparently adding a no-op OFFSET 0 to the subquery defeats the pgsql 12 query planner's ability to see into the subquery.

I think OFFSET 0 is the better approach since it's more obviously a hack showing that something weird is going on, and it's unlikely we'll ever change the optimiser behaviour around OFFSET 0 ... wheras hopefully CTEs will become inlineable at some point CTEs became inlineable by default in PostgreSQL 12.
So in fact there is a syntax for an optimisation fence that is accepted by both earlier and later PostgreSQL versions. It's even recommended by pgsql devs. It's just not documented, and is described by pgsql developers as a "hack". Astonishingly, the fact that it is a "hack" is given as a reason to use it!

Well, I have therefore deployed this "hack". No doubt it will stay in our codebase indefinitely.

Please don't be like that!

I could come up with a lot more examples of other projects that have exhibited similar arrogance. It is becoming a plague! But every example is contentious, and I don't really feel I need to annoy a dozen separate Free Software communities. So I won't make a laundry list of obstructiveness.

If you are an upstream software developer, or a distributor of software to users (eg, a distro maintainer), you have a lot of practical power. In theory it is Free Software so your users could just change it themselves. But for a user or downstream, carrying a patch is often an unsustainable amount of work and risk. Most of us have patches we would love to be running, but which we haven't even written because simply running a nonstandard build is too difficult, no matter how technically excellent our delta.

As an upstream, it is very easy to get into a mindset of defending your code's existing behaviour, and to turn your project's guidelines into inflexible rules. Constant exposure to users who make silly mistakes, and rudely ask for absurd changes, can lead to core project members feeling embattled.

But there is no need for an upstream to feel embattled! You have the vast majority of the power over the software, and over your project communication fora. Use that power consciously, for good.

I can't say that arrogance will hurt you in the short term. Users of software with obstructive upstreams do not have many good immediate options. But we do have longer-term choices: we can choose which software to use, and we can choose whether to try to help improve the software we use.

After reading Colin's experience, I am less likely to try to help improve the experience of other PostgreSQL users by contributing upstream. It doesn't seem like there would be any point. Indeed, instead of helping the PostgreSQL community I am now using them as an example of bad practice. I'm only half sorry about that.

diziet: (Default)

tl;dr: Use MessagePack, rather than CBOR.

Introduction

I recently wanted to choose a binary encoding. This was for a project using Rust serde, so I looked at the list of formats there. I ended up reading about CBOR and MessagePack.

Both of these are binary formats for a JSON-like data model. Both of them are "schemaless", meaning you can decode them without knowing the structure. (This also provides some forwards compatibility.) They are, in fact, quite similar (although they are totally incompatible). This is no accident: CBOR is, effectively, a fork of MessagePack.

Both formats continue to exist and both are being used in new programs. I needed to make a choice but lacked enough information. I thought I would try to examine the reasons and nature of the split, and to make some kind of judgement about the situation. So I did a lot of reading [11]. Here are my conclusions.

History and politics

Between about 2010 and 2013 there was only MessagePack. Unfortunately, MessagePack had some problems. The biggest of these was that it lacked a separate string type. Strings were to be encoded simply as byte blocks. This caused serious problems for many MessagePack library implementors: for example, when decoding a MessagePack file the Python library wouldn't know whether to produce a Python bytes object, or a string. Straightforward data structures wouldn't round trip through MessagePack. [1] [2]

It seems that in late 2012 this came to the attention to someone with an IETF background. According to them, after unsatisfactory conversations with MessagePack upstream, they decided they would have to fork. They submitted an Internet Draft for a partially-incompatible protocol [3] [4]. Little seemed to happen in the IETF until soon before the Orlando in-person IETF meeting in February 2013.[5]

These conversations sparked some discussion in the MessagePack issue tracker. There were long threads including about process [1,2,4 ibid]. But there was also a useful technical discussion, about proposed backward compatible improves to the MessagePack spec.[5] The prominent IETF contributor provided some helpful input in these discussions in the MessagePack community - but also pushed quite hard for a "tagging" system, which suggestion was not accepted (see my technical analysis, below).

An improved MessagePack spec resulted, with string support, developed largely by the MessagePack community. It seems to have been available in useable form since mid-2013 and was officially published as canonical in August 2013.

Meanwhile a parallel process was pursued in the IETF, based on the IETF contributor's fork, with 11 Internet-Drafts from February[7] to September[8]. This seems to have continued even though the original technical reason for the fork - lack of string vs binary distinction - no longer applied. The IETF proponent expressed unhappiness about MessagePack's stewardship and process as much as they did about the technical details [4, ibid]. The IETF process culminated in the CBOR RFC[9].

The discussion on process questions between the IETF proponent and MessagePack upstream, in the MessagePack issue tracker [4, ibid] should make uncomfortable reading for IETF members. The IETF acceptance of CBOR despite clear and fundamental objections from MessagePack upstream[13] and indeed other respected IETF members[14], does not reflect well on the IETF. The much vaunted openness of the IETF process seems to have been rather one-sided. The IETF proponent here was an IETF Chair. Certainly the CBOR author was very well-spoken and constantly talks about politeness and cooperation and process; but what they actually did was very hostile. They accused the MessagePack community of an "us and them" attitude while simultaneously pursuing a forked specification!

The CBOR RFC does mention MessagePack in Appendix E.2. But not to acknowledge that CBOR was inspired by MessagePack. Rather, it does so to make a set of tendentious criticisms of MessagePack. Perhaps these criticisms were true when they were first written in an I-D but they were certainly false by the time the RFC was actually published, which occurred after the MessagePack improvement process was completely concluded, with a formal spec issued.

Since then both formats have existed in parallel. Occasionally people discuss which one is better, and sometimes it is alleged that "yes CBOR is the successor to MessagePack", which is not really fair.[9][10]

Technical differences

The two formats have a similar arrangement: initial byte which can encode small integers, or type and length, or type and specify a longer length encoding. But there are important differences. Overall, MessagePack is very significantly simpler.

Floating point

CBOR supports five floating point formats! Not only three sizes of IEEE754, but also decimal floating point, and bigfloats. This seems astonishing for a supposedly-simple format. (Some of these are supported via the semi-optional tag mechanism - see below.)

Indefinite strings and arrays

Like MessagePack, CBOR mostly precedes items with their length. But CBOR also supports "indefinite" strings, arrays, and so on, where the length is not specified at the beginning. The object (array, string, whatever) is terminated by a special "break" item. This seems to me to be a mistake. If you wanted the kind of application where MessagePack or CBOR would be useful, streaming sub-objects of unknown length is not that important. This possibility considerably complicates decoders.

CBOR tagging system

CBOR has a second layer of sort-of-type which can be attached to each data item. The set of possible tags is open-ended and extensible, but the CBOR spec itself gives tag values for: two kinds of date format; positive and negative bignums; decimal floats (see above); binary but expected to be encoded if converted to JSON (in base64url, base64, or base16); nestedly encoded CBOR; URIs; base64 data (two formats); regexps; MIME messages; and a special tag to make file(1) work.

In practice it is not clear how many of these are used, but a decoder must be prepared to at least discard them. The amount of additional spec complexity here is quite astonishing. IMO binary formats like this will (just like JSON) be used by a next layer which always has an idea of what the data means, including (where the data is a binary blob) what encoding it is in etc. So these tags are not useful.

These tags might look like a middle way between (i) extending the binary protocol with a whole new type such as an extension type (incompatible with old readers) and encoding your new kind data in a existing type (leaving all readers who don't know the schema to print it as just integers or bytes or string). But I think they are more trouble than they are worth.

The tags are uncomfortably similar to the ASN.1 tag system, which is widely regarded as one of ASN.1's unfortunate complexities.

MessagePack extension mechanism

MessagePack explicitly reserves some encoding space for users and for future extensions: there is an "extension type". The payload is an extension type byte plus some more data bytes; the data bytes are in a format to be defined by the extension type byte. Half of the possible extension byte values are reserved for future specification, and half are designated for application use. This is pleasingly straightforward.

(There is also one unused primary initial byte value, but that would be rejected by existing decoders and doesn't seem like a likely direction for future expansion.)

Minor other differences in integer encoding

The encodings of integers differ.

In MessagePack, signed and unsigned integers have different typecodes. In CBOR, signed and unsigned positive integers have the same typecodes; negative integers have a different set of typecodes. This means that a CBOR reader which knows it is expecting a signed value will have to do a top-bit-set check on the actual data value! And a CBOR writer must check the value to choose a typecode.

MessagePack reserves fewer shortcodes for small negative integers, than for small positive integers.

Conclusions and lessons

MessagePack seems to have been prompted into fixing the missing string type problem, but only by the threat of a fork. However, this fork went ahead even after MessagePack clearly accepted the need for a string type. MessagePack had a fixed protocol spec before the IETF did.

The continued pursuit of the IETF fork was ostensibly been motivated by a disapproval of the development process and in particular a sense that the IETF process was superior. However, it seems to me that the IETF process was abused by CBOR's proponent, who just wanted things their own way. I have seen claims by IETF proponents that the open decisionmaking system inherently produces superior results. However, in this case the IETF process produced a bad specification. To the extent that other IETF contributors had influence over the ultimate CBOR RFC, I don't think they significantly improved it. CBOR has been described as MessagePack bikeshedded by the IETF. That would have been bad enough, but I think it's worse than that. To a large extent CBOR is one person's NIH-induced bad design rubber stamped by the IETF. CBOR's problems are not simply matters of taste: it's significantly overcomplicated.

One lesson for the rest of us is that although being the upstream and nominally in charge of a project seems to give us a lot of power, it's wise to listen carefully to one's users and downstreams. Once people are annoyed enough to fork, the fork will have a life of its own.

Another lesson is that many of us should be much warier of the supposed moral authority of the IETF. Many IETF standards are awful (Oauth 2 [12]; IKE; DNSSEC; the list goes on). Sometimes (especially when network adoption effects are weak, as with MessagePack vs CBOR) better results can be obtained from a smaller group, or even an individual, who simply need the thing for their own uses.

Finally, governance systems of public institutions like the IETF need to be robust in defending the interests of outsiders (and hence of society at large) against eloquent insiders who know how to work the process machinery. Any institution which nominally serves the public good faces a constant risk of devolving into self-servingness. This risk gets worse the more powerful and respected the institution becomes.

References

  1. #13: First-class string type in serialization specification (MessagePack issue tracker, June 2010 - August 2013)
  2. #121: Msgpack can't differentiate between raw binary data and text strings (MessagePack issue tracker, November 2012 - February 2013)
  3. draft-bormann-apparea-bpack-00: The binarypack JSON-like representation format (IETF Internet-Draft, October 2012)
  4. #129: MessagePack should be developed in an open process (MessagePack issue tracker, February 2013 - March 2013)
  5. Re: JSON mailing list and BoF (IETF apps-discuss mailing list message from Carsten Bormann, 18 February 2013)
  6. #128: Discussions on the upcoming MessagePack spec that adds the string type to the protocol (MessagePack issue tracker, February 2013 - August 2013)
  7. draft-bormann-apparea-bpack-01: The binarypack JSON-like representation format (IETF Internet-Draft, February 2013)
  8. draft-bormann-cbor: Concise Binary Object Representation (CBOR) (IETF Internet-Drafts, May 2013 - September 2013)
  9. RFC 7049: Concise Binary Object Representation (CBOR) (October 2013)
  10. "MessagePack should be replaced with [CBOR] everywhere ..." (floatboth on Hacker News, 8th April 2017)
  11. Discussion with very useful set of history links (camgunz on Hacker News, 9th April 2017)
  12. OAuth 2.0 and the Road to Hell (Eran Hammer, blog posting from 2012, via Wayback Machine)
  13. Re: [apps-discuss] [Json] msgpack/binarypack (Re: JSON mailing list and BoF) (IETF list message from Sadyuki Furuhashi, 4th March 2013)
  14. "no apologies for complaining about this farce" (IETF list message from Phillip Hallam-Baker, 15th August 2013)
    Edited 2020-07-14 18:55 to fix a minor formatting issue, and 2020-07-14 22:54 to fix two typos

diziet: (Default)
I have been convinced by the arguments that it's not nice to keep using the word master for the default git branch. Regardless of the etymology (which is unclear), some people say they have negative associations for this word, Changing this upstream in git is complicated on a technical level and, sadly, contested.

But git is flexible enough that I can make this change in my own repositories. Doing so is not even so difficult.

So:

Announcement

I intend to rename master to trunk in all repositories owned by my personal hat. To avoid making things very complicated for myself I will just delete refs/heads/master when I make this change. So there may be a little disruption to downstreams.

I intend make this change everywhere eventually. But rather than front-loading the effort, I'm going to do this to repositories as I come across them anyway. That will allow me to update all the docs references, any automation, etc., at a point when I have those things in mind anyway. Also, doing it this way will allow me to focus my effort on the most active projects, and avoids me committing to a sudden large pile of fiddly clerical work. But: if you have an interest in any repository in particular that you want updated, please let me know so I can prioritise it.

Bikeshed

Why "trunk"? "Main" has been suggested elswewhere, and it is often a good replacement for "master" (for example, we can talk very sensibly about a disk's Main Boot Record, MBR). But "main" isn't quite right for the VCS case; for example a "main" branch ought to have better quality than is typical for the primary development branch.

Conversely, there is much precedent for "trunk". "Trunk" was used to refer to this concept by at least SVN, CVS, RCS and CSSC (and therefore probably SCCS) - at least in the documentation, although in some of these cases the command line API didn't have a name for it.

So "trunk" it is.

Aside: two other words - passlist, blocklist

People are (finally!) starting to replace "blacklist" and "whitelist". Seriously, why has it taken everyone this long?

I have been using "blocklist" and "passlist" for these concepts for some time. They are drop-in replacements. I have also heard "allowlist" and "denylist" suggested, but they are cumbersome and cacophonous.

Also "allow" and "deny" seem to more strongly imply an access control function than merely "pass" and "block", and the usefulness of passlists and blocklists extends well beyond access control: protocol compatibility and ABI filtering are a couple of other use cases.

Profile

diziet: (Default)
Ian Jackson

May 2025

S M T W T F S
     123
45678910
11121314151617
18192021222324
25262728293031

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags