diziet: (Default)

If you are an email system administrator, you are probably using DKIM to sign your outgoing emails. You should be rotating the key regularly and automatically, and publishing old private keys. I have just released dkim-rotate 1.0; dkim-rotate is a tool to do this key rotation and publication.

If you are an email user, your email provider ought to be doing this. If this is not done, your emails are “non-repudiable”, meaning that if they are leaked, anyone (eg, journalists, haters) can verify that they are authentic, and prove that to others. This is not desirable (for you).

Details including advice for users )

diziet: (Default)

Recently, we managed to get secnet and hippotat into Debian. They are on track to go into Debian bookworm. This completes in Debian the set of VPN/networking tools I (and other Greenend) folks have been using for many years.

The Sinister Greenend Organisation’s suite of network access tools consists mainly of:

  • secnet - VPN.
  • hippotat - IP-over-HTTP (workaround for bad networks)
  • userv ipif - user-created network interfaces

secnet

secnet is our very mature VPN system.

Its basic protocol idea is similar to that in Wireguard, but it’s much older. Differences from Wireguard include:

  • Comes with some (rather clumsy) provisioning tooling, supporting almost any desired virtual network topology. In the SGO we have a complete mesh of fixed sites (servers), and a number of roaming hosts (clients), each of which can have one or more sites as its home.

  • No special kernel drivers required. Everything is userspace.

  • An exciting “polypath” mode where packets are sent via multiple underlying networks in parallel, offering increased reliability for roaming hosts.

  • Portable to non-Linux platforms.

  • A much older, and less well audited, codebase.

  • Very flexible configuration arrangements, but things are also under-documented and to an extent under-productised.

  • Hasn’t been ported to phones/tablets.

secnet was originally written by Stephen Early, starting in 1996 or so. I inherited it some years ago and have been maintaining it since. It’s mostly written in C.

Hippotat

Hippotat is best described by copying the intro from the docs:

Hippotat is a system to allow you to use your normal VPN, ssh, and other applications, even in broken network environments that are only ever tested with “web stuff”.

Packets are parcelled up into HTTP POST requests, resembling form submissions (or JavaScript XMLHttpRequest traffic), and the returned packets arrive via the HTTP response bodies.

It doesn’t rely on TLS tunnelling so can work even if the local network is trying to intercept TLS. I recently rewrote Hippotat in Rust.

userv ipif

userv ipif is one of the userv utilities.

It allows safe delegation of network routing to unprivileged users. The delegation is of a specific address range, so different ranges can be delegated to different users, and the authorised user cannot interfere with other traffic.

This is used in the default configuration of hippotat packages, so that an ordinary user can start up the hippotat client as needed.

On chiark userv-ipif is used to delegate networking to users, including administrators of allied VPN realms. So chiark actually runs at least 4 VPN-ish systems in production: secnet, hippotat, Mark Wooding’s Tripe, and still a few links managed by the now-superseded udptunnel system.

userv

userv ipif is a userv service. That is, it is a facility which uses userv to bridge a privilege boundary.

userv is perhaps my most under-appreciated program. userv can be used to straightforwardly bridge (local) privilege boundaries on Unix systems.

So for example it can:

  • Allow a sysadmin to provide a shell script to be called by unprivileged users, but which will run as root. sudo can do this too but it has quite a few gotchas, and you have to be quite careful how you use it - and its security record isn’t great either.

  • Form the internal boundary in a privilege-separated system service. So, for example, the hippotat client is a program you can run from the command line as a normal user, if the relevant network addresses have been delegated to you. On chiark, CGI programs run as the providing user - not using suexec (which I don’t trust), but via userv.

userv services can be defined by the called user, not only by the system administrator. This allows a user to reconfigure or divert a system-provided default implementation, and even allows users to define and implement ad-hoc services of their own. (Although, the system administrator can override user config.)

Acknowledgements

Thanks for the help I had in this effort.

In particular, thanks to Sean Whitton for encouragement, and the ftpmaster review; and to the Debian Rust Team for their help navigating the complexities of handling Rust packages within the Debian Rust Team workflow.

diziet: (Default)

Debian does not officially support upgrading from earlier than the previous stable release: you’re not supposed to “skip” releases. Instead, you’re supposed to upgrade to each intervening major release in turn.

However, skipping intervening releases does, in fact, often work quite well. Apparently, this is surprising to many people, even Debian insiders. I was encouraged to write about it some more.

My personal experience

I have three conventionally-managed personal server systems (by which I mean systems which aren’t reprovisioned by some kind of automation). Of these at least two have been skip upgraded at least once:

The one I don’t think I’ve skip-upgraded (at least, not recently) is my house network manager (and now VM host) which I try to keep to a minimum in terms of functionality and which I keep quite up to date. It was crossgraded from i386 (32-bit) to amd64 (64-bit) fairly recently, which is a thing that Debian isn’t sure it supports. The crossgrade was done a hurry and without any planning, prompted by Spectre et al suddenly requiring big changes to Xen. But it went well enough.

My home “does random stuff” server (media server, web cache, printing, DNS, backups etc.), has etckeeper records starting in 2015. I upgraded directly from jessie (Debian 8) to buster (Debian 10). I think it has probably had earlier skip upgrade(s): the oldest file in /etc is from December 1996 and I have been doing occasional skip upgrades as long as I can remember.

And of course there’s chiark, which is one of the oldest Debian installs in existence. I wrote about the most recent upgrade, where I went directly from jessie i386 ELTS (32-bit Debian 8) to bulleye amd64 (64-bit Debian 11). That was a very extreme case which required significant planning and pre-testing, since the package dependencies were in no way sufficient for the proper ordering. But, I don’t normally go to such lengths. Normally, even on chiark, I just edit the sources.list and see what apt proposes to do.

I often skip upgrade chiark because I tend to defer risky-looking upgrades partly in the hope of others fixing the bugs while I wait :-), and partly just because change is disruptive and amortising it is very helpful both to me and my users. I have some records of chiark’s upgrades from my announcements to users. As well as the recent “skip skip up cross grade, direct”, I definitely did a skip upgrade of chiark from squeeze (Debian 6) to jessie (Debian 8). It appears that the previous skip upgrade on chiark was rex (Debian 1.2) to hamm (Debian 2.0).

I don’t think it’s usual for me to choose to do a multi-release upgrade the “officially supported” way, in two (or more) stages, on a server. I have done that on systems with a GUI desktop setup, but even then I usually skip the intermediate reboot(s).

When to skip upgrade (and what precautions to take)

I’m certainly not saying that everyone ought to be doing this routinely. Most users with a Debian install that is older than oldstable probably ought to reinstall it, or do the two-stage upgrade.

Skip upgrading almost always runs into some kind of trouble (albeit, usually trouble that isn’t particularly hard to fix if you know what you’re doing).

However, officially supported non-skip upgrades go wrong too. Doing a two-or-more-releases upgrade via the intermediate releases can expose you to significant bugs in the intermediate releases, which were later fixed. Because Debian’s users and downstreams are cautious, and Debian itself can be slow, it is common for bugs to appear for one release and then be fixed only in the next. Paradoxically, this seems to be especially true with the kind of big and scary changes where you’d naively think the upgrade mechanisms would break if you skipped the release where the change first came in.

I would not recommend a skip upgrade to someone who is not a competent Debian administrator, with good familiarity with Debian package management, including use of dpkg directly to fix things up. You should have a mental toolkit of manual bug workaround techniques. I always should make sure that I have rescue media (and in the case of a remote system, full remote access including ability to boot a different image), although I don’t often need it.

And, when considering a skip upgrade, you should be aware of the major changes that have occurred in Debian.

Skip upgrading is more likely to be a good idea with a complex and highly customised system: a fairly vanilla install is not likely to encounter problems during a two-stage update. (And, a vanilla system can be “upgraded” by reinstalling.)

I haven’t recently skip upgraded a laptop or workstation. I doubt I would attempt it; modern desktop software seems to take a much harder line about breaking things that are officially unsupported, and generally trying to force everyone into the preferred mold.

A request to Debian maintainers

I would like to encourage Debian maintainers to defer removing upgrade compatibility machinery until it is actually getting in the way, or has become hazardous, or many years obsolete.

Examples of the kinds of things which it would be nice to keep, and which do not usually cause much inconvenience to retain, are dependency declarations (particularly, alternatives), and (many) transitional fragments in maintainer scripts.

If you find yourself needing to either delete some compatibility feature, or refactor/reorganise it, I think it is probably best to delete it. If you modify it significantly, the resulting thing (which won’t be tested until someone uses it in anger) is quite likely to have bugs which make it go wrong more badly (or, more confusingly) than the breakage that would happen without it.

Obviously this is all a judgement call.

I’m not saying Debian should formally “support” skip upgrades, to the extent of (further) slowing down important improvements. Nor am I asking for any change to the routine approach to (for example) transitional packages (i.e. the technique for ensuring continuity of installation when a package name changes).

We try to make release upgrades work perfectly; but skip upgrades don’t have to work perfectly to be valuable. Retaining compatibility code can also make it easier to provide official backports, and it probably helps downstreams with different release schedules.

The fact that maintainers do in practice often defer removing compatibility code provides useful flexibility and options to at least some people. So it would be nice if you’d at least not go out of your way to break it.

Building on older releases

I would also like to encourage maintainers to provide source packages in Debian unstable that will still build on older releases, where this isn’t too hard and the resulting binaries might be basically functional.

Speaking personally, it’s not uncommon for me to rebuild packages from unstable and install them on much older releases. This is another thing that is not officially supported, but which often works well.

I’m not saying to contort your build system, or delay progress. You’ll definitely want to depend on a recentish debhelper. But, for example, retaining old build-dependency alternatives is nice. Retaining them doesn’t constitute a promise that it works - it just makes life slightly easier for someone who is going off piste.

If you know you have users on multiple distros or multiple releases, and wish to fully support them, you can go further, of course. Many of my own packages are directly buildable, or even directly installable, on older releases.

diziet: (Default)

The problem I had - Mason, so, sadly, FastCGI

Since the update to current Debian stable, the website for YARRG, (a play-aid for Puzzle Pirates which I wrote some years ago), started to occasionally return “Internal Server Error”, apparently due to bug(s) in some FastCGI libraries.

I was using FastCGI because the website is written in Mason, a Perl web framework, and I found that Mason CGI calls were slow. I’m using CGI - yes, trad CGI - via userv-cgi. Running Mason this way would “compile” the template for each HTTP request just when it was rendered, and then throw the compiled version away. The more modern approach of an application server doesn’t scale well to a system which has many web “applications” most of which are very small. The admin overhead of maintaining a daemon, and corresponding webserver config, for each such “service” would be prohibitive, even with some kind of autoprovisioning setup. FastCGI has an interpreter wrapper which seemed like it ought to solve this problem, but it’s quite inconvenient, and often flaky.

I decided I could do better, and set out to eliminate FastCGI from my setup. The result seems to be a success; once I’d done all the hard work of writing prefork-interp, I found the result very straightforward to deploy.

prefork-interp

prefork-interp is a small C program which wraps a script, plus a scripting language library to cooperate with the wrapper program. Together they achieve the following:

  • Startup cost of the script (loading modules it uses, precompuations, loading and processing of data files, etc.) is paid once, and reused for subsequent invocations of the same script.

  • Minimal intervention to the script source code:
    • one new library to import
    • one new call to make from that library, right after the script intialisation is complete
    • change to the #! line.
  • The new “initialisation complete” call turns the program into a little server (a daemon), and then returns once for each actual invocation, each time in a fresh grandchild process.

Features:

  • Seamless reloading on changes to the script source code (automatic, and configurable).

  • Concurrency limiting.

  • Options for distinguishing different configurations of the same script so that they get a server each.

  • You can run the same script standalone, as a one-off execution, as well as under prefork-interp.

  • Currently, a script-side library is provided for Perl. I’m pretty sure Python would be fairly straightforward.

Important properties not always satisfied by competing approaches:

  • Error output (stderr) and exit status from both phases of the script code execution faithfully reproduced to the calling context. Environment, arguments, and stdin/stdout/stderr descriptors, passed through to each invocation.

  • No polling, other than a long-term idle timeout, so good on laptops (or phones).

  • Automatic lifetime management of the per-script server, including startup and cleanup. No integration needed with system startup machinery: No explicit management of daemons, init scripts, systemd units, cron jobs, etc.

  • Useable right away without fuss for CGI programs but also for other kinds of program invocation.

  • (I believe) reliable handling of unusual states arising from failed invocations or races.

Swans paddling furiously

The implementation is much more complicated than the (apparent) interface.

I won’t go into all the details here (there are some terrifying diagrams in the source code if you really want), but some highlights:

We use an AF_UNIX socket (hopefully in /run/user/UID, but in ~ if not) for rendezvous. We can try to connect without locking, but we must protect the socket with a separate lockfile to avoid two concurrent restart attempts.

We want stderr from the script setup (pre-initialisation) to be delivered to the caller, so the script ought to inherit our stderr and then will need to replace it later. Twice, in fact, because the daemonic server process can’t have a stderr.

When a script is restarted for any reason, any old socket will be removed. We want the old server process to detect that and quit. (If hung about, it would wait for the idle timeout; if this happened a lot - eg, a constantly changing set of services - we might end up running out of pids or something.) Spotting the socket disappearing, without polling, involves use of a library capable of using inotify (or the equivalent elsewhere). Choosing a C library to do this is not so hard, but portable interfaces to this functionality can be hard to find in scripting languages, and also we don’t want every language binding to have to reimplement these checks. So for this purpose there’s a little watcher process, and associated IPC.

When an invoking instance of prefork-interp is killed, we must arrange for the executing service instance to stop reading from its stdin (and, ideally, writing its stdout). Otherwise it’s stealing input from prefork-interp’s successors (maybe the user’s shell)!

Cleanup ought not to depend on positive actions by failing processes, so each element of the system has to detect failures of its peers by means such as EOF on sockets/pipes.

Obtaining prefork-interp

I put this new tool in my chiark-utils package, which is a collection of useful miscellany. It’s available from git.

Currently I make releases by uploading to Debian, where prefork-interp has just hit Debian unstable, in chiark-utils 7.0.0.

Support for other scripting languages

I would love Python to be supported. If any pythonistas reading this think you might like to help out, please get in touch. The specification for the protocol, and what the script library needs to do, is documented in the source code

Future plans for chiark-utils

chiark-utils as a whole is in need of some tidying up of its build system and packaging.

I intend to try to do some reorganisation. Currently I think it would be better to organising the source tree more strictly with a directory for each included facility, rather than grouping “compiled” and “scripts” together.

The Debian binary packages should be reorganised more fully according to their dependencies, so that installing a program will ensure that it works.

I should probably move the official git repo from my own git+gitweb to a forge (so we can have MRs and issues and so on).

And there should be a lot more testing, including Debian autopkgtests.

edited 2022-08-23 10:30 +01:00 to improve the formatting
diziet: (Default)

Background

Internet email is becoming more reliant on DKIM, a scheme for having mail servers cryptographically sign emails. The Big Email providers have started silently spambinning messages that lack either DKIM signatures, or SPF. DKIM is arguably less broken than SPF, so I wanted to deploy it.

But it has a problem: if done in a naive way, it makes all your emails non-repudiable, forever. This is not really a desirable property - at least, not desirable for you, although it can be nice for someone who (for example) gets hold of leaked messages obtained by hacking mailboxes.

This problem was described at some length in Matthew Green’s article Ok Google: please publish your DKIM secret keys. Following links from that article does get you to a short script to achieve key rotation but it had a number of problems, and wasn’t useable in my context.

dkim-rotate

So I have written my own software for rotating and revoking DKIM keys: dkim-rotate.

I think it is a good solution to this problem, and it ought to be deployable in many contexts (and readily adaptable to those it doesn’t already support).

Here’s the feature list taken from the README:

  • Leaked emails become unattestable (plausibily deniable) within a few days — soon after the configured maximum email propagation time.

  • Mail domain DNS configuration can be static, and separated from operational DKIM key rotation. Domain owner delegates DKIM configuration to mailserver administrator, so that dkim-rotate does not need to edit your mail domain’s zone.

  • When a single mail server handles multiple mail domains, only a single dkim-rotate instance is needed.

  • Supports situations where multiple mail servers may originate mails for a single mail domain.

  • DNS zonefile remains small; old keys are published via a webserver, rather than DNS.

  • Supports runtime (post-deployment) changes to tuning parameters and configuration settings. Careful handling of errors and out-of-course situations.

  • Convenient outputs: a standard DNS zonefile; easily parseable settings for the MTA; and, a directory of old keys directly publishable by a webserver.

Complications

It seems like it should be a simple problem. Keep N keys, and every day (or whatever), generate and start using a new key, and deliberately leak the oldest private key.

But, things are more complicated than that. Considerably more complicated, as it turns out.

I didn’t want the DKIM key rotation software to have to edit the actual DNS zones for each relevant mail domain. That would tightly entangle the mail server administration with the DNS administration, and there are many contexts (including many of mine) where these roles are separated.

The solution is to use DNS aliases (CNAME). But, now we need a fixed, relatively small, set of CNAME records for each mail domain. That means a fixed, relatively small set of key identifiers (“selectors” in DKIM terminology), which must be used in rotation.

We don’t want the private keys to be published via the DNS because that makes an ever-growing DNS zone, which isn’t great for performance; and, because we want to place barriers in the way of processes which might enumerate the set of keys we use (and the set of keys we have leaked) and keep records of what status each key had when. So we need a separate publication channel - for which a webserver was the obvious answer.

We want the private keys to be readily noticeable and findable by someone who is verifying an alleged leaked email dump, but to be hard to enumerate. (One part of the strategy for this is to leave a note about it, with the prospective private key url, in the email headers.)

The key rotation operations are more complicated than first appears, too. The short summary, above, neglects to consider the fact that DNS updates have a nonzero propagation time: if you change the DNS, not everyone on the Internet will experience the change immediately. So as well as a timeout for how long it might take an email to be delivered (ie, how long the DKIM signature remains valid), there is also a timeout for how long to wait after updating the DNS, before relying on everyone having got the memo. (This same timeout applies both before starting to sign emails with a new key, and before deliberately compromising a key which has been withdrawn and deadvertised.)

Updating the DNS, and the MTA configuration, are fallible operations. So we need to cope with out-of-course situations, where a previous DNS or MTA update failed. In that case, we need to retry the failed update, and not proceed with key rotation. We mustn’t start the timer for the key rotation until the update has been implemented.

The rotation script will usually be run by cron, but it might be run by hand, and when it is run by hand it ought not to “jump the gun” and do anything “too early” (ie, before the relevant timeout has expired). cron jobs don’t always run, and don’t always run at precisely the right time. (And there’s daylight saving time, to consider, too.)

So overall, it’s not sufficient to drive the system via cron and have it proceed by one unit of rotation on each run.

And, hardest of all, I wanted to support post-deployment configuration changes, while continuing to keep the whole the system operational. Otherwise, you have to bake in all the timing parameters right at the beginning and can’t change anything ever. So for example, I wanted to be able to change the email and DNS propagation delays, and even the number of selectors to use, without adversely affecting the delivery of already-sent emails, and without having to shut anything down.

I think I have solved these problems.

The resulting system is one which keeps track of the timing constraints, and the next event which might occur, on a per-key basis. It calculates on each run, which key(s) can be advanced to the next stage of their lifecycle, and performs the appropriate operations. The regular key update schedule is then an emergent property of the config parameters and cron job schedule. (I provide some example config.)

Exim

Integrating dkim-rotate itself with Exim was fairly easy. The lsearch lookup function can be used to fish information out of a suitable data file maintained by dkim-rotate.

But a final awkwardness was getting Exim to make the right DKIM signatures, at the right time.

When making a DKIM signature, one must choose a signing authority domain name: who should we claim to be? (This is the “SDID” in DKIM terms.) A mailserver that handles many different mail domains will be able to make good signatures on behalf of many of them. It seems to me that domain to be the mail domain in the From: header of the email. (The RFC doesn’t seem to be clear on what is expected.) Exim doesn’t seem to have anything builtin to do that.

And, you only want to DKIM-sign emails that are originated locally or from trustworthy sources. You don’t want to DKIM-sign messages that you received from the global Internet, and are sending out again (eg because of an email alias or mailing list). In theory if you verify DKIM on all incoming emails, you could avoid being fooled into signing bad emails, but rejecting all non-DKIM-verified email would be a very strong policy decision. Again, Exim doesn’t seem to have cooked machinery.

The resulting Exim configuration parameters run to 22 lines, and because they’re parameters to an existing config item (the smtp transport) they can’t even easily be deployed as a drop-in file via Debian’s “split config” Exim configuration scheme.

(I don’t know if the file written for Exim’s use by dkim-rotate would be suitable for other MTAs, but this part of dkim-rotate could easily be extended.)

Conclusion

I have today released dkim-rotate 0.4, which is the first public release for general use.

I have it deployed and working, but it’s new so there may well be bugs to work out.

If you would like to try it out, you can get it via git from Debian Salsa. (Debian folks can also find it freshly in Debian unstable.)

diziet: (Default)

Two weeks ago I upgraded chiark from Debian jessie i386 to bullseye amd64, after nearly 30 years running Debian i386. This went really quite well, in fact!

Background

chiark is my “colo” - a server I run, which lives in a data centre in London. It hosts ~200 users with shell accounts, various websites and mailing lists, moderators for a number of USENET newsgroups, and countless other services. chiark’s internal setup is designed to enable my users to do a maximum number of exciting things with a minimum of intervention from me.

chiark’s OS install dates to 1993, when I installed Debian 0.93R5, the first version of Debian to advertise the ability to be upgraded without reinstalling. I think that makes it one of the oldest Debian installations in existence.

Obviously it’s had several new hardware platforms too. (There was a prior install of Linux on the initial hardware, remnants of which can maybe still be seen in some obscure corners of chiark’s /usr/local.)

chiark’s install is also at the very high end of the installation complexity, and customisation, scale: reinstalling it completely would be an enormous amount of work. And it’s unique.

chiark’s upgrade history

chiark’s last major OS upgrade was to jessie (Debian 8, released in April 2015). That was in 2016. Since then we have been relying on Debian’s excellent security support posture, and the Debian LTS and more recently Freexian’s Debian ELTS projects and some local updates, The use of ELTS - which supports only a subset of packages - was particularly uncomfortable.

Additionally, chiark was installed with 32-bit x86 Linux (Debian i386), since that was what was supported and available at the time. But 32-bit is looking very long in the tooth.

Why do a skip upgrade

So, I wanted to move to the fairly recent stable release - Debian 11 (bullseye), which is just short of a year old. And I wanted to “crossgrade” (as its called) to 64-bit.

In the past, I have found I have had greater success by doing “direct” upgrades, skipping intermediate releases, rather than by following the officially-supported path of going via every intermediate release.

Doing a skip upgrade avoids exposure to any packaging bugs which were present only in intermediate release(s). Debian does usually fix bugs, but Debian has many cautious users, so it is not uncommon for bugs to be found after release, and then not be fixed until the next one.

A skip upgrade avoids the need to try to upgrade to already-obsolete releases (which can involve messing about with multiple snapshots from snapshot.debian.org. It is also significantly faster and simpler, which is important not only because it reduces downtime, but also because it removes opportunities (and reduces the time available) for things to go badly.

One downside is that sometimes maintainers aggressively remove compatibility measures for older releases. (And compatibililty packages are generally removed quite quickly by even cautious maintainers.) That means that the sysadmin who wants to skip-upgrade needs to do more manual fixing of things that haven’t been dealt with automatically. And occasionally one finds compatibility problems that show up only when mixing very old and very new software, that no-one else has seen.

Crossgrading

Crossgrading is fairly complex and hazardous. It is well supported by the low level tools (eg, dpkg) but the higher-level packaging tools (eg, apt) get very badly confused.

Nowadays the system is so complex that downloading things by hand and manually feeding them to dpkg is impractical, other than as a very occasional last resort.

The approach, generally, has been to set the system up to “want to” be the new architecture, run apt in a download-only mode, and do the package installation manually, with some fixing up and retrying, until the system is coherent enough for apt to work.

This is the approach I took. (In current releases, there are tools that will help but they are only in recent releases and I wanted to go direct. I also doubted that they would work properly on chiark, since it’s so unusual.)

Peril and planning

Overall, this was a risky strategy to choose. The package dependencies wouldn’t necessarily express all of the sequencing needed. But it still seemed that if I could come up with a working recipe, I could do it.

I restored most of one of chiark’s backups onto a scratch volume on my laptop. With the LVM snapshot tools and chroots. I was able to develop and test a set of scripts that would perform the upgrade. This was a very effective approach: my super-fast laptop, with local caches of the package repositories, was able to do many “edit, test, debug” cycles.

My recipe made heavy use of snapshot.debian.org, to make sure that it wouldn’t rot between testing and implementation.

When I had a working scheme, I told my users about the planned downtime. I warned everyone it might take even 2 or 3 days. I made sure that my access arrangemnts to the data centre were in place, in case I needed to visit in person. (I have remote serial console and power cycler access.)

Reality - the terrible rescue install

My first task on taking the service down was the check that the emergency rescue installation worked: chiark has an ancient USB stick in the back, which I can boot to from the BIOS. The idea being that many things that go wrong could be repaired from there.

I found that that install was too old to understand chiark’s storage arrangements. mdadm tools gave very strange output. So I needed to upgrade it. After some experiments, I rebooted back into the main install, bringing chiark’s service back online.

I then used the main install of chiark as a kind of meta-rescue-image for the rescue-image. The process of getting the rescue image upgraded (not even to amd64, but just to something not totally ancient) was fraught. Several times I had to rescue it by copying files in from the main install outside. And, the rescue install was on a truly ancient 2G USB stick which was terribly terribly slow, and also very small.

I hadn’t done any significant planning for this subtask, because it was low-risk: there was little way to break the main install. Due to all these adverse factors, sorting out the rescue image took five hours.

If I had known how long it would take, at the beginning, I would have skipped it. 5 hours is more than it would have taken to go to London and fix something in person.

Reality - the actual core upgrade

I was able to start the actual upgrade in the mid-afternoon. I meticulously checked and executed the steps from my plan.

The terrifying scripts which sequenced the critical package updates ran flawlessly. Within an hour or so I had a system which was running bullseye amd64, albeit with many important packages still missing or unconfigured.

So I didn’t need the rescue image after all, nor to go to the datacentre.

Fixing all the things

Then I had to deal with all the inevitable fallout from an upgrade.

Notable incidents:

exim4 has a new tainting system

This is to try to help the sysadmin avoid writing unsafe string interpolations. (“Little Bobby Tables.”) This was done by Exim upstream in a great hurry as part of a security response process.

The new checks meant that the mail configuration did not work at all. I had to turn off the taint check completely. I’m fairly confident that this is correct, because I am hyper-aware of quoting issues and all of my configuration is written to avoid the problems that tainting is supposed to avoid.

One particular annoyance is that the approach taken for sqlite lookups makes it totally impossible to use more than one sqlite database. I think the sqlite quoting operator which one uses to interpolate values produces tainted output? I need to investigate this properly.

LVM now ignores PVs which are directly contained within LVs by default

chiark has LVM-on-RAID-on-LVM. This generally works really well.

However, there was one edge case where I ended up without the intermediate RAID layer. The result is LVM-on-LVM.

But recent versions of the LVM tools do not look at PVs inside LVs, by default. This is to help you avoid corrupting the state of any VMs you have on your system. I didn’t know that at the time, though. All I knew was that LVM was claiming my PV was “unusable”, and wouldn’t explain why.

I was about to start on a thorough reading of the 15,000-word essay that is the commentary in the default /etc/lvm/lvm.conf to try to see if anything was relevant, when I received a helpful tipoff on IRC pointing me to the scan_lvs option.

I need to file a bug asking for the LVM tools to explain why they have declared a PV unuseable.

apache2’s default config no longer read one of my config files

I had to do a merge (of my changes vs the maintainers’ changes) for /etc/apache2/apache2.conf. When doing this merge I failed to notice that the file /etc/apache2/conf.d/httpd.conf was no longer included by default. My merge dropped that line. There were some important things in there, and until I found this the webserver was broken.

dpkg --skip-same-version DTWT during a crossgrade

(This is not a “fix all the things” - I found it when developing my upgrade process.)

When doing a crossgrade, one often wants to say to dpkg “install all these things, but don’t reinstall things that have already been done”. That’s what --skip-same-version is for.

However, the logic had not been updated as part of the work to support multiarch, so it was wrong. I prepared a patched version of dpkg, and inserted it in the appropriate point in my prepared crossgrade plan.

The patch is now filed as bug #1014476 against dpkg upstream

Mailman

Mailman is no longer in bullseye. It’s only available in the previous release, buster.

bullseye has Mailman 3 which is a totally different system - requiring basically, a completely new install and configuration. To even preserve existing archive links (a very important requirement) is decidedly nontrivial.

I decided to punt on this whole situation. Currently chiark is running buster’s version of Mailman. I will have to deal with this at some point and I’m not looking forward to it.

Python

Of course that Mailman is Python 2. The Python project’s extremely badly handled transition includes a recommendation to change the meaning of #!/usr/bin/python from Python 2, to Python 3.

But Python 3 is a new language, barely compatible with Python 2 even in the most recent iterations of both, and it is usual to need to coinstall them.

Happily Debian have provided the python-is-python2 package to make things work sensibly, albeit with unpleasant imprecations in the package summary description.

USENET news

Oh my god. INN uses many non-portable data formats, which just depend on your C types. And there are complicated daemons, statically linked libraries which cache on-disk data, and much to go wrong.

I had numerous problems with this, and several outages and malfunctions. I may write about that on a future occasion.

(edited 2022-07-20 11:36 +01:00 and 2022-07-30 12:28+01:00 to fix typos)

Profile

diziet: (Default)
Ian Jackson

March 2025

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324252627 2829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags