diziet: (Default)

Rust, and resistance to it in some parts of the Linux community, has been in my feed recently. One undercurrent seems to be the notion that Rust is woke (and should therefore be rejected as part of culture wars).

I’m going to argue that Rust, the language, is woke. So the opponents are right, in that sense. Of course, as ever, dissing something for being woke is nasty and fascist-adjacent.

Read more... )

diziet: (Default)

derive-deftly 1.0 is released.

derive-deftly is a template-based derive-macro facility for Rust. It has been a great success. Your codebase may benefit from it too!

Rust programmers will appreciate its power, flexibility, and consistency, compared to macro_rules; and its convenience and simplicity, compared to proc macros.

Programmers coming to Rust from scripting languages will appreciate derive-deftly’s convenient automatic code generation, which works as a kind of compile-time introspection.

Read more... )

diziet: (Default)

derive-deftly, the template-based derive-macro facility for Rust, has been a great success.

It’s coming up to time to declare a stable 1.x version. If you’d like to try it out, and have final comments / observations, now is the time.

Read more... )

diziet: (Default)

tl;dr:

If you are a Debian user who knows git, don’t work with Debian source packages. Don’t use apt source, or dpkg-source. Instead, use dgit and work in git.

Also, don’t use: “VCS” links on official Debian web pages, debcheckout, or Debian’s (semi-)official gitlab, Salsa. These are suitable for Debian experts only; for most people they can be beartraps. Instead, use dgit.

Read more... )

diziet: (Default)

Recently I completed a small project, including an embedded microcontroller. For me, using the popular Arduino IDE, and C, was a mistake. The experience with Rust was better, but still very exciting, and not in a good way.

Here follows the rant.

Read more... )
diziet: (Default)

If you are an email system administrator, you are probably using DKIM to sign your outgoing emails. You should be rotating the key regularly and automatically, and publishing old private keys. I have just released dkim-rotate 1.0; dkim-rotate is a tool to do this key rotation and publication.

If you are an email user, your email provider ought to be doing this. If this is not done, your emails are “non-repudiable”, meaning that if they are leaked, anyone (eg, journalists, haters) can verify that they are authentic, and prove that to others. This is not desirable (for you).

Details including advice for users )

diziet: (Default)

Instructions

  1. Get the official installation image from the usual locations. I got the netinst CD image via BitTorrent.

  2. Boot from the image and go through the installation in the normal way.

    1. You may want to select an alternative desktop environment (and unselect GNOME). These steps have been tested with MATE.

    2. Stop when you are asked to remove the installation media and reboot.

  3. Press Alt + Right arrow to switch to the text VC. Hit return to activate the console and run the following commands (answering yes as appropriate):

chroot /target bash
apt-get install sysvinit-core elogind ntp dbus-x11
apt-get autoremove
exit
  1. Observe the output from the apt-get install. If your disk arrangements are unusual, that may generate some error messages from update-initramfs.

  2. Go back to the installer VC with Alt + Left arrow. If there were no error messages above, you may tell it to reboot.

  3. If there were error messages (for example, I found that if there was disk encryption, alarming messages were printed), tell the installer to go “Back”. Then ask it to “Install GRUB bootloader” (again). After that has completed, you may reboot.

  4. Enjoy your Debian system without systemd.

Read more... )

diziet: (Default)

tl;dr

Have you ever wished that you could that could write a new derive macro without having to mess with procedural macros?

Now you can!

derive-adhoc lets you write a #[derive] macro, using a template syntax which looks a lot like macro_rules!.

It’s still 0.x - so unstable, and maybe with sharp edges. We want feedback!

And, the documentation is still very terse. It is doesn’t omit anything, but, it is severely lacking in examples, motivation, and so on. It will suit readers who enjoy dense reference material.

Read more... )
diziet: (Default)

Recently, we managed to get secnet and hippotat into Debian. They are on track to go into Debian bookworm. This completes in Debian the set of VPN/networking tools I (and other Greenend) folks have been using for many years.

The Sinister Greenend Organisation’s suite of network access tools consists mainly of:

  • secnet - VPN.
  • hippotat - IP-over-HTTP (workaround for bad networks)
  • userv ipif - user-created network interfaces

secnet

secnet is our very mature VPN system.

Its basic protocol idea is similar to that in Wireguard, but it’s much older. Differences from Wireguard include:

  • Comes with some (rather clumsy) provisioning tooling, supporting almost any desired virtual network topology. In the SGO we have a complete mesh of fixed sites (servers), and a number of roaming hosts (clients), each of which can have one or more sites as its home.

  • No special kernel drivers required. Everything is userspace.

  • An exciting “polypath” mode where packets are sent via multiple underlying networks in parallel, offering increased reliability for roaming hosts.

  • Portable to non-Linux platforms.

  • A much older, and less well audited, codebase.

  • Very flexible configuration arrangements, but things are also under-documented and to an extent under-productised.

  • Hasn’t been ported to phones/tablets.

secnet was originally written by Stephen Early, starting in 1996 or so. I inherited it some years ago and have been maintaining it since. It’s mostly written in C.

Hippotat

Hippotat is best described by copying the intro from the docs:

Hippotat is a system to allow you to use your normal VPN, ssh, and other applications, even in broken network environments that are only ever tested with “web stuff”.

Packets are parcelled up into HTTP POST requests, resembling form submissions (or JavaScript XMLHttpRequest traffic), and the returned packets arrive via the HTTP response bodies.

It doesn’t rely on TLS tunnelling so can work even if the local network is trying to intercept TLS. I recently rewrote Hippotat in Rust.

userv ipif

userv ipif is one of the userv utilities.

It allows safe delegation of network routing to unprivileged users. The delegation is of a specific address range, so different ranges can be delegated to different users, and the authorised user cannot interfere with other traffic.

This is used in the default configuration of hippotat packages, so that an ordinary user can start up the hippotat client as needed.

On chiark userv-ipif is used to delegate networking to users, including administrators of allied VPN realms. So chiark actually runs at least 4 VPN-ish systems in production: secnet, hippotat, Mark Wooding’s Tripe, and still a few links managed by the now-superseded udptunnel system.

userv

userv ipif is a userv service. That is, it is a facility which uses userv to bridge a privilege boundary.

userv is perhaps my most under-appreciated program. userv can be used to straightforwardly bridge (local) privilege boundaries on Unix systems.

So for example it can:

  • Allow a sysadmin to provide a shell script to be called by unprivileged users, but which will run as root. sudo can do this too but it has quite a few gotchas, and you have to be quite careful how you use it - and its security record isn’t great either.

  • Form the internal boundary in a privilege-separated system service. So, for example, the hippotat client is a program you can run from the command line as a normal user, if the relevant network addresses have been delegated to you. On chiark, CGI programs run as the providing user - not using suexec (which I don’t trust), but via userv.

userv services can be defined by the called user, not only by the system administrator. This allows a user to reconfigure or divert a system-provided default implementation, and even allows users to define and implement ad-hoc services of their own. (Although, the system administrator can override user config.)

Acknowledgements

Thanks for the help I had in this effort.

In particular, thanks to Sean Whitton for encouragement, and the ftpmaster review; and to the Debian Rust Team for their help navigating the complexities of handling Rust packages within the Debian Rust Team workflow.

diziet: (Default)

I have reviewed, updated and revised my short book about the Rust programming language, Rust for the Polyglot Programmer.

It now covers some language improvements from the past year (noting which versions of Rust they’re available in), and has been updated for changes in the Rust library ecosystem.

With (further) assistance from Mark Wooding, there is also a new table of recommendations for numerical conversion.

Recap about Rust for the Polyglot Programmer

There are many introductory materials about Rust. This one is rather different. Compared to much other information about Rust, Rust for the Polyglot Programmer is:

  • Dense: I assume a lot of starting knowledge. Or to look at it another way: I expect my reader to be able to look up and digest non-Rust-specific words or concepts.

  • Broad: I cover not just the language and tools, but also the library ecosystem, development approach, community ideology, and so on.

  • Frank: much material about Rust has a tendency to gloss over or minimise the bad parts. I don’t do that. That also frees me to talk about strategies for dealing with the bad parts.

  • Non-neutral: I’m not afraid to recommend particular libraries, for example. I’m not afraid to extol Rust’s virtues in the areas where it does well.

  • Terse, and sometimes shallow: I often gloss over what I see as unimportant or fiddly details; instead I provide links to appropriate reference materials.

After reading Rust for the Polyglot Programmer, you won’t know everything you need to know to use Rust for any project, but should know where to find it.

Comments are welcome of course, via the Dreamwidth comments or Salsa issue or MR. (If you’re making a contribution, please indicate your agreement with the Developer Certificate of Origin.)

edited 2022-12-20 01:48 to fix a typo
diziet: (Default)

tl;dr:

Ok-wrapping as needed in today’s Rust is a significant distraction, because there are multiple ways to do it. They are all slightly awkward in different ways, so are least-bad in different situations. You must choose a way for every fallible function, and sometimes change a function from one pattern to another.

Rust really needs #[throws] as a first-class language feature. Code using #[throws] is simpler and clearer.

Please try out withoutboats’s fehler. I think you will like it.

Contents

A recent personal experience in coding style

Ever since I read withoutboats’s 2020 article about fehler, I have been using it in most of my personal projects.

For Reasons I recently had a go at eliminating the dependency on fehler from Hippotat. So, I made a branch, deleted the dependency and imports, and started on the whack-a-mole with the compiler errors.

After about a half hour of this, I was starting to feel queasy.

After an hour I had decided that basically everything I was doing was making the code worse. And, bizarrely, I kept having to make individual decisons about what idiom to use in each place. I couldn’t face it any more.

After sleeping on the question I decided that Hippotat would be in Debian with fehler, or not at all. Happily the Debian Rust Team generously helped me out, so the answer is that fehler is now in Debian, so it’s fine.

For me this experience, of trying to convert Rust-with-#[throws] to Rust-without-#[throws] brought the Ok wrapping problem into sharp focus.

What is Ok wrapping? Intro to Rust error handling

(You can skip this section if you’re already a seasoned Rust programer.)

In Rust, fallibility is represented by functions that return Result<SuccessValue, Error>: this is a generic type, representing either whatever SuccessValue is (in the Ok variant of the data-bearing enum) or some Error (in the Err variant). For example, std::fs::read_to_string, which takes a filename and returns the contents of the named file, returns Result<String, std::io::Error>.

This is a nice and typesafe formulation of, and generalisation of, the traditional C practice, where a function indicates in its return value whether it succeeded, and errors are indicated with an error code.

Result is part of the standard library and there are convenient facilities for checking for errors, extracting successful results, and so on. In particular, Rust has the postfix ? operator, which, when applied to a Result, does one of two things: if the Result was Ok, it yields the inner successful value; if the Result was Err, it returns early from the current function, returning an Err in turn to the caller.

This means you can write things like this:

    let input_data = std::fs::read_to_string(input_file)?;

and the error handling is pretty automatic. You get a compiler warning, or a type error, if you forget the ?, so you can’t accidentally ignore errors.

But, there is a downside. When you are returning a successful outcome from your function, you must convert it into a Result. After all, your fallible function has return type Result<SuccessValue, Error>, which is a different type to SuccessValue. So, for example, inside std::fs::read_to_string, we see this:

        let mut string = String::new();
        file.read_to_string(&mut string)?;
        Ok(string)
    }

string has type String; fs::read_to_string must return Result<String, ..>, so at the end of the function we must return Ok(string). This applies to return statements, too: if you want an early successful return from a fallible function, you must write return Ok(whatever).

This is particularly annoying for functions that don’t actually return a nontrivial value. Normally, when you write a function that doesn’t return a value you don’t write the return type. The compiler interprets this as syntactic sugar for -> (), ie, that the function returns (), the empty tuple, used in Rust as a dummy value in these kind of situations. A block ({ ... }) whose last statement ends in a ; has type (). So, when you fall off the end of a function, the return value is (), without you having to write it. So you simply leave out the stuff in your program about the return value, and your function doesn’t have one (i.e. it returns ()).

But, a function which either fails with an error, or completes successfuly without returning anything, has return type Result<(), Error>. At the end of such a function, you must explicitly provide the success value. After all, if you just fall off the end of a block, it means the block has value (), which is not of type Result<(), Error>. So the fallible function must end with Ok(()), as we see in the example for std::fs::read_to_string.

A minor inconvenience, or a significant distraction?

I think the need for Ok-wrapping on all success paths from fallible functions is generally regarded as just a minor inconvenience. Certainly the experienced Rust programmer gets very used to it. However, while trying to remove fehler’s #[throws] from Hippotat, I noticed something that is evident in codebases using “vanilla” Rust (without fehler) but which goes un-remarked.

There are multiple ways to write the Ok-wrapping, and the different ways are appropriate in different situations.

See the following examples, all taken from a real codebase. (And it’s not just me: I do all of these in different places, - when I don’t have fehler available - but all these examples are from code written by others.)

Idioms for Ok-wrapping - a bestiary

Wrap just a returned variable binding

If you have the return value in a variable, you can write Ok(reval) at the end of the function, instead of retval.

    pub fn take_until(&mut self, term: u8) -> Result<&'a [u8]> {
        // several lines of code
        Ok(result)
    }

If the returned value is not already bound to variable, making a function fallible might mean choosing to bind it to a variable.

Wrap a nontrivial return expression

Even if it’s not just a variable, you can wrap the expression which computes the returned value. This is often done if the returned value is a struct literal:

    fn take_from(r: &mut Reader<'_>) -> Result<Self> {
        // several lines of code
        Ok(AuthChallenge { challenge, methods })
    }

Introduce Ok(()) at the end

For functions returning Result<()>, you can write Ok(()).

This is usual, but not ubiquitous, since sometimes you can omit it.

Wrap the whole body

If you don’t have the return value in a variable, you can wrap the whole body of the function in Ok(). Whether this is a good idea depends on how big and complex the body is.

    fn from_str(s: &str) -> std::result::Result<Self, Self::Err> {
        Ok(match s {
            "Authority" => RelayFlags::AUTHORITY,
            // many other branches
            _ => RelayFlags::empty(),
        })
    }

Omit the wrap when calling fallible sub-functions

If your function wraps another function call of the same return and error type, you don’t need to write the Ok at all. Instead, you can simply call the function and not apply ?.

You can do this even if your function selects between a number of different sub-functions to call:

    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        if flags::unsafe_logging_enabled() {
            std::fmt::Display::fmt(&self.0, f)
        } else {
            self.0.display_redacted(f)
        }
    }

But this doesn’t work if the returned error type isn’t the same, but needs the autoconversion implied by the ? operator.

Convert a fallible sub-function error with Ok( ... ?)

If the final thing a function does is chain to another fallible function, but with a different error type, the error must be converted somehow. This can be done with ?.

     fn try_from(v: i32) -> Result<Self, Error> {
         Ok(Percentage::new(v.try_into()?))
     }

Convert a fallible sub-function error with .map_err

Or, rarely, people solve the same problem by converting explicitly with .map_err:

     pub fn create_unbootstrapped(self) -> Result<TorClient<R>> {
         // several lines of code
         TorClient::create_inner(
             // several parameters
         )
         .map_err(ErrorDetail::into)
     }

What is to be done, then?

The fehler library is in excellent taste and has the answer. With fehler:

  • Whether a function is fallible, and what it’s error type is, is specified in one place. It is not entangled with the main return value type, nor with the success return paths.

  • So the success paths out of a function are not specially marked with error handling boilerplate. The end of function return value, and the expression after return, are automatically wrapped up in Ok. So the body of a fallible function is just like the body of an infallible one, except for places where error handling is actually involved.

  • Error returns occur through ? error chaining, and with a new explicit syntax for error return.

  • We usually talk about the error we are possibly returning, and avoid talking about Result unless we need to.

fehler provides:

  • An attribute macro #[throws(ErrorType)] to make a function fallible in this way.

  • A macro throws!(error) for explicitly failing.

This is precisely correct. It is very ergonomic.

Consequences include:

  • One does not need to decide where to put the Ok-wrapping, since it’s automatic rather than explicitly written out.

  • Specifically, what idiom to adopt in the body (for example {write!(...)?;} vs {write!(...)} in a formatter) does not depend on whether the error needs converting, how complex the body is, and whether the final expression in the function is itself fallible.

  • Making an infallible function fallible involves only adding #[throws] to its definition, and ? to its call sites. One does not need to edit the body, or the return type.

  • Changing the error returned by a function to a suitably compatible different error type does not involve changing the function body.

  • There is no need for a local Result alias shadowing std::result::Result, which means that when one needs to speak of Result explciitly, the code is clearer.

Limitations of fehler

But, fehler is a Rust procedural macro, so it cannot get everything right. Sadly there are some wrinkles.

  • You can’t write #[throws] on a closure.

  • Sometimes you can get quite poor error messages if you have a sufficiently broken function body.

  • Code inside a macro call isn’t properly visible to fehler so sometimes return statements inside macro calls are untreated. This will lead to a type error, so isn’t a correctness hazard, but it can be nuisance if you like other syntax extensions eg if_chain.

  • #[must_use] #[throws(Error)] fn obtain() -> Thing; ought to mean that Thing must be used, not the Result<Thing, Error>.

But, Rust-with-#[throws] is so much nicer a language than Rust-with-mandatory-Ok-wrapping, that these are minor inconveniences.

Please can we have #[throws] in the Rust language

This ought to be part of the language, not a macro library. In the compiler, it would be possible to get the all the corner cases right. It would make the feature available to everyone, and it would quickly become idiomatic Rust throughout the community.

It is evident from reading writings from the time, particularly those from withoutboats, that there were significant objections to automatic Ok-wrapping. It seems to have become quite political, and some folks burned out on the topic.

Perhaps, now, a couple of years later, we can revisit this area and solve this problem in the language itself ?

“Explicitness”

An argument I have seen made against automatic Ok-wrapping, and, in general, against any kind of useful language affordance, is that it makes things less explicit.

But this argument is fundamentally wrong for Ok-wrapping. Explicitness is not an unalloyed good. We humans have only limited attention. We need to focus that attention where it is actually needed. So explicitness is good in situtions where what is going on is unusual; or would otherwise be hard to read; or is tricky or error-prone. Generally: explicitness is good for things where we need to direct humans’ attention.

But Ok-wrapping is ubiquitous in fallible Rust code. The compiler mechanisms and type systems almost completely defend against mistakes. All but the most novice programmer knows what’s going on, and the very novice programmer doesn’t need to. Rust’s error handling arrangments are designed specifically so that we can avoid worrying about fallibility unless necessary — except for the Ok-wrapping. Explicitness about Ok-wrapping directs our attention away from whatever other things the code is doing: it is a distraction.

So, explicitness about Ok-wrapping is a bad thing.

Appendix - examples showning code with Ok wrapping is worse than code using #[throws]

Observe these diffs, from my abandoned attempt to remove the fehler dependency from Hippotat.

I have a type alias AE for the usual error type (AE stands for anyhow::Error). In the non-#[throws] code, I end up with a type alias AR<T> for Result<T, AE>, which I think is more opaque — but at least that avoids typing out -> Result< , AE> a thousand times. Some people like to have a local Result alias, but that means that the standard Result has to be referred to as StdResult or std::result::Result.

With fehler and #[throws] Vanilla Rust, Result<>, mandatory Ok-wrapping

Return value clearer, error return less wordy:
impl Parseable for Secret {  impl Parseable for Secret { 
  #[throws(AE)]     
  fn parse(s: Option<&str>) -> Self {    fn parse(s: Option<&str>) -> AR<Self>
    let s = s.value()?;      let s = s.value()?; 
    if s.is_empty() { throw!(anyhow!(“secret value cannot be empty”)) }      if s.is_empty() { return Err(anyhow!(“secret value cannot be empty”)) } 
    Secret(s.into())      Ok(Secret(s.into())) 
  }    } 
  …    … 
No need to wrap whole match statement in Ok( ):
  #[throws(AE)]     
  pub fn client<T>(&self, key: &’static str, skl: SKL) -> T    pub fn client<T>(&self, key: &’static str, skl: SKL) -> AR<T> 
  where T: Parseable + Default {    where T: Parseable + Default { 
    match self.end {      Ok(match self.end { 
      LinkEnd::Client => self.ordinary(key, skl)?,        LinkEnd::Client => self.ordinary(key, skl)?, 
      LinkEnd::Server => default(),        LinkEnd::Server => default(), 
    }      }) 
    …      … 
Return value and Ok(()) entirely replaced by #[throws]:
impl Display for Loc {  impl Display for Loc { 
  #[throws(fmt::Error)]     
  fn fmt(&self, f: &mut fmt::Formatter) {    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result
    write!(f, “{:?}:{}”, &self.file, self.lno)?;      write!(f, “{:?}:{}”, &self.file, self.lno)?; 
    if let Some(s) = &self.section {      if let Some(s) = &self.section { 
      write!(f, “ ”)?;        write!(f, “ ”)?; 
      …        … 
    }      } 
      Ok(()) 
  }    } 
Call to write! now looks the same as in more complex case shown above:
impl Debug for Secret {  impl Debug for Secret { 
  #[throws(fmt::Error)]     
  fn fmt(&self, f: &mut fmt::Formatter) {    fn fmt(&self, f: &mut fmt::Formatter)-> fmt::Result
    write!(f, "Secret(***)")?;      write!(f, "Secret(***)") 
  }    } 
Much tiresome return Ok() noise removed:
impl FromStr for SectionName {  impl FromStr for SectionName { 
  type Err = AE;    type Err = AE; 
  #[throws(AE)]     
  fn from_str(s: &str) -> Self {    fn from_str(s: &str) ->AR< Self>
    match s {      match s { 
      “COMMON” => return SN::Common,        “COMMON” => return Ok(SN::Common)
      “LIMIT” => return SN::GlobalLimit,        “LIMIT” => return Ok(SN::GlobalLimit)
      _ => { }        _ => { } 
    };      }; 
    if let Ok(n@ ServerName(_)) = s.parse() { return SN::Server(n) }      if let Ok(n@ ServerName(_)) = s.parse() { return Ok(SN::Server(n))
    if let Ok(n@ ClientName(_)) = s.parse() { return SN::Client(n) }      if let Ok(n@ ClientName(_)) = s.parse() { return Ok(SN::Client(n))
    …          …     
    if client == “LIMIT” { return SN::ServerLimit(server) }      if client == “LIMIT” { return Ok(SN::ServerLimit(server))
    let client = client.parse().context(“client name in link section name”)?;      let client = client.parse().context(“client name in link section name”)?; 
    SN::Link(LinkName { server, client })      Ok(SN::Link(LinkName { server, client })) 
  }    } 
edited 2022-12-18 19:58 UTC to improve, and 2022-12-18 23:28 to fix, formatting
diziet: (Default)

tl;dr:

Don’t write a Rust linked list library: they are hard to do well, and usually useless.

Use VecDeque, which is great. If you actually need more than VecDeque can do, use one of the handful of libraries that actually offer a significantly more useful API.

If you are writing your own data structure, check if someone has done it already, and consider slotmap or generation_arena, (or maybe Rc/Arc).

Contents

Survey of Rust linked list libraries

I have updated my Survey of Rust linked list libraries.

Background

In 2019 I was writing plag-mangler, a tool for planar graph layout.

I needed a data structure. Naturally I looked for a library to help. I didn’t find what I needed, so I wrote rc-dlist-deque. However, on the way I noticed an inordinate number of linked list libraries written in Rust. Most all of these had no real reason for existing. Even the one in the Rust standard library is useless.

Results

Now I have redone the survey. The results are depressing. In 2019 there were 5 libraries which, in my opinion, were largely useless. In late 2022 there are now thirteen linked list libraries that ought probably not ever to be used. And, a further eight libraries for which there are strictly superior alternatives. Many of these have the signs of projects whose authors are otherwise competent: proper documentation, extensive APIs, and so on.

There is one new library which is better for some applications than those available in 2019. (I’m referring to generational_token_list, which makes a plausible alternative to dlv-list which I already recommended in 2019.)

Why are there so many poor Rust linked list libraries ?

Linked lists and Rust do not go well together. But (and I’m guessing here) I presume many people are taught in programming school that a linked list is a fundamental data structure; people are often even asked to write one as a teaching exercise. This is a bad idea in Rust. Or maybe they’ve heard that writing linked lists in Rust is hard and want to prove they can do it.

Double-ended queues

One of the main applications for a linked list in a language like C, is a queue, where you put items in at one end, and take them out at the other. The Rust standard library has a data structure for that, VecDeque.

Five of the available libraries:

  • Have an API which is a subset of that of VecDeque: basically, pushing and popping elements at the front and back.
  • Have worse performance for most applications than VecDeque,
  • Are less mature, less available, less well tested, etc., than VecDeque, simply because VecDeque is in the Rust Standard Library.

For these you could, and should, just use VecDeque instead.

The Cursor concept

A proper linked list lets you identify and hold onto an element in the middle of the list, and cheaply insert and remove elements there.

Rust’s ownership and borrowing rules make this awkward. One idea that people have many times reinvented and reimplemented, is to have a Cursor type, derived from the list, which is a reference to an element, and permits insertion and removal there.

Eight libraries have implemented this in the obvious way. However, there is a serious API limitation:

To prevent a cursor being invalidated (e.g. by deletion of the entry it points to) you can’t modify the list while the cursor exists. You can only have one cursor (that can be used for modification) at a time.

The practical effect of this is that you cannot retain cursors. You can make and use such a cursor for a particular operation, but you must dispose of it soon. Attempts to do otherwise will see you losing a battle with the borrow checker.

If that’s good enough, then you could just use a VecDeque and use array indices instead of the cursors. It’s true that deleting or adding elements in the middle involves a lot of copying, but your algorithm is O(n) even with the single-cursor list libraries, because it must first walk the cursor to the desired element.

Formally, I believe any algorithm using these exclusive cursors can be rewritten, in an obvious way, to simply iterate and/or copy from the start or end (as one can do with VecDeque) without changing the headline O() performance characteristics.

IMO the savings available from avoiding extra copies etc. are not worth the additional dependency, unsafe code, and so on, especially as there are other ways of helping with that (e.g. boxing the individual elements).

Even if you don’t find that convincing, generational_token_list and dlv_list are strictly superior since they offer a more flexible and convenient API and better performance, and rely on much less unsafe code.

Rustic approaches to pointers-to-and-between-nodes data structures

Most of the time a VecDeque is great. But if you actually want to hold onto (perhaps many) references to the middle of the list, and later modify it through those references, you do need something more. This is a specific case of a general class of problems where the naive approach (use Rust references to the data structure nodes) doesn’t work well.

But there is a good solution:

Keep all the nodes in an array (a Vec<Option<T>> or similar) and use the index in the array as your node reference. This is fast, and quite ergonomic, and neatly solves most of the problems. If you are concerned that bare indices might cause confusion, as newly inserted elements would reuse indices, add a per-index generation count.

These approaches have been neatly packaged up in libraries like slab, slotmap, generational-arena and thunderdome. And they have been nicely applied to linked lists by the authors of generational_token_list. and dlv-list.

The alternative for nodey data structures in safe Rust: Rc/Arc

Of course, you can just use Rust’s “interior mutability” and reference counting smart pointers, to directly implement the data structure of your choice.

In many applications, a single-threaded data structure is fine, in which case Rc and Cell/RefCell will let you write safe code, with cheap refcount updates and runtime checks inserted to defend against unexpected aliasing, use-after-free, etc.

I took this approach in rc-dlist-deque, because I wanted each node to be able to be on multiple lists.

Rust’s package ecosystem demonstrating software’s NIH problem

The Rust ecosystem is full of NIH libraries of all kinds. In my survey, there are: five good options; seven libraries which are plausible, but just not as good as the alternatives; and fourteen others.

There is a whole rant I could have about how the whole software and computing community is pathologically neophilic. Often we seem to actively resist reusing ideas, let alone code; and are ignorant and dismissive of what has gone before. As a result, we keep solving the same problems, badly - making the same mistakes over and over again. In some subfields, working software, or nearly working software, is frequently replaced with something worse, maybe more than once.

One aspect of this is a massive cultural bias towards rewriting rather than reusing, let alone fixing and using.

Many people can come out of a degree, trained to be a programmer, and have no formal training in selecting and evaluating software; this is even though working effectively with computers requires making good use of everyone else’s work.

If one isn’t taught these skills (when and how to search for prior art, how to choose between dependencies, and so on) one must learn it on the job. The result is usually an ad-hoc and unsystematic approach, often dominated by fashion rather than engineering.

The package naming paradox

The more experienced and competent programmer is aware of all the other options that exist - after all they have evaluated other choices before writing their own library.

So they will call their library something like generational_token_list or vecdeque-stableix.

Whereas the novice straight out of a pre-Rust programming course just thinks what they are doing is the one and only obvious thing (even though it’s a poor idea) and hasn’t even searched for a previous implementation. So they call their package something obvious like “linked list”.

As a result, the most obvious names seem to refer to the least useful libraries.


Edited 2022-11-16 23:55 UTC to update numbers of libraries in various categories following updates to the survey (including updates prompted by feedback received after this post first published).
diziet: (Default)

Debian does not officially support upgrading from earlier than the previous stable release: you’re not supposed to “skip” releases. Instead, you’re supposed to upgrade to each intervening major release in turn.

However, skipping intervening releases does, in fact, often work quite well. Apparently, this is surprising to many people, even Debian insiders. I was encouraged to write about it some more.

My personal experience

I have three conventionally-managed personal server systems (by which I mean systems which aren’t reprovisioned by some kind of automation). Of these at least two have been skip upgraded at least once:

The one I don’t think I’ve skip-upgraded (at least, not recently) is my house network manager (and now VM host) which I try to keep to a minimum in terms of functionality and which I keep quite up to date. It was crossgraded from i386 (32-bit) to amd64 (64-bit) fairly recently, which is a thing that Debian isn’t sure it supports. The crossgrade was done a hurry and without any planning, prompted by Spectre et al suddenly requiring big changes to Xen. But it went well enough.

My home “does random stuff” server (media server, web cache, printing, DNS, backups etc.), has etckeeper records starting in 2015. I upgraded directly from jessie (Debian 8) to buster (Debian 10). I think it has probably had earlier skip upgrade(s): the oldest file in /etc is from December 1996 and I have been doing occasional skip upgrades as long as I can remember.

And of course there’s chiark, which is one of the oldest Debian installs in existence. I wrote about the most recent upgrade, where I went directly from jessie i386 ELTS (32-bit Debian 8) to bulleye amd64 (64-bit Debian 11). That was a very extreme case which required significant planning and pre-testing, since the package dependencies were in no way sufficient for the proper ordering. But, I don’t normally go to such lengths. Normally, even on chiark, I just edit the sources.list and see what apt proposes to do.

I often skip upgrade chiark because I tend to defer risky-looking upgrades partly in the hope of others fixing the bugs while I wait :-), and partly just because change is disruptive and amortising it is very helpful both to me and my users. I have some records of chiark’s upgrades from my announcements to users. As well as the recent “skip skip up cross grade, direct”, I definitely did a skip upgrade of chiark from squeeze (Debian 6) to jessie (Debian 8). It appears that the previous skip upgrade on chiark was rex (Debian 1.2) to hamm (Debian 2.0).

I don’t think it’s usual for me to choose to do a multi-release upgrade the “officially supported” way, in two (or more) stages, on a server. I have done that on systems with a GUI desktop setup, but even then I usually skip the intermediate reboot(s).

When to skip upgrade (and what precautions to take)

I’m certainly not saying that everyone ought to be doing this routinely. Most users with a Debian install that is older than oldstable probably ought to reinstall it, or do the two-stage upgrade.

Skip upgrading almost always runs into some kind of trouble (albeit, usually trouble that isn’t particularly hard to fix if you know what you’re doing).

However, officially supported non-skip upgrades go wrong too. Doing a two-or-more-releases upgrade via the intermediate releases can expose you to significant bugs in the intermediate releases, which were later fixed. Because Debian’s users and downstreams are cautious, and Debian itself can be slow, it is common for bugs to appear for one release and then be fixed only in the next. Paradoxically, this seems to be especially true with the kind of big and scary changes where you’d naively think the upgrade mechanisms would break if you skipped the release where the change first came in.

I would not recommend a skip upgrade to someone who is not a competent Debian administrator, with good familiarity with Debian package management, including use of dpkg directly to fix things up. You should have a mental toolkit of manual bug workaround techniques. I always should make sure that I have rescue media (and in the case of a remote system, full remote access including ability to boot a different image), although I don’t often need it.

And, when considering a skip upgrade, you should be aware of the major changes that have occurred in Debian.

Skip upgrading is more likely to be a good idea with a complex and highly customised system: a fairly vanilla install is not likely to encounter problems during a two-stage update. (And, a vanilla system can be “upgraded” by reinstalling.)

I haven’t recently skip upgraded a laptop or workstation. I doubt I would attempt it; modern desktop software seems to take a much harder line about breaking things that are officially unsupported, and generally trying to force everyone into the preferred mold.

A request to Debian maintainers

I would like to encourage Debian maintainers to defer removing upgrade compatibility machinery until it is actually getting in the way, or has become hazardous, or many years obsolete.

Examples of the kinds of things which it would be nice to keep, and which do not usually cause much inconvenience to retain, are dependency declarations (particularly, alternatives), and (many) transitional fragments in maintainer scripts.

If you find yourself needing to either delete some compatibility feature, or refactor/reorganise it, I think it is probably best to delete it. If you modify it significantly, the resulting thing (which won’t be tested until someone uses it in anger) is quite likely to have bugs which make it go wrong more badly (or, more confusingly) than the breakage that would happen without it.

Obviously this is all a judgement call.

I’m not saying Debian should formally “support” skip upgrades, to the extent of (further) slowing down important improvements. Nor am I asking for any change to the routine approach to (for example) transitional packages (i.e. the technique for ensuring continuity of installation when a package name changes).

We try to make release upgrades work perfectly; but skip upgrades don’t have to work perfectly to be valuable. Retaining compatibility code can also make it easier to provide official backports, and it probably helps downstreams with different release schedules.

The fact that maintainers do in practice often defer removing compatibility code provides useful flexibility and options to at least some people. So it would be nice if you’d at least not go out of your way to break it.

Building on older releases

I would also like to encourage maintainers to provide source packages in Debian unstable that will still build on older releases, where this isn’t too hard and the resulting binaries might be basically functional.

Speaking personally, it’s not uncommon for me to rebuild packages from unstable and install them on much older releases. This is another thing that is not officially supported, but which often works well.

I’m not saying to contort your build system, or delay progress. You’ll definitely want to depend on a recentish debhelper. But, for example, retaining old build-dependency alternatives is nice. Retaining them doesn’t constitute a promise that it works - it just makes life slightly easier for someone who is going off piste.

If you know you have users on multiple distros or multiple releases, and wish to fully support them, you can go further, of course. Many of my own packages are directly buildable, or even directly installable, on older releases.

diziet: (Default)

I have released version 1.0.0 of Hippotat, my IP-over-HTTP system. To quote the README:

You’re in a cafe or a hotel, trying to use the provided wifi. But it’s not working. You discover that port 80 and port 443 are open, but the wifi forbids all other traffic.

Never mind, start up your hippotat client. Now you have connectivity. Your VPN and SSH and so on run over Hippotat. The result is not very efficient, but it does work.

Story

In early 2017 I was in a mountaintop cafeteria, hoping to do some work on my laptop. (For Reasons I couldn’t go skiing that day.) I found that local wifi was badly broken: It had a severe port block. I had to use my port 443 SSH server to get anywhere. My usual arrangements punt everything over my VPN, which uses UDP of course, and I had to bodge several things. Using a web browser directly only the wifi worked normally, of course - otherwise the other guests would have complained. This was not the first experience like this I’d had, but this time I had nothing much else to do but fix it.

In a few furious hacking sessions, I wrote Hippotat, a tool for making my traffic look enough like “ordinary web browsing” that it gets through most stupid firewalls. That Python version of Hippotat served me well for many years, despite being rather shonky, extremely inefficient in CPU (and therefore battery) terms and not very productised.

But recently things have started to go wrong. I was using Twisted Python and there was what I think must be some kind of buffer handling bug, which started happening when I upgraded the OS (getting newer versions of Python and the Twisted libraries). The Hippotat code, and the Twisted APIs, were quite convoluted, and I didn’t fancy debugging it.

So last year I rewrote it in Rust. The new Rust client did very well against my existing servers. To my shame, I didn’t get around to releasing it.

However, more recently I upgraded the server hosts my Hippotat daemons run on to recent Debian releases. They started to be affected by the bug too, rendering my Rust client unuseable. I decided I had to deploy the Rust server code.

This involved some packaging work. Having done that, it’s time to release it: Hippotat 1.0.0 is out.

The package build instructions are rather strange

My usual approach to releasing something like this would be to provide a git repository containing a proper Debian source package. I might also build binaries, using sbuild, and I would consider actually uploading to Debian.

However, despite me taking a fairly conservative approach to adding dependencies to Hippotat, still a couple of the (not very unusual) Rust packages that Hippotat depends on are not in Debian. Last year I considered tackling this head-on, but I got derailed by difficulties with Rust packaging in Debian.

Furthermore, the version of the Rust compiler itself in Debian stable is incapable of dealing with recent versions of very many upstream Rust packages, because many packages’ most recent versions now require the 2021 Edition of Rust. Sadly, Rust’s package manager, cargo, has no mechanism for trying to choose dependency versions that are actually compatible with the available compiler; efforts to solve this problem have still not borne the needed fruit.

The result is that, in practice, currently Hippotat has to be built with (a) a reasonably recent Rust toolchain such as found in Debian unstable or obtained from Rust upstream; (b) dependencies obtained from the upstream Rust repository.

At least things aren’t completely terrible: Rustup itself, despite its alarming install rune, has a pretty good story around integrity, release key management and so on. And with the right build rune, cargo will check not just the versions, but the precise content hashes, of the dependencies to be obtained from crates.io, against the information I provide in the Cargo.lock file. So at least when you build it you can be sure that the dependencies you’re getting are the same ones I used myself when I built and tested Hippotat. And there’s only 147 of them (counting indirect dependencies too), so what could possibly go wrong?

Sadly the resulting package build system cannot work with Debian’s best tool for doing clean and controlled builds, sbuild. Under the circumstances, I don’t feel I want to publish any binaries.

diziet: (Default)

The problem I had - Mason, so, sadly, FastCGI

Since the update to current Debian stable, the website for YARRG, (a play-aid for Puzzle Pirates which I wrote some years ago), started to occasionally return “Internal Server Error”, apparently due to bug(s) in some FastCGI libraries.

I was using FastCGI because the website is written in Mason, a Perl web framework, and I found that Mason CGI calls were slow. I’m using CGI - yes, trad CGI - via userv-cgi. Running Mason this way would “compile” the template for each HTTP request just when it was rendered, and then throw the compiled version away. The more modern approach of an application server doesn’t scale well to a system which has many web “applications” most of which are very small. The admin overhead of maintaining a daemon, and corresponding webserver config, for each such “service” would be prohibitive, even with some kind of autoprovisioning setup. FastCGI has an interpreter wrapper which seemed like it ought to solve this problem, but it’s quite inconvenient, and often flaky.

I decided I could do better, and set out to eliminate FastCGI from my setup. The result seems to be a success; once I’d done all the hard work of writing prefork-interp, I found the result very straightforward to deploy.

prefork-interp

prefork-interp is a small C program which wraps a script, plus a scripting language library to cooperate with the wrapper program. Together they achieve the following:

  • Startup cost of the script (loading modules it uses, precompuations, loading and processing of data files, etc.) is paid once, and reused for subsequent invocations of the same script.

  • Minimal intervention to the script source code:
    • one new library to import
    • one new call to make from that library, right after the script intialisation is complete
    • change to the #! line.
  • The new “initialisation complete” call turns the program into a little server (a daemon), and then returns once for each actual invocation, each time in a fresh grandchild process.

Features:

  • Seamless reloading on changes to the script source code (automatic, and configurable).

  • Concurrency limiting.

  • Options for distinguishing different configurations of the same script so that they get a server each.

  • You can run the same script standalone, as a one-off execution, as well as under prefork-interp.

  • Currently, a script-side library is provided for Perl. I’m pretty sure Python would be fairly straightforward.

Important properties not always satisfied by competing approaches:

  • Error output (stderr) and exit status from both phases of the script code execution faithfully reproduced to the calling context. Environment, arguments, and stdin/stdout/stderr descriptors, passed through to each invocation.

  • No polling, other than a long-term idle timeout, so good on laptops (or phones).

  • Automatic lifetime management of the per-script server, including startup and cleanup. No integration needed with system startup machinery: No explicit management of daemons, init scripts, systemd units, cron jobs, etc.

  • Useable right away without fuss for CGI programs but also for other kinds of program invocation.

  • (I believe) reliable handling of unusual states arising from failed invocations or races.

Swans paddling furiously

The implementation is much more complicated than the (apparent) interface.

I won’t go into all the details here (there are some terrifying diagrams in the source code if you really want), but some highlights:

We use an AF_UNIX socket (hopefully in /run/user/UID, but in ~ if not) for rendezvous. We can try to connect without locking, but we must protect the socket with a separate lockfile to avoid two concurrent restart attempts.

We want stderr from the script setup (pre-initialisation) to be delivered to the caller, so the script ought to inherit our stderr and then will need to replace it later. Twice, in fact, because the daemonic server process can’t have a stderr.

When a script is restarted for any reason, any old socket will be removed. We want the old server process to detect that and quit. (If hung about, it would wait for the idle timeout; if this happened a lot - eg, a constantly changing set of services - we might end up running out of pids or something.) Spotting the socket disappearing, without polling, involves use of a library capable of using inotify (or the equivalent elsewhere). Choosing a C library to do this is not so hard, but portable interfaces to this functionality can be hard to find in scripting languages, and also we don’t want every language binding to have to reimplement these checks. So for this purpose there’s a little watcher process, and associated IPC.

When an invoking instance of prefork-interp is killed, we must arrange for the executing service instance to stop reading from its stdin (and, ideally, writing its stdout). Otherwise it’s stealing input from prefork-interp’s successors (maybe the user’s shell)!

Cleanup ought not to depend on positive actions by failing processes, so each element of the system has to detect failures of its peers by means such as EOF on sockets/pipes.

Obtaining prefork-interp

I put this new tool in my chiark-utils package, which is a collection of useful miscellany. It’s available from git.

Currently I make releases by uploading to Debian, where prefork-interp has just hit Debian unstable, in chiark-utils 7.0.0.

Support for other scripting languages

I would love Python to be supported. If any pythonistas reading this think you might like to help out, please get in touch. The specification for the protocol, and what the script library needs to do, is documented in the source code

Future plans for chiark-utils

chiark-utils as a whole is in need of some tidying up of its build system and packaging.

I intend to try to do some reorganisation. Currently I think it would be better to organising the source tree more strictly with a directory for each included facility, rather than grouping “compiled” and “scripts” together.

The Debian binary packages should be reorganised more fully according to their dependencies, so that installing a program will ensure that it works.

I should probably move the official git repo from my own git+gitweb to a forge (so we can have MRs and issues and so on).

And there should be a lot more testing, including Debian autopkgtests.

edited 2022-08-23 10:30 +01:00 to improve the formatting
diziet: (Default)

Background

Internet email is becoming more reliant on DKIM, a scheme for having mail servers cryptographically sign emails. The Big Email providers have started silently spambinning messages that lack either DKIM signatures, or SPF. DKIM is arguably less broken than SPF, so I wanted to deploy it.

But it has a problem: if done in a naive way, it makes all your emails non-repudiable, forever. This is not really a desirable property - at least, not desirable for you, although it can be nice for someone who (for example) gets hold of leaked messages obtained by hacking mailboxes.

This problem was described at some length in Matthew Green’s article Ok Google: please publish your DKIM secret keys. Following links from that article does get you to a short script to achieve key rotation but it had a number of problems, and wasn’t useable in my context.

dkim-rotate

So I have written my own software for rotating and revoking DKIM keys: dkim-rotate.

I think it is a good solution to this problem, and it ought to be deployable in many contexts (and readily adaptable to those it doesn’t already support).

Here’s the feature list taken from the README:

  • Leaked emails become unattestable (plausibily deniable) within a few days — soon after the configured maximum email propagation time.

  • Mail domain DNS configuration can be static, and separated from operational DKIM key rotation. Domain owner delegates DKIM configuration to mailserver administrator, so that dkim-rotate does not need to edit your mail domain’s zone.

  • When a single mail server handles multiple mail domains, only a single dkim-rotate instance is needed.

  • Supports situations where multiple mail servers may originate mails for a single mail domain.

  • DNS zonefile remains small; old keys are published via a webserver, rather than DNS.

  • Supports runtime (post-deployment) changes to tuning parameters and configuration settings. Careful handling of errors and out-of-course situations.

  • Convenient outputs: a standard DNS zonefile; easily parseable settings for the MTA; and, a directory of old keys directly publishable by a webserver.

Complications

It seems like it should be a simple problem. Keep N keys, and every day (or whatever), generate and start using a new key, and deliberately leak the oldest private key.

But, things are more complicated than that. Considerably more complicated, as it turns out.

I didn’t want the DKIM key rotation software to have to edit the actual DNS zones for each relevant mail domain. That would tightly entangle the mail server administration with the DNS administration, and there are many contexts (including many of mine) where these roles are separated.

The solution is to use DNS aliases (CNAME). But, now we need a fixed, relatively small, set of CNAME records for each mail domain. That means a fixed, relatively small set of key identifiers (“selectors” in DKIM terminology), which must be used in rotation.

We don’t want the private keys to be published via the DNS because that makes an ever-growing DNS zone, which isn’t great for performance; and, because we want to place barriers in the way of processes which might enumerate the set of keys we use (and the set of keys we have leaked) and keep records of what status each key had when. So we need a separate publication channel - for which a webserver was the obvious answer.

We want the private keys to be readily noticeable and findable by someone who is verifying an alleged leaked email dump, but to be hard to enumerate. (One part of the strategy for this is to leave a note about it, with the prospective private key url, in the email headers.)

The key rotation operations are more complicated than first appears, too. The short summary, above, neglects to consider the fact that DNS updates have a nonzero propagation time: if you change the DNS, not everyone on the Internet will experience the change immediately. So as well as a timeout for how long it might take an email to be delivered (ie, how long the DKIM signature remains valid), there is also a timeout for how long to wait after updating the DNS, before relying on everyone having got the memo. (This same timeout applies both before starting to sign emails with a new key, and before deliberately compromising a key which has been withdrawn and deadvertised.)

Updating the DNS, and the MTA configuration, are fallible operations. So we need to cope with out-of-course situations, where a previous DNS or MTA update failed. In that case, we need to retry the failed update, and not proceed with key rotation. We mustn’t start the timer for the key rotation until the update has been implemented.

The rotation script will usually be run by cron, but it might be run by hand, and when it is run by hand it ought not to “jump the gun” and do anything “too early” (ie, before the relevant timeout has expired). cron jobs don’t always run, and don’t always run at precisely the right time. (And there’s daylight saving time, to consider, too.)

So overall, it’s not sufficient to drive the system via cron and have it proceed by one unit of rotation on each run.

And, hardest of all, I wanted to support post-deployment configuration changes, while continuing to keep the whole the system operational. Otherwise, you have to bake in all the timing parameters right at the beginning and can’t change anything ever. So for example, I wanted to be able to change the email and DNS propagation delays, and even the number of selectors to use, without adversely affecting the delivery of already-sent emails, and without having to shut anything down.

I think I have solved these problems.

The resulting system is one which keeps track of the timing constraints, and the next event which might occur, on a per-key basis. It calculates on each run, which key(s) can be advanced to the next stage of their lifecycle, and performs the appropriate operations. The regular key update schedule is then an emergent property of the config parameters and cron job schedule. (I provide some example config.)

Exim

Integrating dkim-rotate itself with Exim was fairly easy. The lsearch lookup function can be used to fish information out of a suitable data file maintained by dkim-rotate.

But a final awkwardness was getting Exim to make the right DKIM signatures, at the right time.

When making a DKIM signature, one must choose a signing authority domain name: who should we claim to be? (This is the “SDID” in DKIM terms.) A mailserver that handles many different mail domains will be able to make good signatures on behalf of many of them. It seems to me that domain to be the mail domain in the From: header of the email. (The RFC doesn’t seem to be clear on what is expected.) Exim doesn’t seem to have anything builtin to do that.

And, you only want to DKIM-sign emails that are originated locally or from trustworthy sources. You don’t want to DKIM-sign messages that you received from the global Internet, and are sending out again (eg because of an email alias or mailing list). In theory if you verify DKIM on all incoming emails, you could avoid being fooled into signing bad emails, but rejecting all non-DKIM-verified email would be a very strong policy decision. Again, Exim doesn’t seem to have cooked machinery.

The resulting Exim configuration parameters run to 22 lines, and because they’re parameters to an existing config item (the smtp transport) they can’t even easily be deployed as a drop-in file via Debian’s “split config” Exim configuration scheme.

(I don’t know if the file written for Exim’s use by dkim-rotate would be suitable for other MTAs, but this part of dkim-rotate could easily be extended.)

Conclusion

I have today released dkim-rotate 0.4, which is the first public release for general use.

I have it deployed and working, but it’s new so there may well be bugs to work out.

If you would like to try it out, you can get it via git from Debian Salsa. (Debian folks can also find it freshly in Debian unstable.)

diziet: (Default)

Two weeks ago I upgraded chiark from Debian jessie i386 to bullseye amd64, after nearly 30 years running Debian i386. This went really quite well, in fact!

Background

chiark is my “colo” - a server I run, which lives in a data centre in London. It hosts ~200 users with shell accounts, various websites and mailing lists, moderators for a number of USENET newsgroups, and countless other services. chiark’s internal setup is designed to enable my users to do a maximum number of exciting things with a minimum of intervention from me.

chiark’s OS install dates to 1993, when I installed Debian 0.93R5, the first version of Debian to advertise the ability to be upgraded without reinstalling. I think that makes it one of the oldest Debian installations in existence.

Obviously it’s had several new hardware platforms too. (There was a prior install of Linux on the initial hardware, remnants of which can maybe still be seen in some obscure corners of chiark’s /usr/local.)

chiark’s install is also at the very high end of the installation complexity, and customisation, scale: reinstalling it completely would be an enormous amount of work. And it’s unique.

chiark’s upgrade history

chiark’s last major OS upgrade was to jessie (Debian 8, released in April 2015). That was in 2016. Since then we have been relying on Debian’s excellent security support posture, and the Debian LTS and more recently Freexian’s Debian ELTS projects and some local updates, The use of ELTS - which supports only a subset of packages - was particularly uncomfortable.

Additionally, chiark was installed with 32-bit x86 Linux (Debian i386), since that was what was supported and available at the time. But 32-bit is looking very long in the tooth.

Why do a skip upgrade

So, I wanted to move to the fairly recent stable release - Debian 11 (bullseye), which is just short of a year old. And I wanted to “crossgrade” (as its called) to 64-bit.

In the past, I have found I have had greater success by doing “direct” upgrades, skipping intermediate releases, rather than by following the officially-supported path of going via every intermediate release.

Doing a skip upgrade avoids exposure to any packaging bugs which were present only in intermediate release(s). Debian does usually fix bugs, but Debian has many cautious users, so it is not uncommon for bugs to be found after release, and then not be fixed until the next one.

A skip upgrade avoids the need to try to upgrade to already-obsolete releases (which can involve messing about with multiple snapshots from snapshot.debian.org. It is also significantly faster and simpler, which is important not only because it reduces downtime, but also because it removes opportunities (and reduces the time available) for things to go badly.

One downside is that sometimes maintainers aggressively remove compatibility measures for older releases. (And compatibililty packages are generally removed quite quickly by even cautious maintainers.) That means that the sysadmin who wants to skip-upgrade needs to do more manual fixing of things that haven’t been dealt with automatically. And occasionally one finds compatibility problems that show up only when mixing very old and very new software, that no-one else has seen.

Crossgrading

Crossgrading is fairly complex and hazardous. It is well supported by the low level tools (eg, dpkg) but the higher-level packaging tools (eg, apt) get very badly confused.

Nowadays the system is so complex that downloading things by hand and manually feeding them to dpkg is impractical, other than as a very occasional last resort.

The approach, generally, has been to set the system up to “want to” be the new architecture, run apt in a download-only mode, and do the package installation manually, with some fixing up and retrying, until the system is coherent enough for apt to work.

This is the approach I took. (In current releases, there are tools that will help but they are only in recent releases and I wanted to go direct. I also doubted that they would work properly on chiark, since it’s so unusual.)

Peril and planning

Overall, this was a risky strategy to choose. The package dependencies wouldn’t necessarily express all of the sequencing needed. But it still seemed that if I could come up with a working recipe, I could do it.

I restored most of one of chiark’s backups onto a scratch volume on my laptop. With the LVM snapshot tools and chroots. I was able to develop and test a set of scripts that would perform the upgrade. This was a very effective approach: my super-fast laptop, with local caches of the package repositories, was able to do many “edit, test, debug” cycles.

My recipe made heavy use of snapshot.debian.org, to make sure that it wouldn’t rot between testing and implementation.

When I had a working scheme, I told my users about the planned downtime. I warned everyone it might take even 2 or 3 days. I made sure that my access arrangemnts to the data centre were in place, in case I needed to visit in person. (I have remote serial console and power cycler access.)

Reality - the terrible rescue install

My first task on taking the service down was the check that the emergency rescue installation worked: chiark has an ancient USB stick in the back, which I can boot to from the BIOS. The idea being that many things that go wrong could be repaired from there.

I found that that install was too old to understand chiark’s storage arrangements. mdadm tools gave very strange output. So I needed to upgrade it. After some experiments, I rebooted back into the main install, bringing chiark’s service back online.

I then used the main install of chiark as a kind of meta-rescue-image for the rescue-image. The process of getting the rescue image upgraded (not even to amd64, but just to something not totally ancient) was fraught. Several times I had to rescue it by copying files in from the main install outside. And, the rescue install was on a truly ancient 2G USB stick which was terribly terribly slow, and also very small.

I hadn’t done any significant planning for this subtask, because it was low-risk: there was little way to break the main install. Due to all these adverse factors, sorting out the rescue image took five hours.

If I had known how long it would take, at the beginning, I would have skipped it. 5 hours is more than it would have taken to go to London and fix something in person.

Reality - the actual core upgrade

I was able to start the actual upgrade in the mid-afternoon. I meticulously checked and executed the steps from my plan.

The terrifying scripts which sequenced the critical package updates ran flawlessly. Within an hour or so I had a system which was running bullseye amd64, albeit with many important packages still missing or unconfigured.

So I didn’t need the rescue image after all, nor to go to the datacentre.

Fixing all the things

Then I had to deal with all the inevitable fallout from an upgrade.

Notable incidents:

exim4 has a new tainting system

This is to try to help the sysadmin avoid writing unsafe string interpolations. (“Little Bobby Tables.”) This was done by Exim upstream in a great hurry as part of a security response process.

The new checks meant that the mail configuration did not work at all. I had to turn off the taint check completely. I’m fairly confident that this is correct, because I am hyper-aware of quoting issues and all of my configuration is written to avoid the problems that tainting is supposed to avoid.

One particular annoyance is that the approach taken for sqlite lookups makes it totally impossible to use more than one sqlite database. I think the sqlite quoting operator which one uses to interpolate values produces tainted output? I need to investigate this properly.

LVM now ignores PVs which are directly contained within LVs by default

chiark has LVM-on-RAID-on-LVM. This generally works really well.

However, there was one edge case where I ended up without the intermediate RAID layer. The result is LVM-on-LVM.

But recent versions of the LVM tools do not look at PVs inside LVs, by default. This is to help you avoid corrupting the state of any VMs you have on your system. I didn’t know that at the time, though. All I knew was that LVM was claiming my PV was “unusable”, and wouldn’t explain why.

I was about to start on a thorough reading of the 15,000-word essay that is the commentary in the default /etc/lvm/lvm.conf to try to see if anything was relevant, when I received a helpful tipoff on IRC pointing me to the scan_lvs option.

I need to file a bug asking for the LVM tools to explain why they have declared a PV unuseable.

apache2’s default config no longer read one of my config files

I had to do a merge (of my changes vs the maintainers’ changes) for /etc/apache2/apache2.conf. When doing this merge I failed to notice that the file /etc/apache2/conf.d/httpd.conf was no longer included by default. My merge dropped that line. There were some important things in there, and until I found this the webserver was broken.

dpkg --skip-same-version DTWT during a crossgrade

(This is not a “fix all the things” - I found it when developing my upgrade process.)

When doing a crossgrade, one often wants to say to dpkg “install all these things, but don’t reinstall things that have already been done”. That’s what --skip-same-version is for.

However, the logic had not been updated as part of the work to support multiarch, so it was wrong. I prepared a patched version of dpkg, and inserted it in the appropriate point in my prepared crossgrade plan.

The patch is now filed as bug #1014476 against dpkg upstream

Mailman

Mailman is no longer in bullseye. It’s only available in the previous release, buster.

bullseye has Mailman 3 which is a totally different system - requiring basically, a completely new install and configuration. To even preserve existing archive links (a very important requirement) is decidedly nontrivial.

I decided to punt on this whole situation. Currently chiark is running buster’s version of Mailman. I will have to deal with this at some point and I’m not looking forward to it.

Python

Of course that Mailman is Python 2. The Python project’s extremely badly handled transition includes a recommendation to change the meaning of #!/usr/bin/python from Python 2, to Python 3.

But Python 3 is a new language, barely compatible with Python 2 even in the most recent iterations of both, and it is usual to need to coinstall them.

Happily Debian have provided the python-is-python2 package to make things work sensibly, albeit with unpleasant imprecations in the package summary description.

USENET news

Oh my god. INN uses many non-portable data formats, which just depend on your C types. And there are complicated daemons, statically linked libraries which cache on-disk data, and much to go wrong.

I had numerous problems with this, and several outages and malfunctions. I may write about that on a future occasion.

(edited 2022-07-20 11:36 +01:00 and 2022-07-30 12:28+01:00 to fix typos)
diziet: (Default)

I have just released Otter 1.0.0.

Recap: what is Otter

Otter is my game server for arbitrary board games. Unlike most online game systems. It does not know (nor does it need to know) the rules of the game you are playing. Instead, it lets you and your friends play with common tabletop/boardgame elements such as hands of cards, boards, and so on. So it’s something like a “tabletop simulator” (but it does not have any 3D, or a physics engine, or anything like that).

There are provided game materials and templates for Penultima, Mao, and card games in general.

Otter also supports uploadable game bundles, which allows users to add support for additional games - and this can be done without programming.

For more information, see the online documentation. There are a longer intro and some screenshots in my 2021 introductory blog post about Otter

Releasing 1.0.0

I’m calling this release 1.0.0 because I think I can now say that its quality, reliability and stability is suitable for general use. In particular, Otter now builds on Stable Rust, which makes it a lot easier to install and maintain.

Switching web framework, and async Rust

I switched Otter from the Rocket web framework to Actix. There are things to prefer about both systems, and I still have a soft spot for Rocket. But ultimately I needed a framework which was fully released and supported for use with Stable Rust.

There are few if any Rust web frameworks that are not async. This is rather a shame. Async Rust is a considerably more awkward programming environment than ordinary non-async Rust. I don’t want to digress into a litany of complaints, but suffice it to say that while I really love Rust, my views on async Rust are considerably more mixed.

Future plans

In the near future I plan to add a couple of features to better support some particular games: currency-like resources, and a better UI for dice-like randomness.

In the longer term, Otter’s, installation and account management arrangements are rather unsophisticated and un-webby. There is not currently any publicly available instance for you to try it out without installing it on a machine of your own. There’s not even any provided binaries: you must built Otter yourself. I hope to be able to improve this situation but it involves dealing with cloud CI and containers and so-on, which can all be rather unpleasant.

Users on chiark will find an instance of Otter there.

Profile

diziet: (Default)
Ian Jackson

March 2025

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324252627 2829
3031     

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags