Tuesday, November 23, 2021

Stability and versioning: Lock yourself in at your own peril.

I usually write about Rust, but today I want to discuss something broader; flexibility in software, particularly in design and architecture. To begin, let's focus briefly on versioning.

Versioning

Prior to semantic versioning, version numbering conventions used in software releases were nearly as numerous as the releases themselves. Some had implied meaning for releases with odd vs even numbers, some would skip numbers arbitrarily, some would use letters and other non-numeric characters, and some would have unique counting strategies. Semantic versioning provided some much-needed consistency.

Consistency, however, is not the ultimate goal. You might imagine some issues with consistency itself; is it better to have some inconsistencies (mix the good with the bad), or just be consistently bad? One of the most important goals for a software maintainer is to provide repeatability; in other words, consistency within your build process and testing infrastructure. Same word, just scoped differently. Versioning, when used appropriately, can help increase the repeatability of your builds and tests.

So then, let's defined "appropriate versioning". If we follow SemVer closely, then it should not be possible to introduce breaking changes in our software that will negatively impact downstream dependents. If a mistake is made and a breaking change goes out in a version that is not a semantically breaking version, then we have to do the right thing and unpublish/yank the semantically incorrect version. That's all there really is to it. I'll leave you to read the SemVer spec for the precise rules to follow, but this is all that one needs to appropriately version their software.

Stability

This term, "stability", is ironically such a fragile concept. For something to be stable, it must be unchanging. For something to be unchangeable, it must be perfect. Very few things are universally perfect. I like to use science to illustrate this point: the sciences are ever-evolving; knowledge is not static but accumulates and is refined over time precisely because it didn't start out perfect. It is fitting that computer science is one of these fluidly changing organizations of knowledge.

When software engineers think of stability, they are usually looking at it in the context of any of the following layers of abstraction:

  • API stability. This is what SemVer addresses directly.
  • ABI stability. The idea that binaries compiled long ago (or by different toolchains) should work transparently with binaries compiled today.
  • Resilience. Resistance to changing behavior or breaking compatibility, but also resistance to logic bugs and the like.

Resilience in service-oriented software is a very interesting topic of ongoing research, but we can frame resilience in the terms described above. Namely resistance to logic bugs, since changes in behavior or interface are better addressed through tools like SemVer.

So, what do we know about logic bugs in stable interfaces? The first thing that comes to my mind is in the C standard library. It's our old friend, the null-terminated string and its wild band of merry havoc-wreaking unsafe functions! There is good reason this has been dubbed the most expensive one-byte mistake. Now just because C has null-terminated strings and its standard library supports them doesn't mean you have to use these abominations even if you write code purely in C. But the sheer support for them in the wider ecosystem sure does make it harder to integrate various libraries when you are using any of the safe alternatives. Not to mention the fragmentation that each of those contributes to.

We are left with a stable interface whose intractable design makes it almost useless in practice. It may as well not exist at all! We'd be better off without null-terminated strings, de facto. But in some sense, we are stuck with them for all time, and there is nothing that I, you, or anyone else can do about it. This is clearly a highly important lesson of stabilization.

There is another term for stability, often used with critical connotation:

Ossification

No doubt, many will first encounter this term when trying to understand the QUIC protocol. It has been used in various circles for far longer, though.

If you read between the lines a little, ossification sounds like an objectively bad thing. It is outside of your control, and you cannot fix it. What could be worse than that? In my opinion, being the decisionmaker in that case would be worse. I wouldn't want to be held accountable for saying "this is our call; it's set in stone and it can never be changed."

Stability comes packaged with this gremlin called ossification, and they are inseparable. What sounds like a good idea right now may turn out to be a bad idea later. It happened with null-terminated strings and nullable pointers, it happened with PHP and Python, and it will happen again. Maybe the next library or application you stabilize will turn out to be totally wrong/broken/inadequate.

You can't tell the future, but you will definitely be stuck with the past.

Outside of standard libraries that ship with programming languages and compilers, it is a pretty safe bet that being totally wrong/broken/inadequate is a reasonable situation to be in; you just patch your library or application when things need to be fixed and all is right, right? I think this actually depends on a few aspects:

  1. The software maintainer needs to be on top of maintenance. If they fall behind on making patches or merging PRs, then you're left with the maintenance burden yourself, often on software that you are not an expert in. It just happens that you depend on its functionality and you get stuck with the bill when the maintainer moves on to other projects.
  2. Other dependents need to maintain their software, too! That means all transient dependencies in the entire tree need the same treatment and amount of prioritization to receive patches in a timely manner.

To echo the sentiment of the Fastly article, as long as everyone keeps up their end of the bargain with maintenance, there isn't a whole lot that can go wrong. It's just an impractical task to herd all of those cats; that's the reality.

What, then, is there to do about standard libraries? Hopefully as little as possible! Considering the likelihood of getting a design or public interface wrong, and the resistance to change with strong stability guarantees, it is in everyone's best interest that standards libraries be as small and self-contained as reasonably possible. Even some primitive types that are useful in general may be better off in a library that can freely be updated (in breaking fasion) by end users. Yes, I'm thinking of strings in C, again; but I'm also thinking about mutex poisoning in Rust; I'm thinking about Python's multiple XML interfaces and its asyncio module that is arguably poorly designed compared to curio and trio; I'm thinking about Go officially supporting largely irrelevant and insecure ciphers like RC4, DES, and MD5 (honorable mention to both pseudorandom number packages, because that's never going to confuse anyone).

Maybe it's better to not include everything in the standard library. We just end up hearing recommendations over an over about how you shouldn't use this feature or that function because it's unsafe/insecure/deprecated/there's a better alternative. Starting with a very small surface area seems like the right way to go. This is not a proposal to start small and grow gradually, though! I don't think a standard library should grow much, if at all.

Think about it this way: Something like JSON might seem like an obvious choice for inclusion now, but in 40 years, is anyone honestly still going to use JSON? I mean, ok, fine. People are still using COBOL. But my point is that 80% of all developers are not COBOL devs. In four decades, sure there will be some JSON stragglers just like there are some COBOL stragglers today, but I can't believe that it will remain popular considering how bad it is for so many reasons. (Let's face it! The only reason protobuf or something else hasn't entirely supplanted JSON is because web browsers have ossified their serialization format.) JSON will very probably end up like null-terminate strings; the bane of some seasoned developer's existence. They both seem just so innocent, don't they?

Be very cautious about what you choose to ossify.

Resilience

I'm not much of a words-person, but when I research things that I write about, I usually go to the thesaurus often to find a good synonym for whatever concept I'm trying to convey. One of the top synonyms given for "flexibility" was "resilience"! And this was a fascinating discovery itself, worthy of a paragraph or two. Flexibility implies resilience because flexible things bend rather than break. When something is rigid and enough pressure is applied, it has to break at some point. And breaking is not very resilient, is it?

This definition of "bend rather than break" applies in a number of relevant ways for software design. A service provider strives for resilience to downtime and outages. A programming language designer and compiler implementer strives for resilience to breaking user code. Interface designers strive for resilience to footguns and obvious logic bugs. In each case, the person wants the software (or user) to bend rather than break.

I like it. In fact, I'm willing to boldly claim that resilience is objectively desired. If that is the case, then its synonym, flexibility, should also be just as desired, objectively. Among the definitions of this particular word, you'll find "bend rather than break", "ease of adaptation or offering many options", and (my person favorite) "the willingness to adjust one's thinking or behavior." Doesn't that last definition sound suspiciously like science and knowledge? It's no coincidence, and I'm not just playing word games. I'm pointing out the underlying truth that things must change. It's the third law of thermodynamics.

Any resilient system will necessarily resist ossification.

What are these quotes, by the way? Don't worry about it! I just made them up. If I used twitter, these are the kind of one-liners I would tweet. But I don't, so I don't. You have to read the longform, instead.

I have argued so far that the concept of "stability" has something of a duality. There's the notion that something is stable because it resists external change, and the notion that something is stable because it resists bugs; it resists stagnation. An example of the former is a strong interface stability guarantee from a library. An example of the latter is keeping your browser and OS up-to-date to fix bugs, known security vulnerabilities, and just to acquire new features. Or stated another way, the "it's not a bug, it's a feature" line of thinking is the bug.

Lock yourself in

Having set the stage for stability as a means to lock-in a design or set it in stone, let's dive in to that concept a bit more. There are some examples in package managers where "locking in" your dependency list is a normal best-practice. npm has a package-lock file and a ci command to install packages from it. Cargo has a Cargo.lock file and installs packages from it by default (top-level only). go.mod files have the same basic use.

I bring this up because versioning dependencies has been converging on this concept of locking the precise version number of every package in the tree within many major package managers. And it's no surprise, because this is one of the best tools for providing repeatable builds and testing. Semantic versioning plays right into that, as I covered earlier.

There's just one problem. This is not the kind of "locking in" that I allude to in the title. I'm talking about locking yourself into a specific design that is hard to change later. I'm talking about being that decisionmaker that owns the responsibility for putting a crappy middlebox on the Internet without ever updating it to support newer protocol revisions (or newer protocols, period). I'm talking about tying your hands by offering a stable API that just plain sucks. No, I take that back. I'm talking about tying your users' hands. Because it's not the author that is locking themselves into a contract. They are locking their users into a contract.

You may lock yourself in at your own peril, but you lock everyone else in with you at their peril.

This is something I take personal grievance with. Development philosophies like "worse is better" means that everything I use on a daily basis is worse than it could be otherwise. The Internet is held together with bubblegum and string. Wirth's law is real, and it threatens our very existence. I'll admit, using Wirth's law to criticize Bitcoin is a bit cheeky. But sometimes you just have to be cheeky.

I've rambled on enough, and linked to about a dozen articles that you should read. I'll just leave you with one last thought: Change or die.

Tuesday, July 27, 2021

Mutable statics have scary superpowers! Do not use them

In Rust, it is well-known that static mut is dangerous. The docs for the static keyword have this to say about mutable statics:

If a static item is declared with the mut keyword, then it is allowed to be modified by the program. However, accessing mutable statics can cause undefined behavior in a number of ways, for example due to data races in a multithreaded context. As such, all accesses to mutable statics require an unsafe block.

This makes perfect sense, so far. But my program is single threaded. I know it is single threaded because it runs on bare metal with no_std. I also know that I will need global state to manage access to memory-mapped I/O, and the global state needs to be mutable so that I can replace it at runtime. I can do this with static mut because I do not have to worry about data races with multiple threads right?

Well, that's the thought I initially had, anyway. And I believe I am not alone. But there is a lot of subtlety with static mut references that are problematic, even in a guaranteed single-threaded context. At first I believed it all has to do with the special 'static lifetime, but that is not the case. Rust By Example describes this lifetime as follows:

As a reference lifetime 'static indicates that the data pointed to by the reference lives for the entire lifetime of the running program. It can still be coerced to a shorter lifetime.

That also makes sense, and it's exactly what I want. In my case, the memory-mapped I/O that I wanted to control was a serial port. The primary goal was to reproduce something akin to standard I/O with println! and dbg! macros like those provided by the Rust standard library. A secondary goal was to make the sink configurable at runtime. This would allow the software to run on hardware with different serial interfaces, for example, without recompiling it for each distinct device.

The platform I was targeting is the Nintendo 64. Which, as you might guess, does not have your good old RS-232 serial port. It has a slot for game cartridges, and an expansion bus on the bottom which is basically just another cartridge port. It also has some controller ports on the front. Nothing really stands out as a plug-and-play means to get a serial console interface. As luck would have it, there are development cartridges that have an FTDI chip and USB port on them for exactly this purpose! The bad news is that each dev cart has its very own custom MMIO interface, and none of them are compatible.

Going back in time a bit, there are also debug interfaces on the original development hardware used in the 90's (and for a few years in the early 2000's) which use either parallel or SCSI ports. Some emulators support these interfaces because games (especially prototypes) actually print interesting things to them.

So we're in a situation where we want to write software for a platform (N64) that has a number of different serial interfaces available, but we don't want to build 3 or 4 different executables and put the burden on the user to figure out which one they need. And so here we are; we make the serial I/O implementation configurable at runtime. It can detect which interface is available, and that's what it uses. No sweat!

The core:fmt::Write trait

We want to use core::fmt::Write to make the implementation generic. This trait provides some useful methods for implementing the macros, and the core formatting machinery takes care of the details I don't want to think about. There are far better designs than what I came up with, but the initial implementation of my runtime-configurable I/O API would store a trait object (implementing fmt::Write) in a lazy_static. And the macros just needed a way to get a mutable reference to it. That's &'static mut dyn fmt::Write, mind you. But, I thought to myself, if I just hide all that inside the macro, nothing bad can happen!

What I didn't realize is that static mut has superpowers. I'll try my best to explain what these superpowers are, but please bear with me! I do not fully understand why these superpowers exist or what they are fully capable of. What I do know is that you never, ever, under any circumstance, want to anger them.

The 'static lifetime means that it lives forever (as far as the program is concerned) and making a static mutable means that you can only touch it in unsafe code, because threads and data races, right? Wrong! You need unsafe because the superpowers of static mut allow you to introduce instant Undefined Behavior. To explain, let's revisit what &mut means in the first place.

It's true that &mut allows you to change the contents of the thing behind it, but that's really just a side effect of the real purpose of &mut; it's an exclusive (unique) borrow. There can be only one.

All jokes aside, this is an invariant that you cannot break. Creating mutable aliases is instantly Undefined Behavior. Invoking Undefined Behavior gives the compiler license to do anything to your code. We really only want the compiler to do what we want it to do with our code, so let's just agree to avoid UB at all costs. To really nail this one down, consider what the Nomicon has to say about transmute:

  • Transmuting an & to &mut is UB.
    • Transmuting an & to &mut is always UB.
    • No you can't do it.
    • No you're not special.

There is some subtlety here that I won't go into, but the core of the issue is that this creates mutable aliases. Don't do it!

A starting point

There are some things that you will naturally begin to intuit while writing Rust after you have been using it for a while. It's helpful to recognize things that will cause the borrow checker to reject your code, for example. A quick internal borrow check in your brain can save a few seconds (or minutes, in extreme cases) of compile time just to be provided with an error message.

Let's begin with this bare minimum State struct:

#[derive(Copy, Clone, Debug)]
struct State {
    x: i32,
}

impl State {
    const fn new() -> Self {
        Self { x: 0 }
    }

    fn get_mut(&mut self) -> &mut i32 {
        &mut self.x
    }
}

There is nothing really special here, so far. You can gain a mutable reference to the x field if you have mutable access to the struct. Standard stuff, so far. If you attempted to write code like the following, it might make you pause as your internal borrow checker raises a red flag:

fn main() {
    let mut state = State::new();

    let a = state.get_mut();
    let b = state.get_mut();

    *a = 42;

    println!("I have two i32s: {}, {}", a, b);
}

Importantly, it isn't the explicit dereference (*a = 42;) that trips the borrow checker, here. It's the println! macro. And the error message illustrates this perfectly:

error[E0499]: cannot borrow `state` as mutable more than once at a time
  --> src/main.rs:22:13
   |
21 |     let a = state.get_mut();
   |             ----- first mutable borrow occurs here
22 |     let b = state.get_mut();
   |             ^^^^^ second mutable borrow occurs here
23 | 
24 |     println!("I have two i32s: {}, {}", a, b);
   |                                         - first borrow later used here

This mutable aliasing issue is something I've become more aware of over time, so it's kind of silly for me to write examples like this. The demonstration does help prove my point about superpowers though.

Let's make it static

Using the same State struct from above, let's make some minor adjustments to the program that uses it, so that State becomes static:

static STATE: State = State::new();

fn main() {
    let a = STATE.get_mut();

    *a = 42;
    
    println!("I have one i32: {}", a);
}

Your internal borrow checker should raise another red flag, here. We made the state static, but not mutable, so we cannot compile this code:

error[E0596]: cannot borrow immutable static item `STATE` as mutable
  --> src/main.rs:20:17
   |
20 |         let a = STATE.get_mut();
   |                 ^^^^^ cannot borrow as mutable

Everything is as expected so far. We need to make the static state mutable so that we can borrow it mutably (exclusively, remember). And that also means we need to use the unsafe keyword. We're about to find out why it's actually unsafe. :)

static mut STATE: State = State::new();

fn main() {
    unsafe {
        let a = STATE.get_mut();

        *a = 42;
    
        println!("I have one i32: {}", a);
    }
}

Well, that's not so bad! My internal borrow checker can't spot anything wrong with that. And neither can the compiler's (much better) borrow checker:

I have one i32: 42

But now I want to get spicy! 🔥 I want two exclusive references to the same memory. Or maybe I don't, but it accidentally happens anyway.

static mut STATE: State = State::new();

fn main() {
    unsafe {
        let a = STATE.get_mut();
        let b = STATE.get_mut();

        *a = 42;
    
        println!("I have two i32s: {}, {}", a, b);
    }
}

My internal borrow checker says this cannot work, just like it didn't work before. And the Rust borrow checker is way better than mine, so surely it won't allow this!

I have two i32s: 42, 42

ಠ_ಠ



(╯°□°)╯︵ ┻━┻


To be fair, trying to run this code in miri does detect the Undefined Behavior. This is a very powerful tool, along with the sanitizers. Always use these when you are writing unsafe code! No exceptions. This is a huge relief, or at the very least it's better than nothing. Here's what miri finds:

error: Undefined Behavior: no item granting write access to tag <2843> at alloc1 found in borrow stack.
  --> src/main.rs:23:9
   |
23 |         *a = 42;
   |         ^^^^^^^ no item granting write access to tag <2843> at alloc1 found in borrow stack.

Still, you may be caught off guard by this. Don't worry, it's a perfectly normal reaction. We didn't fundamentally change the State implementation, and yet putting it into a static mut changes its semantics. We might be given a hint about why this is if we desugar the get_mut method's function signature:

fn get_mut(&'static mut self) -> &'static mut i32

Go ahead, try it out! It compiles just the same. The static lifetime was simply elided. But wait, what if we keep this 'static lifetime and go back to the safe program code that puts State on the stack?

error[E0597]: `state` does not live long enough
  --> src/main.rs:21:13
   |
21 |     let a = state.get_mut();
   |             ^^^^^----------
   |             |
   |             borrowed value does not live long enough
   |             argument requires that `state` is borrowed for `'static`
...
31 | }
   | - `state` dropped here while still borrowed

So it is true that the 'static lifetime is special. We cannot borrow State for 'static because State lives on the stack. We would have to move it to the heap and leak the reference to make it live for 'static if we wanted to call the method with this lifetime bound. However, we cannot create mutable aliases in safe code with leaked boxes because they are not static.

In the case of static mut, we're allowed to borrow with a 'static lifetime, but we're also allowed to exclusively borrow multiple times. If you could only exclusively borrow a static mut once, it would make the feature useless for many valid cases. And this is the tricky part that, at least in my mind, makes static mut such a nightmare. You don't have a choice but to ensure the invariant that mutable aliasing does not occur, but the compiler won't prevent you from doing so. It is particularly insidious when your safe interface allows reentrancy, and that's how I hit this problem. Nested dbg! macro calls are pretty common.

This is the superpower of static mut.

How did I fix this in my configurable serial I/O?

Well, I'm not entirely sure that I did, to be honest! But I did find a way to get my macros to receive an exclusive borrow with a lifetime shorter than 'static. And I did that by not borrowing a static mut. Instead I have statics with interior mutability (STDOUT and STDERR). I opted to use spin::Mutex<RefCell<T>> to keep my uses of the unsafe keyword to a minimum, but it's also possible to store state with interior mutability in a static UnsafeCell. This is the recommended replacement according to Consider deprecation of UB-happy `static mut` · Issue #53639 · rust-lang/rust (github.com)

I also had to jump through hoops to get trait objects working without Box, but that was another rabbit hole entirely. In the end, I learned a valuable lesson about static mut and how bad it is for your health (and mine!) There are ways to make it work that are trivially safe, but reading through that GitHub ticket ought to ruffle your feathers, even if I haven't already convinced you that static mut probably should not be used.