Forrest Jacobs

Rust to Go and back

I wrote two Discord bots relatively recently: a bot called systemctl-bot that lets you start and stop systemd units, and a bot called pipe-bot that posts piped messages. I attempted to write both in Rust, and while one was a delight to write, the other I ended up rewriting in Go in a fit of frustration. Here are a few reasons why:

Starting off too complex

pipe-bot, the program I successfully wrote in Rust, is very simple — it listens to standard in, then calls to the Discord API based on the message:

  graph
  
  start(Start Discord client) --> stdin
  
  subgraph Loop
    stdin@{ shape: lean-r, label: "Wait for stdin" }
  
	  stdin --> parse[Parse stdin]
	  parse -->|Message| message[Send message]
	  parse -->|Status update| status[Update status]
	  parse -->|Else| log[Log error]
	end

However, I started with systemctl-bot, which monitors and controls systemd units, parses and shares a config file, reads async streams, and generally has weird edge cases. While it’s not overly complex, it’s a lot to get your head around when you’re also learning the borrower checker and async Rust.

  graph
  config(Parse config) --> start(Start Discord client) --> register(Register commands)
  register --> command
  start --> status
  allowed -.- config

  subgraph Command Loop
    command@{ shape: lean-r, label: "Command" } --> allowed{Is unit in config?}
	  allowed --> |Not in config| log[Log error]
	  allowed --> |In config| systemctl[Issue systemctl command]
	  systemctl --> |Success| post[Send success message]
	  systemctl --> |Failed| fail[Send failure message]
  end

  subgraph Status Loop
    status@{ shape: lean-r, label: "Unit Status Update" }
	  status --> fus[Fetch units' statuses] --> uds[Update Discord status]
  end

Async Rust

I anticipated fighting with the borrower checker, but—oh boy!—it pales in comparison to writing and understanding async Rust. Since I was coming from the world of “””enterprise software”””, I was used to writing with a level of indirection to facilitate code reuse, unit testing, and refactoring. However, Rust makes you pay for indirection that involves tracking more state or more complex state since it has to track that state while the async call is in progress. Watch this video to hear someone much smarter than me explain why the current state of async Rust ain’t quite it yet:

Testing

Something possessed me to go full enterprise software sicko mode during the development of systemctl-bot and unit test every module to as close to 100% coverage as possible. I’m glad I did because it taught me more about generics and about Box, Rc, and Arc as I tried to find ways to mock dependencies, but it also taught me that this style of testing in Rust produces a huge glob of code that is painful to wrangle.

I decided to take a different approach while developing pipe-bot: I just mocked the outer edges of my program and let every test be an integration test. Any unit-level errors that mattered seem to come up in these tests, and since my program was small it wasn’t difficult to identify the specific function where the error originated. I got 99% of the benefit of unit testing with 20% of the effort.

Final thoughts

I enjoy Rust, but I respect Go. Rust is more fun to write, and the compiler’s strict checking is a superpower that ensures you don’t screw yourself up too badly. However, async Rust is a huge pain for me, and while Go is boring, sometimes it’s the ticket to complete a project.

Keeping NixOS systems up to date with GitHub Actions

Keeping my NixOS servers up to date was dead simple before I switched to flakes – I enabled system.autoUpgrade, and I was good to go. Trying the same with a shared flakes-based config introduced a few problems:

  1. I configured autoUpgrade to commit flake lock changes, but it ran as root. This created file permission issues since my user owned my NixOS config.
  2. Even when committing worked, each machine piled up slightly different commits waiting for me to upstream.

I could have fixed issue #1 by changing the owner, but fixing #2 required me to rethink the process. Instead of having each individual machine update their lock file, I realized it would be cleaner to update the lock file upstream first, and then rebuild each server from upstream. Updating the lock file first ensures there’s only one version of history, and that makes it easier to reason about what is installed on each server.

Below is one method of updating the shared lock file before updating each server:

Updating flake.lock with GitHub Actions

The update-flake-lock GitHub Action updates your project’s flake lock file on a schedule. It essentially runs nix flake update --commit-lock-file and then opens a pull request. Add it to your NixOS config repository like this:

# /.github/workflows/main.yml

name: update-dependencies
on:
  workflow_dispatch: # allows manual triggering
  schedule:
    - cron: '0 6 * * *' # daily at 1 am EST/2 am EDT

jobs:
  update-dependencies:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: DeterminateSystems/nix-installer-action@v12
      - id: update
        uses: DeterminateSystems/update-flake-lock@v23

Add this step if you want to automatically merge the pull request:

      - name: Merge
        run: gh pr merge --auto "${{ steps.update.outputs.pull-request-number }}" --rebase
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
        if: ${{ steps.update.outputs.pull-request-number != '' }}

Pulling changes & rebuilding

Next, it’s time to configure NixOS to pull changes and rebuild. The configuration below adds two systemd services:

  • pull-updates pulls config changes from upstream daily at 4:40. It has a few guardrails: it ensures the local repository is on the main branch, and it only permits fast-forward merges. You’ll want to set serviceConfig.User to the user owning the repository. If it succeeds, it kicks off rebuild
  • rebuild rebuilds and switches to the new configuration, and reboots if required. It’s a simplified version of autoUpgrade’s script.
systemd.services.pull-updates = {
  description = "Pulls changes to system config";
  restartIfChanged = false;
  onSuccess = [ "rebuild.service" ];
  startAt = "04:40";
  path = [pkgs.git pkgs.openssh];
  script = ''
    test "$(git branch --show-current)" = "main"
    git pull --ff-only
  '';
  serviceConfig = {
    WorkingDirectory = "/etc/nixos";
    User = "user-that-owns-the-repo";
    Type = "oneshot";
  };
};

systemd.services.rebuild = {
  description = "Rebuilds and activates system config";
  restartIfChanged = false;
  path = [pkgs.nixos-rebuild pkgs.systemd];
  script = ''
    nixos-rebuild boot
    booted="$(readlink /run/booted-system/{initrd,kernel,kernel-modules})"
    built="$(readlink /nix/var/nix/profiles/system/{initrd,kernel,kernel-modules})"

    if [ "''${booted}" = "''${built}" ]; then
      nixos-rebuild switch
    else
      reboot now
    fi
  '';
  serviceConfig.Type = "oneshot";
};

There are many possible variations. For example, in my real config I split the pull service into separate fetch and merge services so I can fetch more frequently. You could also replace the GitHub action with a different scheduled script, or change the rebuild service to never (or always!) reboot.

Waiting on Tailscale

I restarted my server the other day, and I realized one of my systemd services failed to start on boot because the Tailscale IP address was not assignable:

# journalctl -u bad-bad-not-good.service
...
listen tcp 100.11.22.33:8080: bind: cannot assign requested address

This is easy enough to fix. The service should wait to start until after Tailscale is online, so let’s just add tailscaled.service to the the service’s wants and after properties, reboot, and…

# journalctl -u bad-bad-not-good.service
...
listen tcp 100.11.22.33:8080: bind: cannot assign requested address

Huh. It turns out Tailscale comes up a bit before its IP address is available. I was tempted to add an ExecStartPre to my service to sleep for 1 second – gross! – but eventually I found systemd’s fabulous systemd-networkd-wait-online command, which exits when a given interface has an IP address. Call it with -i [interface name] and either -4 or -6 to wait for an IPv4 or IPv6 address.

Wrapping it up into a service gives you something like this:

# tailscale-online.service
[Unit]
Description=Wait for Tailscale to have an IPv4 address
Requisite=systemd-networkd.service
After=systemd-networkd.service
Conflicts=shutdown.target

[Service]
ExecStart=/usr/lib/systemd/systemd-networkd-wait-online -i tailscale0 -4
RemainAfterExit=true
Type=oneshot

Services using your Tailscale IP address can now depend on tailscale-online.