You know how you sometimes get a glimmer of inspiration and decide to start a blog or a personal website? Yeah, it’s been 6 years since I last wrote anything. Who knows, maybe this time it may stick.

In any case, let’s talk about CI/CD pipeline runners!

Before we begin, I’m fully aware that not many people are bored enough to read the whole thing. I promise the next one will be shorter and more solution oriented. This is the long version of why I built GARM.

About 4 years ago (2022 relative to the time of this writing), I had the privilege of contributing a small fix to the Flatcar Linux project. The PR is nothing spectacular. The process of merging it, however, felt more manual than it needed to be. The reason was not the review process or the code itself, but the fact that the CI was set up inside a private network and was opaque to any external contributors.

There were several reasons for that particular CI setup, most important of which being that the CI was building an entire operating system and then running an entire suite of tests with the resulting OS build. As you can imagine, there are few general purpose CI/CD pipelines that are natively equipped to handle such a task.

However, that didn’t mean we couldn’t try to make it more transparent to external contributors such as myself at that time. So I started on a journey to try and implement their current CI in a set of github workflows.

For the most part, the tooling needed to automate the process of building and testing the OS was already present in the public repository. What was missing was the glue that puts it all together in a github workflow. That part was added in a subsequent PR. Now, that PR is a jump in time that happened after lessons were learned and research was done.

Which brings me to…

The unknowns

As with any research effort, we need to define a few things:

  • What are the current requirements for the existing build system?
  • What are the constraints imposed by the new build system you want to use?
  • What are the equivalent settings, applications and technologies that you need in the new build system? You want to move, but you also don’t want to lose critical functionality.
    • If you must lose functionality, is that functionality critical?
    • Can it be replaced by something else?
    • Is that replacement suitable?
  • We’re building an operating system that eventually ends up running in production environments:
    • Is it safe to build a pipeline that may run arbitrary unsafe code from untrusted parties?
    • Can you protect your CI/CD pipeline secrets?
    • Can you protect your CI/CD pipeline servers from bad actors that may want to compromise the host running the tests? Supply chain attacks are a reality you need to take into account, especially if you’re building an OS.

Flatcar Linux itself is a friendly fork of CoreOS which itself is based on the Gentoo build system. The build system for Flatcar Linux uses Docker. The Docker requirement detail will be important later on when we go deeper into researching options for running in a CI.

Additionally, while building the OS and its packages can easily be done in Docker, testing it means we need to recreate the conditions one would have on a normal server, be it physical or virtual. That means, access to device mapper for tests that require encrypted volumes, LVM, dmraid or any other storage solution that leverages device mapper and Flatcar has a test for.

The first attempt

The first attempt was to just try and use GitHub runners, by cloning most of the functionality of the jenkins pipeline. We had to do this anyway, regardless of whether or not we ended up using GitHub runners or not. The amazing folks over at Flatcar were kind enough to send me the groovy scripts they used internally so I could replicate them in a workflow.

That was pretty simple to do, given that they already did all of the work and most of the CI pipeline code was already available in the scripts repository. All that was needed was the glue that put everything together in a coherent workflow.

Within a day, I had the first draft of the workflow. So, with much excitement, I triggered a workflow. The workflow kicked off, started cloning repositories, downloading the SDK and setting things up. When it got to the stage where packages were being emerged and built I was popping the Champagne bottle, calling my friends in celebration, life was good.

After an hour or so, the progress of the build was still not very far along. So all that excitement started to fade. After about 6 hours, GitHub decided to cancel the workflow as that was the default cutoff time for any workflow. The build had gotten only about 80% of the way there.

But locally, that same workflow would finish in slightly under 2 hours. And this was on a not particularly powerful desktop machine.

So what gives?

Well, as it turns out, at the time GitHub runners had about 6 GB of memory (which was fine) but only 2 vCPUs. And the build system is very CPU bound. All those packages that need to be compiled will eat up as much CPU as they can. The bottleneck was clear. The solution a bit more nuanced.

The path forward had a few options. We could either strip down the build to something that would finish faster, increase the timeout on the job above 6 hours (configurable in the workflow) or try to run the jobs on a beefier server.

Flatcar Linux is an immutable OS with a very small footprint. Stripping it down even more made no sense.

Increasing the timeout on the jobs was a better option, but if the build system only reached about 80% after 6 hours, how much longer would the rest of the build plus the functional tests take? Would the functional tests even run on systems that were this low spec? The functional tests needed to run the newly built OS image inside a VM, so we’d need to spin that up in a fresh VM or a bare metal machine. Doing that inside a VM that only had 2 vCPUs and potentially no nested virtualization was a fever dream at best. Moreover, even if by some miracle of the AllSpark we would manage to get that going, what would we do about ARM64 tests? And would a regular PR run even finish before the heat death of the universe?

So…then…what about getting beefier runners?

This seemed like the only real option. At that time however (2022), GitHub did not have any option to use larger runners. The Flatcar Linux project, however, had a sponsorship from Equinix Metal at the time which provided two bare metal servers: one AMD64 machine and one ARM64 machine. Both server grade with lots of CPU cores and RAM.

But we need to run potentially multiple jobs in parallel on these machines. While they could totally do that, you’d still need to isolate the tests from one another. Moreover, each build run and test must not be tainted by a previous build/test run.

The requirements

So we identified the following set of issues and requirements:

  1. We need to run the build inside a docker container.
  2. The build is heavy on the CPU and disk IO.
  3. We need to allow multiple PR test runs to run in parallel.
  4. The GitHub runners are too low spec to finish in a decent amount of time.
  5. We need to be able to spin up a VM and run the functional tests.
  6. We need to think about ARM64. At least for tests, as the build for ARM64 happens on AMD64.
  7. We need a clean, untainted environment on every build run and on every test run. So ephemeral runners were a must.
  8. We need to keep our pipeline secure. Assume every test run of untrusted code is adversarial. Any system that runs code from a PR should be considered compromised and unusable after a single run.
  9. Whatever we ended up building, had to be easy to set up, easy to maintain. In an ideal situation, as close to zero maintenance as possible. Ops teams are spread thin as it is. Having them take ownership of another complex system is something nobody wants to even entertain.
  10. The thing needed to take care of the lifecycle of the runners without intervention. Essentially giving us the same experience GitHub would.
  11. Bonus points if it was IaaS agnostic. If there’s anything I learned in my many years doing infra, is that you can never rely on things staying the same, especially when it comes to infrastructure. Costs can shoot up, services can degrade, circumstances may obligate you to migrate to something else.
  12. Bonus points if we could manage runners on the repo level and the Org level.

That’s quite the list. And to think this all started from the simple goal of having more transparency in a PR. Kind of feels like it’s turning into a Hal fixing a lightbulb moment, doesn’t it?

The research

Anyway, as the default runners were not up to snuff, I started shopping around for something that would allow us to self-host something that would fit the above criteria. As it turns out, there were quite a few options to choose from. Most notable out of all of them was ARC which was later adopted by GitHub and became an officially supported project.

This was a great option, but it had one snag. It only gave us a container runtime. And one of our core requirements is to run the SDK inside a docker container. While DinD is a thing, to me this always felt like trying to bake a cake using a barbeque. Can you do it? Sure. Is there a better, easier and safer way? Definitely. I mean we started from bare metal servers in the 2000s, moved to virtual machines and the cloud in the early 2010s, then we moved to containers and now we’re trying to turn containers back into virtual machines. We’ve spent all this time and effort trying to run something in a container that was never meant to run in a container, when we already had virtual machines. But I digress…

Getting back to the matter at hand, we needed to run functional tests, which needed some device mapper access and required spinning up a VM. While you could probably hack k8s with an axe and achieve something that might work, all isolation goes out the window. Not something you want to do when running potentially untrusted code that tries to make its way inside an operating system that runs on many production environments.

Okay, what else could we use? As anyone who searches for something like this without having a clue about what’s already out there, you probably stumble across Johannes Nicolai’s awesome runners list. But after parsing the list, it became clear that most options were either really tied to one IaaS, focused on container runtimes, had no autoscaling or only supported repo runners. Additionally, a lot of the available options needed the personal access token (PAT) to be available on the runners themselves. This was an unacceptable security boundary I was unwilling to cross.

I didn’t want to propose something now, that might not work in the future anymore if the IaaS needed to be swapped out. Flatcar is a Microsoft project. While they did have a couple of bare metal servers on which we could set everything up, there was a non-zero chance that at some point they would like to move parts of that CI in Azure. So anything I decided to propose to the Flatcar project would need to be flexible enough to work on a new IaaS with minimal changes, which they could self-host.

The decision

At this point I had two options:

  1. slap something together using multiple projects and document the thing in the hopes that someone in the future won’t hunt me down out of frustration.
  2. Build something new

I mean, the choice is clear, isn’t it? You can choose to use 2 or 3 established, well documented projects with broad community adoption, or you can delude yourself into thinking you can build something new that works for your use case. I mean, the best way to fix a standard is not with a new standard, right? But at the same time, we’re not going to let good sense stand in the way of your Dunning-Kruger.

Anyway, you probably guessed already that I went with option 2.

Introducing GARM - Github Actions Runner Manager

I come from an infrastructure background. I have been hosting websites, building distributed systems and clouds for more than 20 years. I was there when the cloud was born and my job stopped being called a sysadmin and started being called devops. And if you’ve not just worked with clouds for your entire career but also had the privilege of building one from scratch, you realize that almost all clouds, up to a point are the same. The primitives are identical. You can spin up compute resources from an image, connect it to a network, run something at first boot and have something functional.

So in the end, what we needed to do to emulate how GitHub provided the default runners was to implement the compute bit and set up the runner for a repo, org or enterprise. We needed to somehow consume events from GitHub to know when a job needed a runner, spin up a runner on an IaaS, configure it and make it available to the job that requested it.

GARM - setting up the source of truth

At the time, GitHub had only one acceptably reliable way of consuming events, and that was through webhooks. You could setup a service that accepted POSTs from github whenever something you cared about happened. For any autoscaler of runners, the hooks you care about are workflow_job webhooks. This will give you most of the info you need. There are some inexplicable omissions from the queued workflow payload, but I guess we can’t have it all.

GARM would listen for jobs via webhook, record them and decide what to do with them later. With this solved, we needed to define how those requests would be serviced. But to understand how to manage those jobs, we needed to define a primitive that models the requirements for the runners of those jobs in a way that makes sense to both the user and the IaaS we eventually wanted to create the compute instances in.

GitHub allows jobs to target labels. The labels can be a simple scalar or an array of labels. The user could choose to describe the requirement of the job as succinctly or as verbose as they prefer. While I would argue that a single descriptive, composite label is better than an array of narrowly scoped labels, this is not something an autoscaler should limit or be opinionated on.

Labels are meant to target runner types. To, say, target a runner that has a generous amount of memory and CPU you could use something like:

runs-on:
  - ubuntu
  - amd64
  - mem-64G
  - vcpus-16

or just:

runs-on: ubuntu-amd64-64G-16vcpu

GARM - modeling requirements

If we look at the above labels, we notice a few things:

  • The OS is defined
  • The CPU architecture is defined
  • The required memory and CPU is defined

So to model this in an autoscaler we would need to map most of that to something that an IaaS would understand. As it stands, most clouds will allow you to:

  • Boot a VM or container from an image. This usually dictates the CPU architecture as well.
  • Allow you to define resources for that compute instance. This is usually called a flavor, a shape or a vm type.

So the only real problem to solve before we could start on the IaaS providers was how to map the requirements to cloud primitives.

In GARM this is solved by defining pools of homogenous runners. Pools allow you to define the labels a pool reacts to, the cloud the runners need to be created on and the characteristics of the compute instances that get created within a cloud.

To tie it all together, a job comes in, we inspect the runs-on field of the job, look for a local pool, if we have a match, we record it. A few seconds later, the pool worker consumes it and reacts by creating a new runner. The runner spins up, picks up the job and eventually finishes. Throughout this process, github will send us 2 or 3 webhooks (depending on the outcome):

  • queued - a new job comes in
  • in_progress - the job has started running. This webhook may never come if the job was cancelled before it was picked up.
  • completed - this webhook is received when the job finishes, regardless of conclusion (success/failure/canceled).

GARM - transforming requirements to compute resources

So we have our source of truth for work items (webhooks), we have our primitives defined for managing runners, all we need is the glue between github and the IaaS. Like I mentioned earlier, I come from an infrastructure background. I’ve always dealt with multiple clouds in the same way. You had a workload you needed to run in multiple clouds, you would build a provider of some kind. This is nothing new. Many other projects like juju, OpenTofu, even Kubernetes itself, do the same.

The goal however was to build something minimal that only spun up instances for our runners, while still allowing users to customize the compute instances to their needs. Sane defaults, power-user knobs for fine tuning. All of this wrapped in an API that fits with all IaaS providers when set in a pool of runners.

I wanted for users to be able to mix and match their runners. You want to use your LXD/Incus cluster as the main workhorse for CPU intensive runners? Set up a pool that targets those providers. You want to spin up a couple of comically large instances in some cloud somewhere? Create a pool that targets EC2/Azure/GCP. You need instances with GPUs or FPGAs attached? Spin up a pool on your OpenStack cloud. All of this from one CLI, using one service that you can easily maintain.

But most importantly, I wanted to make something that is aimed at the ops professionals out there. They are the unsung heroes of the IT world. I wanted to allow anyone to extend GARM. Add any provider using any programming or scripting language that the teams using GARM were familiar with. The only requirement is that the provider is an executable and that it follows the interface defined by GARM.

While the current providers we wrote are all written in Go, that is not a hard requirement. Providers are simple executables. Not binaries, just executables. This was inspired by the CNI system in kubernetes.

Whenever GARM needs to create, delete or just get details about a compute instance in an IaaS somewhere, it will call into an executable with some environment variables, and potentially standard input. The results of the operations GARM defines are expected as standard output.

The early proof of concept Azure and OpenStack providers were actually written in BASH. As you can imagine, those providers are now severely out of date, but the basic concepts have not changed. You can still write your own provider using bash.

So if you’ve deployed GARM today and at one point in the future a new IaaS becomes a requirement for your team, it is far easier to write a small BASH provider to hook into it than it is to change your entire stack.

GARM - setting the Dunning-Kruger fueled vision free on the world

As you can imagine, the initial release was much more limited in its scope and supported providers. The first release only had support for LXD. But the bones of the architecture have not changed much since that initial release. Providers are still written in the same way as they were 4 years ago.

But after a few weeks of work and a bunch of test runs, a PR was proposed to wire the github workflow for the flatcar pipeline to self-hosted github runners managed by GARM. And I presented the solution in a “Flatcar Office hours” session.

Looking back, GARM started as an attempt to make a CI pipeline more transparent to external contributors. It eventually became a multi-cloud runner management platform supporting GitHub, GHES, and Gitea. I certainly didn’t set out to build that. But infrastructure projects often start that way: you solve one operational problem, discover three more behind it, and eventually realize you’ve built a product.

I owe the blog more posts about how to actually use it. That’s the next one.