mirror of
https://github.com/actions/actions-runner-controller.git
synced 2025-12-11 12:06:57 +00:00
Housekeeping: move adrs/ to docs/ and update status (#2443)
Co-authored-by: Francesco Renzi <rentziass@github.com>
This commit is contained in:
109
docs/adrs/2022-10-17-runner-image.md
Normal file
109
docs/adrs/2022-10-17-runner-image.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# ADR 2022-10-17: Produce the runner image for the scaleset client
|
||||
|
||||
**Date**: 2022-10-17
|
||||
|
||||
**Status**: Done
|
||||
|
||||
# Breaking Changes
|
||||
|
||||
We aim to provide an similar experience (as close as possible) between self-hosted and GitHub-hosted runners. To achieve this, we are making the following changes to align our self-hosted runner container image with the Ubuntu runners managed by GitHub.
|
||||
Here are the changes:
|
||||
|
||||
- We created a USER `runner(1001)` and a GROUP `docker(123)`
|
||||
- `sudo` has been on the image and the `runner` will be a passwordless sudoer.
|
||||
- The runner binary was placed placed under `/home/runner/` and launched using `/home/runner/run.sh`
|
||||
- The runner's work directory is `/home/runner/_work`
|
||||
- `$HOME` will point to `/home/runner`
|
||||
- The container image user will be the `runner(1001)`
|
||||
|
||||
The latest Dockerfile can be found at: https://github.com/actions/runner/blob/main/images/Dockerfile
|
||||
|
||||
# Context
|
||||
|
||||
users can bring their own runner images, the contract we require is:
|
||||
|
||||
- It must have a runner binary under `/actions-runner` i.e. `/actions-runner/run.sh` exists
|
||||
- The `WORKDIR` is set to `/actions-runner`
|
||||
- If the user inside the container is root, the environment variable `RUNNER_ALLOW_RUNASROOT` should be set to `1`
|
||||
|
||||
The existing [ARC runner images](https://github.com/orgs/actions-runner-controller/packages?tab=packages&q=actions-runner) will not work with the new ARC mode out-of-box for the following reason:
|
||||
|
||||
- The current runner image requires the caller to pass runner configuration info, ex: URL and Config Token
|
||||
- The current runner image has the runner binary under `/runner` which violates the contract described above
|
||||
- The current runner image requires a special entrypoint script in order to work around some volume mount limitation for setting up DinD.
|
||||
|
||||
Since we expose the raw runner PodSpec to our end users, they can modify the helm `values.yaml` to adjust the runner container to their needs.
|
||||
|
||||
# Guiding Principles
|
||||
|
||||
- Build image is separated in two stages.
|
||||
|
||||
## The first stage (build)
|
||||
|
||||
- Reuses the same base image, so it is faster to build.
|
||||
- Installs utilities needed to download assets (`runner` and `runner-container-hooks`).
|
||||
- Downloads the runner and stores it into `/actions-runner` directory.
|
||||
- Downloads the runner-container-hooks and stores it into `/actions-runner/k8s` directory.
|
||||
- You can use build arguments to control the runner version, the target platform and runner container hooks version.
|
||||
|
||||
Preview (the published runner image might vary):
|
||||
|
||||
```Dockerfile
|
||||
FROM mcr.microsoft.com/dotnet/runtime-deps:6.0 as build
|
||||
|
||||
ARG RUNNER_ARCH="x64"
|
||||
ARG RUNNER_VERSION=2.298.2
|
||||
ARG RUNNER_CONTAINER_HOOKS_VERSION=0.1.3
|
||||
|
||||
RUN apt update -y && apt install curl unzip -y
|
||||
|
||||
WORKDIR /actions-runner
|
||||
RUN curl -f -L -o runner.tar.gz https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-${RUNNER_ARCH}-${RUNNER_VERSION}.tar.gz \
|
||||
&& tar xzf ./runner.tar.gz \
|
||||
&& rm runner.tar.gz
|
||||
|
||||
RUN curl -f -L -o runner-container-hooks.zip https://github.com/actions/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
|
||||
&& unzip ./runner-container-hooks.zip -d ./k8s \
|
||||
&& rm runner-container-hooks.zip
|
||||
```
|
||||
|
||||
## The main image:
|
||||
|
||||
- Copies assets from the build stage to `/actions-runner`
|
||||
- Does not provide an entrypoint. The entrypoint should be set within the container definition.
|
||||
|
||||
Preview:
|
||||
|
||||
```Dockerfile
|
||||
FROM mcr.microsoft.com/dotnet/runtime-deps:6.0
|
||||
|
||||
WORKDIR /actions-runner
|
||||
COPY --from=build /actions-runner .
|
||||
```
|
||||
|
||||
## Example of pod spec with the init container copying assets
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: <name>
|
||||
spec:
|
||||
containers:
|
||||
- name: runner
|
||||
image: <image>
|
||||
command: ["/runner/run.sh"]
|
||||
volumeMounts:
|
||||
- name: runner
|
||||
mountPath: /runner
|
||||
initContainers:
|
||||
- name: setup
|
||||
image: <image>
|
||||
command: ["sh", "-c", "cp -r /actions-runner/* /runner/"]
|
||||
volumeMounts:
|
||||
- name: runner
|
||||
mountPath: /runner
|
||||
volumes:
|
||||
- name: runner
|
||||
emptyDir: {}
|
||||
```
|
||||
56
docs/adrs/2022-10-27-runnerscaleset-lifetime.md
Normal file
56
docs/adrs/2022-10-27-runnerscaleset-lifetime.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# ADR 2022-10-27: Lifetime of RunnerScaleSet on Service
|
||||
|
||||
**Date**: 2022-10-27
|
||||
|
||||
**Status**: Done
|
||||
|
||||
## Context
|
||||
|
||||
We have created the RunnerScaleSet object and APIs around it on the GitHub Actions service for better support of any self-hosted runner auto-scale solution, like [actions-runner-controller](https://github.com/actions-runner-controller/actions-runner-controller).
|
||||
|
||||
The `RunnerScaleSet` object will represent a set of homogeneous self-hosted runners to the Actions service job routing system.
|
||||
|
||||
A `RunnerScaleSet` client (ARC) needs to communicate with the Actions service via HTTP long-poll in a certain protocol to get a workflow job successfully landed on one of its homogeneous self-hosted runners.
|
||||
|
||||
In this ADR, we discuss the following within the context of actions-runner-controller's new scaling mode:
|
||||
|
||||
- Who and how to create a RunnerScaleSet on the service?
|
||||
- Who and how to delete a RunnerScaleSet on the service?
|
||||
- What will happen to all the runners and jobs when the deletion happens?
|
||||
|
||||
## RunnerScaleSet creation
|
||||
|
||||
- `AutoScalingRunnerSet` custom resource controller will create the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deployment.
|
||||
- The creation is via REST API on Actions service `POST _apis/runtime/runnerscalesets`
|
||||
- The creation needs to use the runner registration token (admin).
|
||||
- `RunnerScaleSet.Name` == `AutoScalingRunnerSet.metadata.Name`
|
||||
- The created `RunnerScaleSet` will only have 1 label and it's the `RunnerScaleSet`'s name
|
||||
- `AutoScalingRunnerSet` controller will store the `RunnerScaleSet.Id` as an annotation on the k8s resource for future lookup.
|
||||
|
||||
## RunnerScaleSet modification
|
||||
|
||||
- When the user patch existing `AutoScalingRunnerSet`'s RunnerScaleSet related properly, ex: `runnerGroupName`, `runnerWorkDir`, the controller needs to make an HTTP PATCH call to the `_apis/runtime/runnerscalesets/2` endpoint in order to update the object on the service.
|
||||
- We will put the deployed `AutoScalingRunnerSet` resource in an error state when the user tries to patch the resource with a different `githubConfigUrl`
|
||||
> Basically, you can't move a deployed `AutoScalingRunnerSet` across GitHub entity, repoA->repoB, repoA->OrgC, etc.
|
||||
> We evaluated blocking the change before instead of erroring at runtime and that we decided not to go down this route because it forces us to re-introduce admission webhooks (require cert-manager).
|
||||
|
||||
## RunnerScaleSet deletion
|
||||
|
||||
- `AutoScalingRunnerSet` custom resource controller will delete the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deletion.
|
||||
> `AutoScalingRunnerSet` deletion will contain several steps:
|
||||
>
|
||||
> - Stop the listener app so no more new jobs coming and no more scaling up/down.
|
||||
> - Request scale down to 0
|
||||
> - Force stop all runners
|
||||
> - Wait for the scale down to 0
|
||||
> - Delete the `RunnerScaleSet` object from service via REST API
|
||||
- The deletion is via REST API on Actions service `DELETE _apis/runtime/runnerscalesets/1`
|
||||
- The deletion needs to use the runner registration token (admin).
|
||||
|
||||
The user's `RunnerScaleSet` will be deleted from the service by `DormantRunnerScaleSetCleanupJob` if the particular `AutoScalingRunnerSet` has not connected to the service for the past 7 days. We have a similar rule for self-hosted runners.
|
||||
|
||||
## Jobs and Runners on deletion
|
||||
|
||||
- `RunnerScaleSet` deletion will be blocked if there is any job assigned to a runner within the `RunnerScaleSet`, which has to scale down to 0 before deletion.
|
||||
- Any job that has been assigned to the `RunnerScaleSet` but hasn't been assigned to a runner within the `RunnerScaleSet` will get thrown back to the queue and wait for assignment again.
|
||||
- Any offline runners within the `RunnerScaleSet` will be deleted from the service side.
|
||||
54
docs/adrs/2022-11-04-crd-api-group-name.md
Normal file
54
docs/adrs/2022-11-04-crd-api-group-name.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# ADR 2022-11-04: Technical detail about actions-runner-controller repository transfer
|
||||
|
||||
**Date**: 2022-11-04
|
||||
|
||||
**Status**: Done
|
||||
|
||||
# Context
|
||||
|
||||
As part of ARC Private Beta: Repository Migration & Open Sourcing Process, we have decided to transfer the current [actions-runner-controller repository](https://github.com/actions-runner-controller/actions-runner-controller) into the [Actions org](https://github.com/actions).
|
||||
|
||||
**Goals:**
|
||||
|
||||
- A clear signal that GitHub will start taking over ARC and provide support.
|
||||
- Since we are going to deprecate the existing auto-scale mode in ARC at some point, we want to have a clear separation between the legacy mode (not supported) and the new mode (supported).
|
||||
- Avoid disrupting users as much as we can, existing ARC users will not notice any difference after the repository transfer, they can keep upgrading to the newer version of ARC and keep using the legacy mode.
|
||||
|
||||
**Challenges**
|
||||
|
||||
- The original creator's name (`summerwind`) is all over the place, including some critical parts of ARC:
|
||||
- The k8s user resource API's full name is `actions.summerwind.dev/v1alpha1/RunnerDeployment`, renaming it to `actions.github.com` is a breaking change and will force the user to rebuild their entire k8s cluster.
|
||||
- All docker images around ARC (controller + default runner) is published to [dockerhub/summerwind](https://hub.docker.com/u/summerwind)
|
||||
- The helm chart for ARC is currently hosted on [GitHub pages](https://actions-runner-controller.github.io/actions-runner-controller) for https://github.com/actions-runner-controller/actions-runner-controller, moving the repository means we will break users who install ARC via the helm chart
|
||||
|
||||
# Decisions
|
||||
|
||||
## APIs group names for k8s custom resources, `actions.summerwind` or `actions.github`
|
||||
|
||||
- We will not rename any existing ARC resources API name after moving the repository under Actions org. (keep `summerwind` for old stuff)
|
||||
- For any new resource API we are going to add, those will be named properly under GitHub, ex: `actions.github.com/v1alpha1/AutoScalingRunnerSet`
|
||||
|
||||
Benefits:
|
||||
|
||||
- A clear separation from existing ARC:
|
||||
- Easy for the support engineer to triage income tickets and figure out whether we need to support the use case from the user
|
||||
- We won't break existing users when they upgrade to a newer version of ARC after the repository transfer
|
||||
|
||||
Based on the spike done by `@nikola-jokic`, we have confidence that we can host multiple resources with different API names under the same repository, and the published ARC controller can handle both resources properly.
|
||||
|
||||
## ARC Docker images
|
||||
|
||||
We will not start using the GitHub container registry for hosting ARC images (controller + runner images) right after the repository transfer.
|
||||
|
||||
But over time, we will start using GHCR for hosting those images along with our deprecation story.
|
||||
|
||||
## Helm chart
|
||||
|
||||
We will recreate the https://github.com/actions-runner-controller/actions-runner-controller repository after the repository transfer.
|
||||
|
||||
The recreated repository will only contain the helm chart assets which keep powering the https://actions-runner-controller.github.io/actions-runner-controller for users to install ARC via Helm.
|
||||
|
||||
Long term, we will switch to hosting the helm chart on GHCR (OCI) instead of using GitHub Pages.
|
||||
|
||||
This will require a one-time change to our users by running
|
||||
`helm repo remove actions-runner-controller` and `helm repo add actions-runner-controller oci://ghcr.io/actions`
|
||||
89
docs/adrs/2022-12-05-adding-labels-k8s-resources.md
Normal file
89
docs/adrs/2022-12-05-adding-labels-k8s-resources.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# ADR 2022-12-05: Adding labels to our resources
|
||||
|
||||
**Date**: 2022-12-05
|
||||
|
||||
**Status**: Superceded [^1]
|
||||
|
||||
## Context
|
||||
|
||||
users need to provide us with logs so that we can help support and troubleshoot their issues. We need a way for our users to filter and retrieve the logs we need.
|
||||
|
||||
## Proposal
|
||||
|
||||
A good start would be a catch-all label to get all logs that are
|
||||
ARC-related: one of the [recommended labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/)
|
||||
is `app.kubernetes.io/part-of` and we can set that for all ARC components
|
||||
to be `actions-runner-controller`.
|
||||
|
||||
Assuming standard logging that would allow us to get all ARC logs by running
|
||||
|
||||
```bash
|
||||
kubectl logs -l 'app.kubernetes.io/part-of=actions-runner-controller'
|
||||
```
|
||||
|
||||
which would be very useful for development to begin with.
|
||||
|
||||
The proposal is to add these sets of labels to the pods ARC creates:
|
||||
|
||||
#### controller-manager
|
||||
|
||||
Labels to be set by the Helm chart:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: actions-runner-controller
|
||||
app.kubernetes.io/component: controller-manager
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
```
|
||||
|
||||
#### Listener
|
||||
|
||||
Labels to be set by controller at creation:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: actions-runner-controller
|
||||
app.kubernetes.io/component: runner-scale-set-listener
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
|
||||
|
||||
# the following labels are to be extracted by the config URL
|
||||
actions.github.com/enterprise: enterprise
|
||||
actions.github.com/organization: organization
|
||||
actions.github.com/repository: repository
|
||||
```
|
||||
|
||||
#### Runner
|
||||
|
||||
Labels to be set by controller at creation:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: actions-runner-controller
|
||||
app.kubernetes.io/component: runner
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
|
||||
actions.github.com/runner-name: runner-name
|
||||
actions.github.com/runner-group-name: runner-group-name
|
||||
|
||||
# the following labels are to be extracted by the config URL
|
||||
actions.github.com/enterprise: enterprise
|
||||
actions.github.com/organization: organization
|
||||
actions.github.com/repository: repository
|
||||
```
|
||||
|
||||
This would allow us to ask users:
|
||||
|
||||
> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/part-of=actions-runner-controller'?
|
||||
|
||||
Or for example if they're having problems specifically with runners:
|
||||
|
||||
> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/component=runner'?
|
||||
|
||||
This way users don't have to understand ARC moving parts but we still have a
|
||||
way to target them specifically if we need to.
|
||||
|
||||
[^1]: Superseded by [ADR 2023-04-14](2023-04-14-adding-labels-k8s-resources.md)
|
||||
94
docs/adrs/2022-12-27-pick-the-right-runner-to-scale-down.md
Normal file
94
docs/adrs/2022-12-27-pick-the-right-runner-to-scale-down.md
Normal file
@@ -0,0 +1,94 @@
|
||||
# ADR 2022-12-27: Pick the right runner to scale down
|
||||
|
||||
**Date**: 2022-12-27
|
||||
|
||||
**Status**: Done
|
||||
|
||||
## Context
|
||||
|
||||
- A custom resource `EphemeralRunnerSet` manage a set of custom resource `EphemeralRunners`
|
||||
- The `EphemeralRunnerSet` has `Replicas` in its `Spec`, and the responsibility of the `EphemeralRunnerSet_controller` is to reconcile a given `EphemeralRunnerSet` to have
|
||||
the same amount of `EphemeralRunners` as the `Spec.Replicas` defined.
|
||||
- This means the `EphemeralRunnerSet_controller` will scale up the `EphemeralRunnerSet` by creating more `EphemeralRunner` in the case of the `Spec.Replicas` is higher than
|
||||
the current amount of `EphemeralRunners`.
|
||||
- This also means the `EphemeralRunnerSet_controller` will scale down the `EphemeralRunnerSet` by finding some existing `EphemeralRunner` to delete in the case of
|
||||
the `Spec.Replicas` is less than the current amount of `EphemeralRunners`.
|
||||
|
||||
This ADR is about how can we find the right existing `EphemeralRunner` to delete when we need to scale down.
|
||||
|
||||
## Current approach
|
||||
|
||||
1. `EphemeralRunnerSet_controller` figure out how many `EphemeralRunner` it needs to delete, ex: need to scale down from 10 to 2 means we need to delete 8 `EphemeralRunner`
|
||||
|
||||
2. `EphemeralRunnerSet_controller` find all `EphemeralRunner` that is in the `Running` or `Pending` phase.
|
||||
|
||||
> `Pending` means the `EphemeralRunner` is still probably creating and a runner has not yet configured with the Actions service.
|
||||
> `Running` means the `EphemeralRunner` is created and a runner has probably configured with Actions service, the runner may sit there idle,
|
||||
> or maybe actively running a workflow job. We don't have a clear answer for it from the ARC side. (Actions service knows it for sure)
|
||||
|
||||
3. `EphemeralRunnerSet_controller` make an HTTP DELETE request to the Actions service for each `EphemeralRunner` from the previous step and ask the Actions service to delete the runner via `RunnerId`.
|
||||
(The `RunnerId` is generated after the runner registered with the Actions service, and stored on the `EphemeralRunner.Status.RunnerId`)
|
||||
|
||||
> - The HTTP DELETE request looks like the following:
|
||||
> `DELETE https://pipelines.actions.githubusercontent.com/WoxlUxJHrKEzIp4Nz3YmrmLlZBonrmj9xCJ1lrzcJ9ZsD1Tnw7/_apis/distributedtask/pools/0/agents/1024`
|
||||
> The Actions service will return 2 types of responses:
|
||||
>
|
||||
> 1. 204 (No Content): The runner with Id 1024 has been successfully removed from the service or the runner with Id 1024 doesn't exist.
|
||||
> 2. 400 (Bad Request) with JSON body that contains an error message like `JobStillRunningException`: The service can't remove this runner at this point since it has been
|
||||
> assigned to a job request, the client won't be able to remove the runner until the runner finishes its current assigned job request.
|
||||
|
||||
4. `EphemeralRunnerSet_controller` will ignore any deletion error from runners that are still running a job, and keep trying deletion until the amount of `204` equals the amount of
|
||||
`EphemeralRunner` needs to delete.
|
||||
|
||||
## The problem with the current approach
|
||||
|
||||
In a busy `AutoScalingRunnerSet`, the scale up and down may happen all the time as jobs are queued up and jobs finished.
|
||||
|
||||
We will make way too many HTTP requests to the Actions service and ask it to try to delete a certain runner, and rely on the exception from the service to figure out what to do next.
|
||||
|
||||
The runner deletion request is not cheap to the service, for synchronization, the `JobStillRunningException` is raised from the DB call for the request.
|
||||
|
||||
So we are wasting resources on both the Actions service (extra load to the database) and the actions-runner-controller (useless outgoing HTTP requests).
|
||||
|
||||
In the test ARC that I deployed to Azure, the ARC controller tried to delete RunnerId 12408 for `bbq-beets/ting-test` a total of 35 times within 10 minutes.
|
||||
|
||||
## Root cause
|
||||
|
||||
The `EphemeralRunnerSet_controller` doesn't know whether a given `EphemeralRunner` is actually running a workflow job or not
|
||||
(it only knows the runner is configured at the service), so it can't filter out the `EphemeralRunner`.
|
||||
|
||||
## Additional context
|
||||
|
||||
The legacy ARC's custom resource allows the runner image to leverage the RunnerJobHook feature to update the status of the runner custom resource in K8S (Mark the runner as running workflow run Id XXX).
|
||||
|
||||
This brings a good value to users as it can provide some insight about which runner is running which job for all the runners in the cluster and it looks pretty close to what we want to fix the [root cause](#root-cause)
|
||||
|
||||
However, the legacy ARC approach means the service account for running the runner pod needs to have elevated permission to update the custom resource,
|
||||
this would be a big `NO` from a security point of view since we may not trust the code running inside the runner pod.
|
||||
|
||||
## Possible Solution
|
||||
|
||||
The nature of the k8s controller-runtime means we might reconcile the resource base on stale cache data.
|
||||
|
||||
I think our goal for the solution should be:
|
||||
|
||||
- Reduce wasteful HTTP requests on a scale-down as much as we can.
|
||||
- We can accept that we might make 1 or 2 wasteful requests to Actions service, but we can't accept making 5/10+ of them.
|
||||
- See if we can meet feature parity with what the RunnerJobHook support with compromise any security concerns.
|
||||
|
||||
Since the root cause of why the reconciliation can't skip an `EphemeralRunner` is that we don't know whether an `EphemeralRunner` is running a job,
|
||||
a simple thought is how about we somehow attach some info to the `EphemeralRunner` to indicate it's currently running a job?
|
||||
|
||||
How about we send this info from the service to the auto-scaling-listener via the existing HTTP long-poll
|
||||
and let the listener patch the `EphemeralRunner.Status` to indicate it's running a job?
|
||||
|
||||
> The listener is normally in a separate namespace with elevated permission and it's something we can trust.
|
||||
|
||||
Changes:
|
||||
|
||||
- Introduce a new message type `JobStarted` (in addition to the existing `JobAvailable/JobAssigned/JobCompleted`) on the service side, the message is sent when a runner of the `RunnerScaleSet` get assigned to a job,
|
||||
`RequestId`, `RunnerId`, and `RunnerName` will be included in the message.
|
||||
- Add `RequestId (int)` to `EphemeralRunner.Status`, this will indicate which job the runner is running.
|
||||
- The `AutoScalingListener` will base on the payload of this new message to patch `EphemeralRunners/RunnerName/Status` with the `RequestId`
|
||||
- When `EphemeralRunnerSet_controller` try to find `EphemeralRunner` to delete on a scale down, it will skip any `EphemeralRunner` that has `EphemeralRunner.Status.RequestId` set.
|
||||
- In the future, we can expose more info to this `JobStarted` message and introduce more property under `EphemeralRunner.Status` to reach feature parity with legacy ARC's RunnerJobHook
|
||||
42
docs/adrs/2023-02-02-automate-runner-updates.md
Normal file
42
docs/adrs/2023-02-02-automate-runner-updates.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# ADR 2023-02-02: Automate updating runner version
|
||||
|
||||
**Date**: 2023-02-02
|
||||
|
||||
**Status**: Done
|
||||
|
||||
## Context
|
||||
|
||||
When a new [runner](https://github.com/actions/runner) version is released, new
|
||||
images need to be built in
|
||||
[actions-runner-controller/releases](https://github.com/actions-runner-controller/releases).
|
||||
This is currently started by the
|
||||
[release-runners](https://github.com/actions/actions-runner-controller/blob/master/.github/workflows/release-runners.yaml)
|
||||
workflow, although this only starts when the set of file containing the runner
|
||||
version is updated (and this is currently done manually).
|
||||
|
||||
## Decision
|
||||
|
||||
We can have another workflow running on a cadence (hourly seems sensible) and checking for new runner
|
||||
releases, creating a PR updating `RUNNER_VERSION` in:
|
||||
|
||||
- `.github/workflows/release-runners.yaml`
|
||||
- `Makefile`
|
||||
- `runner/Makefile`
|
||||
- `test/e2e/e2e_test.go`
|
||||
|
||||
Once that PR is merged, the existing workflow will pick things up.
|
||||
|
||||
## Consequences
|
||||
|
||||
We don't have to add an extra step to the runner release process and a direct
|
||||
dependency on ARC. Since images won't be built until the generated PR is merged
|
||||
we still have room to wait before triggering a build should there be any
|
||||
problems with the runner release.
|
||||
|
||||
## Considered alternatives
|
||||
|
||||
We also considered firing the workflow to create the PR via
|
||||
`repository_dispatch` as part of the release process of runner itself, but we
|
||||
discarded it because that would have required a PAT or a GitHub app with `repo`
|
||||
scope within the Actions org and would have added a new direct dependency on the
|
||||
runner side.
|
||||
138
docs/adrs/2023-02-10-limit-manager-role-permission.md
Normal file
138
docs/adrs/2023-02-10-limit-manager-role-permission.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# ADR 2023-02-10: Limit Permissions for Service Accounts in Actions-Runner-Controller
|
||||
|
||||
**Date**: 2023-02-10
|
||||
|
||||
**Status**: Done
|
||||
|
||||
## Context
|
||||
|
||||
- `actions-runner-controller` is a Kubernetes CRD (with controller) built using https://github.com/kubernetes-sigs/controller-runtime
|
||||
|
||||
- [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) has a default cache based k8s API client.Reader to make query k8s API server more efficiency.
|
||||
|
||||
- The cache-based API client requires cluster scope `list` and `watch` permission for any resource the controller may query.
|
||||
|
||||
- This documentation only scopes to the AutoscalingRunnerSet CRD and its controller.
|
||||
|
||||
## Service accounts and their role binding in actions-runner-controller
|
||||
|
||||
There are 3 service accounts involved for a working `AutoscalingRunnerSet` based `actions-runner-controller`
|
||||
|
||||
1. Service account for each Ephemeral runner Pod
|
||||
|
||||
This should have the lowest privilege (not any `RoleBinding` nor `ClusterRoleBinding`) by default, in the case of `containerMode=kubernetes`, it will get certain write permission with `RoleBinding` to limit the permission to a single namespace.
|
||||
|
||||
> References:
|
||||
>
|
||||
> - ./charts/gha-runner-scale-set/templates/no_permission_serviceaccount.yaml
|
||||
> - ./charts/gha-runner-scale-set/templates/kube_mode_role.yaml
|
||||
> - ./charts/gha-runner-scale-set/templates/kube_mode_role_binding.yaml
|
||||
> - ./charts/gha-runner-scale-set/templates/kube_mode_serviceaccount.yaml
|
||||
|
||||
2. Service account for AutoScalingListener Pod
|
||||
|
||||
This has a `RoleBinding` to a single namespace with a `Role` that has permission to `PATCH` `EphemeralRunnerSet` and `EphemeralRunner`.
|
||||
|
||||
3. Service account for the controller manager
|
||||
|
||||
Since the CRD controller is a singleton installed in the cluster that manages the CRD across multiple namespaces by default, the service account of the controller manager pod has a `ClusterRoleBinding` to a `ClusterRole` with broader permissions.
|
||||
|
||||
The current `ClusterRole` has the following permissions:
|
||||
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `AutoScalingRunnerSets` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `AutoScalingListeners` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `EphemeralRunnerSets` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `EphemeralRunners` (with `Status` and `Finalizer` sub-resource)
|
||||
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `Pods` (with `Status` sub-resource)
|
||||
- **Get/List/Create/Delete/Update/Patch/Watch on `Secrets`**
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `Roles`
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `RoleBindings`
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `ServiceAccounts`
|
||||
|
||||
> Full list can be found at: https://github.com/actions/actions-runner-controller/blob/facae69e0b189d3b5dd659f36df8a829516d2896/charts/actions-runner-controller-2/templates/manager_role.yaml
|
||||
|
||||
## Limit cluster role permission on Secrets
|
||||
|
||||
The cluster scope `List` `Secrets` permission might be a blocker for adopting `actions-runner-controller` for certain customers as they may have certain restriction in their cluster that simply doesn't allow any service account to have cluster scope `List Secrets` permission.
|
||||
|
||||
To help these customers and improve security for `actions-runner-controller` in general, we will try to limit the `ClusterRole` permission of the controller manager's service account down to the following:
|
||||
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `AutoScalingRunnerSets` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `AutoScalingListeners` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `EphemeralRunnerSets` (with `Status` and `Finalizer` sub-resource)
|
||||
- Get/List/Create/Delete/Update/Patch/Watch on `EphemeralRunners` (with `Status` and `Finalizer` sub-resource)
|
||||
|
||||
- List/Watch on `Pods`
|
||||
- List/Watch on `Roles`
|
||||
- List/Watch on `RoleBindings`
|
||||
- List/Watch on `ServiceAccounts`
|
||||
|
||||
> We will change the default cache-based client to bypass cache on reading `Secrets` and `ConfigMaps`(ConfigMap is used when you configure `githubServerTLS`), so we can eliminate the need for `List` and `Watch` `Secrets` permission in cluster scope.
|
||||
|
||||
Introduce a new `Role` for the controller and `RoleBinding` the `Role` with the controller's `ServiceAccount` in the namespace the controller is deployed. This role will grant the controller's service account required permission to work with `AutoScalingListeners` in the controller namespace.
|
||||
|
||||
- Get/Create/Delete on `Pods`
|
||||
- Get on `Pods/status`
|
||||
- Get/Create/Delete/Update/Patch on `Secrets`
|
||||
- Get/Create/Delete/Update/Patch on `ServiceAccounts`
|
||||
|
||||
The `Role` and `RoleBinding` creation will happen during the `helm install demo oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller`
|
||||
|
||||
During `helm install demo oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller`, we will store the controller's service account info as labels on the controller `Deployment`.
|
||||
Ex:
|
||||
|
||||
```yaml
|
||||
actions.github.com/controller-service-account-namespace: {{ .Release.Namespace }}
|
||||
actions.github.com/controller-service-account-name: {{ include "gha-runner-scale-set-controller.serviceAccountName" . }}
|
||||
```
|
||||
|
||||
Introduce a new `Role` per `AutoScalingRunnerSet` installation and `RoleBinding` the `Role` with the controller's `ServiceAccount` in the namespace that each `AutoScalingRunnerSet` deployed with the following permission.
|
||||
|
||||
- Get/Create/Delete/Update/Patch/List on `Secrets`
|
||||
- Create/Delete on `Pods`
|
||||
- Get on `Pods/status`
|
||||
- Get/Create/Delete/Update/Patch on `Roles`
|
||||
- Get/Create/Delete/Update/Patch on `RoleBindings`
|
||||
- Get on `ConfigMaps`
|
||||
|
||||
The `Role` and `RoleBinding` creation will happen during `helm install demo oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set` to grant the controller's service account required permissions to operate in the namespace the `AutoScalingRunnerSet` deployed.
|
||||
|
||||
The `gha-runner-scale-set` helm chart will try to find the `Deployment` of the controller using `helm lookup`, and get the service account info from the labels of the controller `Deployment` (`actions.github.com/controller-service-account-namespace` and `actions.github.com/controller-service-account-name`).
|
||||
|
||||
The `gha-runner-scale-set` helm chart will use this service account to properly render the `RoleBinding` template.
|
||||
|
||||
The `gha-runner-scale-set` helm chart will also allow customers to explicitly provide the controller service account info, in case the `helm lookup` couldn't locate the right controller `Deployment`.
|
||||
|
||||
New sections in `values.yaml` of `gha-runner-scale-set`:
|
||||
|
||||
```yaml
|
||||
## Optional controller service account that needs to have required Role and RoleBinding
|
||||
## to operate this gha-runner-scale-set installation.
|
||||
## The helm chart will try to find the controller deployment and its service account at installation time.
|
||||
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
|
||||
## to help it finish RoleBinding with the right service account.
|
||||
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
|
||||
controllerServiceAccount:
|
||||
namespace: arc-system
|
||||
name: test-arc-gha-runner-scale-set-controller
|
||||
```
|
||||
|
||||
## Install ARC to only watch/react resources in a single namespace
|
||||
|
||||
In case the user doesn't want to have any `ClusterRole`, they can choose to install the `actions-runner-controller` in a mode that only requires a `Role` with `RoleBinding` in a particular namespace.
|
||||
|
||||
In this mode, the `actions-runner-controller` will only be able to watch the `AutoScalingRunnerSet` resource in a single namespace.
|
||||
|
||||
If you want to deploy multiple `AutoScalingRunnerSet` into different namespaces, you will need to install `actions-runner-controller` in this mode multiple times as well and have each installation watch the namespace you want to deploy an `AutoScalingRunnerSet`
|
||||
|
||||
You will install `actions-runner-controller` with something like `helm install arc --namespace arc-system --set watchSingleNamespace=test-namespace oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller` (the `test-namespace` namespace needs to be created first).
|
||||
|
||||
You will deploy the `AutoScalingRunnerSet` with something like `helm install demo --namespace TestNamespace oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set`
|
||||
|
||||
In this mode, you will end up with a manager `Role` that has all Get/List/Create/Delete/Update/Patch/Watch permissions on resources we need, and a `RoleBinding` to bind the `Role` with the controller `ServiceAccount` in the watched single namespace and the controller namespace, ex: `test-namespace` and `arc-system` in the above example.
|
||||
|
||||
The downside of this mode:
|
||||
|
||||
- When you have multiple controllers deployed, they will still use the same version of the CRD. So you will need to make sure every controller you deployed has to be the same version as each other.
|
||||
- You can't mismatch install both `actions-runner-controller` in this mode (watchSingleNamespace) with the regular installation mode (watchAllClusterNamespaces) in your cluster.
|
||||
84
docs/adrs/2023-03-17-workflow-improvements.md
Normal file
84
docs/adrs/2023-03-17-workflow-improvements.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Improve ARC workflows for autoscaling runner sets
|
||||
|
||||
**Date**: 2023-03-17
|
||||
|
||||
**Status**: Done
|
||||
|
||||
## Context
|
||||
|
||||
In the [actions-runner-controller](https://github.com/actions/actions-runner-controller)
|
||||
repository we essentially have two projects living side by side: the "legacy"
|
||||
actions-runner-controller and the new one GitHub is supporting
|
||||
(gha-runner-scale-set). To hasten progress we relied on existing workflows and
|
||||
added some of our own (e.g.: end-to-end tests). We now got to a point where it's
|
||||
sort of confusing what does what and why, not to mention the increased running
|
||||
times of some those workflows and some GHA-related flaky tests getting in the
|
||||
way of legacy ARC and viceversa. The three main areas we want to cover are: Go
|
||||
code, Kubernetes manifests / Helm charts and E2E tests.
|
||||
|
||||
## Go code
|
||||
|
||||
At the moment we have three workflows that validate Go code:
|
||||
|
||||
- [golangci-lint](https://github.com/actions/actions-runner-controller/blob/34f3878/.github/workflows/golangci-lint.yaml):
|
||||
this is a collection of linters that currently runs on all PRs and push to
|
||||
master
|
||||
- [Validate ARC](https://github.com/actions/actions-runner-controller/blob/01e9dd3/.github/workflows/validate-arc.yaml):
|
||||
this is a bit of a catch-all workflow, other than Go tests this also validates
|
||||
Kubernetes manifests, runs `go generate`, `go fmt` and `go vet`
|
||||
- [Run CodeQL](https://github.com/actions/actions-runner-controller/blob/a095f0b66aad5fbc8aa8d7032f3299233e4c84d2/.github/workflows/run-codeql.yaml)
|
||||
|
||||
### Proposal
|
||||
|
||||
I think having one `Go` workflow that collects everything-Go would help a ton with
|
||||
reliability and understandability of what's going on. This shouldn't be limited
|
||||
to the GHA-supported mode as there are changes that even if made outside the GHA
|
||||
code base could affect us (such as a dependency update).
|
||||
This workflow should only run on changes to `*.go` files, `go.mod` and `go.sum`.
|
||||
It should have these jobs, aiming to cover all existing functionality and
|
||||
eliminate some duplication:
|
||||
|
||||
- `test`: run all Go tests in the project. We currently use the `-short` and
|
||||
`-coverprofile` flags: while `-short` is used to skip [old ARC E2E
|
||||
tests](https://github.com/actions/actions-runner-controller/blob/master/test/e2e/e2e_test.go#L85-L87),
|
||||
`-coverprofile` is adding to the test time without really giving us any value
|
||||
in return. We should also start using `actions/setup-go@v4` to take advantage
|
||||
of caching (it would speed up our tests by a lot) or enable it on `v3` if we
|
||||
have a strong reason not to upgrade. We should keep ignoring our E2E tests too
|
||||
as those will be run elsewhere (either use `Short` there too or ignoring the
|
||||
package like we currently do). As a dependency for tests this needs to run
|
||||
`make manifests` first: we should fail there and then if there is a diff.
|
||||
- `fmt`: we currently run `go fmt ./...` as part of `Validate ARC` but do
|
||||
nothing with the results. We should fail in case of a diff. We don't need
|
||||
caching for this job.
|
||||
- `lint`: this corresponds to what's currently the `golanci-lint` workflow (this
|
||||
also covers `go vet` which currently happens as part of `Validate ARC too`)
|
||||
- `generate`: the current behaviour for this is actually quite risky, we
|
||||
generate our code in `Validate ARC` workflow and use the results to run the
|
||||
tests but we don't validate that up to date generate code is checked in. This
|
||||
job should run `go generate` and fail on a diff.
|
||||
- `vulncheck`: **EDIT: this is covered by CodeQL** the Go team is maintaining [`govulncheck`](https://go.dev/blog/vuln), a tool to recursively
|
||||
analyzing all function calls in Go code and spot vulnerabilities on the call
|
||||
stack.
|
||||
|
||||
## Kubernetes manifests / Helm charts
|
||||
|
||||
We have [recently separated](https://github.com/actions/actions-runner-controller/commit/bd9f32e3540663360cf47f04acad26e6010f772e)
|
||||
Helm chart validation and we validate up-to-dateness of manifests as part of `Go
|
||||
/ test`.
|
||||
|
||||
## End to end tests
|
||||
|
||||
These tests are giving us really good coverage and should be one of the main
|
||||
actors when it comes to trusting our releases. Two improvements that could be
|
||||
done here are:
|
||||
|
||||
- renaming the workflow to `GHA E2E`: since renaming our resources the `gha`
|
||||
prefix has been used to identify things related to the mode GitHub supports
|
||||
and these jobs strictly validate the GitHub mode _only_. Having a shorter name
|
||||
allows for more readability of the various scenarios (e.g. `GHA E2E /
|
||||
single-namespace-setup`).
|
||||
- the test currently monitors and validates the number of pods spawning during
|
||||
the workflow but not the outcome of the workflow. While not necessary to look
|
||||
at pods specifics, we should at least guarantee that the workflow can
|
||||
successfully conclude.
|
||||
89
docs/adrs/2023-04-14-adding-labels-k8s-resources.md
Normal file
89
docs/adrs/2023-04-14-adding-labels-k8s-resources.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# ADR 2023-04-14: Adding labels to our resources
|
||||
|
||||
**Date**: 2023-04-14
|
||||
|
||||
**Status**: Done [^1]
|
||||
|
||||
## Context
|
||||
|
||||
Users need to provide us with logs so that we can help support and troubleshoot their issues. We need a way for our users to filter and retrieve the logs we need.
|
||||
|
||||
## Proposal
|
||||
|
||||
A good start would be a catch-all label to get all logs that are
|
||||
ARC-related: one of the [recommended labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/)
|
||||
is `app.kubernetes.io/part-of` and we can set that for all ARC components
|
||||
to be `actions-runner-controller`.
|
||||
|
||||
Assuming standard logging that would allow us to get all ARC logs by running
|
||||
|
||||
```bash
|
||||
kubectl logs -l 'app.kubernetes.io/part-of=gha-runner-scale-set-controller'
|
||||
```
|
||||
|
||||
which would be very useful for development to begin with.
|
||||
|
||||
The proposal is to add these sets of labels to the pods ARC creates:
|
||||
|
||||
#### controller-manager
|
||||
|
||||
Labels to be set by the Helm chart:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: gha-runner-scale-set-controller
|
||||
app.kubernetes.io/component: controller-manager
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
```
|
||||
|
||||
#### Listener
|
||||
|
||||
Labels to be set by controller at creation:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: gha-runner-scale-set-controller
|
||||
app.kubernetes.io/component: runner-scale-set-listener
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
|
||||
|
||||
# the following labels are to be extracted by the config URL
|
||||
actions.github.com/enterprise: enterprise
|
||||
actions.github.com/organization: organization
|
||||
actions.github.com/repository: repository
|
||||
```
|
||||
|
||||
#### Runner
|
||||
|
||||
Labels to be set by controller at creation:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/part-of: gha-runner-scale-set-controller
|
||||
app.kubernetes.io/component: runner
|
||||
app.kubernetes.io/version: "x.x.x"
|
||||
actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
|
||||
actions.github.com/runner-name: runner-name
|
||||
actions.github.com/runner-group-name: runner-group-name
|
||||
|
||||
# the following labels are to be extracted by the config URL
|
||||
actions.github.com/enterprise: enterprise
|
||||
actions.github.com/organization: organization
|
||||
actions.github.com/repository: repository
|
||||
```
|
||||
|
||||
This would allow us to ask users:
|
||||
|
||||
> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/part-of=gha-runner-scale-set-controller'?
|
||||
|
||||
Or for example if they're having problems specifically with runners:
|
||||
|
||||
> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/component=runner'?
|
||||
|
||||
This way users don't have to understand ARC moving parts but we still have a
|
||||
way to target them specifically if we need to.
|
||||
|
||||
[^1]: Supersedes [ADR 2022-12-05](2022-12-05-adding-labels-k8s-resources.md)
|
||||
18
docs/adrs/yyyy-mm-dd-TEMPLATE.md
Normal file
18
docs/adrs/yyyy-mm-dd-TEMPLATE.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Title
|
||||
|
||||
<!-- ADR titles should typically be imperative sentences. -->
|
||||
|
||||
**Status**: (Proposed|Accepted|Rejected|Superceded|Deprecated)
|
||||
|
||||
## Context
|
||||
|
||||
_What is the issue or background knowledge necessary for future readers
|
||||
to understand why this ADR was written?_
|
||||
|
||||
## Decision
|
||||
|
||||
_**What** is the change being proposed? **How** will it be implemented?_
|
||||
|
||||
## Consequences
|
||||
|
||||
_What becomes easier or more difficult to do because of this change?_
|
||||
Reference in New Issue
Block a user