Commit Graph

100 Commits

Author SHA1 Message Date
Yusuke Kuoka
a6f0e0008f Make unregistration timeout and retry delay configurable in integration tests 2022-02-20 12:05:34 +00:00
Yusuke Kuoka
79a31328a5 Stop recreating ephemeral runner pod
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384
2022-02-20 04:42:19 +00:00
Yusuke Kuoka
3c16188371 Introduce consistent timeouts for runner unregistration and runner pod deletion
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
2022-02-20 04:36:35 +00:00
Tingluo Huang
0b9bef2c08 Try to unconfig runner before deleting the pod to recreate (#1125)
There is a race condition between ARC and GitHub service about deleting runner pod.

- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.

To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.

TODO: I need to do some E2E tests to force the race condition to happen.

Ref #911
2022-02-19 21:22:31 +09:00
Yusuke Kuoka
a5ed6bd263 Fix RunerSet managed runner pods to terminate more gracefully (#1126)
Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460
2022-02-19 21:19:37 +09:00
Yusuke Kuoka
921f547200 fix: Do recreate runner pod on registration token update (#1087)
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.

This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
2022-02-19 21:18:00 +09:00
Daniel
8a73560dbc if a Volume is defined by the operator don't add another "work" volume. (#1015)
This allows providing a different `work` Volume.

This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.

This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-2
spec:
  template:
    spec:
      dockerdContainerResources: {}
      env:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      # this is to mount the docker in docker onto NVME disk
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: scratch
        subPathExpr: $(POD_NAME)-docker
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      volumeMounts:
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      volumes:
      - hostPath:
          path: /mnt/disks/ssd0
        name: scratch
      - hostPath:
          path: /mnt/disks/ssd0
        name: work
      nodeSelector:
       cloud.google.com/gke-nodepool: runner-16-with-nvme
      ephemeral: false
      image: ""
      imagePullPolicy: Always
      labels:
        - runner-2
        - self-hosted
      organization: yourorganization

```
2022-01-07 10:01:40 +09:00
Callum Tait
ad48851dc9 feat: expose if docker is enabled and wait for docker to be ready (#962)
Resolves #897
Resolves #915
2021-12-29 10:23:35 +09:00
renovate[bot]
c64000e11c fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 (#740)
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0

* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade

* Regenerate manifests with the latest K8s dependencies

Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-17 09:06:55 +09:00
Felipe Galindo Sanchez
9bb21aef1f Add support for default image pull secret name (#921)
Resolves #896

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-15 09:29:31 +09:00
Pavel Smalenski
91102c8088 Add dockerEnv variable for RunnerDeployment (#912)
Resolves #878

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-14 17:13:24 +09:00
Felipe Galindo Sanchez
f0fccc020b refactor: split Reconciler from Reconcile in a few methods (#926)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-12 14:22:55 +09:00
Patrick Ellis
ea2dbc2807 Update go-github from v37 -> v39 (#925) 2021-12-11 21:43:40 +09:00
Yusuke Kuoka
898ad3c355 Work-around for offline+busy runners (#993)
Ref #911
2021-12-09 09:31:06 +09:00
apr-1985
0d3de9ee2a chore: correct logging typo (#904) 2021-10-22 09:03:23 +09:00
Maxim Pogozhiy
fce7d6d2a7 Add topologySpreadConstraints (#814) 2021-10-17 21:49:44 +01:00
Tristan Keen
5e3f89bdc5 Correct test to append docker container (#837)
Fixes #835
2021-09-24 09:18:20 +09:00
Tristan Keen
1eb135cace Correct default image logic 2021-09-14 17:00:57 +09:00
Rolf Ahrenberg
14564c7b8e Allow disabling /runner emptydir mounts and setting storage volume (#674)
* Allow disabling /runner emptydir mounts

* Support defining storage medium for emptydirs

* Fix typos
2021-07-15 06:29:58 +09:00
Sebastien Le Digabel
7f2795b5d6 Adding a default docker registry mirror (#689)
* Adding a default docker registry mirror

This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.

The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
2021-07-15 06:20:08 +09:00
Yusuke Kuoka
f858e2e432 Add POC of GitHub Webhook Delivery Forwarder (#682)
* Add POC of GitHub Webhook Delivery Forwarder

* multi-forwarder and ctrl-c existing and fix for non-woring http post

* Rename source files

* Extract signal handling into a dedicated source file

* Faster ctrl-c handling

* Enable automatic creation of repo hook on startup

* Add support for forwarding org hook deliveries

* Set hook secret on hook creation via envvar (HOOK_SECRET)

* Fix org hook support

* Fix HOOK_SECRET for consistency

* Refactor to prepare for custom log position provider

* Refactor to extract inmemory log position provider

* Add configmap-based log position provider

* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder

* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg

* Refactor to extract logger initialization

* Add hookdeliveryforwarder README and bump go-github to unreleased ver
2021-07-14 10:18:55 +09:00
Yusuke Kuoka
6f130c2db5 Fix dockerdWithinRunnerContainer for Runner(Deployment) not working in the main branch (#696)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/674#issuecomment-878600993
2021-07-13 18:14:15 +09:00
Yusuke Kuoka
f19e7ea8a8 chore: Upgrade go-github to v36 (#681) 2021-07-04 17:43:52 +09:00
Yusuke Kuoka
acb906164b RunnerSet: Automatic-recovery from registration timeout and deregistration on pod termination (#652)
Ref #629
Ref #613
Ref #612
2021-06-24 20:39:37 +09:00
Yusuke Kuoka
8b90b0f0e3 Clean up import list (#645)
Resolves #644
2021-06-22 17:55:06 +09:00
Jonathan Gonzalez V
a277489003 Added support to enable and disable enableServiceLinks. (#628)
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.

This option it's set by default to `true` in any Kubernetes deployment.

Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>
2021-06-22 17:27:26 +09:00
Yusuke Kuoka
9e4dbf497c feat: RunnerSet backed by StatefulSet (#629)
* feat: RunnerSet backed by StatefulSet

Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.

Resolves #613
Ref #612

* Upgrade controller-runtime to 0.9.0

* Bump Go to 1.16.x following controller-runtime 0.9.0

* Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup

* Fix startup failure due to missing LeaderElectionID

* Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered

* Allow force-updating statefulset

* Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false

* Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec

* Enable running acceptance tests against arbitrary kind cluster

* RunnerSet supports non-ephemeral runners only today

* fix: docker-build from root Makefile on intel mac

* fix: arch check fixes for mac and ARM

* ci: aligning test data format and patching checks

* fix: removing namespace in test data

* chore: adding more ignores

* chore: removing leading space in shebang

* Re-add metrics to org hra testdata

* Bump cert-manager to v1.1.1 and fix deploy.sh

Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
2021-06-22 17:10:09 +09:00
Jonah Back
8c42f99d0b feat: avoid setting privileged flag if seLinuxOptions is not null (#599)
Sets the privileged flag to false if SELinuxOptions are present/defined. This is needed because containerd treats SELinux and Privileged controls as mutually exclusive. Also see https://github.com/containerd/cri/blob/aa2d5a97c/pkg/server/container_create.go#L164.

This allows users who use SELinux for managing privileged processes to use GH Actions - otherwise, based on the SELinux policy, the Docker in Docker container might not be privileged enough. 

Signed-off-by: Jonah Back <jonah@jonahback.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-06-04 08:59:11 +09:00
Ameer Ghani
7523ea44f1 feat: allow specifying runtime class in runner spec (#580)
This allows using the `runtimeClassName` directive in the runner's spec.

One of the use-cases for this is Kata Containers, which use `runtimeClassName` in a pod spec as an indicator that the pod should run inside a Kata container. This allows us a greater degree of pod isolation.
2021-06-04 08:56:43 +09:00
Thejas N
588872a316 feat: allow ephemeral runner to be optional (#498)
- Adds `ephemeral` option to `runner.spec` 
    
    ```
      ....
      template:
         spec:
             ephemeral: false
             repository: mumoshu/actions-runner-controller-ci
      ....
    ```
- `ephemeral` defaults to `true`
- `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag
- Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`.

Resolves #457
2021-05-02 19:04:14 +09:00
Yusuke Kuoka
dbd7b486d2 feat: Support for scaling from/to zero (#465)
This is an attempt to support scaling from/to zero.

The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.

GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners.

In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.

Related to #447
2021-05-02 16:11:36 +09:00
Rolf Ahrenberg
6b77a2a5a8 feat: Docker registry mirror (#478)
Changes:

- Switched to use `jq` in startup.sh
- Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting

Check #478 for how this feature is tested and supposed to be used.
2021-04-25 14:04:01 +09:00
Manuel Jurado
37c2a62fa8 Allow to configure runner volume size limit (#436)
Enable the user to set a limit size on the volume of the runner to avoid some runner pod affecting other resources of the same cluster

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-04-18 13:56:59 +09:00
Agoney Garcia-Deniz
2e551c9d0a Add hostAliases to the runner spec (#456) 2021-04-17 17:04:52 +09:00
asoldino
b42b8406a2 Add dockerVolumeMounts (#439)
Resolves #435
2021-04-06 10:10:10 +09:00
Christoph Brand
9ed245c85e feature(controller): remove dockerd executable (#432) 2021-04-01 08:50:48 +09:00
Yusuke Kuoka
156e2c1987 Fix MTU configuration for dockerd (#421)
Resolves #393
2021-03-31 09:29:21 +09:00
Yusuke Kuoka
374105c1f3 Fix dindWithinRunnerContainer not to crash-loop runner pods (#419)
Apparently #253 broke dindWithinRunnerContainer completely due to the difference in how /runner volume is set up.
2021-03-25 10:23:36 +09:00
Yusuke Kuoka
bc6e499e4f Make logging more concise (#410)
This makes logging more concise by changing logger names to something like `controllers.Runner` to `actions-runner-controller.runner` after the standard `controller-rutime.controller` and reducing redundant logs by removing unnecessary requeues. I have also tweaked log messages so that their style is more consistent, which will also help readability. Also, runnerreplicaset-controller lacked useful logs so I have enhanced it.
2021-03-20 07:34:25 +09:00
Yusuke Kuoka
07f822bb08 Do include Runner controller in integration test (#409)
So that we could catch bugs in runner controller like seen in #398, #404, and #407.

Ref #400
2021-03-19 16:14:15 +09:00
Yusuke Kuoka
c424215044 Do recheck runner registration timely (#405)
Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded).

More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.
2021-03-19 11:02:47 +09:00
Yusuke Kuoka
3cccca8d09 Do patch runner status instead of update to reduce conflicts and avoid future bugs
Ref https://github.com/summerwind/actions-runner-controller/pull/398#issuecomment-801548375
2021-03-18 10:31:17 +09:00
Yusuke Kuoka
7a7086e7aa Make error logs more helpful 2021-03-18 10:26:21 +09:00
Yusuke Kuoka
3f23501b8e Reduce "No runner matching the specified labels was found" errors while runner replacement (#392)
We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered.

A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.
2021-03-16 10:52:30 +09:00
Brandon Kimbrough
2273b198a1 Add ability to set the MTU size of the docker in docker container (#385)
* adding abilitiy to set docker in docker MTU size

* safeguards to only set MTU env var if it is set
2021-03-12 08:44:49 +09:00
Johannes Nicolai
2d7fbbfb68 Handle offline runners gracefully (#341)
* if a runner pod starts up with an invalid token, it will go in an 
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding 
runner objects exists in GitHub and the pod will get removed after 
registration timeout
* if the GitHub runner object already existed before - e.g. because a 
finalizer was not properly run as part of a partial Kubernetes crash, 
the runner will always stay in a running mode, even updating the 
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner 
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart, 
only do additional restart checks if no restart was already decided, 
making code a bit cleaner and saving GitHub API calls after each job 
completion
2021-02-22 10:08:04 +09:00
Yusuke Kuoka
7d024a6c05 Fix "duplicate metrics collector registration attempted" errors at startup (#317)
I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.
2021-02-16 18:51:33 +09:00
Johannes Nicolai
2623140c9a Make log message less scary (#311)
* the reconciliation loop is often much faster than the runner startup, 
so changing runner not found related messages to debug and also add the 
possibility that the runner just needs more time
2021-02-16 09:55:55 +09:00
Johannes Nicolai
bc8bc70f69 Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00
Johannes Nicolai
9c8d7305f1 Introduce pod deletion timeout and forcefully delete stuck pods (#307)
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken  runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
2021-02-15 09:32:28 +09:00