Commit Graph

267 Commits

Author SHA1 Message Date
oreonl
e511401e51 fix: don't base64 decode secret strings (#1683) 2022-08-03 11:47:07 +09:00
Yusuke Kuoka
97404144eb Fix excessive runnerreplicaset update issue since 0.25.0 (#1650)
Fixes #1643
2022-07-17 19:43:24 +09:00
Felipe Galindo Sanchez
584745b67d Minor improvements for runner groups
- Add group in runners columns
- Add constant for runner group and labels
2022-07-15 09:47:25 +09:00
Yusuke Kuoka
8071ac7066 Remove github-api-cache-duration flag and code (#1631)
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.

Ref #1412
2022-07-12 20:37:24 +09:00
Yusuke Kuoka
618276e3d3 Enhance support for multi-tenancy (#1371)
This enhances every ARC controller and the various K8s custom resources so that the user can now configure a custom GitHub API credentials (that is different from the default one configured per the ARC instance).

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1067#issuecomment-1043716646
2022-07-12 09:45:00 +09:00
Yusuke Kuoka
1cfe1974c4 Add missing job-related permissions to runner pods with k8s container mode 2022-07-10 16:16:32 +09:00
Felipe Galindo Sanchez
11cb9b7882 feat: allow to discover runner statuses (#1268)
* feat: allow to discover runner statuses

* fix manifests

* Bump runner version to 2.289.1 which includes the hooks support

* Add feedback from review

* Update reference to newRunnerPod

* Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific

* Fix additional TestNewRunnerPod test

* Cover additional feedback from review

* fix rbac manager role

* Add permissions to service account for container mode if not provided

* Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-07-10 15:11:29 +09:00
everpcpc
fea1457f12 fix: annotate pod instead of label on UnregistrationFailureMessage (#1615)
The error message should go to annotation instead of pod label, since we only check annotations on autoscale:

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/autoscaling.go#L338-L342

Then there is no need to truncate or normalize the error message.
2022-07-09 11:45:05 +09:00
Yusuke Kuoka
bfc5ea4727 Fix a regression in webhook-based autoscaler (#1596)
The regression resulted in the webhook-based autoscaler be unable to find visible runner groups and therefore unable to scale up and down the target RunnerDeployment/RunnerSet at all when the webhook-based autoscaler was provided GitHub API credentials to enable the runner groups support. This fixes that.

The regression was introduced via #1578 which is not released yet. Users of existing ARC releases are therefore not affected.
2022-07-04 20:17:09 +09:00
Yusuke Kuoka
a517c1ff66 Fix old runner pods stuck in Terminating since #1579 (#1585)
Ref #1579
2022-06-29 22:02:42 +09:00
Yusuke Kuoka
8161136cbd Fix PercentageRunnersBusy scaling delay (#1579)
* Use a dedicated pod label to say it is a runner pod

Follow-up for #1546

* Fix PercentageRunnersBusy scaling delay

Ref #1374
2022-06-29 20:49:21 +09:00
Yusuke Kuoka
ddd417f756 Bump go-github to v45 (#1573)
* Bump go-github to v45

Ref #1402

* fixup! Bump go-github to v45
2022-06-29 06:34:58 +09:00
Thomas Boop
0386c0734c containerMode option to allow running jobs in k8's instead of docker (#1546)
* added containerMode=kubernetes env variables to the runner

* removed unused logging

* restored configs and charts

* restored makefile cert version and acceptance/run

* added workVolumeClaimTemplate in pod definition, including logic

* added claim template name based on the runner

* Apply suggestions from code review

update errors

* added concurrent cleanup before runner pod is deleted

* update manifests

* added retry after 30s if pod cleanup contains err

* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s

* style changes and added comments

* added izZero timestamp check for deleting runner-linked pods

* changed order of local variable to avoid copy if p is deleted

* removed docker from container mode k8s

* restored charts, config, makefile

* restored forked files back and not the ARC ones

* created PersistentVolume on containerMode k8s

* create pv only if storage class name is local-storage

* removed actions if storage class name is local-storage

* added service account validation if container mode kubernetes

* changed the coding style to match rest of the ARC

* added validation to the runnerdeployment webhook

* specified fields more precisely, added webhook validation to the replicaset as well

* remake manifests

* wraped delete runner-linked-pods in kube mode

* fixed empty line

* fixed import

* makefile changes for hooks

* added cleanup secrets

* create manifests

* docs

* update access modes

* update dockerfile

* nit changes

* fixed dockerfile

* rewrite allowing reuse for runners and runnersets

* deepcopy forgot to stage

* changed privileged

* make manifests

* partly moved to finalizer, still need to apply finalizer first

* finalizer added if env variable used in container mode exists

* bump runner version

* error message moved from Error to Info on cleanup pods/secrets

* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate

* Apply suggestions from code review

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* add hook version to cli, update to 0.1.2

* Apply suggestions from code review

* Update controllers/utils_test.go

* Update runner/Makefile

* Fix missing secret permission and the error handling

* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry

Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-28 14:12:40 +09:00
Yusuke Kuoka
af96de6184 Fix completed runner pod recreation not to be blocked after max out (#1568)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1477#issuecomment-1164154496
2022-06-28 13:50:07 +09:00
Sam Weston
bc7a3cab1b Add priorityClassName to CRDs (#1513)
* Add pod priorityClassName to controller and crds

* Add missing bits in bases directory

* Regenerate crds
2022-06-28 08:45:19 +09:00
Yusuke Kuoka
e2c8163b8c Make webhook-based scale race-free (#1477)
* Make webhook-based scale operation asynchronous

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321

* Fix typos

* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers

* Batch capacity reservation updates for efficient use of apiserver

* Fix potential never-ending HRA update conflicts in batch update

* Extract batchScaler out of webhook-based autoscaler for testability

* Fix log levels and batch scaler hang on start

* Correlate webhook event with scale trigger amount in logs

* Fix log message
2022-06-27 18:31:48 +09:00
Yusuke Kuoka
0ef9a22cd4 Fix confusing PV controller log (#1526)
Ref #1511
2022-06-14 08:35:04 +09:00
Yusuke Kuoka
9dd26168d6 Fix confusing logs from pv and pvc controllers (#1487)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212
2022-05-26 18:21:23 +09:00
Yusuke Kuoka
c7eea169ad test: Add test for runner with generic ephemeral volume as "work" (#1472)
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
2022-05-25 10:37:23 +09:00
Yusuke Kuoka
63be0223ad fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" (#1471)
* fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work"

While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error.

This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears.

Ref #1464

* e2e: Ensure work generic ephemeral volume to work as expected
2022-05-22 10:25:50 +09:00
Yusuke Kuoka
3e988afc09 test: add fuzzing to the test suite (#1463)
As a part of #1298, we add fuzzing based on Go test's fuzzing support to the test suite
2022-05-19 13:34:23 +01:00
Felipe Galindo Sanchez
78c01fd31d cleanup some unused code and minor refactors (#1274)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 18:38:32 +09:00
Callum Tait
65f7ee92a6 refactor: remove registration runner dead code (#1260)
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 11:23:39 +09:00
Yusuke Kuoka
b5194fd75a Enhance RunnerSet to optionally retain PVs accross restarts (#1340)
* Enhance RunnerSet to optionally retain PVs accross restarts

This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.

Ref #1286

* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs

Ref #1286
2022-05-16 09:26:48 +09:00
Yusuke Kuoka
759349de11 fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395)
Ref #1369
2022-05-14 09:07:17 +01:00
Yusuke Kuoka
e46b90f758 fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420)
This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
2022-05-12 09:34:27 +01:00
Yusuke Kuoka
3a7e8c844b feat: Support arbitrarily setting privileged: true for runner container (#1383)
Resolves #1282
2022-05-12 09:25:51 +01:00
Yusuke Kuoka
dabbc99c78 refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385)
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.

Ref #1196
2022-05-11 11:42:55 +01:00
Yusuke Kuoka
4053ab3e11 Fix label support for TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1390)
In #1373 we made two mistakes:

- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.

The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.

Ref #1056
Ref #1373
2022-04-27 16:24:21 +01:00
Yusuke Kuoka
1c726ae20c chore: Add unit test for RunnerReconciler.newPod (#1382)
Adds some unit tests for the runner pod generation logic that is used internally by runner controller as preparation for #1282
2022-04-25 09:59:21 +09:00
Yusuke Kuoka
d6cdd5964c chore: Add unit test for newRunnerPod (#1381)
Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282
2022-04-25 08:52:58 +09:00
Yusuke Kuoka
a622968ff2 feat: Add label support to TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1373)
This is an implementation for my intepretation of the "bronze" case proposed in #1056

Ref #1056
2022-04-24 14:41:34 +09:00
Yusuke Kuoka
1551f3b5fc Remove the default githubEvent: {} requiring a event to be defined (#1379)
Ref #1358
2022-04-24 13:37:26 +09:00
Yusuke Kuoka
3ba7179995 Do not enable TotalNumberOfQueuedAndInProgressWorkflowRuns by default (#1372)
Previously, omitting hra.spec.metrics at all resulted in ARC enabling the TotalNumberOfQueuedAndInProgressWorkflowRuns.
That turned out not a good idea so since this change it is not enabled by default.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/728
2022-04-24 13:36:42 +09:00
Callum Tait
24aae58dbc feat: default scale down flag (#963)
Resolves #899

Co-authored-by: Callum <callum@domain.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-04-20 11:09:09 +09:00
Jeff Billimek
13bfa2da4e Fix runner pod dnsConfig (#1227)
Fixes #1226
Fixes #1224

Signed-off-by: Jeff Billimek <jeff@billimek.com>
2022-04-20 10:55:20 +09:00
Patrick Ellis
7a5a6381c3 Add WorkflowJob to GitHubEventScaleUpTriggerSpec types (#922) 2022-04-20 09:59:08 +09:00
Siyuan Zhang
a37b4dfbe3 Fix scale down condition to exclude skipped (#1330)
* Fix scale down condition to exclude skipped
* Use fallthrough and break to let default handle the skipped case

Fixes #1326
2022-04-13 08:53:07 +09:00
Yusuke Kuoka
b09c54045a Prevent runners from stuck in Terminating when pod disappeared without standard termination process (#1318)
This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.

Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.

Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307
Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273
2022-04-08 10:17:33 +09:00
Felipe Galindo Sanchez
e7e48a77e4 Merge remote-tracking branch 'upstream/master' into patch-4 2022-04-04 09:54:08 -07:00
Yusuke Kuoka
631a70a35f Fix runner pod to be cleaned up earlier regardless of the sync period (#1299)
Ref #1291
2022-04-03 11:12:44 +09:00
Rolf Ahrenberg
1b327a0721 refactor: use const envvars (#1251) 2022-03-27 12:14:56 +01:00
Endre Karlson
ee7484ac91 Use container name to detect runner container in Pod 2022-03-23 12:39:58 +01:00
Felipe Galindo Sanchez
9657d3e5b3 Fix deleting a runner when pod was deleted
With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource
2022-03-22 14:44:50 -07:00
Yusuke Kuoka
4551309e30 Fix runners to not terminate before unregistration when scaling down
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
2022-03-13 13:09:46 +00:00
Yusuke Kuoka
7123b18a47 chore: Log more variables when log level is -2 2022-03-13 13:04:28 +00:00
Yusuke Kuoka
cc55d0bd7d Let runnerdeployment controller log runnerreplicaset creation 2022-03-13 12:25:53 +00:00
Yusuke Kuoka
c612e87d85 fix: Let RunnerDeployment scale RunnerReplicaSet to zero before terminating it
so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.
2022-03-13 12:18:22 +00:00
Yusuke Kuoka
326d6a1fe8 Fix the timing of Marking owner for unregistration completion log 2022-03-13 12:16:55 +00:00
Yusuke Kuoka
fa8ff70aa2 Add log when deletion timestamp is being set on owner object 2022-03-13 12:16:29 +00:00