Commit Graph

101 Commits

Author SHA1 Message Date
Yusuke Kuoka
4fa5315311 Fix possible flapping autoscale on runner update (#371)
Addresses https://github.com/summerwind/actions-runner-controller/pull/355#discussion_r587199428
2021-03-05 10:21:20 +09:00
Hiroshi Muraoka
11e58fcc41 Manage runner with label (#355)
* Update RunnerDeploymentSpec to have Selector field

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Update RunnerReplicaSetSpec to have Selector field

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Add CloneSelectorAndAddLabel to add Selector field

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Fix tests

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Use label to find RunnerReplicaSet/Runner

Signed-off-by: binoue <banji-inoue@cybozu.co.jp>

* Update controller-gen versions in CRD

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Update autoscaler to list Pods with labels

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Add debug log

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Modify RunnerDeployment tests

Signed-off-by: binoue <banji-inoue@cybozu.co.jp>

* Modify RunnerReplicaset test

Signed-off-by: binoue <banji-inoue@cybozu.co.jp>

* Modify integration test

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Use RunnerDeployment Template Labels as the default selector for backward compatibility

* Fix labeling

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Update func in Eventually to return (int, error)

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Update RunnerDeployment controller not to use label selector

Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com>

* Fix potential replicaset controller breakage on replicaset created before v0.17.0

* Fix errors on existing runner replica sets

* Ensure RunnerReplicaSet Spec Selector addition does not break controller

* Ensure RunnerDeployment Template.Spec.Labels change does result in template hash change

* Fix comment

Co-authored-by: binoue <banji-inoue@cybozu.co.jp>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-03-05 10:15:39 +09:00
Yusuke Kuoka
584590e97c Use patch instead of update to alleviate HRA conflict on webhook (#358)
We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.

I have also updated the hra controller to use Patch when updating RunnerDeployment, too.

Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
2021-02-26 10:17:09 +09:00
Yusuke Kuoka
d18884a0b9 Fix HRA expired cache entries not cleaned up (#357)
Fixes #356
2021-02-26 09:54:24 +09:00
Yusuke Kuoka
e9eef04993 Fix old HRA capacity reservations not cleaned up (#354)
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
2021-02-25 11:08:00 +09:00
Yusuke Kuoka
598dd1d9fe Fix incorrect DESIRED on `kubectl get hra (#353)
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
2021-02-25 10:32:09 +09:00
Yusuke Kuoka
9890a90e69 Improve webhook-based autoscaler log (#352)
The controller had been writing confusing messages like the below on missing scale target:

```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```

This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
2021-02-25 10:07:41 +09:00
Yusuke Kuoka
9da123ae5e Fix integration test flakiness (#351)
Ref https://github.com/summerwind/actions-runner-controller/pull/345#issuecomment-785015406
2021-02-25 09:30:32 +09:00
Yusuke Kuoka
022007078e Compact excessive error message on runnerreplicaset status update conflict (#350)
We occasionally see logs like the below:

```
2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
/home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
```

which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.
2021-02-25 09:01:02 +09:00
Johannes Nicolai
31e5e61155 Log correct runner that was deleted (#349) 2021-02-25 08:38:55 +09:00
Yusuke Kuoka
e44e53b88e Fix failure while saving HRA status after running controller for a while (#348)
Fixes #346
2021-02-24 11:20:21 +09:00
Yusuke Kuoka
991535e567 Fix panic on webhook for user-owned repository (#344)
* Fix panic on webhook for user-owned repository

Follow-up for #282 and #334
2021-02-23 08:05:25 +09:00
Johannes Nicolai
2d7fbbfb68 Handle offline runners gracefully (#341)
* if a runner pod starts up with an invalid token, it will go in an 
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding 
runner objects exists in GitHub and the pod will get removed after 
registration timeout
* if the GitHub runner object already existed before - e.g. because a 
finalizer was not properly run as part of a partial Kubernetes crash, 
the runner will always stay in a running mode, even updating the 
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner 
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart, 
only do additional restart checks if no restart was already decided, 
making code a bit cleaner and saving GitHub API calls after each job 
completion
2021-02-22 10:08:04 +09:00
Hidetake Iwata
b0e74bebab Fix index key to find HRA in GitHub webhook handler 2021-02-20 21:25:23 +09:00
Hidetake Iwata
dfbe53dcca Fix webhook payload in integration test 2021-02-20 21:08:23 +09:00
Yusuke Kuoka
ebc3970b84 Add integration test for autoscaling on check_run webhook event 2021-02-19 10:33:04 +09:00
Hidetake Iwata
1ddcf6946a Fix nil pointer error on received check_run event (#331)
* Reproduce nil pointer error on received check_run event

* Fix nil pointer error on received check_run event
2021-02-18 20:22:36 +09:00
Yusuke Kuoka
67f6de010b feat: Common runner labels configurable per controller (#327)
* feat: Common runner labels configurable per controller

Ref #321
2021-02-18 20:19:08 +09:00
Yusuke Kuoka
2fdf35ac9d Refactor integration test to use helpers (#320)
This should make the test code a bit more DRY and readable.
2021-02-17 10:23:35 +09:00
Yusuke Kuoka
eb2eaf8130 Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining completed jobs (#316)
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.

This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
2021-02-16 18:55:55 +09:00
Yusuke Kuoka
7d024a6c05 Fix "duplicate metrics collector registration attempted" errors at startup (#317)
I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.
2021-02-16 18:51:33 +09:00
Yusuke Kuoka
434823bcb3 scale{Up,Down}Adjustment to add/remove constant number of replicas on scaling (#315)
* `scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling

Ref #305

* Bump chart version
2021-02-16 17:16:26 +09:00
Yusuke Kuoka
f1db6af1c5 Add repository runners support for PercentageRunnersBusy-based autoscaling (#313)
Resolves #258
2021-02-16 12:44:51 +09:00
Johannes Nicolai
2623140c9a Make log message less scary (#311)
* the reconciliation loop is often much faster than the runner startup, 
so changing runner not found related messages to debug and also add the 
possibility that the runner just needs more time
2021-02-16 09:55:55 +09:00
Johannes Nicolai
bc8bc70f69 Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00
Johannes Nicolai
9c8d7305f1 Introduce pod deletion timeout and forcefully delete stuck pods (#307)
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken  runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
2021-02-15 09:32:28 +09:00
Yusuke Kuoka
addcbfa7ee Fix runner registration timeout (#301)
Fixes #300
2021-02-12 10:00:20 +09:00
Yusuke Kuoka
bbb036e732 feat: Prevent blocking on transient runner registration failure (#297)
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).

It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).

Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.

Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.

Ref #288
2021-02-09 10:17:52 +09:00
Yusuke Kuoka
9301409aec fix: Paginate ListRepositoryWorkflowRuns (#295)
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
2021-02-09 10:13:53 +09:00
Yusuke Kuoka
ab1c39de57 feat: HorizontalRunnerAutoscaler Webhook server (#282)
* feat: HorizontalRunnerAutoscaler Webhook server

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

Changes:

* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
2021-02-07 17:37:27 +09:00
Jesse Haka
28e80a2d28 Add support for enterprise runners (#290)
* Add support for enterprise runners

* update docs
2021-02-05 09:31:06 +09:00
Jonas Lergell
6c64ae6a01 Actually use 'dockerdContainerResources' to set resources on the dind container (#273) 2021-01-29 09:18:28 +09:00
Yusuke Kuoka
ace95d72ab Fix self-update failuers due to /runner/externals mount (#253)
* Fix self-update failuers due to /runner/externals mount

Fixes #252

* Tested Self-update Fixes (#269)

Adding fixes to #253 as confirmed and tested in https://github.com/summerwind/actions-runner-controller/issues/264#issuecomment-764549833 by @jolestar, @achedeuzot and @hfuss 🙇 🍻

Co-authored-by: Hayden Fuss <wifu1234@gmail.com>
2021-01-24 10:58:35 +09:00
Johannes Nicolai
94e8c6ffbf minReplicas <= desiredReplicas <= maxReplicas (#267)
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
2021-01-22 10:11:21 +09:00
ZacharyBenamram
48923fec56 Autoscaling: Percentage runners busy - remove magic number used for round up (#235)
* remove magic number for autoscaling

Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-12-15 14:38:01 +09:00
ZacharyBenamram
466b30728d Add "PercentageRunnersBusy" horizontal runner autoscaler metric type (#223)
* hpa scheme based off busy runners

* running make manifests

Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-12-13 08:48:19 +09:00
Yusuke Kuoka
dfffd3fb62 feat: EKS IAM Roles for Service Accounts for Runner Pods (#226)
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.

Resolves #200
2020-12-08 17:56:06 +09:00
Juho Saarinen
f710a54110 Don't compare runner connetion token at restart need check (#227)
Fixes #143
2020-12-08 08:48:35 +09:00
Erik Nobel
a2b335ad6a Github pkg: Bump github package to version 33 (#222) 2020-12-06 10:01:47 +09:00
Shinnosuke Sawada
be25715e1e Use TLS for secure docker connection (#192) 2020-11-30 08:57:33 +09:00
Reinier Timmer
ee8fb5a388 parametrized working directory (#185)
* parametrized working directory

* manifests v3.0
2020-11-25 08:55:26 +09:00
Erik Nobel
4e93879b8f [BUG?]: Create mountpoint for /externals/ (#203)
* runner/controller: Add externals directory mount point

* Runner: Create hack for moving content of /runner/externals/ dir

* Externals dir Mount: mount examples for '__e/node12/bin/node' not found error
2020-11-25 08:53:47 +09:00
Shinnosuke Sawada
4371de9733 add dockerEnabled option (#191)
Add dockerEnabled option for users who does not need docker and want not to run privileged container.
if `dockerEnabled == false`, dind container not run, and there are no privileged container.

Do the same as closed #96
2020-11-16 09:41:12 +09:00
Shinnosuke Sawada
a4061d0625 gofmt ed 2020-11-12 09:20:06 +09:00
Shinnosuke Sawada
83857ba7e0 use tcp DOCKER_HOST instead of sharing docker.sock 2020-11-12 08:07:52 +09:00
Yusuke Kuoka
e613219a89 Fix token registration broken since v0.11.0 (#167)
Fixes #166
2020-11-11 09:38:05 +09:00
Dan Webb
dcf8524b5c Adds RUNNER_GROUP argument to the runner registration (#157)
* Adds RUNNER_GROUP argument to the runner registration

Adds the ability to register a runner to a predefined runner_group

Resolves #137

* Update README with runner group example

- Updates the README with instructions of how to add the runner to a
  group
- Fix code fencing for shell and yaml blocks in the README
- Use consistent bullet points (dash not asterisk)
2020-11-10 17:15:54 +09:00
Juho Saarinen
f2a2ab7ede Check token validity only when creating new pod (#159)
Fixes #143
2020-11-10 17:02:30 +09:00
Juho Saarinen
40c5050978 Added support for other than public GitHub URL (#146)
Refactoring a bit
2020-10-28 22:15:53 +09:00
Yusuke Kuoka
faaca10fba Rename Runner.Spec.dockerWithinRunnerContainer to docker"d"WithinRunnerContainer (#134)
* Rename Runner.Spec.dockerWithinRunnerContainer to dockerdWithinRunnerContainer

Ref https://github.com/summerwind/actions-runner-controller/pull/126#issuecomment-712501790
2020-10-21 21:32:40 +09:00