Commit Graph

305 Commits

Author SHA1 Message Date
Hidetake Iwata
56b4598d1d Fix helm template error when webhook server is enabled (#365)
* Fix include block in githubwebhook.deployment.yaml

* Fix include block in githubwebhook.secrets.yaml
2021-03-03 09:21:58 +09:00
Taehyun Kim
8f977dbe48 Fix various bugs in helm chart (#364)
* Fix wrong trim

* add missing MutatingWeghookConfiguration.webhooks[*].sideEffects

* fix missing admissionReviewVersions

* admissionregistration.k8s.io/v1 for kustomization manifests

* revert webhook config
actions-runner-controller-0.6.1
2021-03-03 09:21:20 +09:00
Yusuke Kuoka
9ae3551744 Remove unnecessary GitHub API calls (#363)
The controller had the 2 extra and redundant calls to List Workflow Runs API.

Ref #362
2021-03-02 10:55:30 +09:00
Rolf Ahrenberg
05ad3f5469 Set default python (#361) 2021-03-01 09:45:13 +09:00
callum-tait-pbx
9c7372a8e0 docs: styling fixes (#359)
* docs: styling fixes

* docs: grammer fixes
2021-03-01 09:44:35 +09:00
Yusuke Kuoka
584590e97c Use patch instead of update to alleviate HRA conflict on webhook (#358)
We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.

I have also updated the hra controller to use Patch when updating RunnerDeployment, too.

Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
2021-02-26 10:17:09 +09:00
Yusuke Kuoka
d18884a0b9 Fix HRA expired cache entries not cleaned up (#357)
Fixes #356
2021-02-26 09:54:24 +09:00
callum-tait-pbx
f987571b64 Improve docs (#303) 2021-02-26 09:32:18 +09:00
Taehyun Kim
450e384c4c Update helm chart (#343)
* add replicaCount

* Add authSecret.existingSecret

* set image.tag null by default

* implement ingress for githubwebhook server

* fix deprecated and secretName template

* backward compat .authSecret.enabled

* existingSecret for github webhook secret

* use secretName template

* set default secret names

* do not use app version based image tag

* create and name variable for secrets
actions-runner-controller-0.6.0
2021-02-26 09:26:51 +09:00
Yusuke Kuoka
e9eef04993 Fix old HRA capacity reservations not cleaned up (#354)
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
2021-02-25 11:08:00 +09:00
Yusuke Kuoka
598dd1d9fe Fix incorrect DESIRED on `kubectl get hra (#353)
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
2021-02-25 10:32:09 +09:00
Yusuke Kuoka
9890a90e69 Improve webhook-based autoscaler log (#352)
The controller had been writing confusing messages like the below on missing scale target:

```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```

This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
2021-02-25 10:07:41 +09:00
Yusuke Kuoka
9da123ae5e Fix integration test flakiness (#351)
Ref https://github.com/summerwind/actions-runner-controller/pull/345#issuecomment-785015406
2021-02-25 09:30:32 +09:00
Johannes Nicolai
4d4137aa28 Avoid zombie runners that missed token expiration by a bit (#345)
* if a new runner pod was just scheduled to start up right before a 
registration expired, it will not get a new registration token and go in 
an infinite update loop (until #341) kicks in
* if registzration tokens got updated a little bit before they actually 
expired, just starting up pods will way more likely get a working token
2021-02-25 09:07:49 +09:00
Yusuke Kuoka
022007078e Compact excessive error message on runnerreplicaset status update conflict (#350)
We occasionally see logs like the below:

```
2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
/home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
```

which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.
2021-02-25 09:01:02 +09:00
Johannes Nicolai
31e5e61155 Log correct runner that was deleted (#349) 2021-02-25 08:38:55 +09:00
Aditya Purandare
1d1453c5f2 Fix user used for dind runner group permissions (#337) 2021-02-24 19:06:52 +09:00
Yusuke Kuoka
e44e53b88e Fix failure while saving HRA status after running controller for a while (#348)
Fixes #346
2021-02-24 11:20:21 +09:00
Yusuke Kuoka
398791241e Fix runner release workflow to do docker-push (#347)
Apparently I have mistakenly removed `push` option from the workflow in #323 which resulted in new runner build #323 not being pushed. This fixes that.
2021-02-24 11:08:33 +09:00
Yusuke Kuoka
991535e567 Fix panic on webhook for user-owned repository (#344)
* Fix panic on webhook for user-owned repository

Follow-up for #282 and #334
2021-02-23 08:05:25 +09:00
Johannes Nicolai
2d7fbbfb68 Handle offline runners gracefully (#341)
* if a runner pod starts up with an invalid token, it will go in an 
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding 
runner objects exists in GitHub and the pod will get removed after 
registration timeout
* if the GitHub runner object already existed before - e.g. because a 
finalizer was not properly run as part of a partial Kubernetes crash, 
the runner will always stay in a running mode, even updating the 
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner 
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart, 
only do additional restart checks if no restart was already decided, 
making code a bit cleaner and saving GitHub API calls after each job 
completion
2021-02-22 10:08:04 +09:00
Yusuke Kuoka
dd0b9f3e95 Merge pull request #340 from int128/integration-test-check-run
Fix index key to find HRA in GitHub webhook handler
2021-02-22 09:49:54 +09:00
Yusuke Kuoka
7cb2bc84c8 Merge pull request #334 from summerwind/integration-test-check-run
Add integration test for autoscaling on check_run webhook event
2021-02-22 09:38:07 +09:00
Hidetake Iwata
b0e74bebab Fix index key to find HRA in GitHub webhook handler 2021-02-20 21:25:23 +09:00
Hidetake Iwata
dfbe53dcca Fix webhook payload in integration test 2021-02-20 21:08:23 +09:00
Yusuke Kuoka
ebc3970b84 Add integration test for autoscaling on check_run webhook event 2021-02-19 10:33:04 +09:00
Hidetake Iwata
1ddcf6946a Fix nil pointer error on received check_run event (#331)
* Reproduce nil pointer error on received check_run event

* Fix nil pointer error on received check_run event
2021-02-18 20:22:36 +09:00
Yusuke Kuoka
cfbaad38c8 Merge pull request #328 from int128/fix-port-name-length
Changes:

1. Fix length of github-webhook-server port name
2. Add a cluster role binding for github-webhook-server
3. Remove --enable-leader-election from github-webhook-server
actions-runner-controller-0.5.2
2021-02-18 20:20:39 +09:00
Yusuke Kuoka
67f6de010b feat: Common runner labels configurable per controller (#327)
* feat: Common runner labels configurable per controller

Ref #321
2021-02-18 20:19:08 +09:00
Hidetake Iwata
2db608879a Remove --enable-leader-election from github-webhook-server 2021-02-18 16:51:47 +09:00
Hidetake Iwata
2c4a6ca90b Add cluster role binding for github-webhook-server 2021-02-18 16:49:24 +09:00
Hidetake Iwata
829bf20449 Fix length of github-webhook-server port name 2021-02-18 16:42:15 +09:00
Reinier Timmer
be13322816 Update runner to 2.277.1 (#322)
* Update runner to 2.277.1

* Update build-and-release-runners.yml

* integration test condition

Don't run integration tests when only updating the runner image

* fixup! integration test condition

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-02-18 09:29:53 +09:00
Johannes Nicolai
7f4a76a39b Also log into DockerHub for release event (#326)
* so far, only push events would trigger the DockerHub login step
* hence, attempts to release would fail because of a permission problem (tested locally)
* adding OR condition to also login in case a release got published
2021-02-18 08:54:44 +09:00
callum-tait-pbx
0fce761686 fix: add trunate to ensure service kinds have valid names (#325)
* fix: adding truncate for service kinds

* chore : bumping chart version
actions-runner-controller-0.5.1
2021-02-18 08:43:48 +09:00
Yusuke Kuoka
c88ff44518 Fix wip.yml workflow for building controller canary tags (#323)
In #306 we seem to have accidentally updated a wrong workflow, which was for runner builds. This updates the one for the controller.

Resolves #302
2021-02-18 08:42:24 +09:00
Yusuke Kuoka
2fdf35ac9d Refactor integration test to use helpers (#320)
This should make the test code a bit more DRY and readable.
2021-02-17 10:23:35 +09:00
Johannes Nicolai
6cce3fefc5 Add project to awesome-runners list (#319) 2021-02-17 09:14:42 +09:00
Yusuke Kuoka
eb2eaf8130 Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining completed jobs (#316)
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.

This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
2021-02-16 18:55:55 +09:00
callum-tait-pbx
7bf712d0d4 fix: duplicate name attribute (#318) 2021-02-16 18:52:08 +09:00
Yusuke Kuoka
7d024a6c05 Fix "duplicate metrics collector registration attempted" errors at startup (#317)
I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.
2021-02-16 18:51:33 +09:00
Yusuke Kuoka
434823bcb3 scale{Up,Down}Adjustment to add/remove constant number of replicas on scaling (#315)
* `scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling

Ref #305

* Bump chart version
actions-runner-controller-0.5.0
2021-02-16 17:16:26 +09:00
Yusuke Kuoka
35d047db01 Fix enterprise runners misusing cached token (#314)
Follow-up for #290
2021-02-16 12:56:52 +09:00
Yusuke Kuoka
f1db6af1c5 Add repository runners support for PercentageRunnersBusy-based autoscaling (#313)
Resolves #258
2021-02-16 12:44:51 +09:00
Hidetake Iwata
4f3f2fb60d Add metrics for GitHub API rate limit (#312) 2021-02-16 09:58:09 +09:00
Johannes Nicolai
2623140c9a Make log message less scary (#311)
* the reconciliation loop is often much faster than the runner startup, 
so changing runner not found related messages to debug and also add the 
possibility that the runner just needs more time
2021-02-16 09:55:55 +09:00
Johannes Nicolai
1db9d9d574 Use ARM64 compatible kube-rbac-proxy from upstream (#310)
* as pointed out in #281 the currently used image for the 
kube-rbac-proxy - gcr.io/kubebuilder/kube-rbac-proxy:v0.4.1" - does not 
have an ARM64 image
* hence, trying to use the standard deployment manifest / helm char will 
fail on ARM64 systems
* replaced image with quay.io/brancz/kube-rbac-proxy:v0.8.0 which is the 
latest version from the upstream maintainer 
(https://github.com/brancz/kube-rbac-proxy/blob/master/Makefile#L13)
* successfully tested on both AMD64 and ARM64 clusters
* fixes #281
2021-02-16 09:55:03 +09:00
callum-tait-pbx
d046350240 chore: bumping helm chart sematically (#296)
* chore: bumping helm chart sematically

* chore: removing the app version config
actions-runner-controller-0.4.0
2021-02-16 09:45:56 +09:00
callum-tait-pbx
cca4d249e9 feat: create workflow for runner releases (#306) 2021-02-16 09:42:28 +09:00
Johannes Nicolai
bc8bc70f69 Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00