Commit Graph

261 Commits

Author SHA1 Message Date
Hidetake Iwata
4f3f2fb60d Add metrics for GitHub API rate limit (#312) 2021-02-16 09:58:09 +09:00
Johannes Nicolai
2623140c9a Make log message less scary (#311)
* the reconciliation loop is often much faster than the runner startup, 
so changing runner not found related messages to debug and also add the 
possibility that the runner just needs more time
2021-02-16 09:55:55 +09:00
Johannes Nicolai
1db9d9d574 Use ARM64 compatible kube-rbac-proxy from upstream (#310)
* as pointed out in #281 the currently used image for the 
kube-rbac-proxy - gcr.io/kubebuilder/kube-rbac-proxy:v0.4.1" - does not 
have an ARM64 image
* hence, trying to use the standard deployment manifest / helm char will 
fail on ARM64 systems
* replaced image with quay.io/brancz/kube-rbac-proxy:v0.8.0 which is the 
latest version from the upstream maintainer 
(https://github.com/brancz/kube-rbac-proxy/blob/master/Makefile#L13)
* successfully tested on both AMD64 and ARM64 clusters
* fixes #281
2021-02-16 09:55:03 +09:00
callum-tait-pbx
d046350240 chore: bumping helm chart sematically (#296)
* chore: bumping helm chart sematically

* chore: removing the app version config
actions-runner-controller-0.4.0
2021-02-16 09:45:56 +09:00
callum-tait-pbx
cca4d249e9 feat: create workflow for runner releases (#306) 2021-02-16 09:42:28 +09:00
Johannes Nicolai
bc8bc70f69 Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00
Johannes Nicolai
34c6c3d9cd Pod eviction policy examples (crashed nodes) (#308)
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged
2021-02-15 09:33:01 +09:00
Johannes Nicolai
9c8d7305f1 Introduce pod deletion timeout and forcefully delete stuck pods (#307)
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken  runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
2021-02-15 09:32:28 +09:00
Yusuke Kuoka
addcbfa7ee Fix runner registration timeout (#301)
Fixes #300
2021-02-12 10:00:20 +09:00
Yusuke Kuoka
bbb036e732 feat: Prevent blocking on transient runner registration failure (#297)
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).

It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).

Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.

Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.

Ref #288
2021-02-09 10:17:52 +09:00
Yusuke Kuoka
9301409aec fix: Paginate ListRepositoryWorkflowRuns (#295)
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
2021-02-09 10:13:53 +09:00
Yusuke Kuoka
ab1c39de57 feat: HorizontalRunnerAutoscaler Webhook server (#282)
* feat: HorizontalRunnerAutoscaler Webhook server

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

Changes:

* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
2021-02-07 17:37:27 +09:00
alex-mozejko
a4350d0fc2 bug-fix: patched dir owned by runner (#284)
* bug-fix: patched dir owned by runner

* always build with latest runner version

* Revert "always build with latest runner version"

This reverts commit e719724ae9fe92a12d4a087185cf2a2ff543a0dd.

* Also patch dindrunner.Dockerfile

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-02-07 17:21:10 +09:00
callum-tait-pbx
2146c62c9e chore: bumping Helm chart version (#294)
* chore: adding mac rubbish to gitignore

* chore: bumping chart version
actions-runner-controller-0.3.1
2021-02-07 16:46:19 +09:00
Jesse Haka
28e80a2d28 Add support for enterprise runners (#290)
* Add support for enterprise runners

* update docs
v0.17.0
2021-02-05 09:31:06 +09:00
Tom Bamford
831db9ee2a Added github.sha to DockerHub push (#286)
* Added GITHUB.RUN_NUMBER to DockerHub push

* switch run_number to sha on docker tag

* re-add mutable tags for backwards compatability

* truncate to short SHA (7 chars)

* behaviour workaround

* use ENV to define sha_short

* use ::set-output to define sha_short

* bump action
2021-02-04 09:29:32 +09:00
Donovan Muller
4d69e0806e Update GitHub runner version (#280) 2021-02-02 14:06:08 +09:00
Donovan Muller
d37cd69e9b feat/helm: Bump appVersion to 0.6.1 release (#272)
* feat/helm: Bump appVersion to 0.6.1 release

* Also bump chart version to trigger a new chart release

Co-authored-by: Yusuke Kuoka <c-ykuoka@zlab.co.jp>
actions-runner-controller-0.2.1
2021-01-29 09:29:43 +09:00
Yusuke Kuoka
a2690aa5cb Update README.md
Follow-up for #275
2021-01-29 09:29:26 +09:00
Clément
da020df0fd docs: fix install installation method (#275) 2021-01-29 09:28:34 +09:00
Jonas Lergell
6c64ae6a01 Actually use 'dockerdContainerResources' to set resources on the dind container (#273) 2021-01-29 09:18:28 +09:00
Yusuke Kuoka
42c7d0489d chart: Bump to 0.2.0 actions-runner-controller-0.2.0 2021-01-25 09:14:49 +09:00
Donovan Muller
b3bef6404c Add support for additional environment variables (#271) 2021-01-25 09:00:03 +09:00
David Young
1127c447c4 Add GitHub Actions to publish helm chart (#257)
* Add chart workflows (#1)

* Add chart workflows

* Fix publishing step in CI

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Update CI on push-to-master (#3)

* Put helm installation step in the correct CI job

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Put helm installation step in the correct CI job (#4)

* Update on-push-master-publish-chart.yml

* Remove references to certmanager dependency

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Add ability to customize kube-rbac-proxy image

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Only install cert-manager if we're going to spin up KinD

Signed-off-by: David Young <davidy@funkypenguin.co.nz>
actions-runner-controller-0.1.2
2021-01-24 15:37:01 +09:00
Yusuke Kuoka
ace95d72ab Fix self-update failuers due to /runner/externals mount (#253)
* Fix self-update failuers due to /runner/externals mount

Fixes #252

* Tested Self-update Fixes (#269)

Adding fixes to #253 as confirmed and tested in https://github.com/summerwind/actions-runner-controller/issues/264#issuecomment-764549833 by @jolestar, @achedeuzot and @hfuss 🙇 🍻

Co-authored-by: Hayden Fuss <wifu1234@gmail.com>
v0.16.1
2021-01-24 10:58:35 +09:00
Johannes Nicolai
42493d5e01 Adding --name-space parameter in example (#259)
* when setting a GitHub Enterprise server URL without a namespace, an error occurs: "error: the server doesn't have a resource type "controller-manager"
* setting default namespace "actions-runner-system" makes the example work out of the box
2021-01-22 10:12:04 +09:00
Johannes Nicolai
94e8c6ffbf minReplicas <= desiredReplicas <= maxReplicas (#267)
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
2021-01-22 10:11:21 +09:00
callum-tait-pbx
563c79c1b9 feat/helm: add manager secret to Helm chart (#254)
* feat: adding maanger secret to Helm

* fix: correcting secret data format

* feat: adding in common labels

* fix: updating default values to have config

The auth config needs to be commented out by default as we don't want to deploy both configs empty. This may break stuff, so we want the user to actively uncomment the auth method they want instead

* chore: updating default format of cert

* chore: wording
2021-01-22 10:03:25 +09:00
Johannes Nicolai
cbb41cbd18 Updating custom container example (#260)
* use latest instead of outdated version
* use sudo for package install (required)
* use sudo for package meta data removal (required)
2021-01-22 09:57:42 +09:00
Johannes Nicolai
64a1a58acf GitHub runner groups have to be created first (#261)
* in contrast to runner labels, GitHub runner groups are not automatically created
2021-01-22 09:52:35 +09:00
Reinier Timmer
524cf1b379 Update runner to v2.275.1 (#239) 2020-12-18 08:38:39 +09:00
ZacharyBenamram
0dadddfc7d Update README for "PercentageRunnersBusy" HRA metric type (#237)
* adding readme for new hpa scheme

* callum's comments

Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
v0.16.0
2020-12-17 10:21:27 +09:00
ZacharyBenamram
48923fec56 Autoscaling: Percentage runners busy - remove magic number used for round up (#235)
* remove magic number for autoscaling

Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-12-15 14:38:01 +09:00
ZacharyBenamram
466b30728d Add "PercentageRunnersBusy" horizontal runner autoscaler metric type (#223)
* hpa scheme based off busy runners

* running make manifests

Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-12-13 08:48:19 +09:00
callum-tait-pbx
c13704d7e2 feat: custom labels (#231)
Co-authored-by: Callum Tait <callum.tait@PBXUK-HH-05772.photobox.priv>
2020-12-13 08:33:04 +09:00
callum-tait-pbx
fb49bbda75 feat: adding helm config for dind sidecar (#232)
Co-authored-by: Callum Tait <callum.tait@PBXUK-HH-05772.photobox.priv>
2020-12-13 08:31:24 +09:00
Reinier Timmer
8d6f77e07c Remove beta GitHub client implementations (#228) 2020-12-10 09:08:51 +09:00
Yusuke Kuoka
dfffd3fb62 feat: EKS IAM Roles for Service Accounts for Runner Pods (#226)
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.

Resolves #200
v0.15.0
2020-12-08 17:56:06 +09:00
Juho Saarinen
f710a54110 Don't compare runner connetion token at restart need check (#227)
Fixes #143
2020-12-08 08:48:35 +09:00
Yusuke Kuoka
85c29a95f5 runner: Add support for ruby/setup-ruby (#224)
It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby).

This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.
2020-12-06 11:53:38 +09:00
Erik Nobel
a2b335ad6a Github pkg: Bump github package to version 33 (#222) v0.14.0 2020-12-06 10:01:47 +09:00
Tom Bamford
56c57cbf71 ci: Replace deprecated crazy-max buildx action to use alternative docker actions (#197)
Deprecated action `crazy-max/setup-buildx-action@v1` has been replaced with:
  `docker/setup-qemu-action@v1`
  `docker/setup-buildx-action@v1`
  `docker/login-action@v1`
  `docker/build-push-action@v2`

See: https://github.com/crazy-max/ghaction-docker-buildx
2020-12-06 10:00:10 +09:00
Ahmad Hamade
837563c976 Adding priorityClassName to helm chart (#215)
* Adding priorityClassName to helm chart and README file

* removed README and revert chart version
2020-11-30 09:04:25 +09:00
ZacharyBenamram
df99f394b4 Remove 10 minute buffer to token expiration (#214)
Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-11-30 09:03:27 +09:00
Shinnosuke Sawada
be25715e1e Use TLS for secure docker connection (#192) 2020-11-30 08:57:33 +09:00
Yusuke Kuoka
4ca825eef0 Publish runner images for v2.274.2
Ref #212
2020-11-27 08:49:58 +09:00
Yusuke Kuoka
e5101554b3 Fix release workflow to not use add-path
Fixes #208
v0.13.0 v0.13.1
2020-11-26 08:39:03 +09:00
Reinier Timmer
ee8fb5a388 parametrized working directory (#185)
* parametrized working directory

* manifests v3.0
2020-11-25 08:55:26 +09:00
Erik Nobel
4e93879b8f [BUG?]: Create mountpoint for /externals/ (#203)
* runner/controller: Add externals directory mount point

* Runner: Create hack for moving content of /runner/externals/ dir

* Externals dir Mount: mount examples for '__e/node12/bin/node' not found error
2020-11-25 08:53:47 +09:00
Shinnosuke Sawada
6ce6737f61 add dockerEnabled document (#193)
Follow-up for #191
2020-11-17 09:31:34 +09:00