# Summary
- add lifecycle, terminationGracePeriodSeconds, and loadBalancerSource ranges to metrics server
- these were missed when copying from the other webhook server
- original PR adding them to the other webhook server is here https://github.com/actions/actions-runner-controller/pull/2305
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
In response to https://github.com/actions/actions-runner-controller/issues/2212 , the ARC helm chart is missing ClusterRoleBinding and ClusterRole for the ActionsMetricsServer resulting on missing permissions.
This also fix the labels of the ActionsMetricsServer Service as it is selected by the ServiceMonitor with those labels.
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Some requests send method in lowercase (verified with curl and as a default for AWS ALB health check requests), but Go HTTP library constant MethodGet is in upper.
I observed that 100% of canceled jobs in my runner pool were not causing scale down events. This PR fixes that.
The problem was caused by #2119.
#2119 ignores certain webhook events in order to fix#2118. However, #2119 overdoes it and filters out valid job cancellation events. This PR uses stricter filtering and add visibility for future troubleshooting.
<details><summary>Example cancellation event</summary>
This is the redacted top portion of a valid cancellation event my runner pool received and ignored.
```json
{
"action": "completed",
"workflow_job": {
"id": 12848997134,
"run_id": 4738060033,
"workflow_name": "slack-notifier",
"head_branch": "auto-update/slack-notifier-0.5.1",
"run_url": "https://api.github.com/repos/nuru/<redacted>/actions/runs/4738060033",
"run_attempt": 1,
"node_id": "CR_kwDOB8Xtbc8AAAAC_dwjDg",
"head_sha": "55bada8f3d0d3e12a510a1bf34d0c3e169b65f89",
"url": "https://api.github.com/repos/nuru/<redacted>/actions/jobs/12848997134",
"html_url": "https://github.com/nuru/<redacted>/actions/runs/4738060033/jobs/8411515430",
"status": "completed",
"conclusion": "cancelled",
"created_at": "2023-04-19T00:03:12Z",
"started_at": "2023-04-19T00:03:42Z",
"completed_at": "2023-04-19T00:03:42Z",
"name": "build (arm64)",
"steps": [
],
"check_run_url": "https://api.github.com/repos/nuru/<redacted>/check-runs/12848997134",
"labels": [
"self-hosted",
"arm64"
],
"runner_id": 0,
"runner_name": "",
"runner_group_id": 0,
"runner_group_name": ""
},
```
</details>
Starting ARC v0.27.2, we've changed the `docker.sock` path from `/var/run/docker.sock` to `/var/run/docker/docker.sock`. That resulted in breaking some container-based actions due to the hard-coded `docker.sock` path in various places.
Even `actions/runner` seem to use `/var/run/docker.sock` for building container-based actions and for service containers?
Anyway, this fixes that by moving the sock file back to the previous location.
Once this gets merged, users stuck at ARC v0.27.1, previously upgraded to 0.27.2 or 0.27.3 and reverted back to v0.27.1 due to #2519, should be able to upgrade to the upcoming v0.27.4.
Resolves#2519Resolves#2538
#2490 has been happening since v0.27.2 for non-dind runners based on Ubuntu 20.04 runner images. It does not affect Ubuntu 22.04 non-dind runners(i.e. runners with dockerd sidecars) and Ubuntu 20.04/22.04 dind runners(i.e. runners without dockerd sidecars). However, presuming many folks are still using Ubuntu 20.04 runners and non-dind runners, we should fix it.
This change tries to fix it by defaulting to the docker group id 1001 used by Ubuntu 20.04 runners, and use gid 121 for Ubuntu 22.04 runners. We use the image tag to see which Ubuntu version the runner is based on. The algorithm is so simple- we assume it's Ubuntu-22.04-based if the image tag contains "22.04".
This might be a breaking change for folks who have already upgraded to Ubuntu 22.04 runners using their own custom runner images. Note again; we rely on the image tag to detect Ubuntu 22.04 runner images and use the proper docker gid- Folks using our official Ubuntu 22.04 runner images are not affected. It is a breaking change anyway, so I have added a remedy-
ARC got a new flag, `--docker-gid`, which defaults to `1001` but can be set to `121` or whatever gid the operator/admin likes. This can be set to `--docker-gid=121`, for example, if you are using your own custom runner image based on Ubuntu 22.04 and the image tag does not contain "22.04".
Fixes#2490
* Update controller package names to match the owning API group name
* feedback.
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
Breaks up the ARC documentation into several smaller articles.
`@vijay-train` and `@martin389` put together the plan for this update, and I've just followed it here.
In these updates:
- The README has been updated to include more general project information, and link to each new article.
- The `detailed-docs.md` file has been broken up into multiple articles, and then deleted.
- The Actions Runner Controller Overview doc has been renamed to `about-arc.md`.
Any edits to content beyond generally renaming headers or fixing typos is out of scope for this PR, but will be made in the future.
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
* Changed folder structure to allow multi group registration
* included actions.github.com directory for resources and controllers
* updated go module to actions/actions-runner-controller
* publish arc packages under actions-runner-controller
* Update charts/actions-runner-controller/docs/UPGRADING.md
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* runner: Expose dind runner dockerd logs via stdout/stderr
We've been letting supervisord to run dockerd within the dind runner container presuming it would avoid producing zombie processes. However we used dumb-init to wrap supervisord to wrap dockerd. In this picture supervisord might be unnecessary and dumb-init is actually a correct pid 0 for containers.
Rmoving supervisord removes this unnecessary complexity, while saving a little memory, and more importantly logs from dockerd is exposed via stdout/stderr of the container for easy access from kubectl-logs, fluentd, and so on.
* Add workflow job metrics to Github webhook server
* Fix handling of workflow_job.Conclusion
* Make the prometheus metrics exporter for the workflow jobs a dedicated application
* chart: Add support for deploying actions-metrics-server
* A few improvements to make it easy to cover in E2E
* chart: Add missing actionsmetrics.service.yaml
* chart: Do not modify actionsMetricsServer.replicaCount
* chart: Add documentation for actionsMetrics and actionsMetricsServer
Co-authored-by: Colin Heathman <cheathman@benchsci.com>
* feat: dind-rootless 22.04 runner
* runner: Bring back packages needed by rootlesskit
* e2e: Update E2E buildvars with ubuntu 22.04 dockerfiles
* feat: use new uid for runner user
* e2e: Make it possible to inject ubuntu version via envvar for actiosn-runner-dind image
* doc: Use fsGroup=1001 for IRSA on Ubuntu 22.04 runner
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* ci: prepare ci for multiple runners
* chore: rename dockerfiles
* chore: sup multiple os in makefile
* chore: changes to support multiple versions
* chore: remove test for TARGETPLATFORM
* chore: fixes and add individual targets
* ci: add latest tag back in
* ci: remove latest suffix tag
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout.
Related to #1759
This fixes the said issue that I found while I was running a series of E2E tests to test other features and pull requestes I have recently contributed.
* Use quote on caBundle values for the webhook deployment
* Drop unrecognized --log-format arg on the manager container
* Update custom cert docs with the default san/secret names
* Revert "Drop unrecognized --log-format arg on the manager container"
This reverts commit d76dd67317.
* Fixes etcd for macos.
The older version of etcd packaged in kubebuilder 2.3.2 for Darwin
throws a stack trace upon attempted startup.
This retrieves the latest version of etcd from coreos and installs
that instead; this works on all OSes.
I removed some redundancy in the Makefile around test dependency
retrieval, too.
* Capture further OS specific test command tweaks.
* feat: clean and add docker-compose
* feat: make docker compose download arch aware
* fix: use new ARG name
* fix: correct case in url
* ci: add some debug output to workflow
* ci: add ARG for docker
* fix: various fixes
* chore: more alignment changes
* chore: use /usr/bin over /usr/local/bin
* chore: more logical order
* fix: add recursive flag
* chore: actions/runner stuff with actions/runner
* ci: bump checkout to latest
* fix: rootless build
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Setting SecurityContext.Privileged bit to false, which is default,
prevents GKE from admitting Windows pods. Privileged bit is not
supported on Windows.
* ci: Fix runner builds but not pushes for forks
I noticed that our runners workflow is failing on docker-login due to that a pull request workflow job from a fork does not have access to repo secrets.
https://github.com/malachiobadeyi/actions-runner-controller/actions/runs/3390463793/jobs/5634638183
Can we try this, so that hopefully it suppresses docker-login for pull requests from forks?
* Update .github/workflows/runners.yaml
* fixup! Update .github/workflows/runners.yaml
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* fixup! fixup! Update .github/workflows/runners.yaml
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
By default `sudo` drops all environment variables and executes its commands with a clean environment. This is by design, but for the `DEBIAN_FRONTEND` environment variable it is not what we want, since it results in installers being interactive. This adds the `env_keep` instruction to `/etc/sudoers` to keep `DEBIAN_FRONTEND` with its `noninteractive` value, and thus pass it on to commands that care about it. Note that this makes no difference in our builds, because we are running them directly as `root`. However, for users of our image this is going to make a difference, since they start out as `runner` and have to use `sudo`.
Co-authored-by: Fleshgrinder <fleshgrinder@users.noreply.github.com>
* Allow to set docker default address pool
* fixup! Allow to set docker default address pool
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Revert unnecessary chart ver bump
* Update docs for DOCKER_DEFAULT_ADDRESS_POOL_*
* Fix the dockerd default address pool scripts to actually work as probably intended
* Update the E2E testdata runnerdeployment to accomodate the new docker default addr pool options
* Correct default dockerd addr pool doc
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
Co-authored-by: Claudio Vellage <claudio.vellage@pm.me>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* 1770 update log format and add runID and Id to worflow logs
update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook
use logging package
remove unused modules
add setup name to setuplog
add flag to change log format
change flag name to enableProdLogConfig
move log opts to logger package
remove empty else and reset timeEncoder
update flag description
use get function to handle nil
rename flag and update logger function
Update main.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update controllers/horizontal_runner_autoscaler_webhook.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update logging/logger.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
copy log opt per each NewLogger call
revert to use autoscaler.log
update flag descript and remove unused imports
add logFormat to readme
rename setupLog to logger
make fmt
* Fix E2E along the way
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix permission issue when you use PV for rootless dind cache
This fixes the said issue I have found while testing #1759.
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.
It turned out too hard to debug configuration issues on the rootless dind daemon as it was not writing any logs to stdout/stderr of the container. This fixes that, so that any rootless dind configuration or startup errors are visible in e.g. the kubectl-logs output.
Specifically, it was always downloading the linux version
no matter the platform. So I moved the OS detection to be early
in the makefile and verified that I can now download the darwin
version of shellcheck.
* Let it be a bug only when it's reproducible with official runner image
A custom runner image can break runners and ARC in interesting ways. Probably it's better to clearly state that ARC is not guaranteed to work with every custom runner image in the wild.
* Update .github/ISSUE_TEMPLATE/bug_report.yml
* Add workflow for validating runner scripts with shellcheck
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759.
This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request.
Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduce invalid or suspicious bash code.
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation
While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes.
You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
* Update detailed-docs.md
Quick start instructions are now inline in README.md. Updating the link to point to the `Getting Started` section of README.md
* Update Actions-Runner-Controller-Overview.md
Quick start instructions are now inline in README.md. Updating the link to point to the `Getting Started` section of README.md
* Delete QuickStartGuide.md
This guide is no longer needed as the details of it are merged with README.md.
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759.
This commit adds the makefile target to run shellformat locally, so that any contributor can use it before submitting a pull request.
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort for a major enhancement to the runner entrypoints being made in #1759.
This change updates and reintroduces #1517 contributed by @CASABECI in a way it becomes applicable to today's code-base.
Both fields can be useless when the reporter thought only one or two lines of the respective logs are relevant and it turned out we had to see another line later.
To avoid such a situation I'd like to change the field labels to include `Whole` so that it looks like `Whole Runner Pod Logs`, and ask to not omit the logs in the field description.
Honestly, I'm a bit tired of seeing issues filed with the default title "Bug" that are seemingly not related to real bugs!
I'm emptying it so that the reporter is more encourage to write up a sentence to describe the problem and they hopefully notice along the way that there are places to diagnose before considering it as a bug.
* Update with recent changes from README.md
* Update README.md
This is the final phase of simplifying README.md
- include only Getting started` steps ( These steps were earlier reviewed as part of /docs/QuickStartGuide.md)
- links to a detailed documentation ( the detailed documentation is a copy of the current README.md)
Once this is merged, any new detailed docs should be captured in /docs/detailed-docs.md
* Update detailed-docs.md
Redo the change made in #1873
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Fixes: #942
helm charts for actions runner, currently its having only RunnerDeployment and Autoscaler resources.
Looks like deployment order is important here, facing the below issue if Autoscaler deployed first and then autoscaling not working as expected.
```
2022-04-21T12:13:08Z DEBUG controllers.webhookbasedautoscaler RunnerDeployment not found with scale target ref name test-actions-runner for hra test-actions-runner-autoscaler
```
Helm doesn't support [ordering](https://github.com/helm/helm/issues/8439) for custom resources. So using List to overcome this issue, didn't use helm chart hooks for ordering since its not [tracked](https://helm.sh/docs/topics/charts_hooks/#hook-resources-are-not-managed-with-corresponding-releases) after creation.
Co-authored-by: Josh Feierman <joshua.feierman@warnermedia.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* feat: Add container to propagate host network MTU
Some network environments use non-standard MTU values. In these
situations, the `DockerMTU` setting might be used to specify the MTU
setting for the `bridge` network created by Docker. However, when the
Github Actions workflow creates networks, it doesn't propagate the
`bridge` network MTU which can lead to `connection reset by peer`
messages.
To overcome this, I've created a new docker image called
`summerwind/actions-runner-mtu` that shims the docker binary in order to
propagate the MTU setting to networks created by Github workflows.
This is a follow-up on the discussion in
(#1046)[https://github.com/actions-runner-controller/actions-runner-controller/issues/1046]
and uses a separate image since there might be some unintended
side-effects with this approach.
* fixup! feat: Add container to propagate host network MTU
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This introduces a linter to PRs to help with code reviews and code hygiene. I've also gone ahead and fixed (or ignored) the existing lints.
I've only setup the default linters right now. There are many more options that are documented at https://golangci-lint.run/.
The GitHub Action should add appropriate annotations to the lint job for the PR. Contributors can also lint locally using `make lint`.
I think we should at least recommend the reporter read the troubleshooting guide before submitting a bug report. It would give them more chances to realize it isn't a bug in some cases.
Even though we have a bug report form, we've seen so many issues being written in freestyle and I think every occurrence of it resulted in needing multiple hops for us to gather the information necessary for diagnosing the issue. I hope every bug report is more actionable. Now we disable blank issues so that the reporter is likely to choose the bug report form or Discussions depending on their situation.
* [Docs] Move into docs folder and other minor fixes
* Move into [docs] folder
* Fix minor formatting
* Create detailed-docs.md
Moving current Readme contents into a separate detailed-docs.md file. Added one additional section "Getting Started" to have the getting started link. Rest of the file is as is from the current Readme.md
This file would be linked from a new simplified Readme (to be raised a separate PR)
* Changed Dockerfile to get the Enviroment variable from the github actions workflow and pass it to the main.go file
Added a function in main.go to fetch the enviroment varible and to have a fallback if the env variable isnt there
Added a test for the version to use for this branch only
* Update test-version.yaml
* Update test-version.yaml
* Removed the test because its not needed when we push upstream
* Moved the version print in main.go to the Log codeblock as requested by toast-gear
Added version as issue#1161 requests.
Decided to use a docker tag structure for the userAgent string, with : being a seperator of the name and version
* Used ldflags instead like mumoshu recommended
Changed Dockerfile to use $VERSION from the workflow
Added version.go and the build package
Removed the getVersion function as we can just get the value directly
* Used ldflags instead like mumoshu recommended
Changed Dockerfile to use $VERSION from the workflow
Added version.go and the build package
Removed the getVersion function as we can just get the value directly
* * Removed the default from the go code (set it as N/A)
* Changed version from latest to dev inside makefile
* Added buildarg for version to the dockerfile in the makerfile
* Added VERSION with default dev value as arg inside dockerfile
* Cleaned up inside dockerfile
* Fix failing test
* Fix possible missing VERSION in the ARC UA suffix due to missing build arg in docker-build-push step
Co-authored-by: S8338C <viktor.lindgren@seb.se>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* [Doc] Create ARC overview doc
The purpose of this doc is a starting point overview to ARC, with links to Quick start guide within.
* Update Actions-Runner-Controller-Overview.md
Fixed some formatting
* Update for minor formatting
Fixed links to include quotes, where missing. Added spaces after periods, where missing.
* Updated links to the QuickStart guide
* Updated Images and scaling sections
Updated the following based on PR feedback
- `The Runner container image` now calls out more explicitly the recommended way to install additional software
- `Scaling runners - dynamically with Pull Driven ScalingScaling runners - dynamically with Pull Driven Scaling` - Removed mentions of `TotalNumberOfQueuedAndInProgressWorkflowRuns` as its not fully implemented
* Apply suggestions from code review
Incorporated review feedback from @andyfeller, @sethrylan, @debuger24 and @mumoshu. Thank you all.
Co-authored-by: Andy Feller <andyfeller@github.com>
Co-authored-by: Rahul Kumar <rahulcomp24@gmail.com>
Co-authored-by: Seth Rylan Gainey <sethrylan@github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Apply suggestions from code review
Add more detailed config for PercentageRunnersBusy metric
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Updated link text to "Pull Driven Scaling"
Co-authored-by: Andy Feller <andyfeller@github.com>
Co-authored-by: Rahul Kumar <rahulcomp24@gmail.com>
Co-authored-by: Seth Rylan Gainey <sethrylan@github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Add prometheus metrics for autoscaling
* Add desc for prometheus-metrics
* FIX: Typo
* Remove replicas_desired_before in metrics
* Remove Num prefix in metricws
* Create QuickStartGuide.md
Creating a new Quickstart guide that captures simple onboarding instructions. The intent is for first time users to be able to follow this guide and get their environment running and try out ARC. A link to this guide would be added to the repo readme once this PR get merged.
* Update QuickStartGuide.md
Fixed a typo - removed "$" from codeblock "$ kubectl apply -f runnerdeployment.yaml"
* Update QuickStartGuide.md
Eliminated need to specify PAT in Custom_values.yaml. Instead passing as parameter while installing helm chart. This eliminates need to store PAT in a file and also eliminates a setup step.
* Fixed minor typos
Fixed types identified by @nebuk89
* Minor formatting in links and periods.
Fixed formatting to include space after period and commas. Fixed formatting on some links to include quotes
Changed from `Kuernetes` to `Kubernetes` in the **Multitenancy** chapter.
By the way why not use [the vale-action](https://github.com/errata-ai/vale-action) to automate linting in the markdown files? If you'd like I can probably find some time to do it. Just a small token of appreciation for an awesome project!
* chart: Remove support for extensions/v1beta1 and networking.k8s.io/v1beta1
`networking.k8s.io/v1` has been available since v1.19.
As of today, AWS EKS supports v1.19+ and Oracle Cloud supports v1.20+. GKE and AKS supports v1.21+. The upstream Kubernetes project maintains v1.22+.
So it should be safe to remove it now.
* fixup! chart: Remove support for extensions/v1beta1 and networking.k8s.io/v1beta1
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.
Ref #1412
* feat: allow to discover runner statuses
* fix manifests
* Bump runner version to 2.289.1 which includes the hooks support
* Add feedback from review
* Update reference to newRunnerPod
* Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific
* Fix additional TestNewRunnerPod test
* Cover additional feedback from review
* fix rbac manager role
* Add permissions to service account for container mode if not provided
* Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This contains apparently enough changes to the current E2E test code to make it runnable against remote Kubernetes clusters. I was actually able to make the test passing against my AWS EKS based test clusters with these changes. You still need to trigger it manually from a local checkout of the ARC repo today. But this might be the foundation for automated E2E tests against major cloud providers.
I encountered this once while E2E testing ARC with K8s 1.22 and cert-manager 1.1.1. The K8s version is too high / The cert-manager is too low so you generally need to fix either. In a standard scenario, it should be more feasible and meaningful to upgrade cert-manager to a recent enough version that supports the new Kubernetes version.
I've been manually setting up Argo Tunnel to expose the webhook server while running E2E tests so that I can cover the webhook-based autoscaling. This automates the setup process so that we can automatiaclly bring up and down cloudflared before/after the test run, so that it can be a part of our upcoming automated E2E test.
The regression resulted in the webhook-based autoscaler be unable to find visible runner groups and therefore unable to scale up and down the target RunnerDeployment/RunnerSet at all when the webhook-based autoscaler was provided GitHub API credentials to enable the runner groups support. This fixes that.
The regression was introduced via #1578 which is not released yet. Users of existing ARC releases are therefore not affected.
* Cover ARC upgrade in E2E test
so that we can make it extra sure that you can upgrade the existing installation of ARC to the next and also (hopefully) it is backward-compatible, or at least it does not break immediately after upgrading.
* Consolidate E2E tests for RS and RD
* Fix E2E for RD to pass
* Add some comment in E2E for how to release disk consumed after dozens of test runs
* chore: move HOME to more logical place
* chore: don't break the PATH
* chore: don't break the PATH
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
The primary goal of this change is to let the tester know about the config difference between the explicitly configured ephemeral work volume vs the automatically configured work volume with workVolumeClaimTemplate+containerMode=kubernetes.
* added troubleshooting solution
* added error example
* added entry to the pages index
* sorted
Co-authored-by: Mike Joseph <mike@Mikes-MacBook-Pro-5618.local>
Co-authored-by: Mike Joseph <mike@efrontier.com>
* added containerMode=kubernetes env variables to the runner
* removed unused logging
* restored configs and charts
* restored makefile cert version and acceptance/run
* added workVolumeClaimTemplate in pod definition, including logic
* added claim template name based on the runner
* Apply suggestions from code review
update errors
* added concurrent cleanup before runner pod is deleted
* update manifests
* added retry after 30s if pod cleanup contains err
* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s
* style changes and added comments
* added izZero timestamp check for deleting runner-linked pods
* changed order of local variable to avoid copy if p is deleted
* removed docker from container mode k8s
* restored charts, config, makefile
* restored forked files back and not the ARC ones
* created PersistentVolume on containerMode k8s
* create pv only if storage class name is local-storage
* removed actions if storage class name is local-storage
* added service account validation if container mode kubernetes
* changed the coding style to match rest of the ARC
* added validation to the runnerdeployment webhook
* specified fields more precisely, added webhook validation to the replicaset as well
* remake manifests
* wraped delete runner-linked-pods in kube mode
* fixed empty line
* fixed import
* makefile changes for hooks
* added cleanup secrets
* create manifests
* docs
* update access modes
* update dockerfile
* nit changes
* fixed dockerfile
* rewrite allowing reuse for runners and runnersets
* deepcopy forgot to stage
* changed privileged
* make manifests
* partly moved to finalizer, still need to apply finalizer first
* finalizer added if env variable used in container mode exists
* bump runner version
* error message moved from Error to Info on cleanup pods/secrets
* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate
* Apply suggestions from code review
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* add hook version to cli, update to 0.1.2
* Apply suggestions from code review
* Update controllers/utils_test.go
* Update runner/Makefile
* Fix missing secret permission and the error handling
* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Make webhook-based scale operation asynchronous
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.
Ref #1321
* Fix typos
* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers
* Batch capacity reservation updates for efficient use of apiserver
* Fix potential never-ending HRA update conflicts in batch update
* Extract batchScaler out of webhook-based autoscaler for testability
* Fix log levels and batch scaler hang on start
* Correlate webhook event with scale trigger amount in logs
* Fix log message
This overhaul turns it into a shellcheck valid script with explicit error handling for all possible situations I could think of. This change takes https://github.com/actions-runner-controller/actions-runner-controller/pull/1409 into account and things can be merged in any order. There are a few important changes here to the logic:
- The wait logic for checking if docker comes up was fundamentally flawed because it checks for the PID. Docker will always come up and thus become visible in the process list, just to immediately die when it encounters an issue, after which supervisor starts it again. This means that our check so far is flaky due to the `sleep 1` it might encounter a PID, or it might not, and the existence of the PID does not mean anything. The `docker ps` check we have in the `entrypoint.sh` script does not suffer from this as it checks for a feature of docker and not a PID. I thus entirely removed the PID check, and instead I am handing things over to our `entrypoint.sh` script by setting the environment variables correctly.
- This change has an influence on the `docker0` interface MTU configuration, because the interface might or might not exist after we started docker. Hence, I changed this to a time boxed loop that tries for one minute to set up the interface's MTU. In case the command fails we log an error and continue with the run.
- I changed the entire MTU handling by validating its value before configuring it, logging an error and continuing without if it is set incorrectly. This ensures that we are not going to send our users on a bug hunt.
- The way we started supervisord did not make much sense to me. It sends itself into the background automatically, there is no need for us to do so with Bash.
The decision to not fail on errors but continue is a deliberate choice, because I believe that running a build is more important than having a perfectly configured system. However, this strategy might also hide issues for all users who are not properly checking their logs. It also makes testing harder. Hence, we could change all error conditions from graceful to panicking. We should then align the exit codes across `startup.sh` and `entrypoint.sh` to ensure that every possible error condition has its own unique error code for easy debugging.
* ci: align pipeline files and setups
* ci: more changes
* ci: various changes
* ci: fix setup-helm action ref
* ci: better pipeline name
* ci: more format aligning
* ci: more format aligning
* ci: better job name
* ci: supports multiple languages
* ci: better pipeline and job names
* ci: do a verb-noun thing for consistency
* ci: use 'arc' when talking holistically
* ci: add caching scope
* ci: put canary in a scope
* ci: fix syntax error
* ci: better pipeline and job names
* ci: better job name
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
* Fix example manifests for webhook based scaling
I tried running these on my k8s cluster and I got some easy to fix errors, so I am committing them here.
* Fix example manifests for webhook autoscaling with workflow_jobs
* Fix the explation on how to setup webhooks on your cluster
* Replace unclear comment with actual code examples
There was a comment instructing users to add minReplicas and
maxReplicas to all the HRA yamls, so I just removed it and added
these attributes to the yamls themselves for clarity.
* Make clear that using the ingress example is just a suggestion
* Apply some text improvements suggested by @mumoshu
* Update examples so the webhook server is exposed on a NodePort
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Remove an unnecessary field from one the examples
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Apply suggestion from @mumoshu
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Remove namespace fields from webhook autoscaler examples
This change was suggested by @mumoshu
* Apply final suggestion from @mumoshu
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
A small improvement to our E2E test suite which allows you to set `ARC_E2E_NO_CLEANUP=whatever` to let it prevent the kind cluster cleanup on successful test run, so that you can rerun it without waiting for the new kind cluster to come up.
* doc: Use RunnerSet to retain various cache
In relation to #1286 and as a follow-up for #1340
* docs: clarify client vs daemon
* docs: better wording
* Separate RunnerSet examples for docker iimage layer caching
* Revert changes on testdata as it is going to be added via #1471 instead
* Update README.md
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* fixup! Update README.md
* Remove the outdated RunnerSet limitation
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
* fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work"
While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error.
This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears.
Ref #1464
* e2e: Ensure work generic ephemeral volume to work as expected
Renamed the runner dockerfiles so that we have proper syntax highlighting for them, as well as a consistent way to map from the image name to the dockerfile. Added a `.dockerignore` file to avoid uploading things to the daemon that we never use.
* chart: Add extraPaths to Ingress of GitHub Webhook Server
* Update charts/actions-runner-controller/templates/githubwebhook.ingress.yaml
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Prefix the toYaml expression to remove the extra newline before extra paths
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859#1207
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
Give up pinning deps with commit IDs because PRs were unreviewable due to missing changelog and it sends PRs for every commit to the master/main branch of the deps, which is undesired. We only need updates for tagged releases!
* chore: Add signrel command for sigining arc release assets
I used this command to sign assets for the recent releases to comply with the recommendation of 5758364c82/docs/checks.md (signed-releases)
Ref #1298
* Implement signrel subcommands for listing tags and signing assets, with docs
* chore: more initialisation info to help debug
* chore: clearer flag description
* chore: use actual english
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
This is intended to fix#1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".
Ref #1369
* docs: runner cleanup instructions
* docs: add delay in job allocation
* docs: fix broken link
* docs: reorganise into categories
* docs: align the format across the doc
* docs: remove code comment
* docs: add a tools section
* docs: add a short description to each section
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.
Ref #1196
* runner: Remove the ability to use the deprecated `--once` flag
Ref #1196
* runner: Ability to opt-out of using --ephemeral
Although we are going to eventually remove the ability to use the legacy --once flag as proposed in #1196, there might be folks still using legacy GHES versions 3.2 or earlier.
This commit removes the existing feature flag to opt-in for --ephemeral, while adding another feature flag RUNNER_FEATURE_FLAG_ONCE to opt-in for --once so that folks stuck in legacy GHES versions
can still use ARC.
Since this change every user starts using --ephemeral by default. If they see any issues on legacy GHES instance, RUNNER_FEATURE_FLAG_ONCE=true can be set to opt-in to keep using --once, which gives one more ARC release until they upgrade their GHES instance.
But beware, we won't support legacy GHES instances forever as it's going to be a maintenance nightmare. Please upgrade!
Ref #1196
I believe this helps us focus on relatively more important issues like critical bug reports and highly-requested feature requests.
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
> chart test is failing due to `flag provided but not defined: -default-scale-down-delay` which seems to come from the fact that we still use ARC 0.22.3 for chart testing.
>
> Probably we'd better figure out how to test it against both the latest release version of ARC and the canary version of ARC?
>
> Or just test it against the canary version so that it won't fail when the chart depends on features that are available only in the canary version of ARC? 🤔
yup, lets get this merged though so we can do a release today
so that we can hopefully get enough information to diagnose the issue in case it's really a bug report, or it goes to Discussions in case it's a question.
In #1373 we made two mistakes:
- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.
The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.
Ref #1056
Ref #1373
* refactor: remove legacy build and use buildkit
* refactor: add runner version to root makefie
* refactor: enable buildkit for runner make build
* refactor: ignore runner makefile in ci
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
It's probably worth highlighting it's ver 0.X.X by design and that breaking changes are possible until we move it to 1.0.0
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282
Without that field, GKE 1.21 refuses to create the CRD
with an error message that conversionReviewVersions is mandatory.
conversionReviewVersions is a required field when creating apiextensions.k8s.io/v1 custom resource definitions.
Webhooks are required to support at least one ConversionReview version understood by the current and previous API server.
See https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/_print/#webhook-request-and-response
The [ListRunnerGroup API](https://docs.github.com/en/rest/reference/actions#list-self-hosted-runner-groups-for-an-organization) now add a new query parameter `visible_to_repository`.
We were doing `N+1` lookup when trying to find which runner group can be used for job from a certain repository.
- List all runner groups
- Loop through all groups to check repository access for each of them via [API](https://docs.github.com/en/rest/reference/actions#list-repository-access-to-a-self-hosted-runner-group-in-an-organization)
The new query parameter `visible_to_repository` should allow us to get the runner groups with access in one call.
Limitation:
- The new query parameter is only supported in GitHub.com, which means anyone who uses ARC in GitHub Enterprise Server won't get this.
- I am working on a PR to update `go-github` library to support the new parameter, but it will take a few weeks for a newer `go-github` to be released, so in the meantime, I am duplicating the implementation in ARC as well to support the new query parameter.
* Improved Bash Logger
This is a first step towards having robust Bash scripts in the runner images. The changes _could_ be considered breaking, depending on our backwards compatibility definition.
* Fixed Log Formatting Issues
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Bumps the chart version along with the controller version.
We bump the patch number for the chart as the release for the controller is a patch release.
That's the same handling as we've done in the previous version ecc8b4472a and #1300
As always, be sure to upgrade CRDs before updating the controller version!
Otherwise it can break in interesting ways.
This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.
Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.
Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions.
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307
Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273
* chore: bump chart to latest
Bumps the chart version along with the controller version.
We bump the patch number for the chart as the release for the controller is a patch release.
That's the same handling as we've done in the previous version ecc8b4472a
As always, be sure to upgrade CRDs before updating the controller version!
Otherwise it can break in interesting ways.
* docs: expand on CRD upgrade requirement
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource
* feat: copy dotfiles from asset to service dir
* Fixed `UNITTEST` Condition
* Load `/etc/environment`
See https://github.com/actions/runner/issues/1703 for context on this change.
docs: various changes in preparation for 0.22.0 release
- Move RunnerSets to a more predominant location in the docs
- Clean up a few bits
- Highlight the deprecation and removal timeline for the `--once` flag
- Renamed ephemeral runners section to something more logical (persistent runners). Static runners were an option however the word static is awkward as it's sort of tied up with autoscaling and the `Runner` kind so Persistent was chosen instead.
- Update upgrade docs to use `replace` instead of `apply`
Fix runner{set,deployment} rollouts and static runner scaling
I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed.
In particular, this PR fixes static runners reliability issues in two means.
c4b24f8366 fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs.
791634fb12 fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners.
Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. c612e87 fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process.
It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. 4551309 fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated.
This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
As it turned out to be the biggest release ever, I was afraid I might not be able to write a summary of changes that communicates well. Here is my attempt. Please review and leave any comments so that we can be more confident in this release. Thank you!
I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.
Use --ephemeral by default
Every runner is now --ephemeral by default.
Note that this works by ARC setting the RUNNER_FEATURE_FLAG_EPHEMERAL envvar to true by default. Previously you had to explicitly set it to true otherwise the runner was passed --once which is known to various race conditions.
It's worth noting that the very confusing and related configuration, ephemeral: true, which creates --once runners instead of static(or persistent) runners had been the default since many months ago. So, this should be the only change needed to make every runner ephemeral without any explicit configuration.
You can still fall back to static(persistent) runners by setting ephemeral: false, and to --once runners by setting RUNNER_FEATURE_FLAG_EPHEMERAL to "false". But I don't think there're many reasons to do so.
Ref #1189
The version of `bradleyfalzon/ghinstallation` which is used to enable GitHub App authentication turned out to add an extra header `application/vnd.github.machine-man-preview+json` to every HTTP request. That revealed an edge-case in our HTTP cache layer `gregjones/httpcache` that results it to not serve responses from cache when it should.
There were two problems. One was that it does not support multi-valued header and it only looked for the first value for each header, and another is that it does not support any http.RoundTripper implementation that modifies HTTP request headers in a RoundTrip function call.
I fixed it in my fork of httpcache, which is hosted at https://github.com/actions-runner-controller/httpcache.
The relevant commits are:
- 70d975e77d
- 197a8a3546
This can be considered as a follow-up for #1127, which turned out to have enabled the cache only for the case that ARC uses PAT for authentication.
Since this fix, the cache is also enabled when ARC authenticates as a GitHub App.
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.
This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.
The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
* Remove legacy GitHub API cache of HRA.Status.CachedEntries
We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.
* Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well
* Fix integration test
The TimeEncoder for zap seems to have been set to EpochTimeEncoder which is the default and it was not very readable. Changing it to a TimeEncoderOfLayout(time.RFC3339) for readability.
Another benefit of doing this is the ts format is now consistent with various timestamps ARC put into pod and other custom resource annotations.
* fix(chart): allow to use basic auth when authSecret.create is false
When secret is created outside of the ARC chart using authSecret.create=false and basicAuth,
the controller fails as we're not including the basic password as environment variable as
the password value won't be inside the helm values.
This PR includes both environment variables for consistent regardless if
those are set or not similar as the rest of the other auth options (e.g
app_id, private key, etc)
* chart: Add back the conditional block for .Values.authSecret.github_basicauth_username
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.
In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet.
RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.
This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.
Follow-up for #1127, #1167, and 1178
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
I can't find any requests made by user agent `actions-runner-controller` in GitHub.com's telemetry in the past 7 days.
Turns out we only set user agent `actions-runner-controller` if we are configured to use BasicAuth which is not the case for most customers I think.
I update the code a little bit to make sure it always set `actions-runner-controller` as UserAgent for the GitHub HttpClient in ARC.
A further step would be somehow baking the ARC release version into the UserAgent as well.
Enhances ARC(both the controller-manager and github-webhook-server) to cache any GitHub API responses with HTTP GET and an appropriate Cache-Control header.
Ref #920
## Cache Implementation
`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:
https://github.com/google/go-github#conditional-requests
`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:
https://github.com/gregjones/httpcache#cache-backends
We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.
`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.
## Ephemeral runner pods are no longer recreated
The addition of the cache layer resulted in a slow down of a scale-down process and a trade-off between making the runner pod termination process fragile to various race conditions(shorter grace period before runner deletion) or delaying runner pod deletion depending on how long the grace period is(longer grace period). A grace period needs to be at least longer than 60s (which is the same as cache duration of ListRunners API) to not prematurely delete a runner pod that was just created.
But once I disabled automatic recreation of ephemeral runner pod, it turned out to be no more of an issue when it's being scaled via workflow_job webhook.
Ephemeral runner resources are still automatically added on demand by RunnerDeployment via RunnerReplicaSet(I've added `EffectiveTime` fields to our CRDs but that's an implementation detail so let's omit). A good side-effect of disabling ephemeral runner pod recreations is that ARC will no longer create redundant ephemeral runners when used with webhook-based autoscaler.
Basically, autoscaling still works as everyone might expect. It's just better than before overall.
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
so that you can run `kubectl logs` on controller pods without the specifying the container name.
It is especially useful when you want to run kubectl-logs on all ARC pods across controller-manager and github-webhook-server like:
```
kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller
```
That was previously impossible due to that the selector matches pods from both controller-manager and github-webhook-server and kubectl does not provide a way to specify container names for respective pods.
The log level -3 is the minimum log level that is supported today, smaller than debug(-1) and -2(used to log some HRA related logs).
This commit adds a logging HTTP transport to log HTTP requests and responses to that log level.
It implements http.RoundTripper so that it can log each HTTP request with useful metadata like `from_cache` and `ratelimit_remaining`.
The former is set to `true` only when the logged request's response was served from ARC's in-memory cache.
The latter is set to X-RateLimit-Remaining response header value if and only if the response was served by GitHub, not by ARC's cache.
This will cache any GitHub API responses with correct Cache-Control header.
`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:
https://github.com/google/go-github#conditional-requests
`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:
https://github.com/gregjones/httpcache#cache-backends
We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.
`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.
Ref #920
There is a race condition between ARC and GitHub service about deleting runner pod.
- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.
To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.
TODO: I need to do some E2E tests to force the race condition to happen.
Ref #911
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.
This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
* chart: Allow using different secrets for controller-manager and gh-webhook-server
As it is entirely possible to do so because they are two different K8s deployments. It may provide better scalability because then each component gets its own GitHub API quota.
Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.
We had to manually remove the secret first to update the GitHub credentials used by the controller, which was cumbersome.
Note that you still need to recreate the controller pods and the gh webhook server pods to let them remount the recreated secret.
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.
Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012
If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories
`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:
1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.
b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Add env variable to configure `disablupdate` flag
* Write test for entrypoint disable update
* Rename flag, update docs for DISABLE_RUNNER_UPDATE
* chore: bump runner version in makefile
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
This allows providing a different `work` Volume.
This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.
This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-2
spec:
template:
spec:
dockerdContainerResources: {}
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# this is to mount the docker in docker onto NVME disk
dockerVolumeMounts:
- mountPath: /var/lib/docker
name: scratch
subPathExpr: $(POD_NAME)-docker
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
volumeMounts:
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
dockerEnv:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- hostPath:
path: /mnt/disks/ssd0
name: scratch
- hostPath:
path: /mnt/disks/ssd0
name: work
nodeSelector:
cloud.google.com/gke-nodepool: runner-16-with-nvme
ephemeral: false
image: ""
imagePullPolicy: Always
labels:
- runner-2
- self-hosted
organization: yourorganization
```
Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs.
This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed.
Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this).
If anyone is able to help out verifying this PR, i would really appreciate it.
Resolves#990
GitHub currently has some limitations w.r.t permissions management on
runner groups as they all require org admin, however at our company
we're using runner groups to serve different internal teams (with
different permissions), thus we needed to deploy a custom proxy API with
our internal authentication to provide who has access to certain APIs
depending on the repository/runner group on a given org/enterprise
This change just allows to optionally send the GitHub API calls to an alternate custom
proxy URL instead of cloud github (github.com) or an enterprise URL with
basic authentication
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).
This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.
This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling.
This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture).
The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels.
code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]`
Fixes#951
A lot of people have issues with private GKE clusters and it seems they are all solved by setting up a firewall policy. I think it would be relevant to include this in a troubleshooting-section since so many people are searching around issues for it. I myself just spent most of my day trying to figure it out.
Issues where this is the solution:
* #293
* #335
* #68
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0
* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade
* Regenerate manifests with the latest K8s dependencies
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
When false the chart deployment template will not add GITHUB_*
environment variables to the manager container. In addition, the `volume`
and `volumeMount` for the secret will also be omitted from the
deployment manifest.
Signed-off-by: Piaras Hoban <phoban01@gmail.com>
* chore(deps): update quay.io/brancz/kube-rbac-proxy docker tag to v0.11.0
* chore(deps): update quay.io/brancz/kube-rbac-proxy make tag to v0.11.0
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* doc: Add GitHub App Permission for Workflow Job Webhook
I think we've missed documenting about the app permission for receiving workflow_job events via the app webhook
* docs: clarification on webhook events
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* docs: minor correction to autoscaling description
* docs: make pull metrics plural
* docs: remove redundant words
* docs: make plural and expand anti-flapping
The README had some examples of RunnerDeployment with wrong/outdated
spec.
This caused some pain when trying to create them based on the examples.
=> Fixed using the spec declared in: api/v1alpha1/runner_types.go
* docs: clean up of autoscaling section
* docs: clarifying anti-flapping
* docs: more improvements
* docs: more improvements
* docs: adding duration details and cavaets
* docs: smaller english and better structure
* docs: use consistent wording
* docs: adding limitation cavaet for RunnerSets
* docs: correct helm uprgade order
* docs: lines helm upgrade command with help switch
* docs: use existing limitations section
* docs: fix table of headers and contents
* docs: add link to runnersets on first mention
* docs: adding runnerset limitation
* chore: use new enterprise permission for PAT
* docs: bump example deploy to latest version
* docs: adding oauth apps link
* docs: adding cavaet to the oauth apps doc
* Revert "chore: support app ids as int or strings (#869)"
This reverts commit 0a3d2b686e.
* docs: adding some comments to the code
* docs: adding comment to the chart values
* Create optional serviceAnnotations value in helm chart
* update annotation key
* update annotation key - webhook service
* fix README.md
* docs: using consistent tense
* docs: making the code comments more generic
* feat: Have githubwebhook monitor a single namespace
When using `scope.singleNamespace: true` in Helm, expected behaviour is
that the github webhook server behaves the same way as the controller.
The current behaviour is that the webhook server monitors all the
namespaces.
* Changing the chart README.md to reflect the scope
The documentation now mentions that both the controller and the github
webhook server will make use of the `scope.watchNamespace` field if
`scope.singleNamespace` is set to `true`.
Co-authored-by: Sebastien Le Digabel <sebastien.ledigabel@skyscanner.net>
* Fix bug related to label matching.
Add start of test framework for Workflow Job Events
Signed-off-by: Aidan Jensen <aidan@artificial.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* docs: updating to reflect the latest release
* docs: further updates
* docs: format updates
* docs: correcting title sizes
* docs: correcting links
* docs: correcting the grammar of multi controllers
* docs: slightly less awkward English
I have a dedicated GitHub organization and a private repository to run this E2E test. After a few fixes included in this change, it has successfully passed.
This was something that was missing in #707.
Adding a new test to make sure the ephemeral feature flag from upstream
is set up correctly by the script.
The unit tests are simulating a run for entrypoint. It creates some
dummy config.sh and runsvc.sh and makes sure the logic behind
entrypoint.sh is correct.
Unfortunately the entrypoint.sh contains some sections that are not
mockable so I had to put some logic in there too.
Testing includes for now:
- the normal scenario
- the normal non-ephemeral scenario
- the configuration failure scenario
Also tested the entrypoint.sh on a real runner, still works as expected.
Adding a basic retry loop during configuration. If configuration fails,
the runner will just straight into a retry loop and will continuously
fail until it dies after a while.
This change will retry 10 times and will exit if the configuration
wasn't successful.
Also, changed the logging format, adding a bit of color in the event of
success or failure.
This fixes the below error that occurs in `make release`:
```
kustomize build config/default > release/actions-runner-controller.yaml
Error: accumulating resources: accumulation err='accumulating resources from '../webhook': '/home/mumoshu/p/actions-runner-controller/config/webhook' must resolve to a file': recursed accumulation of path '/home/mumoshu/p/actions-runner-controller/config/webhook': accumulating resources: accumulation err='accumulating resources from 'manifests.v1beta1.yaml': evalsymlink failure on '/home/mumoshu/p/actions-runner-controller/config/webhook/manifests.v1beta1.yaml' : lstat /home/mumoshu/p/actions-runner-controller/config/webhook/manifests.v1beta1.yaml: no such file or directory': evalsymlink failure on '/home/mumoshu/p/actions-runner-controller/config/webhook/manifests.v1beta1.yaml' : lstat /home/mumoshu/p/actions-runner-controller/config/webhook/manifests.v1beta1.yaml: no such file or directory
make: *** [Makefile:156: release] Error 1
```
Ref #144
This fixes the below error on installing the chart:
```
Error: UPGRADE FAILED: error validating "": error validating data: [ValidationError(MutatingWebhookConfiguration.webhooks[0]): missing required field "admissionReviewVersions" in io.k8s.api.admissionregistration.v1.MutatingWebhook, ValidationError(MutatingWebhookConfiguration.webhooks[1]): missing required field "admissionReviewVersions" in io.k8s.api.admissionregistration.v1.MutatingWebhook, ValidationError(MutatingWebhookConfiguration.webhooks[2]): missing required field "admissionReviewVersions" in io.k8s.api.admissionregistration.v1.MutatingWebhook, ValidationError(MutatingWebhookConfiguration.webhooks[3]): missing required field "admissionReviewVersions" in io.k8s.api.admissionregistration.v1.MutatingWebhook]
```
Ref #144
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.
**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.
**Ephemeral runners**:
The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).
`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.
To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.
Please read the section `Ephemeral Runners` in the updated version of our README for more information.
Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!
**`workflow_job` events**:
`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.
Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.
In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.
Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.
Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
* Adding a default docker registry mirror
This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.
The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
Add `shortNames` to kube api-resource CRDs. Short-names make it easier when interacting/troubleshooting api-resources with kubectl.
We have tried to follow the naming convention similar to what K8s uses which should help with avoiding any naming conflicts as well. For example:
* `Deployment` has a shortName of deploy, so added rdeploy for `runnerdeployment`
* `HorizontalPodAutoscaler` has a shortName of hpa, so added hra for `HorizontalRunnerAutoscaler`
* `ReplicaSets` has a shortName of rs, so added rrs for `runnerreplicaset`
Co-authored-by: abhinav454 <43758739+abhinav454@users.noreply.github.com>
* Add POC of GitHub Webhook Delivery Forwarder
* multi-forwarder and ctrl-c existing and fix for non-woring http post
* Rename source files
* Extract signal handling into a dedicated source file
* Faster ctrl-c handling
* Enable automatic creation of repo hook on startup
* Add support for forwarding org hook deliveries
* Set hook secret on hook creation via envvar (HOOK_SECRET)
* Fix org hook support
* Fix HOOK_SECRET for consistency
* Refactor to prepare for custom log position provider
* Refactor to extract inmemory log position provider
* Add configmap-based log position provider
* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder
* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg
* Refactor to extract logger initialization
* Add hookdeliveryforwarder README and bump go-github to unreleased ver
Improves #664 by adding annotations to the server's service. Beyond general applications, we use these annotations within my own projects to configure various LB values.
* Optional override of runner image in chart
This commit adds the option to override the actions runner image. This
allows running the controller in environments where access to Dockerhub
is restricted.
It uses the parameter [--runner-image](https://github.com/actions-runner-controller/actions-runner-controller/blob/master/main.go#L89) from the controller.
The default value is set as a constant
[here](acb906164b/main.go (L40)).
The default value for the chart is the same.
* Fixing actionsRunner name
... to actionsRunnerRepositoryAndTag for consistency.
* Bumping chart to v0.12.5
Previously the E2E test suite covered only RunnerSet. This refactors the existing E2E test code to extract the common test structure into a `env` struct and its methods, and use it to write two very similar tests, one for RunnerSet and another for RunnerDeployment.
There's no reason to create a non-working secret by default. If someone wants to deploy the secrets via the chart they will need to do some config regardless so they might as well also set the create flag
Enhances out existing E2E test suite to additionally support triggering two or more concurrent workflow jobs and verifying all the results, so that you can ensure the runners managed by the controller are able to handle jobs reliably when loaded.
This enhances the E2E test suite introduced in #658 to also include the following steps:
- Install GitHub Actions workflow
- Trigger a workflow run via a git commit
- Verify the workflow run result
In the workflow, we use `kubectl create cm --from-literal` to create a configmap that contains an unique test ID. In the last step we obtain the configmap from within the E2E test and check the test ID to match the expected one.
To install a GitHub Actions workflow, we clone a GitHub repository denoted by the TEST_REPO envvar, progmatically generate a few files with some Go code, run `git-add`, `git-commit`, and then `git-push` to actually push the files to the repository. A single commit containing an updated workflow definition and an updated file seems to run a workflow derived to the definition introduced in the commit, which was a bit surpirising and useful behaviour.
At this point, the E2E test fully covers all the steps for a GitHub token based installation. We need to add scenarios for more deployment options, like GitHub App, RunnerDeployment, HRA, and so on. But each of them would worth another pull request.
This is the initial version of our E2E test suite which is currently a subset of the acceptance test suite reimplemented in Go.
To run it, pass `-run ^TestE2E$` to `go test`, without `-short`, like `go test -timeout 600s -run ^TestE2E$ github.com/actions-runner-controller/actions-runner-controller/test/e2e -v`.
`make test` is modified to pass `-short` to `go test` by default to skip E2E tests.
The biggest benefit of rewriting the acceptance test in Go turned out to be the fact that you can easily rerun each step- a go-test "subtest"- individually from your IDE, for faster turnaround. Both VS Code and IntelliJ IDEA/GoLand are known to work.
In the near future, we will add more steps to the suite, like actually git-comminting some Actions workflow and pushing some commit to trigger a workflow run, and verify the workflow and job run results, and finally run it on our `test` workflow to fully automated E2E testing. But that s another story.
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.
It defaults to `RunnerDeployment` for backward compatibility.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: myhra
spec:
scaleTargetRef:
kind: RunnerSet
name: myrunnerset
```
Ref #629
Ref #613
Ref #612
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.
This option it's set by default to `true` in any Kubernetes deployment.
Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>
* feat: RunnerSet backed by StatefulSet
Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.
Resolves#613
Ref #612
* Upgrade controller-runtime to 0.9.0
* Bump Go to 1.16.x following controller-runtime 0.9.0
* Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup
* Fix startup failure due to missing LeaderElectionID
* Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered
* Allow force-updating statefulset
* Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false
* Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec
* Enable running acceptance tests against arbitrary kind cluster
* RunnerSet supports non-ephemeral runners only today
* fix: docker-build from root Makefile on intel mac
* fix: arch check fixes for mac and ARM
* ci: aligning test data format and patching checks
* fix: removing namespace in test data
* chore: adding more ignores
* chore: removing leading space in shebang
* Re-add metrics to org hra testdata
* Bump cert-manager to v1.1.1 and fix deploy.sh
Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
To prevent people from writing related and unrelated things to already closed issues. As a maitainer, that kind of situation only makes it harder to effectively provide user support. Please create another issue with concrete description of "your issue" and the reproduction steps, rather than commenting "me too" on unrelated issues!
Sets the privileged flag to false if SELinuxOptions are present/defined. This is needed because containerd treats SELinux and Privileged controls as mutually exclusive. Also see https://github.com/containerd/cri/blob/aa2d5a97c/pkg/server/container_create.go#L164.
This allows users who use SELinux for managing privileged processes to use GH Actions - otherwise, based on the SELinux policy, the Docker in Docker container might not be privileged enough.
Signed-off-by: Jonah Back <jonah@jonahback.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This allows using the `runtimeClassName` directive in the runner's spec.
One of the use-cases for this is Kata Containers, which use `runtimeClassName` in a pod spec as an indicator that the pod should run inside a Kata container. This allows us a greater degree of pod isolation.
* docs: adding upgrade notes for Helm
* chore: adding new ignore
* docs: add in cmd to check for stuck runners
* docs: better format
* docs: removing superfluous steps
* docs: moved location of docs
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
Adds a column to help the operator see if they configured HRA.Spec.ScheduledOverrides correctly, in a form of "next override schedule recognized by the controller":
```
$ k get horizontalrunnerautoscaler
NAME MIN MAX DESIRED SCHEDULE
actions-runner-aos-autoscaler 0 5 0
org 0 5 0 min=0 time=2021-05-21 15:00:00 +0000 UTC
```
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/484
This fixes human-readable output of `kubectl get` on `runnerdeployment`, `runnerreplicaset`, and `runner`.
Most notably, CURRENT and READY of runner replicasets are now computed and printed correctly. Runner deployments now have UP-TO-DATE and AVAILABLE instead of READY so that it is consistent with columns of K8s deployments.
A few fixes has been also made to runner deployment and runner replicaset controllers so that those numbers stored in Status objects are reliably updated and in-sync with actual values.
Finally, `AGE` columns are added to runnerdeployment, runnerreplicaset, runnner to make that more visible to users.
`kubectl get` outputs should now look like the below examples:
```
# Immediately after runnerdeployment updated/created
$ k get runnerdeployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
example-runnerdeploy 0 0 0 0 8d
org-runnerdeploy 5 5 5 0 8d
# A few dozens of seconds after update/create all the runners are registered that "available" numbers increase
$ k get runnerdeployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
example-runnerdeploy 0 0 0 0 8d
org-runnerdeploy 5 5 5 5 8d
```
```
$ k get runnerreplicaset
NAME DESIRED CURRENT READY AGE
example-runnerdeploy-wnpf6 0 0 0 61m
org-runnerdeploy-fsnmr 2 2 0 8m41s
```
```
$ k get runner
NAME ENTERPRISE ORGANIZATION REPOSITORY LABELS STATUS AGE
example-runnerdeploy-wnpf6-registration-only actions-runner-controller/mumoshu-actions-test Running 61m
org-runnerdeploy-fsnmr-n8kkx actions-runner-controller ["mylabel 1","mylabel 2"] 21s
org-runnerdeploy-fsnmr-sq6m8 actions-runner-controller ["mylabel 1","mylabel 2"] 21s
```
Fixes#490
`PercentageRunnersBusy`, in combination with a secondary `TotalInProgressAndQueuedWorkflowRuns` metric, enables scale-from-zero for PercentageRunnersBusy.
Please see the new `Autoscaling to/from 0` section in the updated documentation about how it works.
Resolves#522
This adds the initial version of ScheduledOverrides to HorizontalRunnerAutoscaler.
`MinReplicas` overriding should just work.
When there are two or more ScheduledOverrides, the earliest one that matched is activated. Each ScheduledOverride can be recurring or one-time. If you have two or more ScheduledOverrides, only one of them should be one-time. And the one-time override should be the earliest item in the list to make sense.
Tests will be added in another commit. Logging improvements and additional observability in HRA.Status will also be added in yet another commits.
Ref #484
Adds two types `RecurrenceRule` and `Period` and one function `MatchSchedule` as the foundation for building the upcoming ScheduledOverrides feature.
Ref #484
- Adds `ephemeral` option to `runner.spec`
```
....
template:
spec:
ephemeral: false
repository: mumoshu/actions-runner-controller-ci
....
```
- `ephemeral` defaults to `true`
- `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag
- Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`.
Resolves#457
Previously any non-go changes resulted in `make docker-build` rerunning time-consufming `go build`. This fixes that by adding clearly unnecessary files .dockerignore
- You can now use `make acceptance/run` to run only a specific acceptance test case
- Add note about Ubuntu 20.04 users / snap-provided docker
- Add instruction to run Ginkgo tests
- Extract acceptance/load from acceptance/kind
- Make `acceptance/pull` not depend on `docker-build`, so that you can do `make docker-build acceptance/load` for faster image reload
This is an attempt to support scaling from/to zero.
The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.
GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners.
In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.
Related to #447
* Fix acceptance helm test not using newly built controller image
* Locally build runner image instead of pulling it
* Revert runner controller image pull policy to always
and add a line to the test deployment to use IfNotPresent
* Change runner repository from summerwind/action-runner to the owner of actions-runner-controller.
Also fix some Makefile formatting.
* Undo renaming acceptance/pull to docker-pull
* Some env var cleanup
Rename USERNAME to DOCKER_USER(is still used for github too tho)
Add RUNNER_NAME var(defaults to $DOCKER_USER/actions-runner)
Add TEST_REPO(defaults to $DOCKER_USER/actions-runner-controller)
Changes:
- Switched to use `jq` in startup.sh
- Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting
Check #478 for how this feature is tested and supposed to be used.
* chore: adding Helm app version back
* chore: removing redundant values entry
* chore: bumping to newer version
* chore: bumping app version to latest
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Enable the user to set a limit size on the volume of the runner to avoid some runner pod affecting other resources of the same cluster
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Cache docker images locally
Cache dind, runner, and kube-rbac-proxy docker image on the host and copy onto the kind node instead of downloading it to the node directly.
* Also cache certmanager docker images
Images for `actions-runner:v${VERSION}` and `actions-runner:latest` tags are upgraded to Ubuntu 20.04.
If you would like not to upgrade Ubuntu in the runner image in the future, migrate to new tags suffixed with `-ubuntu-20.04` like`actions-runner:v${VERSION}-ubuntu-20.04`.
We also keep publishing the existing Ubuntu 18.04 images with new `actions-runner:v${VERSION}-ubuntu-18.04` tags. Please use it when it turned out that you had workflows dependent on Ubuntu 18.04.
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Prevents arguments from being split when e.g. the RUNNER_GROUP variable contains spaces (which is legit. One can create such groups in GitHub).
I've seen that all workers with group names that contain no spaces can register successfully, while all workers with groups that contain spaces will not register.
Furthermore, I suppose also other chars can be used here to inject arbitrary commands in an unsupported way via e.g. pipe symbol.
Quoting the vars correctly should prevent that and allow for e.g. group names and runner labels with spaces and other bash reserved characters.
This makes logging more concise by changing logger names to something like `controllers.Runner` to `actions-runner-controller.runner` after the standard `controller-rutime.controller` and reducing redundant logs by removing unnecessary requeues. I have also tweaked log messages so that their style is more consistent, which will also help readability. Also, runnerreplicaset-controller lacked useful logs so I have enhanced it.
As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event.
Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted.
This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas.
Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas).
Follow-up for #282
Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded).
More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.
We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered.
A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.
This allows you to trigger autoscaling depending on check_run names(i.e. actions job names). If you are willing to differentiate scale amount only for a specific job, or want to scale only on a specific job, try this.
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector.
Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-795200205
* so far, runner group parameter is only set for runners with org scope
* now set group for enterprise runners as well
* removed null check for org scope as either org or enterprise will be set
We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.
I have also updated the hra controller to use Patch when updating RunnerDeployment, too.
Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
* add replicaCount
* Add authSecret.existingSecret
* set image.tag null by default
* implement ingress for githubwebhook server
* fix deprecated and secretName template
* backward compat .authSecret.enabled
* existingSecret for github webhook secret
* use secretName template
* set default secret names
* do not use app version based image tag
* create and name variable for secrets
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
The controller had been writing confusing messages like the below on missing scale target:
```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```
This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
* if a new runner pod was just scheduled to start up right before a
registration expired, it will not get a new registration token and go in
an infinite update loop (until #341) kicks in
* if registzration tokens got updated a little bit before they actually
expired, just starting up pods will way more likely get a working token
We occasionally see logs like the below:
```
2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
/home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
```
which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.
Apparently I have mistakenly removed `push` option from the workflow in #323 which resulted in new runner build #323 not being pushed. This fixes that.
* if a runner pod starts up with an invalid token, it will go in an
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding
runner objects exists in GitHub and the pod will get removed after
registration timeout
* if the GitHub runner object already existed before - e.g. because a
finalizer was not properly run as part of a partial Kubernetes crash,
the runner will always stay in a running mode, even updating the
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart,
only do additional restart checks if no restart was already decided,
making code a bit cleaner and saving GitHub API calls after each job
completion
Changes:
1. Fix length of github-webhook-server port name
2. Add a cluster role binding for github-webhook-server
3. Remove --enable-leader-election from github-webhook-server
* Update runner to 2.277.1
* Update build-and-release-runners.yml
* integration test condition
Don't run integration tests when only updating the runner image
* fixup! integration test condition
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* so far, only push events would trigger the DockerHub login step
* hence, attempts to release would fail because of a permission problem (tested locally)
* adding OR condition to also login in case a release got published
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.
This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
* the reconciliation loop is often much faster than the runner startup,
so changing runner not found related messages to debug and also add the
possibility that the runner just needs more time
* as pointed out in #281 the currently used image for the
kube-rbac-proxy - gcr.io/kubebuilder/kube-rbac-proxy:v0.4.1" - does not
have an ARM64 image
* hence, trying to use the standard deployment manifest / helm char will
fail on ARM64 systems
* replaced image with quay.io/brancz/kube-rbac-proxy:v0.8.0 which is the
latest version from the upstream maintainer
(https://github.com/brancz/kube-rbac-proxy/blob/master/Makefile#L13)
* successfully tested on both AMD64 and ARM64 clusters
* fixes#281
* errors.Is compares all members of a struct to return true which never
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to
unregistering runners after the registration timeout
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).
It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).
Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.
Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.
Ref #288
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* bug-fix: patched dir owned by runner
* always build with latest runner version
* Revert "always build with latest runner version"
This reverts commit e719724ae9fe92a12d4a087185cf2a2ff543a0dd.
* Also patch dindrunner.Dockerfile
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Added GITHUB.RUN_NUMBER to DockerHub push
* switch run_number to sha on docker tag
* re-add mutable tags for backwards compatability
* truncate to short SHA (7 chars)
* behaviour workaround
* use ENV to define sha_short
* use ::set-output to define sha_short
* bump action
* feat/helm: Bump appVersion to 0.6.1 release
* Also bump chart version to trigger a new chart release
Co-authored-by: Yusuke Kuoka <c-ykuoka@zlab.co.jp>
* Add chart workflows (#1)
* Add chart workflows
* Fix publishing step in CI
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Update CI on push-to-master (#3)
* Put helm installation step in the correct CI job
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Put helm installation step in the correct CI job (#4)
* Update on-push-master-publish-chart.yml
* Remove references to certmanager dependency
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Add ability to customize kube-rbac-proxy image
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Only install cert-manager if we're going to spin up KinD
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* when setting a GitHub Enterprise server URL without a namespace, an error occurs: "error: the server doesn't have a resource type "controller-manager"
* setting default namespace "actions-runner-system" makes the example work out of the box
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
* feat: adding maanger secret to Helm
* fix: correcting secret data format
* feat: adding in common labels
* fix: updating default values to have config
The auth config needs to be commented out by default as we don't want to deploy both configs empty. This may break stuff, so we want the user to actively uncomment the auth method they want instead
* chore: updating default format of cert
* chore: wording
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.
Resolves#200
It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby).
This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.
description:Please check all the boxes below before submitting
options:
- label:I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
required:true
- label:I am using charts that are officially provided
- type:input
id:controller-version
attributes:
label:Controller Version
description:Refers to semver-like release tags for controller versions. Any release tags prefixed with `gha-runner-scale-set-` are releases associated with this API group
placeholder:ex. 0.6.1
validations:
required:true
- type:dropdown
id:deployment-method
attributes:
label:Deployment Method
description:Which deployment method did you use to install ARC?
options:
- Helm
- Kustomize
- ArgoCD
- Other
validations:
required:true
- type:checkboxes
id:checks
attributes:
label:Checks
description:Please check all the boxes below before submitting
options:
- label:This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
required:true
- label:I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
required:true
- type:textarea
id:reproduction-steps
attributes:
label:To Reproduce
description:"Steps to reproduce the behavior"
render:markdown
placeholder:|
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
validations:
required:true
- type:textarea
id:actual-behavior
attributes:
label:Describe the bug
description:Also tell us, what did happen?
placeholder:A clear and concise description of what happened.
validations:
required:true
- type:textarea
id:expected-behavior
attributes:
label:Describe the expected behavior
description:Also tell us, what did you expect to happen?
placeholder:A clear and concise description of what the expected behavior is.
validations:
required:true
- type:textarea
id:additional-context
attributes:
label:Additional Context
render:yaml
description:|
Provide `values.yaml` files that are relevant for this issue. PLEASE REDACT ANY INFORMATION THAT SHOULD NOT BE PUBLICALY AVAILABLE, LIKE GITHUB TOKEN FOR EXAMPLE.
placeholder:|
PLEASE REDACT ANY INFORMATION THAT SHOULD NOT BE PUBLICALY AVAILABLE, LIKE GITHUB TOKEN FOR EXAMPLE.
validations:
required:true
- type:textarea
id:controller-logs
attributes:
label:Controller Logs
description:"NEVER EVER OMIT THIS! Include complete logs from `actions-runner-controller`'s controller-manager pod."
render:shell
placeholder:|
PROVIDE THE LOGS VIA A GIST LINK (https://gist.github.com/), NOT DIRECTLY IN THIS TEXT AREA
name:Bug Report (actions.summerwind.net API group)
description:File a bug report for actions.summerwind.net API group
title:"<Please write what didn't work for you here>"
labels:["bug","needs triage","community"]
body:
- type:checkboxes
id:read-troubleshooting-guide
attributes:
label:Checks
description:Please check all the boxes below before submitting
options:
- label:I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
required:true
- label:I'm not using a custom entrypoint in my runner image
required:true
- type:input
id:controller-version
attributes:
label:Controller Version
description:Refer to semver-like release tags for controller versions. Any release tags prefixed with `actions-runner-controller-` are for chart releases
placeholder:ex. 0.18.2 or git commit ID
validations:
required:true
- type:input
id:chart-version
attributes:
label:Helm Chart Version
description:Run `helm list` and see what's shown under CHART VERSION. Any release tags prefixed with `actions-runner-controller-` are for chart releases
placeholder:ex. 0.11.0
- type:input
id:cert-manager-version
attributes:
label:CertManager Version
description:Run `kubectl get po -o yaml $CERT_MANAGER_POD` and see the image tag, or run `helm list` and see what's shown under APP VERSION for your cert-manager Helm release.
placeholder:ex. 1.8
- type:dropdown
id:deployment-method
attributes:
label:Deployment Method
description:Which deployment method did you use to install ARC?
options:
- Helm
- Kustomize
- ArgoCD
- Other
validations:
required:true
- type:textarea
id:cert-manager
attributes:
label:cert-manager installation
description:Confirm that you've installed cert-manager correctly by answering a few questions
placeholder:|
- Did you follow https://github.com/actions/actions-runner-controller#installation? If not, describe the installation process so that we can reproduce your environment.
- Are you sure you've installed cert-manager from an official source?
(Note that we won't provide user support for cert-manager itself. Make sure cert-manager is fully working before testing ARC or reporting a bug
validations:
required:true
- type:checkboxes
id:checks
attributes:
label:Checks
description:Please check all the boxes below before submitting
options:
- label:This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions). It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
required:true
- label:I've read [releasenotes](https://github.com/actions/actions-runner-controller/tree/master/docs/releasenotes) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
required:true
- label:My actions-runner-controller version (v0.x.y) does support the feature
required:true
- label:I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
required:true
- label:I've migrated to the workflow job webhook event (if you using webhook driven scaling)
required:true
- type:textarea
id:resource-definitions
attributes:
label:Resource Definitions
description:"Add copy(s) of your resource definition(s) (RunnerDeployment or RunnerSet, and HorizontalRunnerAutoscaler. If RunnerSet, also include the StorageClass being used)"
render:yaml
placeholder:|
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: example
spec:
#snip
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
name: example
spec:
#snip
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: example
provisioner: ...
reclaimPolicy: ...
volumeBindingMode: ...
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name:
spec:
#snip
validations:
required:true
- type:textarea
id:reproduction-steps
attributes:
label:To Reproduce
description:"Steps to reproduce the behavior"
render:markdown
placeholder:|
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
validations:
required:true
- type:textarea
id:actual-behavior
attributes:
label:Describe the bug
description:Also tell us, what did happen?
placeholder:A clear and concise description of what happened.
validations:
required:true
- type:textarea
id:expected-behavior
attributes:
label:Describe the expected behavior
description:Also tell us, what did you expect to happen?
placeholder:A clear and concise description of what the expected behavior is.
validations:
required:true
- type:textarea
id:controller-logs
attributes:
label:Whole Controller Logs
description:"NEVER EVER OMIT THIS! Include logs from `actions-runner-controller`'s controller-manager pod. Don't omit the parts you think irrelevant!"
render:shell
placeholder:|
PROVIDE THE LOGS VIA A GIST LINK (https://gist.github.com/), NOT DIRECTLY IN THIS TEXT AREA
To grab controller logs:
# Set NS according to your setup
NS=actions-runner-system
# Grab the pod name and set it to $POD_NAME
kubectl -n $NS get po
kubectl -n $NS logs $POD_NAME > arc.log
validations:
required:true
- type:textarea
id:runner-pod-logs
attributes:
label:Whole Runner Pod Logs
description:"Include logs from runner pod(s). Please don't omit the parts you think irrelevant!"
render:shell
placeholder:|
PROVIDE THE WHOLE LOGS VIA A GIST LINK (https://gist.github.com/), NOT DIRECTLY IN THIS TEXT AREA
To grab the runner pod logs:
# Set NS according to your setup. It should match your RunnerDeployment's metadata.namespace.
NS=default
# Grab the name of the problematic runner pod and set it to $POD_NAME
If any of the containers are getting terminated immediately, try adding `--previous` to the kubectl-logs command to obtain logs emitted before the termination.
validations:
required:true
- type:textarea
id:additional-context
attributes:
label:Additional Context
description:|
Add any other context about the problem here.
Tip: You can attach images or log files by clicking this area to highlight it and then dragging files in.
*A clear and concise description of what you want to happen.*
Note: Feature requests to integrate vendor specific cloud tools (e.g. `awscli`, `gcloud-sdk`, `azure-cli`) will likely be rejected as the Runner image aims to be vendor agnostic.
### Why is this needed?
*A clear and concise description of any alternative solutions or features you've considered.*
### Additional context
*Add any other context or screenshots about the feature request here.*
This is the template of actions-runner-controller's release notes.
Whenever a new release is made, I start by manually copy-pasting this template onto the GitHub UI for creating the release.
I then walk-through all the changes, take sometime to think abount best one-sentence explanations to tell the users about changes, write it all,
and click the publish button.
If you think you can improve future release notes in any way, please do submit a pull request to change the template below.
Note that even though it looks like a Go template, I don't use any templating to generate the changelog.
It's just that I'm used to reading and intepreting Go template by myself, not a computer program :)
**Title**:
```
v{{ .Version }}: {{ .TitlesOfImportantChanges }}
```
**Body**:
```
**CAUTION:** If you're using the Helm chart, beware to review changes to CRDs and do manually upgrade CRDs! Helm installs CRDs only on installing a chart. It doesn't automatically upgrade CRDs. Otherwise you end up with troubles like #427, #467, and #468. Please refer to the [UPGRADING](charts/actions-runner-controller/docs/UPGRADING.md) docs for the latest process.
This release includes the following changes from contributors. Thank you!
- @{{ .GitHubUser }} fixed {{ .Feature }} to not break when ... (#{{ .PullRequestNumber }})
// Try to find the workflow run triggered by the previous step using the workflow_dispatch event.
// - Find recently create workflow runs in the test repository
// - For each workflow run, list its workflow job and see if the job's labels contain `inputs.arc-name`
// - Since the inputs.arc-name should be unique per e2e workflow run, once we find the job with the label, we find the workflow that we just triggered.
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}
const owner = '${{inputs.repo-owner}}'
const repo = '${{inputs.repo-name}}'
const workflow_id = '${{inputs.workflow-file}}'
let workflow_run_id = 0
let workflow_job_id = 0
let workflow_run_html_url = ""
let count = 0
while (count++<12) {
await sleep(10 * 1000);
let listRunResponse = await github.rest.actions.listWorkflowRuns({
This document is the single source of truth for how to contribute to the code base.
Feel free to browse the [open issues](https://github.com/actions/actions-runner-controller/issues) or file a new one, all feedback is welcome!
By reading this guide, we hope to give you all of the information you need to be able to pick up issues, contribute new features, and get your work
reviewed and merged.
## Before contributing code
We welcome code patches, but to make sure things are well coordinated you should discuss any significant change before starting the work. The maintainers ask that you signal your intention to contribute to the project using the issue tracker. If there is an existing issue that you want to work on, please let us know so we can get it assigned to you. If you noticed a bug or want to add a new feature, there are issue templates you can fill out.
When filing a feature request, the maintainers will review the change and give you a decision on whether we are willing to accept the feature into the project.
For significantly large and/or complex features, we may request that you write up an architectural decision record ([ADR](https://github.blog/2020-08-13-why-write-adrs/)) detailing the change.
Please use the [template](/docs/adrs/yyyy-mm-dd-TEMPLATE) as guidance.
<!--
TODO: Add a pre-requisite section describing what developers should
install in order get started on ARC.
-->
## How to Contribute a Patch
Depending on what you are patching depends on how you should go about it.
Below are some guides on how to test patches locally as well as develop the controller and runners.
When submitting a PR for a change please provide evidence that your change works as we still need to work on improving the CI of the project.
Some resources are provided for helping achieve this, see this guide for details.
### Developing the Controller
Rerunning the whole acceptance test suite from scratch on every little change to the controller, the runner, and the chart would be counter-productive.
To make your development cycle faster, use the below command to update deploy and update all the three:
```shell
# Let assume we have all other envvars like DOCKER_USER, GITHUB_TOKEN already set,
# The below command will (re)build `actions-runner-controller:controller1` and `actions-runner:runner1`,
# load those into kind nodes, and then rerun kubectl or helm to install/upgrade the controller,
# and finally upgrade the runner deployment to use the new runner image.
#
# As helm 3 and kubectl is unable to recreate a pod when no tag change,
# you either need to bump VERSION and RUNNER_TAG on each run,
# or manually run `kubectl delete pod $POD` on respective pods for changes to actually take effect.
# Makefile
VERSION=controller1 \
RUNNER_TAG=runner1 \
make acceptance/pull acceptance/kind docker-buildx acceptance/load acceptance/deploy
```
If you've already deployed actions-runner-controller and only want to recreate pods to use the newer image, you can run:
```shell
# Makefile
NAME=$DOCKER_USER/actions-runner-controller \
make docker-build acceptance/load &&\
kubectl -n actions-runner-system delete po $(kubectl -n actions-runner-system get po -ojsonpath={.items[*].metadata.name})
```
Similarly, if you'd like to recreate runner pods with the newer runner image you can use the runner specific [Makefile](runner/Makefile) to build and / or push new runner images
```shell
# runner/Makefile
NAME=$DOCKER_USER/actions-runner make \
-C runner docker-{build,push}-ubuntu &&\
(kubectl get po -ojsonpath={.items[*].metadata.name}| xargs -n1 kubectl delete po)
```
### Developing the Runners
#### Tests
A set of example pipelines (./acceptance/pipelines) are provided in this repository which you can use to validate your runners are working as expected.
When raising a PR please run the relevant suites to prove your change hasn't broken anything.
#### Running Ginkgo Tests
You can run the integration test suite that is written in Ginkgo with:
```shell
make test-with-deps
```
This will firstly install a few binaries required to setup the integration test environment and then runs `go test` to start the Ginkgo test.
If you don't want to use `make`, like when you're running tests from your IDE, install required binaries to `/usr/local/kubebuilder/bin`.
That's the directory in which controller-runtime's `envtest` framework locates the binaries.
go test -v -run TestAPIs github.com/actions/actions-runner-controller/controllers/actions.summerwind.net
```
To run Ginkgo tests selectively, set the pattern of target test names to `GINKGO_FOCUS`.
All the Ginkgo test that matches `GINKGO_FOCUS` will be run.
```shell
GINKGO_FOCUS='[It] should create a new Runner resource from the specified template, add a another Runner on replicas increased, and removes all the replicas when set to 0'\
go test -v -run TestAPIs github.com/actions/actions-runner-controller/controllers/actions.summerwind.net
```
### Running End to End Tests
> **Notes for Ubuntu 20.04+ users**
>
> If you're using Ubuntu 20.04 or greater, you might have installed `docker` with `snap`.
>
> If you want to stick with `snap`-provided `docker`, do not forget to set `TMPDIR` to somewhere under `$HOME`.
> Otherwise `kind load docker-image` fail while running `docker save`.
> See <https://kind.sigs.k8s.io/docs/user/known-issues/#docker-installed-with-snap> for more information.
To test your local changes against both PAT and App based authentication please run the `acceptance` make target with the authentication configuration details provided:
```shell
# This sets `VERSION` envvar to some appropriate value
. hack/make-env.sh
DOCKER_USER=*** \
GITHUB_TOKEN=*** \
APP_ID=*** \
PRIVATE_KEY_FILE_PATH=path/to/pem/file \
INSTALLATION_ID=*** \
make acceptance
```
#### Rerunning a failed test
When one of tests run by `make acceptance` failed, you'd probably like to rerun only the failed one.
It can be done by `make acceptance/run` and by setting the combination of `ACCEPTANCE_TEST_DEPLOYMENT_TOOL=helm|kubectl` and `ACCEPTANCE_TEST_SECRET_TYPE=token|app` values that failed (note, you just need to set the corresponding authentication configuration in this circumstance)
In the example below, we rerun the test for the combination `ACCEPTANCE_TEST_DEPLOYMENT_TOOL=helm ACCEPTANCE_TEST_SECRET_TYPE=token` only:
```shell
DOCKER_USER=*** \
GITHUB_TOKEN=*** \
ACCEPTANCE_TEST_DEPLOYMENT_TOOL=helm \
ACCEPTANCE_TEST_SECRET_TYPE=token \
make acceptance/run
```
#### Testing in a non-kind cluster
If you prefer to test in a non-kind cluster, you can instead run:
```shell
KUBECONFIG=path/to/kubeconfig \
DOCKER_USER=*** \
GITHUB_TOKEN=*** \
APP_ID=*** \
PRIVATE_KEY_FILE_PATH=path/to/pem/file \
INSTALLATION_ID=*** \
ACCEPTANCE_TEST_SECRET_TYPE=token \
make docker-build acceptance/setup \
acceptance/deploy \
acceptance/tests
```
### Code conventions
Before shipping your PR, please check the following items to make sure CI passes.
- Run `go mod tidy` if you made changes to dependencies.
- Format the code using `gofmt`
- Run the `golangci-lint` tool locally.
- We recommend you use `make lint` to run the tool using a Docker container matching the CI version.
### Opening the Pull Request
Send PR, add issue number to description
## Helm Version Changes
In general we ask you not to bump the version in your PR.
The maintainers will manage releases and publishing new charts.
## Testing Controller Built from a Pull Request
We always appreciate your help in testing open pull requests by deploying custom builds of actions-runner-controller onto your own environment, so that we are extra sure we didn't break anything.
It is especially true when the pull request is about GitHub Enterprise, both GHEC and GHES, as [maintainers don't have GitHub Enterprise environments for testing](docs/about-arc.md#github-enterprise-support).
The process would look like the below:
- Clone this repository locally
- Checkout the branch. If you use the `gh` command, run `gh pr checkout $PR_NUMBER`
- Run `NAME=$DOCKER_USER/actions-runner-controller VERSION=canary make docker-build docker-push` for a custom container image build
- Update your actions-runner-controller's controller-manager deployment to use the new image, `$DOCKER_USER/actions-runner-controller:canary`
Please also note that you need to replace `$DOCKER_USER` with your own DockerHub account name.
## Release process
Only the maintainers can release a new version of actions-runner-controller, publish a new version of the helm charts, and runner images.
All release workflows have been moved to [actions-runner-controller/releases](https://github.com/actions-runner-controller/releases) since the packages are owned by the former organization.
### Workflow structure
Following the migration of actions-runner-controller into GitHub actions, all the workflows had to be modified to accommodate the move to a new organization. The following table describes the workflows, their purpose and dependencies.
| gha-e2e-tests.yaml | (gha) E2E Tests | Tests the Autoscaling Runner Set mode end to end. Coverage is restricted to this mode. Legacy modes are not tested. |
| go.yaml | Format, Lint, Unit Tests | Formats, lints and runs unit tests for the entire codebase. |
| arc-publish.yaml | Publish ARC Image | Uploads release/actions-runner-controller.yaml as an artifact to the newly created release and triggers the [build and publication of the controller image](https://github.com/actions-runner-controller/releases/blob/main/.github/workflows/publish-arc.yaml) |
| global-publish-canary.yaml | Publish Canary Images | Builds and publishes canary controller container images for both new and legacy modes. |
| gha-publish-chart.yaml | (gha) Publish Helm Charts | Packages and publishes charts/gha-runner-scale-set-controller and charts/gha-runner-scale-set charts (OCI to GHCR) |
| arc-release-runners.yaml | Release ARC Runner Images | Triggers [release-runners.yaml](https://github.com/actions-runner-controller/releases/blob/main/.github/workflows/release-runners.yaml) which will build and push new runner images used with the legacy ARC modes. |
| global-run-codeql.yaml | Run CodeQL | Run CodeQL on all the codebase |
| global-run-first-interaction.yaml | First Interaction | Informs first time contributors what to expect when they open a new issue / PR |
| global-run-stale.yaml | Run Stale Bot | Closes issues / PRs without activity |
| arc-update-runners-scheduled.yaml | Runner Updates Check (Scheduled Job) | Polls [actions/runner](https://github.com/actions/runner) and [actions/runner-container-hooks](https://github.com/actions/runner-container-hooks) for new releases. If found, a PR is created to publish new runner images |
| arc-validate-chart.yaml | Validate Helm Chart | Run helm chart validators for charts/actions-runner-controller |
| gha-validate-chart.yaml | (gha) Validate Helm Charts | Run helm chart validators for charts/gha-runner-scale-set-controller and charts/gha-runner-scale-set charts |
| arc-validate-runners.yaml | Validate ARC Runners | Run validators for runners |
#### Releasing legacy actions-runner-controller image and helm charts
1. Start by making sure the master branch is stable and all CI jobs are passing
2. Create a new release in <https://github.com/actions/actions-runner-controller/releases> (Draft a new release)
3. Bump up the `version` and `appVersion` in charts/actions-runner-controller/Chart.yaml - make sure the `version` matches the release version you just created. (Example: <https://github.com/actions/actions-runner-controller/pull/2577>)
4. When the workflows finish execution, you will see:
1. A new controller image published to: <https://github.com/actions-runner-controller/actions-runner-controller/pkgs/container/actions-runner-controller>
2. Helm charts published to: <https://github.com/actions-runner-controller/actions-runner-controller.github.io/tree/master/actions-runner-controller> (the index.yaml file is updated)
When a new release is created, the [Publish ARC Image](https://github.com/actions/actions-runner-controller/blob/master/.github/workflows/arc-publish.yaml) workflow is triggered.
| `runner_version` | The version of the [actions/runner](https://github.com/actions/runner) to use | `2.300.2` |
| `docker_version` | The version of docker to use | `20.10.12` |
| `runner_container_hooks_version` | The version of [actions/runner-container-hooks](https://github.com/actions/runner-container-hooks) to use | `0.2.0` |
| `sha` | The commit sha from [actions/actions-runner-controller](https://github.com/actions/actions-runner-controller) to be used to build the runner images. This will be provided to `actions/checkout`& used to tag the container images | Empty string. |
| `push_to_registries` | Whether to push the images to the registries. Use false to test the build | false |
#### Release gha-runner-scale-set-controller image and helm charts
1. Make sure the master branch is stable and all CI jobs are passing
1. Prepare a release PR (example: <https://github.com/actions/actions-runner-controller/pull/2467>)
1. Bump up the version of the chart in: charts/gha-runner-scale-set-controller/Chart.yaml
2. Bump up the version of the chart in: charts/gha-runner-scale-set/Chart.yaml
1. Make sure that `version`, `appVersion` of both charts are always the same. These versions cannot diverge.
3. Update the quickstart guide to reflect the latest versions: docs/preview/gha-runner-scale-set-controller/README.md
4. Add changelog to the PR as well as the quickstart guide
1. Merge the release PR
1. Manually trigger the [(gha) Publish Helm Charts](https://github.com/actions/actions-runner-controller/actions/workflows/gha-publish-chart.yaml) workflow
1. Manually create a tag and release in [actions/actions-runner-controller](https://github.com/actions/actions-runner-controller/releases) with the format: `gha-runner-scale-set-x.x.x` where the version (x.x.x) matches that of the Helm chart
| `ref` | The branch, tag or SHA to cut a release from. | default branch |
| `release_tag_name` | The tag of the controller image. This is not a git tag. | canary |
| `push_to_registries` | Push images to registries. Use false to test the build process. | false |
| `publish_gha_runner_scale_set_controller_chart` | Publish new helm chart for gha-runner-scale-set-controller. This will push the new OCI archive to GHCR | false |
| `publish_gha_runner_scale_set_chart` | Publish new helm chart for gha-runner-scale-set. This will push the new OCI archive to GHCR | false |
#### Release actions/runner image
A new runner image is built and published to <https://github.com/actions/runner/pkgs/container/actions-runner> whenever a new runner binary has been released. There's nothing to do here.
#### Canary releases
We publish canary images for both the legacy actions-runner-controller and gha-runner-scale-set-controller images.
# We use -count=1 instead of `go clean -testcache`
# See https://terratest.gruntwork.io/docs/testing-best-practices/avoid-test-caching/
.PHONY:e2e
e2e:
go test -count=1 -v -timeout 600s -run '^TestE2E$$' ./test/e2e
# Upload release file to GitHub.
github-release:release
ghr ${VERSION} release/
# find or download controller-gen
# download controller-gen if necessary
# Find or download controller-gen
#
# Note that controller-gen newer than 0.4.1 is needed for https://github.com/kubernetes-sigs/controller-tools/issues/444#issuecomment-680168439
# Otherwise we get errors like the below:
# Error: failed to install CRD crds/actions.summerwind.dev_runnersets.yaml: CustomResourceDefinition.apiextensions.k8s.io "runnersets.actions.summerwind.dev" is invalid: [spec.validation.openAPIV3Schema.properties[spec].properties[template].properties[spec].properties[containers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property, spec.validation.openAPIV3Schema.properties[spec].properties[template].properties[spec].properties[initContainers].items.properties[ports].items.properties[protocol].default: Required value: this property is in x-kubernetes-list-map-keys, so it must have a default or be a required property]
#
# Note that controller-gen newer than 0.6.0 is needed due to https://github.com/kubernetes-sigs/controller-tools/issues/448
# Otherwise ObjectMeta embedded in Spec results in empty on the storage.
[GitHub Actions](https://github.com/features/actions) is a very useful tool for automating development. GitHub Actions jobs are run in the cloud by default, but you may want to run your jobs in your environment. [Self-hosted runner](https://github.com/actions/runner) can be used for such use cases, but requires the provisioning and configuration of a virtual machine instance. Instead if you already have a Kubernetes cluster, it makes more sense to run the self-hosted runner on top of it.
Actions Runner Controller (ARC) is a Kubernetes operator that orchestrates and scales self-hosted runners for GitHub Actions.
**actions-runner-controller** makes that possible. Just create a *Runner* resource on your Kubernetes, and it will run and operate the self-hosted runner for the specified repository. Combined with Kubernetes RBAC, you can also build simple Self-hostedrunners as a Service.
With ARC, you can create runner scale sets that automatically scale based on the number of workflows running in your repository, organization, or enterprise. Because controlled runners can be ephemeral and based on containers, new runner instances can scale up or down rapidly and cleanly. For more information about autoscaling, see ["Autoscaling with self-hosted runners."](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners)
## Installation
You can set up ARC on Kubernetes using Helm, then create and run a workflow that uses runner scale sets. For more information about runner scale sets, see ["Deploying runner scale sets with Actions Runner Controller."](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#runner-scale-set)
## People
actions-runner-controller uses [cert-manager](https://cert-manager.io/docs/installation/kubernetes/) for certificate management of Admission Webhook. Make sure you have already installed cert-manager before you install. The installation instructions for cert-manager can be found below.
Actions Runner Controller (ARC) is an open-source project currently developed and maintained in collaboration with the GitHub Actions team, external maintainers @mumoshu and @toast-gear, various [contributors](https://github.com/actions/actions-runner-controller/graphs/contributors), and the [awesome community](https://github.com/actions/actions-runner-controller/discussions).
- [Installing cert-manager on Kubernetes](https://cert-manager.io/docs/installation/kubernetes/)
If you think the project is awesome and is adding value to your business, please consider directly sponsoring [community maintainers](https://github.com/sponsors/actions-runner-controller) and individual contributors via GitHub Sponsors.
Install the custom resource and actions-runner-controller itself. This will create actions-runner-system namespace in your Kubernetes and deploy the required resources.
In case you are already the employer of one of contributors, sponsoring via GitHub Sponsors might not be an option. Just support them in other means!
See [the sponsorship dashboard](https://github.com/sponsors/actions-runner-controller) for the former and the current sponsors.
### Github Enterprise support
## Getting Started
If you use either Github Enterprise Cloud or Server (and have recent enought version supporting Actions), you can use **actions-runner-controller** with those, too. Authentication works same way as with public Github (repo and organization level).
To give ARC a try with just a handful of commands, Please refer to the [Quickstart guide](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller).
```shell
kubectl set env deploy controller-manager -c manager GITHUB_ENTERPRISE_URL=<GHEC/S URL>
```
For an overview of ARC, please refer to [About ARC](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller)
[Enterprise level](https://docs.github.com/en/enterprise-server@2.22/actions/hosting-your-own-runners/adding-self-hosted-runners#adding-a-self-hosted-runner-to-an-enterprise) runners are not working yet as there's no API definition for those.
With the introduction of [autoscaling runner scale sets](https://github.com/actions/actions-runner-controller/discussions/2775), the existing [autoscaling modes](./docs/automatically-scaling-runners.md) are now legacy. The legacy modes have certain use cases and will continue to be maintained by the community only.
## Setting up authentication with GitHub API
For further information on what is supported by GitHub and what's managed by the community, please refer to [this announcement discussion.](https://github.com/actions/actions-runner-controller/discussions/2775)
There are two ways for actions-runner-controller to authenticate with the GitHub API:
### Documentation
1. Using GitHub App.
2. Using Personal Access Token.
ARC documentation is available on [docs.github.com](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller).
Regardless of which authentication method you use, the same permissions are required, those permissions are:
- Repository: Administration (read/write)
- Repository: Actions (read)
- Organization: Self-hosted runners (read/write)
### Legacy documentation
The following documentation is for the legacy autoscaling modes that continue to be maintained by the community
**NOTE: It is extremely important to only follow one of the sections below and not both.**
- [Quickstart guide](/docs/quickstart.md)
- [About ARC](/docs/about-arc.md)
- [Installing ARC](/docs/installing-arc.md)
- [Authenticating to the GitHub API](/docs/authenticating-to-the-github-api.md)
- [Deploying alternative runners](/docs/deploying-alternative-runners.md)
- [Monitoring and troubleshooting](/docs/monitoring-and-troubleshooting.md)
### Using GitHub App
## Contributing
You can create a GitHub App for either your account or any organization. If you want to create a GitHub App for your account, open the following link to the creation page, enter any unique name in the "GitHub App name" field, and hit the "Create GitHub App" button at the bottom of the page.
We welcome contributions from the community. For more details on contributing to the project (including requirements), please refer to "[Getting Started with Contributing](https://github.com/actions/actions-runner-controller/blob/master/CONTRIBUTING.md)."
- [Create GitHub Apps on your account](https://github.com/settings/apps/new?url=http://github.com/summerwind/actions-runner-controller&webhook_active=false&public=false&administration=write&actions=read)
## Troubleshooting
If you want to create a GitHub App for your organization, replace the `:org` part of the following URL with your organization name before opening it. Then enter any unique name in the "GitHub App name" field, and hit the "Create GitHub App" button at the bottom of the page to create a GitHub App.
- [Create GitHub Apps on your organization](https://github.com/organizations/:org/settings/apps/new?url=http://github.com/summerwind/actions-runner-controller&webhook_active=false&public=false&administration=write&organization_self_hosted_runners=write&actions=read)
You will see an *App ID* on the page of the GitHub App you created as follows, the value of this App ID will be used later.
When the installation is complete, you will be taken to a URL in one of the following formats, the last number of the URL will be used as the Installation ID later (For example, if the URL ends in `settings/installations/12345`, then the Installation ID is `12345`).
Finally, register the App ID (`APP_ID`), Installation ID (`INSTALLATION_ID`), and downloaded private key file (`PRIVATE_KEY_FILE_PATH`) to Kubernetes as Secret.
From an account that has `admin` privileges for the repository, create a [personal access token](https://github.com/settings/tokens) with `repo` scope. This token is used to register a self-hosted runner by *actions-runner-controller*.
Self-hosted runners in GitHub can either be connected to a single repository, or to a GitHub organization (so they are available to all repositories in the organization). This token is used to register a self-hosted runner by *actions-runner-controller*.
For adding a runner to a repository, the token should have `repo` scope. If the runner should be added to an organization, the token should have `admin:org` scope. Note that to use a Personal Access Token, you must issue the token with an account that has `admin` privileges (on the repository and/or the organization).
Open the Create Token page from the following link, grant the `repo` and/or `admin:org` scope, and press the "Generate Token" button at the bottom of the page to create the token.
- [Create personal access token](https://github.com/settings/tokens/new)
Register the created token (`GITHUB_TOKEN`) as a Kubernetes secret.
- Manage a set of runners with `RunnerDeployment`.
### Repository runners
To launch a single self-hosted runner, you need to create a manifest file includes *Runner* resource as follows. This example launches a self-hosted runner with name *example-runner* for the *summerwind/actions-runner-controller* repository.
```yaml
# runner.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:Runner
metadata:
name:example-runner
spec:
repository:summerwind/actions-runner-controller
env:[]
```
Apply the created manifest file to your Kubernetes.
```shell
$ kubectl apply -f runner.yaml
runner.actions.summerwind.dev/example-runner created
```
You can see that the Runner resource has been created.
You can also see that the runner pod has been running.
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
example-runner 2/2 Running 0 1m
```
The runner you created has been registered to your repository.
<imgwidth="756"alt="Actions tab in your repository settings"src="https://user-images.githubusercontent.com/230145/73618667-8cbf9700-466c-11ea-80b6-c67e6d3f70e7.png">
Now you can use your self-hosted runner. See the [official documentation](https://help.github.com/en/actions/automating-your-workflow-with-github-actions/using-self-hosted-runners-in-a-workflow) on how to run a job with it.
### Organization Runners
To add the runner to an organization, you only need to replace the `repository` field with `organization`, so the runner will register itself to the organization.
```yaml
# runner.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:Runner
metadata:
name:example-org-runner
spec:
organization:your-organization-name
```
Now you can see the runner on the organization level (if you have organization owner permissions).
### RunnerDeployments
There are `RunnerReplicaSet` and `RunnerDeployment` that corresponds to `ReplicaSet` and `Deployment` but for `Runner`.
You usually need only `RunnerDeployment` rather than `RunnerReplicaSet` as the former is for managing the latter.
```yaml
# runnerdeployment.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:RunnerDeployment
metadata:
name:example-runnerdeploy
spec:
replicas:2
template:
spec:
repository:mumoshu/actions-runner-controller-ci
env:[]
```
Apply the manifest file to your cluster:
```shell
$ kubectl apply -f runner.yaml
runnerdeployment.actions.summerwind.dev/example-runnerdeploy created
```
You can see that 2 runners have been created as specified by `replicas: 2`:
`RunnerDeployment` can scale the number of runners between `minReplicas` and `maxReplicas` fields, depending on pending workflow runs.
In the below example, `actions-runner` checks for pending workflow runs for each sync period, and scale to e.g. 3 if there're 3 pending jobs at sync time.
The scale out performance is controlled via the manager containers startup `--sync-period` argument. The default value is 10 minutes to prevent unconfigured deployments rate limiting themselves from the GitHub API. The period can be customised in the `config/default/manager_auth_proxy_patch.yaml` patch for those that are building the solution via the kustomize setup.
Additionally, the autoscaling feature has an anti-flapping option that prevents periodic loop of scaling up and down.
By default, it doesn't scale down until the grace period of 10 minutes passes after a scale up. The grace period can be configured by setting `scaleDownDelaySecondsAfterScaleUp`:
When using default runner, runner pod starts up 2 containers: runner and DinD (Docker-in-Docker). This might create issues if there's `LimitRange` set to namespace.
```yaml
# dindrunnerdeployment.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:RunnerDeployment
metadata:
name:example-dindrunnerdeploy
spec:
replicas:2
template:
spec:
image:summerwind/actions-runner-dind
dockerdWithinRunnerContainer:true
repository:mumoshu/actions-runner-controller-ci
env:[]
```
This also helps with resources, as you don't need to give resources separately to docker and runner.
## Additional tweaks
You can pass details through the spec selector. Here's an eg. of what you may like to do:
```yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:RunnerDeployment
metadata:
name:actions-runner
namespace:default
spec:
replicas:2
template:
spec:
nodeSelector:
node-role.kubernetes.io/test:""
tolerations:
- effect:NoSchedule
key:node-role.kubernetes.io/test
operator:Exists
repository:mumoshu/actions-runner-controller-ci
image:custom-image/actions-runner:latest
imagePullPolicy:Always
resources:
limits:
cpu:"4.0"
memory:"8Gi"
requests:
cpu:"2.0"
memory:"4Gi"
# If set to false, there are no privileged container and you cannot use docker.
dockerEnabled:false
# If set to true, runner pod container only 1 container that's expected to be able to run docker, too.
# image summerwind/actions-runner-dind or custom one should be used with true -value
dockerdWithinRunnerContainer:false
# Valid if dockerdWithinRunnerContainer is not true
dockerdContainerResources:
limits:
cpu:"4.0"
memory:"8Gi"
requests:
cpu:"2.0"
memory:"4Gi"
sidecarContainers:
- name:mysql
image:mysql:5.7
env:
- name:MYSQL_ROOT_PASSWORD
value:abcd1234
securityContext:
runAsUser:0
# if workDir is not specified, the default working directory is /runner/_work
# this setting allows you to customize the working directory location
# for example, the below setting is the same as on the ubuntu-18.04 image
workDir:/home/runner/work
```
## Runner labels
To run a workflow job on a self-hosted runner, you can use the following syntax in your workflow:
```yaml
jobs:
release:
runs-on:self-hosted
```
When you have multiple kinds of self-hosted runners, you can distinguish between them using labels. In order to do so, you can specify one or more labels in your `Runner` or `RunnerDeployment` spec.
```yaml
# runnerdeployment.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:RunnerDeployment
metadata:
name:custom-runner
spec:
replicas:1
template:
spec:
repository:summerwind/actions-runner-controller
labels:
- custom-runner
```
Once this spec is applied, you can observe the labels for your runner from the repository or organization in the GitHub settings page for the repository or organization. You can now select a specific runner from your workflow by using the label in `runs-on`:
```yaml
jobs:
release:
runs-on:custom-runner
```
Note that if you specify `self-hosted` in your workflow, then this will run your job on _any_ self-hosted runner, regardless of the labels that they have.
## Runner Groups
Runner groups can be used to limit which repositories are able to use the GitHub Runner at an Organisation level.
To add the runner to the group `NewGroup`, specify the group in your `Runner` or `RunnerDeployment` spec.
```yaml
# runnerdeployment.yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:RunnerDeployment
metadata:
name:custom-runner
spec:
replicas:1
template:
spec:
group:NewGroup
```
## Software installed in the runner image
The GitHub hosted runners include a large amount of pre-installed software packages. For Ubuntu 18.04, this list can be found at <https://github.com/actions/virtual-environments/blob/master/images/linux/Ubuntu1804-README.md>
The container image is based on Ubuntu 18.04, but it does not contain all of the software installed on the GitHub runners. It contains the following subset of packages from the GitHub runners:
- Basic CLI packages
- git (2.26)
- docker
- build-essentials
The virtual environments from GitHub contain a lot more software packages (different versions of Java, Node.js, Golang, .NET, etc) which are not provided in the runner image. Most of these have dedicated setup actions which allow the tools to be installed on-demand in a workflow, for example: `actions/setup-java` or `actions/setup-node`
If there is a need to include packages in the runner image for which there is no setup action, then this can be achieved by building a custom container image for the runner. The easiest way is to start with the `summerwind/actions-runner` image and installing the extra dependencies directly in the docker image:
```shell
FROM summerwind/actions-runner:v2.169.1
RUN sudo apt update -y \
&& apt install YOUR_PACKAGE
&& rm -rf /var/lib/apt/lists/*
```
You can then configure the runner to use a custom docker image by configuring the `image` field of a `Runner` or `RunnerDeployment`:
```yaml
apiVersion:actions.summerwind.dev/v1alpha1
kind:Runner
metadata:
name:custom-runner
spec:
repository:summerwind/actions-runner-controller
image:YOUR_CUSTOM_DOCKER_IMAGE
```
## Common Errors
### invalid header field value
```json
2020-11-12T22:17:30.693ZERRORcontroller-runtime.controllerReconcilererror{"controller":"runner","request":"actions-runner-system/runner-deployment-dk7q8-dk5c9","error":"failed to create registration token: Post \"https://api.github.com/orgs/$YOUR_ORG_HERE/actions/runners/registration-token\": net/http: invalid header field value \"Bearer $YOUR_TOKEN_HERE\\n\" for key Authorization"}
```
**Solutions**<br/>
Your base64'ed PAT token has a new line at the end, it needs to be created without a `\n` added
*`echo -n $TOKEN | base64`
* Create the secret as described in the docs using the shell and documeneted flags
# Developing
If you'd like to modify the controller to fork or contribute, I'd suggest using the following snippet for running
Although the situation can change over time, as of writing this sentence, the benefits of using `actions-runner-controller` over the alternatives are:
-`actions-runner-controller` has the ability to autoscale runners based on number of pending/progressing jobs (#99)
-`actions-runner-controller` is able to gracefully stop runners (#103)
-`actions-runner-controller` has ARM support
We are very happy to help you with any issues you have. Please refer to the "[Troubleshooting](https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md)" section for common issues.
GitHub takes the security of our software products and services seriously, including all of the open source code repositories managed through our GitHub organizations, such as [GitHub](https://github.com/GitHub).
Even though [open source repositories are outside of the scope of our bug bounty program](https://bounty.github.com/index.html#scope) and therefore not eligible for bounty rewards, we will ensure that your finding gets passed along to the appropriate maintainers for remediation.
## Reporting Security Issues
If you believe you have found a security vulnerability in any GitHub-owned repository, please report it to us through coordinated disclosure.
**Please do not report security vulnerabilities through public GitHub issues, discussions, or pull requests.**
Instead, please send an email to opensource-security[@]github.com.
Please include as much of the information listed below as you can to help us better understand and resolve the issue:
* The type of issue (e.g., buffer overflow, SQL injection, or cross-site scripting)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
## Policy
See [GitHub's Safe Harbor Policy](https://docs.github.com/en/github/site-policy/github-bug-bounty-program-legal-safe-harbor#1-safe-harbor-terms)
* [Unable to scale to zero with TotalNumberOfQueuedAndInProgressWorkflowRuns](#unable-to-scale-to-zero-with-totalnumberofqueuedandinprogressworkflowruns)
* [Slow / failure to boot dind sidecar (default runner)](#slow--failure-to-boot-dind-sidecar-default-runner)
## Tools
A list of tools which are helpful for troubleshooting
* [Multi pod and container log tailing for Kubernetes `stern`](https://github.com/stern/stern)
## Installation
Troubeshooting runbooks that relate to ARC installation problems
### InternalError when calling webhook: context deadline exceeded
**Problem**
This issue can come up for various reasons like leftovers from previous installations or not being able to access the K8s service's clusterIP associated with the admission webhook server (of ARC).
Post "https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment?timeout=10s": context deadline exceeded
```
**Solution**
First we will try the common solution of checking webhook leftovers from previous installations:
1. ```bash
kubectl get validatingwebhookconfiguration -A
kubectl get mutatingwebhookconfiguration -A
```
2. If you see any webhooks related to actions-runner-controller, delete them:
If that didn't work then probably your K8s control-plane is somehow unable to access the K8s service's clusterIP associated with the admission webhook server:
1. You're running apiserver as a binary and you didn't make service cluster IPs available to the host network.
2. You're running the apiserver in the pod but your pod network (i.e. CNI plugin installation and config) is not good so your pods(like kube-apiserver) in the K8s control-plane nodes can't access ARC's admission webhook server pod(s) in probably data-plane nodes.
Another reason could be due to GKEs firewall settings you may run into the following errors when trying to deploy runners on a private GKE cluster:
To fix this, you may either:
1. Configure the webhook to use another port, such as 443 or 10250, [each of
which allow traffic by default](https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules).
```sh
# With helm, you'd set `webhookPort` to the port number of your choice
# See https://github.com/actions/actions-runner-controller/pull/1410/files for more information
2. Set up a firewall rule to allow the master node to connect to the default
webhook port. The exact way to do this may vary, but the following script
should point you in the right direction:
```sh
# 1) Retrieve the network tag automatically given to the worker nodes
# NOTE: this only works if you have only one cluster in your GCP project. You will have to manually inspect the result of this command to find the tag for the cluster you want to target
"error": "failed to create registration token: Post \"https://api.github.com/orgs/$YOUR_ORG_HERE/actions/runners/registration-token\": net/http: invalid header field value \"Bearer $YOUR_TOKEN_HERE\\n\" for key Authorization"
}
```
**Solution**
Your base64'ed PAT token has a new line at the end, it needs to be created without a `\n` added, either:
* `echo -n $TOKEN | base64`
* Create the secret as described in the docs using the shell and documented flags
### Helm chart install failure: certificate signed by unknown authority
**Problem**
```text
Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": x509: certificate signed by unknown authority
```
Apparently, it's failing while `helm` is creating one of resources defined in the ARC chart and the cause was that cert-manager's webhook is not working correctly, due to the missing or the invalid CA certficate.
You'd try to tail logs from the `cert-manager-cainjector` and see it's failing with an error like:
I0703 03:32:12.339197 1 request.go:645] Throttling request took 1.047437675s, request: GET:https://10.96.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
E0703 03:32:13.143790 1 start.go:151] cert-manager/ca-injector "msg"="Error registering certificate based controllers. Retrying after 5 seconds." "error"="no matches for kind \"MutatingWebhookConfiguration\" in version \"admissionregistration.k8s.io/v1beta1\""
Error: error registering secret controller: no matches for kind "MutatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1"
```
**Solution**
Your cluster is based on a new enough Kubernetes of version 1.22 or greater which does not support the legacy `admissionregistration.k8s.io/v1beta1` API anymore, and your `cert-manager` is not up-to-date hence it's still trying to use the leagcy Kubernetes API.
In many cases, it's not an option to downgrade Kubernetes. So, just upgrade `cert-manager` to a more recent version that does have have the support for the specific Kubernetes version you're using.
See <https://cert-manager.io/docs/installation/supported-releases/> for the list of available cert-manager versions.
## Operations
Troubeshooting runbooks that relate to ARC operational problems
### Stuck runner kind or backing pod
**Problem**
Sometimes either the runner kind (`kubectl get runners`) or it's underlying pod can get stuck in a terminating state for various reasons. You can get the kind unstuck by removing its finaliser using something like this:
**Solution**
Remove the finaliser from the relevent runner kind or pod
# Get all pods that are stuck terminating and remove the finalizer
$ kubectl -n get pods | grep Terminating | awk {'print $1'} | xargs kubectl patch pod -p '{"metadata":{"finalizers":null}}'
```
_Note the code assumes you have already selected the namespace your runners are in and that they
are in a namespace not shared with anything else_
### Delay in jobs being allocated to runners
**Problem**
ARC isn't involved in jobs actually getting allocated to a runner. ARC is responsible for orchestrating runners and the runner lifecycle. Why some people see large delays in job allocation is not clear however it has been confirmed https://github.com/actions/actions-runner-controller/issues/1387#issuecomment-1122593984 that this is caused from the self-update process somehow.
**Solution**
Disable the self-update process in your runner manifests
```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: example-runnerdeployment-with-sleep
spec:
template:
spec:
...
env:
- name: DISABLE_RUNNER_UPDATE
value: "true"
```
### Runner coming up before network available
**Problem**
If you're running your action runners on a service mesh like Istio, you might
have problems with runner configuration accompanied by logs like:
```text
....
runner Starting Runner listener with startup type: service
runner Started listener process
runner An error occurred: Not configured
runner Runner listener exited with error code 2
runner Runner listener exit with retryable error, re-launch runner in 5 seconds.
....
```
This is because the `istio-proxy` has not completed configuring itself when the
configuration script tries to communicate with the network.
More broadly, there are many other circumstances where the runner pod coming up first can cause issues.
**Solution**
> Added originally to help users with older istio instances.
> Newer Istio instances can use Istio's `holdApplicationUntilProxyStarts` attribute ([istio/istio#11130](https://github.com/istio/istio/issues/11130)) to avoid having to delay starting up the runner.
> Please read the discussion in [#592](https://github.com/actions/actions-runner-controller/pull/592) for more information.
You can add a delay to the runner's entrypoint script by setting the `STARTUP_DELAY_IN_SECONDS` environment variable for the runner pod. This will cause the script to sleep X seconds, this works with any runner kind.
```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: example-runnerdeployment-with-sleep
spec:
template:
spec:
...
env:
- name: STARTUP_DELAY_IN_SECONDS
value: "5"
```
### Outgoing network action hangs indefinitely
**Problem**
Some random outgoing network actions hangs indefinitely. This could be because your cluster does not give Docker the standard MTU of 1500, you can check this out by running `ip link` in a pod that encounters the problem and reading the outgoing interface's MTU value. If it is smaller than 1500, then try the following.
**Solution**
Add a `dockerMTU` key in your runner's spec with the value you read on the outgoing interface. For instance:
```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: github-runner
namespace: github-system
spec:
replicas: 6
template:
spec:
dockerMTU: 1400
repository: $username/$repo
env: []
```
If the issue still persists, you can set the `ARC_DOCKER_MTU_PROPAGATION` to propagate the host MTU to networks created
by the GitHub Runner. For instance:
```yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: github-runner
namespace: github-system
spec:
replicas: 6
template:
spec:
dockerMTU: 1400
repository: $username/$repo
env:
- name: ARC_DOCKER_MTU_PROPAGATION
value: "true"
```
You can read the discussion regarding this issue in
### Unable to scale to zero with TotalNumberOfQueuedAndInProgressWorkflowRuns
**Problem**
HRA doesn't scale the RunnerDeployment to zero, even though you did configure HRA correctly, to have a pull-based scaling metric `TotalNumberOfQueuedAndInProgressWorkflowRuns`, and set `minReplicas: 0`.
**Solution**
You very likely have some dangling workflow jobs stuck in `queued` or `in_progress` as seen in [#1057](https://github.com/actions/actions-runner-controller/issues/1057#issuecomment-1133439061).
Manually call [the "list workflow runs" API](https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository), and [remove the dangling workflow job(s)](https://docs.github.com/en/rest/actions/workflow-runs#delete-a-workflow-run).
### Slow / failure to boot dind sidecar (default runner)
**Problem**
If you noticed that it takes several minutes for sidecar dind container to be created or it exits with with error just after being created it might indicate that you are experiencing disk performance issue. You might see message `failed to reserve container name` when scaling up multiple runners at once. When you ssh on kubernetes node that problematic pods were scheduled on you can use tools like `atop`, `htop` or `iotop` to check IO usage and cpu time percentage used on iowait. If you see that disk usage is high (80-100%) and iowaits are taking a significant chunk of you cpu time (normally it should not be higher than 10%) it means that performance is being bottlenecked by slow disk.
**Solution**
The solution is to switch to using faster storage, if you are experiencing this issue you are probably using HDD storage. Switching to SSD storage fixed the problem in my case. Most cloud providers have a list of storage options to use just pick something faster that your current disk, for on prem clusters you will need to invest in some SSDs.
### Dockerd no space left on device
**Problem**
If you are running many containers on your runner you might encounter an issue where docker daemon is unable to start new containers and you see error `no space left on device`.
**Solution**
Add a `dockerVarRunVolumeSizeLimit` key in your runner's spec with a higher size limit (the default is 1M) For instance:
# To prevent `CustomResourceDefinition.apiextensions.k8s.io "runners.actions.summerwind.dev" is invalid: metadata.annotations: Too long: must have at most 262144 bytes`
# Needed to report test success by crating a cm from within workflow job step
- apiGroups:[""]
resources:["configmaps"]
verbs:["create","delete"]
---
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRole
metadata:
name:runner-status-updater
rules:
- apiGroups:["actions.summerwind.dev"]
resources:["runners/status"]
verbs:["get","update","patch"]
---
apiVersion:v1
kind:ServiceAccount
metadata:
name:${RUNNER_SERVICE_ACCOUNT_NAME}
namespace:${NAMESPACE}
---
# To verify it's working, try:
# kubectl auth can-i --as system:serviceaccount:default:runner get pod
# If incomplete, workflows and jobs would fail with an error message like:
# Error: Error: The Service account needs the following permissions [{"group":"","verbs":["get","list","create","delete"],"resource":"pods","subresource":""},{"group":"","verbs":["get","create"],"resource":"pods","subresource":"exec"},{"group":"","verbs":["get","list","watch"],"resource":"pods","subresource":"log"},{"group":"batch","verbs":["get","list","create","delete"],"resource":"jobs","subresource":""},{"group":"","verbs":["create","delete","get","list"],"resource":"secrets","subresource":""}] on the pod resource in the 'default' namespace. Please contact your self hosted runner administrator.
# Error: Process completed with exit code 1.
apiVersion:rbac.authorization.k8s.io/v1
# This role binding allows "jane" to read pods in the "default" namespace.
# You need to already have a Role named "pod-reader" in that namespace.
# Fix the following no space left errors with rootless-dind runners that can happen while running buildx build:
# ------
# > [4/5] RUN go mod download:
# ------
# ERROR: failed to solve: failed to prepare yxsw8lv9hqnuafzlfta244l0z: mkdir /home/runner/.local/share/docker/vfs/dir/yxsw8lv9hqnuafzlfta244l0z/usr/local/go/src/cmd/compile/internal/types2/testdata: no space left on device
# Error: Process completed with exit code 1.
#
volumeMounts:
- name:rootless-dind-work-dir
# Omit the /share/docker part of the /home/runner/.local/share/docker as
# that part is created by dockerd.
mountPath:/home/runner/.local
readOnly:false
# See https://github.com/actions/actions-runner-controller/issues/2123
# Be sure to omit the "aliases" field from the config.json.
# Otherwise you may encounter nasty errors like:
# $ docker build
# docker: 'buildx' is not a docker command.
# See 'docker --help'
# due to the incompatibility between your host docker config.json and the runner environment.
# That is, your host dockcer config.json might contain this:
# "aliases": {
# "builder": "buildx"
# }
# And this results in the above error when the runner does not have buildx installed yet.
- name:docker-config
mountPath:/home/runner/.docker/config.json
subPath:config.json
readOnly:true
- name:docker-config-root
mountPath:/home/runner/.docker
volumes:
- name:rootless-dind-work-dir
ephemeral:
volumeClaimTemplate:
spec:
accessModes:["ReadWriteOnce"]
storageClassName:"${NAME}-rootless-dind-work-dir"
resources:
requests:
storage:3Gi
- name:docker-config
# Refer to .dockerconfigjson/.docker/config.json
secret:
secretName:docker-config
items:
- key:.dockerconfigjson
path:config.json
- name:docker-config-root
emptyDir:{}
#
# Non-standard working directory
#
# workDir: "/"
# # Uncomment the below to enable the kubernetes container mode
# # See https://github.com/actions/actions-runner-controller#runner-with-k8s-jobs
# Valid only when dockerdWithinRunnerContainer=false
# - name: docker
# # PV-backed runner work dir
# volumeMounts:
# - name: work
# mountPath: /runner/_work
# # Cache docker image layers, in case dockerdWithinRunnerContainer=false
# - name: var-lib-docker
# mountPath: /var/lib/docker
# # image: mumoshu/actions-runner-dind:dev
# # For buildx cache
# - name: cache
# mountPath: "/home/runner/.cache"
# For fixing no space left error on rootless dind runner
- name:rootless-dind-work-dir
# Omit the /share/docker part of the /home/runner/.local/share/docker as
# that part is created by dockerd.
mountPath:/home/runner/.local
readOnly:false
# Comment out the ephemeral work volume if you're going to test the kubernetes container mode
# volumes:
# - name: work
# ephemeral:
# volumeClaimTemplate:
# spec:
# accessModes:
# - ReadWriteOnce
# storageClassName: "${NAME}-runner-work-dir"
# resources:
# requests:
# storage: 10Gi
# Fix the following no space left errors with rootless-dind runners that can happen while running buildx build:
# ------
# > [4/5] RUN go mod download:
# ------
# ERROR: failed to solve: failed to prepare yxsw8lv9hqnuafzlfta244l0z: mkdir /home/runner/.local/share/docker/vfs/dir/yxsw8lv9hqnuafzlfta244l0z/usr/local/go/src/cmd/compile/internal/types2/testdata: no space left on device
# Error: Process completed with exit code 1.
#
volumes:
- name:rootless-dind-work-dir
ephemeral:
volumeClaimTemplate:
spec:
accessModes:["ReadWriteOnce"]
storageClassName:"${NAME}-rootless-dind-work-dir"
resources:
requests:
storage:3Gi
volumeClaimTemplates:
- metadata:
name:vol1
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage:10Mi
storageClassName:${NAME}
## Dunno which provider supports auto-provisioning with selector.
## At least the rancher local path provider stopped with:
## waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator
# selector:
# matchLabels:
# runnerset-volume-id: ${NAME}-vol1
- metadata:
name:vol2
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage:10Mi
storageClassName:${NAME}
# selector:
# matchLabels:
# runnerset-volume-id: ${NAME}-vol2
- metadata:
name:var-lib-docker
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage:10Mi
storageClassName:${NAME}-var-lib-docker
- metadata:
name:cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage:10Mi
storageClassName:${NAME}-cache
- metadata:
name:runner-tool-cache
# It turns out labels doesn't distinguish PVs across PVCs and the
# end result is PVs are reused by wrong PVCs.
# The correct way seems to be to differentiate storage class per pvc template.
# Set actions-runner-controller settings for testing
logLevel:"-4"
imagePullSecrets:[]
image:
# This needs to be an empty array rather than a single-item array with empty name.
# Otherwise you end up with the following error on helm-upgrade:
# Error: UPGRADE FAILED: failed to create patch: map: map[] does not contain declared merge key: name && failed to create patch: map: map[] does not contain declared merge key: name
All additional docs are kept in the `docs/` folder, this README is solely for documenting the values.yaml keys and values
## Values
**_The values are documented as of HEAD, to review the configuration options for your chart version ensure you view this file at the relevant [tag](https://github.com/actions/actions-runner-controller/tags)_**
> _Default values are the defaults set in the charts `values.yaml`, some properties have default configurations in the code for when the property is omitted or invalid_
| `authSecret.name` | Set the name of the auth secret | controller-manager |
| `authSecret.annotations` | Set annotations for the auth Secret | |
| `authSecret.github_app_id` | The ID of your GitHub App. **This can't be set at the same time as `authSecret.github_token`** | |
| `authSecret.github_app_installation_id` | The ID of your GitHub App installation. **This can't be set at the same time as `authSecret.github_token`** | |
| `authSecret.github_app_private_key` | The multiline string of your GitHub App's private key. **This can't be set at the same time as `authSecret.github_token`** | |
| `authSecret.github_token` | Your chosen GitHub PAT token. **This can't be set at the same time as the `authSecret.github_app_*`** | |
| `authSecret.github_basicauth_username` | Username for GitHub basic auth to use instead of PAT or GitHub APP in case it's running behind a proxy API | |
| `authSecret.github_basicauth_password` | Password for GitHub basic auth to use instead of PAT or GitHub APP in case it's running behind a proxy API | |
| `dockerRegistryMirror` | The default Docker Registry Mirror used by runners. | |
| `hostNetwork` | The "hostNetwork" of the controller container | false |
| `dnsPolicy` | The "dnsPolicy" of the controller container | ClusterFirst |
| `image.repository` | The "repository/image" of the controller container | summerwind/actions-runner-controller |
| `image.tag` | The tag of the controller container | |
| `image.actionsRunnerRepositoryAndTag` | The "repository/image" of the actions runner container | summerwind/actions-runner:latest |
| `image.actionsRunnerImagePullSecrets` | Optional image pull secrets to be included in the runner pod's ImagePullSecrets | |
| `image.dindSidecarRepositoryAndTag` | The "repository/image" of the dind sidecar container | docker:dind |
| `image.pullPolicy` | The pull policy of the controller image | IfNotPresent |
| `metrics.serviceMonitor.enable` | Deploy serviceMonitor kind for for use with prometheus-operator CRDs | false |
| `metrics.serviceMonitor.interval` | Configure the interval that Prometheus should scrap the controller's metrics | 1m |
| `metrics.serviceMonitor.namespace | Namespace which Prometheus is running in | `Release.Namespace` (the default namespace of the helm chart). |
| `metrics.serviceMonitor.timeout` | Configure the timeout the timeout of Prometheus scrapping. | 30s |
| `metrics.serviceAnnotations` | Set annotations for the provisioned metrics service resource | |
| `metrics.port` | Set port of metrics service | 8443 |
| `metrics.proxy.enabled` | Deploy kube-rbac-proxy container in controller pod | true |
| `metrics.proxy.image.repository` | The "repository/image" of the kube-proxy container | quay.io/brancz/kube-rbac-proxy |
| `metrics.proxy.image.tag` | The tag of the kube-proxy image to use when pulling the container | v0.13.1 |
| `metrics.serviceMonitorLabels` | Set labels to apply to ServiceMonitor resources | |
| `imagePullSecrets` | Specifies the secret to be used when pulling the controller pod containers | |
| `fullnameOverride` | Override the full resource names | |
| `nameOverride` | Override the resource name prefix | |
| `serviceAccount.annotations` | Set annotations to the service account | |
| `serviceAccount.create` | Deploy the controller pod under a service account | true |
| `podAnnotations` | Set annotations for the controller pod | |
| `podLabels` | Set labels for the controller pod | |
| `serviceAccount.name` | Set the name of the service account | |
| `securityContext` | Set the security context for each container in the controller pod | |
| `podSecurityContext` | Set the security context to controller pod | |
| `service.annotations` | Set annotations for the provisioned webhook service resource | |
| `service.port` | Set controller service ports | |
| `service.type` | Set controller service type | |
| `topologySpreadConstraints` | Set the controller pod topologySpreadConstraints | |
| `nodeSelector` | Set the controller pod nodeSelector | |
| `resources` | Set the controller pod resources | |
| `affinity` | Set the controller pod affinity rules | |
| `podDisruptionBudget.enabled` | Enables a PDB to ensure HA of controller pods | false |
| `podDisruptionBudget.minAvailable` | Minimum number of pods that must be available after eviction | |
| `podDisruptionBudget.maxUnavailable` | Maximum number of pods that can be unavailable after eviction. Kubernetes 1.7+ required. | |
| `tolerations` | Set the controller pod tolerations | |
| `env` | Set environment variables for the controller container | |
| `priorityClassName` | Set the controller pod priorityClassName | |
| `scope.watchNamespace` | Tells the controller and the github webhook server which namespace to watch if `scope.singleNamespace` is true | `Release.Namespace` (the default namespace of the helm chart). |
| `scope.singleNamespace` | Limit the controller to watch a single namespace | false |
| `certManagerEnabled` | Enable cert-manager. If disabled you must set admissionWebHooks.caBundle and create TLS secrets manually | true |
| `runner.statusUpdateHook.enabled` | Use custom RBAC for runners (role, role binding and service account), this will enable reporting runner statuses | false |
| `admissionWebHooks.caBundle` | Base64-encoded PEM bundle containing the CA that signed the webhook's serving certificate | |
| `githubWebhookServer.logLevel` | Set the log level of the githubWebhookServer container | |
| `githubWebhookServer.logFormat` | Set the log format of the githubWebhookServer controller. Valid options are "text" and "json" | text |
| `githubWebhookServer.replicaCount` | Set the number of webhook server pods | 1 |
| `githubWebhookServer.useRunnerGroupsVisibility` | Enable supporting runner groups with custom visibility, you also need to set `githubWebhookServer.secret.enabled` to enable this feature. | false |
| `githubWebhookServer.enabled` | Deploy the webhook server pod | false |
| `githubWebhookServer.queueLimit` | Set the queue size limit in the githubWebhookServer | |
| `githubWebhookServer.secret.enabled` | Passes the webhook hook secret to the github-webhook-server | false |
| `githubWebhookServer.secret.name` | Set the name of the webhook hook secret | github-webhook-server |
| `githubWebhookServer.secret.github_webhook_secret_token` | Set the webhook secret token value | |
| `githubWebhookServer.imagePullSecrets` | Specifies the secret to be used when pulling the githubWebhookServer pod containers | |
| `githubWebhookServer.nameOverride` | Override the resource name prefix | |
| `githubWebhookServer.fullnameOverride` | Override the full resource names | |
| `githubWebhookServer.serviceAccount.create` | Deploy the githubWebhookServer under a service account | true |
| `githubWebhookServer.serviceAccount.annotations` | Set annotations for the service account | |
| `githubWebhookServer.serviceAccount.name` | Set the service account name | |
| `githubWebhookServer.podAnnotations` | Set annotations for the githubWebhookServer pod | |
| `githubWebhookServer.podLabels` | Set labels for the githubWebhookServer pod | |
| `githubWebhookServer.podSecurityContext` | Set the security context to githubWebhookServer pod | |
| `githubWebhookServer.securityContext` | Set the security context for each container in the githubWebhookServer pod | |
| `githubWebhookServer.resources` | Set the githubWebhookServer pod resources | |
| `githubWebhookServer.topologySpreadConstraints` | Set the githubWebhookServer pod topologySpreadConstraints | |
| `githubWebhookServer.nodeSelector` | Set the githubWebhookServer pod nodeSelector | |
| `githubWebhookServer.tolerations` | Set the githubWebhookServer pod tolerations | |
| `githubWebhookServer.affinity` | Set the githubWebhookServer pod affinity rules | |
| `githubWebhookServer.priorityClassName` | Set the githubWebhookServer pod priorityClassName | |
| `githubWebhookServer.terminationGracePeriodSeconds` | Set the githubWebhookServer pod terminationGracePeriodSeconds. Useful when using preStop hooks to drain/sleep. | `10` |
| `githubWebhookServer.lifecycle` | Set the githubWebhookServer pod lifecycle hooks | `{}` |
| `githubWebhookServer.service.type` | Set githubWebhookServer service type | |
| `githubWebhookServer.service.ports` | Set githubWebhookServer service ports | `[{"port":80, "targetPort:"http", "protocol":"TCP", "name":"http"}]` |
| `githubWebhookServer.service.loadBalancerSourceRanges` | Set githubWebhookServer loadBalancerSourceRanges for restricting loadBalancer type services | `[]` |
| `githubWebhookServer.ingress.enabled` | Deploy an ingress kind for the githubWebhookServer | false |
| `githubWebhookServer.ingress.annotations` | Set annotations for the ingress kind | |
| `githubWebhookServer.ingress.hosts` | Set hosts configuration for ingress | `[{"host": "chart-example.local", "paths": []}]` |
| `githubWebhookServer.ingress.tls` | Set tls configuration for ingress | |
| `githubWebhookServer.ingress.ingressClassName` | Set ingress class name | |
| `githubWebhookServer.podDisruptionBudget.enabled` | Enables a PDB to ensure HA of githubwebhook pods | false |
| `githubWebhookServer.podDisruptionBudget.minAvailable` | Minimum number of pods that must be available after eviction | |
| `githubWebhookServer.podDisruptionBudget.maxUnavailable` | Maximum number of pods that can be unavailable after eviction. Kubernetes 1.7+ required. | |
| `actionsMetricsServer.logLevel` | Set the log level of the actionsMetricsServer container | |
| `actionsMetricsServer.logFormat` | Set the log format of the actionsMetricsServer controller. Valid options are "text" and "json" | text |
| `actionsMetricsServer.enabled` | Deploy the actions metrics server pod | false |
| `actionsMetricsServer.secret.enabled` | Passes the webhook hook secret to the actions-metrics-server | false |
| `actionsMetricsServer.secret.name` | Set the name of the webhook hook secret | actions-metrics-server |
| `actionsMetricsServer.secret.github_webhook_secret_token` | Set the webhook secret token value | |
| `actionsMetricsServer.imagePullSecrets` | Specifies the secret to be used when pulling the actionsMetricsServer pod containers | |
| `actionsMetricsServer.nameOverride` | Override the resource name prefix | |
| `actionsMetricsServer.fullnameOverride` | Override the full resource names | |
| `actionsMetricsServer.serviceAccount.create` | Deploy the actionsMetricsServer under a service account | true |
| `actionsMetricsServer.serviceAccount.annotations` | Set annotations for the service account | |
| `actionsMetricsServer.serviceAccount.name` | Set the service account name | |
| `actionsMetricsServer.podAnnotations` | Set annotations for the actionsMetricsServer pod | |
| `actionsMetricsServer.podLabels` | Set labels for the actionsMetricsServer pod | |
| `actionsMetricsServer.podSecurityContext` | Set the security context to actionsMetricsServer pod | |
| `actionsMetricsServer.securityContext` | Set the security context for each container in the actionsMetricsServer pod | |
| `actionsMetricsServer.resources` | Set the actionsMetricsServer pod resources | |
| `actionsMetricsServer.topologySpreadConstraints` | Set the actionsMetricsServer pod topologySpreadConstraints | |
| `actionsMetricsServer.nodeSelector` | Set the actionsMetricsServer pod nodeSelector | |
| `actionsMetricsServer.tolerations` | Set the actionsMetricsServer pod tolerations | |
| `actionsMetricsServer.affinity` | Set the actionsMetricsServer pod affinity rules | |
| `actionsMetricsServer.priorityClassName` | Set the actionsMetricsServer pod priorityClassName | |
| `actionsMetricsServer.terminationGracePeriodSeconds` | Set the actionsMetricsServer pod terminationGracePeriodSeconds. Useful when using preStop hooks to drain/sleep. | `10` |
| `actionsMetricsServer.lifecycle` | Set the actionsMetricsServer pod lifecycle hooks | `{}` |
| `actionsMetricsServer.service.type` | Set actionsMetricsServer service type | |
| `actionsMetricsServer.service.ports` | Set actionsMetricsServer service ports | `[{"port":80, "targetPort:"http", "protocol":"TCP", "name":"http"}]` |
| `actionsMetricsServer.service.loadBalancerSourceRanges` | Set actionsMetricsServer loadBalancerSourceRanges for restricting loadBalancer type services | `[]` |
| `actionsMetricsServer.ingress.enabled` | Deploy an ingress kind for the actionsMetricsServer | false |
| `actionsMetricsServer.ingress.annotations` | Set annotations for the ingress kind | |
| `actionsMetricsServer.ingress.hosts` | Set hosts configuration for ingress | `[{"host": "chart-example.local", "paths": []}]` |
| `actionsMetricsServer.ingress.tls` | Set tls configuration for ingress | |
| `actionsMetricsServer.ingress.ingressClassName` | Set ingress class name | |
| `actionsMetrics.serviceMonitor.enable` | Deploy serviceMonitor kind for for use with prometheus-operator CRDs | false |
| `actionsMetrics.serviceMonitor.interval` | Configure the interval that Prometheus should scrap the controller's metrics | 1m |
| `actionsMetrics.serviceMonitor.namespace` | Namespace which Prometheus is running in. | `Release.Namespace` (the default namespace of the helm chart). |
| `actionsMetrics.serviceMonitor.timeout` | Configure the timeout the timeout of Prometheus scrapping. | 30s |
| `actionsMetrics.serviceAnnotations` | Set annotations for the provisioned actions metrics service resource | |
| `actionsMetrics.port` | Set port of actions metrics service | 8443 |
| `actionsMetrics.proxy.enabled` | Deploy kube-rbac-proxy container in controller pod | true |
| `actionsMetrics.proxy.image.repository` | The "repository/image" of the kube-proxy container | quay.io/brancz/kube-rbac-proxy |
| `actionsMetrics.proxy.image.tag` | The tag of the kube-proxy image to use when pulling the container | v0.13.1 |
| `actionsMetrics.serviceMonitorLabels` | Set labels to apply to ServiceMonitor resources | |
description:HorizontalRunnerAutoscaler is the Schema for the horizontalrunnerautoscaler
API
properties:
apiVersion:
description:'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
versions:
- additionalPrinterColumns:
- jsonPath:.spec.minReplicas
name:Min
type:number
- jsonPath:.spec.maxReplicas
name:Max
type:number
- jsonPath:.status.desiredReplicas
name:Desired
type:number
- jsonPath:.status.scheduledOverridesSummary
name:Schedule
type:string
kind:
description:'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type:string
metadata:
type:object
spec:
description:HorizontalRunnerAutoscalerSpec defines the desired state of
HorizontalRunnerAutoscaler
name:v1alpha1
schema:
openAPIV3Schema:
description:HorizontalRunnerAutoscaler is the Schema for the horizontalrunnerautoscaler API
properties:
maxReplicas:
description:MinReplicas is the maximum number of replicas the deployment
is allowed to scale
type:integer
metrics:
description:Metrics is the collection of various metric targets to
calculate desired number of runners
items:
properties:
repositoryNames:
description:RepositoryNames is the list of repository names to
be used for calculating the metric. For example, a repository
name is the REPO part of `github.com/USER/REPO`.
items:
type:string
type:array
type:
description:Type is the type of metric to be used for autoscaling.
The only supported Type is TotalNumberOfQueuedAndInProgressWorkflowRuns
type:string
type:object
type:array
minReplicas:
description:MinReplicas is the minimum number of replicas the deployment
is allowed to scale
type:integer
scaleDownDelaySecondsAfterScaleOut:
description:ScaleDownDelaySecondsAfterScaleUp is the approximate delay
for a scale down followed by a scale up Used to prevent flapping (down->up->down->...
loop)
type:integer
scaleTargetRef:
description:ScaleTargetRef sis the reference to scaled resource like
RunnerDeployment
apiVersion:
description:|-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type:string
kind:
description:|-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type:string
metadata:
type:object
spec:
description:HorizontalRunnerAutoscalerSpec defines the desired state of HorizontalRunnerAutoscaler
properties:
name:
capacityReservations:
items:
description:|-
CapacityReservation specifies the number of replicas temporarily added
to the scale target until ExpirationTime.
properties:
effectiveTime:
format:date-time
type:string
expirationTime:
format:date-time
type:string
name:
type:string
replicas:
type:integer
type:object
type:array
githubAPICredentialsFrom:
properties:
secretRef:
properties:
name:
type:string
required:
- name
type:object
type:object
maxReplicas:
description:MaxReplicas is the maximum number of replicas the deployment is allowed to scale
type:integer
metrics:
description:Metrics is the collection of various metric targets to calculate desired number of runners
items:
properties:
repositoryNames:
description:|-
RepositoryNames is the list of repository names to be used for calculating the metric.
For example, a repository name is the REPO part of `github.com/USER/REPO`.
items:
type:string
type:array
scaleDownAdjustment:
description:|-
ScaleDownAdjustment is the number of runners removed on scale-down.
You can only specify either ScaleDownFactor or ScaleDownAdjustment.
type:integer
scaleDownFactor:
description:|-
ScaleDownFactor is the multiplicative factor applied to the current number of runners used
to determine how many pods should be removed.
type:string
scaleDownThreshold:
description:|-
ScaleDownThreshold is the percentage of busy runners less than which will
trigger the hpa to scale the runners down.
type:string
scaleUpAdjustment:
description:|-
ScaleUpAdjustment is the number of runners added on scale-up.
You can only specify either ScaleUpFactor or ScaleUpAdjustment.
type:integer
scaleUpFactor:
description:|-
ScaleUpFactor is the multiplicative factor applied to the current number of runners used
to determine how many pods should be added.
type:string
scaleUpThreshold:
description:|-
ScaleUpThreshold is the percentage of busy runners greater than which will
trigger the hpa to scale runners up.
type:string
type:
description:|-
Type is the type of metric to be used for autoscaling.
It can be TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy.
type:string
type:object
type:array
minReplicas:
description:MinReplicas is the minimum number of replicas the deployment is allowed to scale
type:integer
scaleDownDelaySecondsAfterScaleOut:
description:|-
ScaleDownDelaySecondsAfterScaleUp is the approximate delay for a scale down followed by a scale up
Used to prevent flapping (down->up->down->... loop)
type:integer
scaleTargetRef:
description:ScaleTargetRef is the reference to scaled resource like RunnerDeployment
properties:
kind:
description:Kind is the type of resource being referenced
enum:
- RunnerDeployment
- RunnerSet
type:string
name:
description:Name is the name of resource being referenced
type:string
type:object
scaleUpTriggers:
description:|-
ScaleUpTriggers is an experimental feature to increase the desired replicas by 1
on each webhook requested received by the webhookBasedAutoscaler.
This feature requires you to also enable and deploy the webhookBasedAutoscaler onto your cluster.
Note that the added runners remain until the next sync period at least,
and they may or may not be used by GitHub Actions depending on the timing.
They are intended to be used to gain "resource slack" immediately after you
receive a webhook from GitHub, so that you can loosely expect MinReplicas runners to be always available.
This project makes extensive use of CRDs to provide much of its functionality. Helm unfortunately does not support [managing](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/) CRDs by design:
_The full breakdown as to how they came to this decision and why they have taken the approach they have for dealing with CRDs can be found in [Helm Improvement Proposal 11](https://github.com/helm/community/blob/main/hips/hip-0011.md)_
```
There is no support at this time for upgrading or deleting CRDs using Helm. This was an explicit decision after much
community discussion due to the danger for unintentional data loss. Furthermore, there is currently no community
consensus around how to handle CRDs and their lifecycle. As this evolves, Helm will add support for those use cases.
```
Helm will do an initial install of CRDs but it will not touch them afterwards (update or delete).
Additionally, because the project leverages CRDs so extensively you **MUST** run the matching controller app container with its matching CRDs i.e. always redeploy your CRDs if you are changing the app version.
Due to the above you can't just do a `helm upgrade` to release the latest version of the chart, the best practice steps are recorded below:
## Steps
1. Upgrade CRDs, this isn't optional, the CRDs you are using must be those that correspond with the version of the controller you are installing
```shell
# REMEMBER TO UPDATE THE CHART_VERSION TO RELEVANT CHART VERISON!!!!
CHART_VERSION=0.18.0
curl -L https://github.com/actions/actions-runner-controller/releases/download/actions-runner-controller-${CHART_VERSION}/actions-runner-controller-${CHART_VERSION}.tgz | tar zxv --strip 1 actions-runner-controller/crds
kubectl replace -f crds/
```
Note that in case you're going to create prometheus-operator `ServiceMonitor` resources via the chart, you'd need to deploy prometheus-operator-related CRDs as well.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.