Commit Graph

958 Commits

Author SHA1 Message Date
Yusuke Kuoka
b5d1a63bdf Enhance the acceptance runnerset yaml template for manual E2E (#1587)
The primary goal of this change is to let the tester know about the config difference between the explicitly configured ephemeral work volume vs the automatically configured work volume with workVolumeClaimTemplate+containerMode=kubernetes.
2022-06-29 22:15:50 +09:00
Yusuke Kuoka
6f3e23973d Bump E2E runner version to 2.294.0 (#1586)
so that every runner does not result in auto-updating itself on startup in E2E, which makes E2E take longer to complete.
2022-06-29 22:05:50 +09:00
Yusuke Kuoka
a517c1ff66 Fix old runner pods stuck in Terminating since #1579 (#1585)
Ref #1579
2022-06-29 22:02:42 +09:00
Yusuke Kuoka
9b28e633c1 Drop support for --once (#1580)
Ref #1196
2022-06-29 21:49:52 +09:00
Yusuke Kuoka
8161136cbd Fix PercentageRunnersBusy scaling delay (#1579)
* Use a dedicated pod label to say it is a runner pod

Follow-up for #1546

* Fix PercentageRunnersBusy scaling delay

Ref #1374
2022-06-29 20:49:21 +09:00
Nikola Jokic
a9ac5a1cbf extracted validations to a single point (#1582) 2022-06-29 20:32:00 +09:00
Callum Tait
d4f35cff4f ci: add paths to push trigger (#1583) 2022-06-29 20:30:07 +09:00
Yusuke Kuoka
f661249f07 Use the go-github impl of ListRunnerGroups with visible_to_repository (#1578)
Ref #1402
2022-06-29 09:53:03 +01:00
Mike
73e430ce54 Add a solution to InternalError webhook context timedout (#1558)
* added troubleshooting solution

* added error example

* added entry to the pages index

* sorted

Co-authored-by: Mike Joseph <mike@Mikes-MacBook-Pro-5618.local>
Co-authored-by: Mike Joseph <mike@efrontier.com>
2022-06-29 09:40:23 +09:00
renovate[bot]
858ef8979d chore(deps): update helm/kind-action action to v1.3.0 (#1532)
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2022-06-29 09:05:26 +09:00
renovate[bot]
1ce0a183a6 chore(deps): update azure/setup-helm action to v3 (#1571)
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2022-06-29 09:04:33 +09:00
renovate[bot]
63935d2053 fix(deps): update module github.com/stretchr/testify to v1.7.5 (#1510)
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2022-06-29 08:39:09 +09:00
John Delivuk
fc63d6d26e Fix: Match Ingress API Version correctly. (#1541)
* Updating conditional to match the api version and kind

mend

* Updating conditional to match the api version and kind

mend
2022-06-29 08:30:11 +09:00
renovate[bot]
5ea08411e6 chore(deps): update dependency golang to v1.18.3 (#1509)
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
2022-06-29 08:29:14 +09:00
Giuseppe Crinò
067ed2e5ec docs: fix logic explanation for scale down delay (#1562)
Signed-off: Giuseppe Crinò <giuscri@gmail.com>
2022-06-29 08:26:28 +09:00
renovate[bot]
d86bd2bcd7 fix(deps): update module sigs.k8s.io/controller-runtime to v0.12.2 (#1449)
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.12.2

* Regenerate manfiests with the updated k8s and controller-runtime deps

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-29 06:42:17 +09:00
Yusuke Kuoka
ddd417f756 Bump go-github to v45 (#1573)
* Bump go-github to v45

Ref #1402

* fixup! Bump go-github to v45
2022-06-29 06:34:58 +09:00
Thomas Boop
0386c0734c containerMode option to allow running jobs in k8's instead of docker (#1546)
* added containerMode=kubernetes env variables to the runner

* removed unused logging

* restored configs and charts

* restored makefile cert version and acceptance/run

* added workVolumeClaimTemplate in pod definition, including logic

* added claim template name based on the runner

* Apply suggestions from code review

update errors

* added concurrent cleanup before runner pod is deleted

* update manifests

* added retry after 30s if pod cleanup contains err

* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s

* style changes and added comments

* added izZero timestamp check for deleting runner-linked pods

* changed order of local variable to avoid copy if p is deleted

* removed docker from container mode k8s

* restored charts, config, makefile

* restored forked files back and not the ARC ones

* created PersistentVolume on containerMode k8s

* create pv only if storage class name is local-storage

* removed actions if storage class name is local-storage

* added service account validation if container mode kubernetes

* changed the coding style to match rest of the ARC

* added validation to the runnerdeployment webhook

* specified fields more precisely, added webhook validation to the replicaset as well

* remake manifests

* wraped delete runner-linked-pods in kube mode

* fixed empty line

* fixed import

* makefile changes for hooks

* added cleanup secrets

* create manifests

* docs

* update access modes

* update dockerfile

* nit changes

* fixed dockerfile

* rewrite allowing reuse for runners and runnersets

* deepcopy forgot to stage

* changed privileged

* make manifests

* partly moved to finalizer, still need to apply finalizer first

* finalizer added if env variable used in container mode exists

* bump runner version

* error message moved from Error to Info on cleanup pods/secrets

* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate

* Apply suggestions from code review

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* add hook version to cli, update to 0.1.2

* Apply suggestions from code review

* Update controllers/utils_test.go

* Update runner/Makefile

* Fix missing secret permission and the error handling

* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry

Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-28 14:12:40 +09:00
Yusuke Kuoka
af96de6184 Fix completed runner pod recreation not to be blocked after max out (#1568)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1477#issuecomment-1164154496
2022-06-28 13:50:07 +09:00
Arnaud
abb8615796 Webhook server configuration with kustomize (#1312)
* webhook server configuration with kustomize

* Update README.md

* Update README.md

* Update README.md

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-28 09:08:25 +09:00
Sam Weston
bc7a3cab1b Add priorityClassName to CRDs (#1513)
* Add pod priorityClassName to controller and crds

* Add missing bits in bases directory

* Regenerate crds
2022-06-28 08:45:19 +09:00
Yusuke Kuoka
e2c8163b8c Make webhook-based scale race-free (#1477)
* Make webhook-based scale operation asynchronous

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321

* Fix typos

* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers

* Batch capacity reservation updates for efficient use of apiserver

* Fix potential never-ending HRA update conflicts in batch update

* Extract batchScaler out of webhook-based autoscaler for testability

* Fix log levels and batch scaler hang on start

* Correlate webhook event with scale trigger amount in logs

* Fix log message
2022-06-27 18:31:48 +09:00
Callum Tait
84d16c1c12 revert: "Overhauled startup.sh Script (#1454)" (#1561)
This reverts commit 071898c96b.
2022-06-23 12:39:32 +01:00
Richard Fussenegger
071898c96b Overhauled startup.sh Script (#1454)
This overhaul turns it into a shellcheck valid script with explicit error handling for all possible situations I could think of. This change takes https://github.com/actions-runner-controller/actions-runner-controller/pull/1409 into account and things can be merged in any order. There are a few important changes here to the logic:

- The wait logic for checking if docker comes up was fundamentally flawed because it checks for the PID. Docker will always come up and thus become visible in the process list, just to immediately die when it encounters an issue, after which supervisor starts it again. This means that our check so far is flaky due to the `sleep 1` it might encounter a PID, or it might not, and the existence of the PID does not mean anything. The `docker ps` check we have in the `entrypoint.sh` script does not suffer from this as it checks for a feature of docker and not a PID. I thus entirely removed the PID check, and instead I am handing things over to our `entrypoint.sh` script by setting the environment variables correctly.
- This change has an influence on the `docker0` interface MTU configuration, because the interface might or might not exist after we started docker. Hence, I changed this to a time boxed loop that tries for one minute to set up the interface's MTU. In case the command fails we log an error and continue with the run.
- I changed the entire MTU handling by validating its value before configuring it, logging an error and continuing without if it is set incorrectly. This ensures that we are not going to send our users on a bug hunt.
- The way we started supervisord did not make much sense to me. It sends itself into the background automatically, there is no need for us to do so with Bash.

The decision to not fail on errors but continue is a deliberate choice, because I believe that running a build is more important than having a perfectly configured system. However, this strategy might also hide issues for all users who are not properly checking their logs. It also makes testing harder. Hence, we could change all error conditions from graceful to panicking. We should then align the exit codes across `startup.sh` and `entrypoint.sh` to ensure that every possible error condition has its own unique error code for easy debugging.
2022-06-23 09:37:01 +09:00
renovate[bot]
f24e2fa44e chore(deps): update dependency actions/runner to v2.294.0 2022-06-22 21:45:32 +00:00
Callum Tait
3c7d3d6b57 ci: hardcode dockerhub username (#1555) 2022-06-22 16:15:50 +01:00
Callum Tait
23f091d7fa ci: don't login on a pr (#1554)
* ci: don't login on a pr

Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
2022-06-22 16:03:36 +01:00
Callum Tait
667764e027 chore: suggest gist first (#1539) 2022-06-18 17:38:37 +09:00
Callum Tait
de693c4191 ci: runners trigger on push (#1549)
* ci: runners trigger on push

* ci: comments

* ci: comments
2022-06-18 17:34:40 +09:00
Callum Tait
510fc9c834 ci: add GitHub packages to arc release (#1525)
* ci: add GitHub packages to arc release

* ci: use restrictive permissions
2022-06-15 11:37:19 +09:00
Callum Tait
7fd5e24961 chore: bump chart to app 0.24.1 (#1531) actions-runner-controller-0.19.1 2022-06-15 11:34:55 +09:00
Yusuke Kuoka
9974b1a2b7 e2e: Enable buildx in more images (#1530) 2022-06-14 09:29:30 +01:00
Yusuke Kuoka
bd91b73fd9 chore: update bug_report.yml (#1529) 2022-06-14 09:29:06 +01:00
Callum Tait
a7ae910ee4 docs: add CRD disclaimer to bug report (#1516) 2022-06-14 09:42:31 +09:00
Callum Tait
2733c36d0e ci: publish controller canary to github packages (#1524)
* ci: publish controller canary to github packages

* ci: include image name
2022-06-14 09:10:13 +09:00
Yusuke Kuoka
0ef9a22cd4 Fix confusing PV controller log (#1526)
Ref #1511
v0.24.1
2022-06-14 08:35:04 +09:00
Renovate Bot
933b0c7888 chore(deps): update dependency actions/runner to v2.293.0 2022-06-13 17:09:29 +00:00
renovate[bot]
1b7ec33135 chore(deps): update actions/setup-python action to v4 (#1514)
Co-authored-by: Renovate Bot <bot@renovateapp.com>
2022-06-13 14:07:52 +01:00
Callum Tait
a62882d243 ci: fix permisions (#1512)
* ci: fix permisions

* chore: change to trigger build

* ci: add write permission to packages

* ci: remove conditionals for docker logins

* Update controllers/utils_test.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-09 10:25:56 +09:00
Callum Tait
0cd13fe51d ci: align pipeline files and setups (#1484)
* ci: align pipeline files and setups

* ci: more changes

* ci: various changes

* ci: fix setup-helm action ref

* ci: better pipeline name

* ci: more format aligning

* ci: more format aligning

* ci: better job name

* ci: supports multiple languages

* ci: better pipeline and job names

* ci: do a verb-noun thing for consistency

* ci: use 'arc' when talking holistically

* ci: add caching scope

* ci:  put canary in a scope

* ci: fix syntax error

* ci: better pipeline and job names

* ci: better job name

Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
2022-06-08 10:04:14 +09:00
Vinícius Garcia
01c8dc237e Fix example manifests for webhooks-based scaling (#1354)
* Fix example manifests for webhook based scaling

I tried running these on my k8s cluster and I got some easy to fix errors, so I am committing them here.

* Fix example manifests for webhook autoscaling with workflow_jobs

* Fix the explation on how to setup webhooks on your cluster

* Replace unclear comment with actual code examples

There was a comment instructing users to add minReplicas and
maxReplicas to all the HRA yamls, so I just removed it and added
these attributes to the yamls themselves for clarity.

* Make clear that using the ingress example is just a suggestion

* Apply some text improvements suggested by @mumoshu

* Update examples so the webhook server is exposed on a NodePort

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

* Remove an unnecessary field from one the examples

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

* Apply suggestion from @mumoshu

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

* Remove namespace fields from webhook autoscaler examples

This change was suggested by @mumoshu

* Apply final suggestion from @mumoshu

Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-07 08:33:09 +09:00
Yusuke Kuoka
7c4db63718 chart: Bump appVersion to 0.24.0 (#1505) actions-runner-controller-0.19.0 2022-06-03 22:01:35 +09:00
Yusuke Kuoka
3d88b9630a doc: Add "people" section (#1498)
Ref #1497
v0.24.0
2022-05-31 09:29:15 +09:00
Yusuke Kuoka
1152e6b31d Add release note for v0.24.0 (#1493) 2022-05-30 09:10:36 +09:00
renovate[bot]
ac27df8301 chore(deps): update dependency actions/runner to v2.292.0 (#1475)
Co-authored-by: Renovate Bot <bot@renovateapp.com>
2022-05-27 09:49:46 +09:00
Yusuke Kuoka
9dd26168d6 Fix confusing logs from pv and pvc controllers (#1487)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212
2022-05-26 18:21:23 +09:00
Yusuke Kuoka
18bfb28c0b e2e: ARC_E2E_NO_CLEANUP to prevent cleanup (#1470)
A small improvement to our E2E test suite which allows you to set `ARC_E2E_NO_CLEANUP=whatever` to let it prevent the kind cluster cleanup on successful test run, so that you can rerun it without waiting for the new kind cluster to come up.
2022-05-26 10:59:50 +09:00
renovate[bot]
84210e900b chore(deps): update actions/setup-python digest to fff15a2 (#1458)
Co-authored-by: Renovate Bot <bot@renovateapp.com>
2022-05-25 12:12:22 +09:00
Yusuke Kuoka
ef3313d147 doc: Use RunnerSet to retain various cache by leveraging PV (#1464)
* doc: Use RunnerSet to retain various cache

In relation to #1286 and as a follow-up for #1340

* docs: clarify client vs daemon

* docs: better wording

* Separate RunnerSet examples for docker iimage layer caching

* Revert changes on testdata as it is going to be added via #1471 instead

* Update README.md

Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>

* fixup! Update README.md

* Remove the outdated RunnerSet limitation

Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
2022-05-25 11:09:36 +09:00
Yusuke Kuoka
c7eea169ad test: Add test for runner with generic ephemeral volume as "work" (#1472)
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
2022-05-25 10:37:23 +09:00