Commit Graph

326 Commits

Author SHA1 Message Date
Nuru
9bd4025e9c Stricter filtering of check run completion events (#2520)
I observed that 100% of canceled jobs in my runner pool were not causing scale down events. This PR fixes that.

The problem was caused by #2119. 

#2119 ignores certain webhook events in order to fix #2118. However, #2119 overdoes it and filters out valid job cancellation events. This PR uses stricter filtering and add visibility for future troubleshooting.

<details><summary>Example cancellation event</summary>

This is the redacted top portion of a valid cancellation event my runner pool received and ignored.

```json
{
  "action": "completed",
  "workflow_job": {
    "id": 12848997134,
    "run_id": 4738060033,
    "workflow_name": "slack-notifier",
    "head_branch": "auto-update/slack-notifier-0.5.1",
    "run_url": "https://api.github.com/repos/nuru/<redacted>/actions/runs/4738060033",
    "run_attempt": 1,
    "node_id": "CR_kwDOB8Xtbc8AAAAC_dwjDg",
    "head_sha": "55bada8f3d0d3e12a510a1bf34d0c3e169b65f89",
    "url": "https://api.github.com/repos/nuru/<redacted>/actions/jobs/12848997134",
    "html_url": "https://github.com/nuru/<redacted>/actions/runs/4738060033/jobs/8411515430",
    "status": "completed",
    "conclusion": "cancelled",
    "created_at": "2023-04-19T00:03:12Z",
    "started_at": "2023-04-19T00:03:42Z",
    "completed_at": "2023-04-19T00:03:42Z",
    "name": "build (arm64)",
    "steps": [

    ],
    "check_run_url": "https://api.github.com/repos/nuru/<redacted>/check-runs/12848997134",
    "labels": [
      "self-hosted",
      "arm64"
    ],
    "runner_id": 0,
    "runner_name": "",
    "runner_group_id": 0,
    "runner_group_name": ""
  },
```

</details>
2023-04-27 13:15:23 +09:00
Yusuke Kuoka
94c089c407 Revert docker.sock path to /var/run/docker.sock (#2536)
Starting ARC v0.27.2, we've changed the `docker.sock` path from `/var/run/docker.sock` to `/var/run/docker/docker.sock`. That resulted in breaking some container-based actions due to the hard-coded `docker.sock` path in various places.

Even `actions/runner` seem to use `/var/run/docker.sock` for building container-based actions and for service containers?

Anyway, this fixes that by moving the sock file back to the previous location.

Once this gets merged, users stuck at ARC v0.27.1, previously upgraded to 0.27.2 or 0.27.3 and reverted back to v0.27.1 due to #2519, should be able to upgrade to the upcoming v0.27.4.

Resolves #2519
Resolves #2538
2023-04-27 13:06:35 +09:00
Nikola Jokic
9859bbc7f2 Use build.Version to check if resource version is a mismatch (#2521)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-04-24 10:40:15 +02:00
Yusuke Kuoka
3e0bc3f7be Fix docker.sock permission error for non-dind Ubuntu 20.04 runners since v0.27.2 (#2499)
#2490 has been happening since v0.27.2 for non-dind runners based on Ubuntu 20.04 runner images. It does not affect Ubuntu 22.04 non-dind runners(i.e. runners with dockerd sidecars) and Ubuntu 20.04/22.04 dind runners(i.e. runners without dockerd sidecars). However, presuming many folks are still using Ubuntu 20.04 runners and non-dind runners, we should fix it.

This change tries to fix it by defaulting to the docker group id 1001 used by Ubuntu 20.04 runners, and use gid 121 for Ubuntu 22.04 runners. We use the image tag to see which Ubuntu version the runner is based on. The algorithm is so simple- we assume it's Ubuntu-22.04-based if the image tag contains "22.04".

This might be a breaking change for folks who have already upgraded to Ubuntu 22.04 runners using their own custom runner images. Note again; we rely on the image tag to detect Ubuntu 22.04 runner images and use the proper docker gid- Folks using our official Ubuntu 22.04 runner images are not affected. It is a breaking change anyway, so I have added a remedy-

ARC got a new flag, `--docker-gid`, which defaults to `1001` but can be set to `121` or whatever gid the operator/admin likes. This can be set to `--docker-gid=121`, for example, if you are using your own custom runner image based on Ubuntu 22.04 and the image tag does not contain "22.04".

Fixes #2490
2023-04-17 21:30:41 +09:00
Nikola Jokic
ba1ac0990b Reordering methods and constants so it is easier to look it up (#2501) 2023-04-12 09:50:23 +02:00
Nikola Jokic
b86af190f7 Extend manager roles to accept ephemeralrunnerset/finalizers (#2493) 2023-04-10 08:49:32 +02:00
Nikola Jokic
a804bf8b00 Add ImagePullPolicy to the AutoscalingListener, configurable through Manager env (#2477) 2023-04-04 19:07:20 +02:00
Nikola Jokic
5dea6db412 Fix helm uninstall cleanup by adding finalizers and cleaning them from the controller (#2433)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-04-03 21:06:12 +02:00
Stewart Thomson
a7ef871248 Check if appID and instID are non-empty before attempting to parseInt (#2463) 2023-04-03 09:06:59 +09:00
Milas Bowman
878c9b8b49 runner: Use Docker socket via shared emptyDir instead of TCP/mTLS (#2324)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-03-28 11:29:16 +09:00
Nikola Jokic
56e1c62ac2 Add labels to autoscaling runner set subresources to allow easier inspection (#2391)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-03-27 11:19:34 +02:00
Tingluo Huang
08acb1b831 Get RunnerScaleSet based on both RunnerGroupId and Name. (#2413) 2023-03-15 11:10:09 -04:00
Tingluo Huang
2bf83d0d7f Remove list/watch secrets permission from the manager cluster role. (#2276) 2023-03-14 09:23:14 -04:00
Tingluo Huang
261d4371b5 Update E2E test workflow. (#2395) 2023-03-14 09:00:07 -04:00
Nikola Jokic
babbfc77d5 Surface EphemeralRunnerSet stats to AutoscalingRunnerSet (#2382) 2023-03-13 16:16:28 +01:00
Tingluo Huang
d7b589bed5 Helm chart react changes for the new runner image. (#2348) 2023-03-10 11:18:21 +00:00
Francesco Renzi
c569304271 Add support for self-signed CA certificates (#2268)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-03-09 17:23:32 +00:00
Chris Patterson
41f2ca3ed9 Adding parameter to configure the runner set name. (#2279)
Co-authored-by: TingluoHuang <TingluoHuang@github.com>
2023-03-03 08:36:14 -05:00
Francesco Renzi
40c905f25d Simplify the setup of controller tests (#2352) 2023-03-02 18:55:49 +00:00
Nikola Jokic
2984de912c Split listener pod label to avoid long names issue (#2341) 2023-03-02 17:25:50 +01:00
Alex Williams
69abd51f30 Ensure that EffectiveTime is updated on webhook scale down (#2258)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-03-01 08:27:37 +09:00
Francesco Renzi
73e22a1756 Disable metrics serving in proxy tests (#2307) 2023-02-22 16:57:59 +00:00
Francesco Renzi
6b4250ca90 Add support for proxy (#2286)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
2023-02-21 17:33:48 +00:00
Nathan Klick
ced88228fc Resolves the erroneous webhook scale down due to check runs (#2119)
Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
2023-02-21 10:56:46 +09:00
Andrei Vydrin
44c06c21ce fix: case-insensitive webhook label matching (#2302)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-02-21 09:37:42 +09:00
Nikola Jokic
8e52a6d2cf EphemeralRunner: On cleanup, if pod is pending, delete from service (#2255)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-02-11 19:55:12 -05:00
Nikola Jokic
9990243520 Early return if finalizer does not exist to make it more readable (#2262) 2023-02-08 15:21:13 +01:00
Tingluo Huang
facae69e0b Remove un-required permissions for the manager-role of the new AutoScalingRunnerSet (#2260) 2023-02-07 12:37:09 -05:00
Nikola Jokic
c4297d25bb Avoid deleting scale set if annotation is not parsable or if it does not exist (#2239) 2023-02-03 17:27:31 +01:00
Tingluo Huang
1f4fe4681e Delete RunnerScaleSet on service when AutoScalingRunnerSet is deleted. (#2223) 2023-01-31 15:03:11 -05:00
Tingluo Huang
803818162c Allow update runner group for AutoScalingRunnerSet (#2216) 2023-01-27 09:27:52 -05:00
dependabot[bot]
219ba5b477 chore(deps): bump sigs.k8s.io/controller-runtime from 0.13.1 to 0.14.1 (#2132)
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-01-27 09:23:28 +09:00
Nikola Jokic
882bfab569 Renaming autoScaling to autoscaling in tests matching the convention (#2201) 2023-01-23 17:03:01 +01:00
Tingluo Huang
4932412cd6 Fix L0 test to make it more reliable. (#2178) 2023-01-19 07:33:04 -05:00
Tingluo Huang
bb61bb1342 Include extra user-agent for runners created by actions-runner-controller. (#2177) 2023-01-18 07:38:59 +09:00
Tingluo Huang
622eaa34f8 Introduce new preview auto-scaling mode for ARC. (#2153)
Co-authored-by: Cory Miller <cory-miller@github.com>
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Ava Stancu <AvaStancu@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
Co-authored-by: Francesco Renzi <rentziass@github.com>
Co-authored-by: Bassem Dghaidi <Link-@github.com>
2023-01-17 12:06:20 -05:00
Tingluo Huang
044c8ad4d5 Include actions-runner-controller in runner's User-Agent for better telemetry in Actions service. (#2155) 2023-01-15 09:35:56 +09:00
Tingluo Huang
eaa451df32 Update controller package names to match the owning API group name (#2150)
* Update controller package names to match the owning API group name

* feedback.

Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-01-13 08:24:11 +09:00
Yusuke Kuoka
bc4f4fee12 Fix various golangci-lint errors (#2147)
that we introduced via controller-runtime upgrade and via the removal of legacy pull-based scale triggers (#2001).
2023-01-13 07:14:36 +09:00
Nikola Jokic
aa6dab5a9a Changes to folder structure to allow multigroups and changed go mod name (#2105)
* Changed folder structure to allow multi group registration

* included actions.github.com directory for resources and controllers

* updated go module to actions/actions-runner-controller

* publish arc packages under actions-runner-controller

* Update charts/actions-runner-controller/docs/UPGRADING.md

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-12-28 09:38:34 +09:00
Callum Tait
418f719bdf chore: highlight watch namespace (#2087)
* chore: highlight watch namespace

* chore: wording

Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
2022-12-12 08:39:04 +09:00
Yusuke Kuoka
96a930bfd9 Fix runner pod to not stuck in Terminating when runner got deleted before pod scheduling (#2043)
This fixes the said issue that I found while I was running a series of E2E tests to test other features and pull requestes I have recently contributed.
2022-11-27 11:13:38 +09:00
Yusuke Kuoka
ae86b1a011 Use the patch API instead to prevent unnecessary field updates (#1998)
Fixes #1916
2022-11-22 12:09:24 +09:00
Yusuke Kuoka
86d7893d61 breaking: Make legacy webhook scale triggers no-op (#2001)
Ref #1607
2022-11-22 12:08:29 +09:00
Igor Sarkisov
8f374d561f Do not explicitly set Privileged to false. (#2009)
Setting SecurityContext.Privileged bit to false, which is default,
prevents GKE from admitting Windows pods.  Privileged bit is not
supported on Windows.
2022-11-15 11:29:37 +09:00
malachiobadeyi
fbdfe0df8c 1770 update log format and add additional fields to webhook server logs (#1771)
* 1770 update log format and add runID and Id to worflow logs

update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook

use logging package

remove unused modules

add setup name to setuplog

add flag to change log format

change flag name to enableProdLogConfig

move log opts to logger package

remove empty else and reset timeEncoder

update flag description

use get function to handle nil

rename flag and update logger function

Update main.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Update controllers/horizontal_runner_autoscaler_webhook.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Update logging/logger.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

copy log opt per each NewLogger call

revert to use autoscaler.log

update flag descript and remove unused imports

add logFormat to readme

 rename setupLog to logger

make fmt

* Fix E2E along the way

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-11-04 10:46:58 +09:00
Yusuke Kuoka
23c8fe4a8b Fix dead-lock when runner unregistration triggered before PV attachment (#1975)
This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.
2022-11-04 06:29:19 +09:00
Yusuke Kuoka
c74ad6195f Fix runners to do their best to gracefully stop on pod eviction (#1759)
Ref #1535
Ref #1581

Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-11-01 20:30:10 +09:00
Yusuke Kuoka
710e2fbc3a Prevent runner controller from recreating runner pod when pod was terminated externally (#1851) 2022-10-13 09:04:50 +09:00
Yusuke Kuoka
7ff5b7da8c Handle missing runner ID more gracefully (#1855)
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation

While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes.

You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
2022-10-09 16:52:51 +09:00