Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204)

Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
This commit is contained in:
Yusuke Kuoka
2022-03-11 19:03:17 +09:00
committed by GitHub
parent 736a53fed6
commit 9628bb2937
3 changed files with 68 additions and 21 deletions

View File

@@ -18,6 +18,10 @@ const (
// AnnotationKeyUnregistrationCompleteTimestamp is the annotation that is added onto the pod once the previously started unregistration process has been completed.
AnnotationKeyUnregistrationCompleteTimestamp = annotationKeyPrefix + "unregistration-complete-timestamp"
// AnnotationKeyRunnerCompletionWaitStartTimestamp is the annotation that is added onto the pod when
// ARC decided to wait until the pod to complete by itself, without the need for ARC to unregister the corresponding runner.
AnnotationKeyRunnerCompletionWaitStartTimestamp = annotationKeyPrefix + "runner-completion-wait-start-timestamp"
// unregistarionStartTimestamp is the annotation that contains the time that the requested unregistration process has been started
AnnotationKeyUnregistrationStartTimestamp = annotationKeyPrefix + "unregistration-start-timestamp"