4.7 KiB
GitHub Actions Runner KillMode Analysis
Problem Statement
The question "is this a good idea?" regarding "killmode changing?" asks us to evaluate whether the current systemd KillMode=process setting should be changed to a different option.
Current Implementation
Systemd Service Configuration
- KillMode:
process(only main process gets signal) - KillSignal:
SIGTERM - TimeoutStopSec:
5min
Signal Handling Flow
- systemd sends SIGTERM to
runsvc.sh(main process) runsvc.shhas trap:trap 'kill -INT $PID' TERM INT- Converts SIGTERM → SIGINT and sends to Node.js runner process
- Node.js process handles graceful shutdown
Analysis of Current Approach
Strengths
- Graceful Shutdown Control: Manual signal conversion allows proper Node.js shutdown handling
- Predictable Behavior: Only main process receives systemd signals
- Custom Logic: Allows for runner-specific shutdown procedures
- Signal Compatibility: SIGINT is more commonly handled by Node.js applications
Potential Issues
- Single Point of Failure: If
runsvc.shfails to forward signals, child processes orphaned - Complex Chain: More components in signal propagation path
- Process Tree Cleanup: May not handle deep process hierarchies as robustly
Orphan Process Context
The codebase reveals significant effort to handle orphan processes:
Evidence from Code Analysis
-
JobExtension.cs: Dedicated orphan process cleanup mechanism
- Tracks processes before/after job execution
- Uses
RUNNER_TRACKING_IDenvironment variable - Terminates orphan processes at job completion
-
JobDispatcher.cs: Worker process orphan prevention
- Explicit waits to prevent orphan worker processes
- Handles "zombie worker" scenarios
-
ProcessInvoker.cs: Process tree termination
- Implements both Windows and Unix process tree killing
- Signal escalation: SIGINT → SIGTERM → SIGKILL
Alternative KillMode Options
KillMode=control-group
Behavior: All processes in service's cgroup get SIGTERM, then SIGKILL after timeout
Pros:
- Robust cleanup of entire process tree
- Built-in systemd guarantees
- Simpler signal flow
- No dependency on runsvc.sh signal forwarding
Cons:
- Less control over shutdown sequence
- All processes get SIGTERM simultaneously
- May interrupt graceful shutdown of worker processes
KillMode=mixed
Behavior: Main process gets SIGTERM, remaining processes get SIGKILL after timeout
Pros:
- Combines benefits of both approaches
- Main process can handle graceful shutdown
- Systemd ensures process tree cleanup
- Fallback protection against orphan processes
Cons:
- More complex behavior
- Still depends on main process signal handling
Security and Reliability Considerations
Current Risks
- If
runsvc.shcrashes before forwarding signals, Node.js process continues running - Deep process trees from job execution may not be properly cleaned up
- Container processes might not receive proper termination signals
Reliability Improvements with control-group/mixed
- systemd guarantees process cleanup regardless of main process behavior
- Reduces risk of orphan processes surviving service shutdown
- More predictable behavior for administrators
Recommendation
Recommended Change: KillMode=mixed
Rationale:
- Maintains Graceful Shutdown: Main process (runsvc.sh) still receives SIGTERM first
- Adds Safety Net: systemd ensures cleanup if main process fails to handle signals
- Reduces Orphan Risk: Addresses the orphan process concerns evident in the codebase
- Better Process Tree Handling: More robust for complex job process hierarchies
- Container Compatibility: Better handling of containerized workloads
Implementation Impact
- Low Risk: Change only affects service shutdown behavior
- Backward Compatible: No changes to startup or normal operation
- Testable: Can be validated with process monitoring during service stops
Alternative Considerations
- KillMode=control-group could be considered if graceful shutdown proves problematic
- Current KillMode=process could remain if the signal forwarding is deemed reliable enough
Testing Recommendations
- Test service shutdown with various job types running
- Verify process cleanup with nested process trees
- Test container job termination scenarios
- Monitor for any regressions in graceful shutdown behavior
Conclusion
Changing to KillMode=mixed would provide a good balance between maintaining the current graceful shutdown behavior while adding systemd's robust process cleanup guarantees. This addresses the orphan process concerns evident throughout the codebase while maintaining compatibility.