Production Runbook¶
Audience: SRE
For initial setup steps see Getting Started. For detailed symptom → diagnosis steps see Troubleshooting.
Day-2 Operations¶
Adding a Tenant¶
- Ensure the tenant namespace exists:
kubectl get namespace <namespace>. - Have the tenant create the GitHub App Secret in their namespace. See Getting Started §2.
- Have the tenant create the
ActionsGatewayCR. See Getting Started §3. - Confirm the GMC has provisioned resources within ~30 seconds:
- Confirm the
Ready=Truecondition on theActionsGatewayCR.
No cluster-admin involvement is required after initial GMC deployment.
Adjusting Tenant Quota¶
The namespace ResourceQuota is platform-owned — it is not a field on the ActionsGateway CR. Edit the ResourceQuota object on the tenant namespace directly (or through your GitOps / tenant-operator stack, if that is what manages it):
The change takes effect immediately. Running jobs are not interrupted; the new quota applies on the next pod creation attempt. The gateway reads remaining quota and reacts to exhaustion (it fast-cancels and reruns quota-blocked jobs) but never writes the quota itself.
Scaling maxListeners¶
The GMC propagates the change to the RunnerGroup CR. The AGC reconciles the new ceiling on its next reconcile cycle (a few seconds). No restart needed.
Rotating GitHub App Credentials¶
See Getting Started — Rotating GitHub App Credentials for the full procedure.
In brief: create a new Secret with the new private key, then change spec.gitHubAppRef.name in the ActionsGateway CR to reference the new Secret. The GMC detects the Secret reference change and rolls the AGC Deployment. Do not update the existing Secret in-place; the GMC does not watch Secret contents, only the reference.
Alerting¶
Reference the SLO targets in Appendix A for threshold derivation.
The alerts below cover availability and SLO breaches. For abuse and compromise detection (eviction-retry loops, proxy slowloris, credential harvesting), see security-operations.md.
Which Metrics to Alert On¶
| Metric | Recommended threshold | Severity | Notes |
|---|---|---|---|
actions_gateway_token_refresh_errors_total |
rate > 1/hour per namespace | Page | Token expiry causes session failures within ~1 hour |
actions_gateway_renewjob_errors_total |
rate > 5/minute per namespace | Page | Sustained failures cancel running jobs |
actions_gateway_pod_creation_latency_seconds p95 |
> 15s | Ticket | SLO target from Appendix A |
actions_gateway_pod_creation_latency_seconds p99 |
> 60s | Page | Indicates scheduling stall or quota exhaustion |
actions_gateway_eviction_retries_exhausted_total |
rate > 0 | Ticket | Each increment requires a manual re-run |
actions_gateway_active_sessions |
= 0 for a RunnerGroup | Page | No listener polling; jobs queue indefinitely |
actions_gateway_reconcile_errors_total |
rate > 1/5min | Ticket | Persistent reconcile failure; resources may be stale |
ActionsGateway condition RateLimited=True |
duration > 10 minutes | Page | Installation is over API budget |
Proxy HPA TARGETS: <unknown> |
any | Ticket | HPA metric broken; autoscaling not working |
| AGC pod OOMKilled | any | Page | AGC has no active sessions while restarting |
Page-Worthy vs. Ticket-Worthy¶
Page (requires immediate response, typically < 15 minutes):
- active_sessions = 0 — no jobs can be acquired until fixed.
- renewjob_errors_total rate high — jobs will be cancelled.
- token_refresh_errors_total spiking — token will expire within ~1 hour.
- pod_creation_latency p99 > 60s — scheduling is stalled.
- RateLimited condition > 10 minutes — installation is over budget.
- AGC pod in OOMKilled / CrashLoopBackOff.
Ticket (respond within next business day):
- pod_creation_latency p95 > 15s — degraded but jobs are completing.
- eviction_retries_exhausted_total incrementing — jobs require manual re-run.
- reconcile_errors_total non-zero — investigate before it becomes a page.
- HPA metric unknown — autoscaling broken; proxy may not handle burst load.
SLO Breach Response¶
pod_creation_latency_seconds p95 > 15s¶
- Check for quota exhaustion:
kubectl describe resourcequota -n <namespace>. - Check for pending pods:
kubectl get pods -n <namespace> | grep Pending. - Describe a pending pod for scheduling events:
kubectl describe pod -n <namespace> <pod>. - If quota is exhausted: raise the platform-owned
ResourceQuotaon the namespace (kubectl edit resourcequota -n <namespace> <quota-name>) or wait for running pods to complete. - If no schedulable nodes: check node autoscaler or provision capacity.
- If PriorityClass is missing: create it. See Troubleshooting — Worker Pods Stuck Pending.
active_sessions Flatlining at Zero¶
- Check AGC pod status:
kubectl get pod -n <namespace> -l app=actions-gateway-controller. - Check AGC logs:
kubectl logs -n <namespace> deploy/actions-gateway-controller --tail=100. - Check RunnerGroup conditions:
kubectl get runnergroup -n <namespace> -o yaml. - If pod is
CrashLoopBackOfforError: see Troubleshooting — AGC CrashLoopBackOff. - If pod is running but sessions are zero: check for token errors (see Token Refresh Errors) and network connectivity (see Network Connectivity Failures).
jobs_acquired_total Stops Incrementing¶
- Verify jobs are actually queued: check the GitHub Actions UI for the repository.
- Check
active_sessions— if zero, restore sessions first (see above). - Check
RateLimitedcondition — if true, reduce session load or wait for the burst to subside. - Check
message_poll_errors_total— persistent poll errors indicate a broken GitHub connection. - If sessions are active and no errors, the queue may simply be empty.
Incident Response¶
GitHub App Key Compromise¶
Immediate steps (< 5 minutes):
- Revoke the compromised private key in the GitHub App settings (Settings → Developer settings → GitHub Apps →
<app>→ Private keys → Revoke). - The AGC's token refresh will fail within minutes of revocation; sessions will become invalid.
Restoration steps:
- Generate a new private key from the GitHub App settings page and download the
.pemfile. - Create a new Secret with the new key:
- Update the
ActionsGatewayCR to reference the new Secret: - Confirm the AGC Deployment has rolled and the new pod is healthy:
- Confirm
actions_gateway_token_refresh_errors_totalis no longer incrementing. - Delete the old Secret once confirmed healthy, and delete the downloaded
.pemfile from disk (shred -u <path>on Linux,rm -P <path>on macOS) — the key now lives only in the Kubernetes Secret.
Scope assessment. The compromised key could have been used to acquire installation tokens (scoped to Actions: Read, Administration: Read). Check GitHub's audit log for unusual API activity from the App installation: Settings → Organizations → <org> → Audit log → filter by the App name.
AGC Total Failure¶
If the AGC pod is destroyed and cannot restart (e.g. node failure without rescheduling, OOM loop):
- In-flight jobs whose
renewjobloop has lapsed will be cancelled by GitHub. There is no automatic recovery for these — they require manual re-run. - Queued jobs (not yet acquired) will be redelivered by GitHub to the next healthy session within ~2 minutes of the AGC restarting.
- To force restart:
kubectl rollout restart deploy/actions-gateway-controller -n <namespace>. - Monitor
actions_gateway_active_sessions— it should reach 1 per RunnerGroup within a few seconds of the pod starting.
State that persists: All RunnerGroup CRs, Secrets, and Kubernetes resources are durable. The AGC reconstructs all in-memory state (session registry, per-job renewers) from scratch on restart. The only non-recoverable state is in-flight job locks that expire during the blackout window.
GMC Total Failure¶
If the GMC pod is unavailable:
- Existing tenant gateways continue operating normally. The GMC is not in the data plane; it only responds to
ActionsGatewayCR changes. Provisioned AGCs, proxies, and RunnerGroups are not affected. - New
ActionsGatewayCRs will not be provisioned until the GMC recovers. - Spec changes to existing
ActionsGatewayCRs will not be reconciled until the GMC recovers. - To restore:
kubectl rollout restart deploy/gmc-controller-manager -n gmc-system. - On recovery, the GMC reconciles all
ActionsGatewayCRs idempotently — it compares desired vs. actual state and only applies changes. No resources are duplicated or deleted.
On-Call Handoff Checklist¶
Before handing off to the next on-call:
- [ ] All
ActionsGatewayconditionsReady=Trueacross active tenant namespaces. - [ ] No sustained
RateLimitedconditions. - [ ]
active_sessions> 0 for all active RunnerGroups. - [ ]
token_refresh_errors_totalrate is zero (or below 1/hour). - [ ]
renewjob_errors_totalrate is zero. - [ ] No pods in
CrashLoopBackOfforOOMKilledstate. - [ ] No open incidents or unresolved pages.
- [ ] Any
eviction_retries_exhausted_totalincrements from the shift are documented and re-runs are queued.
Reference Links¶
- Troubleshooting Guide — symptom → diagnosis → resolution for each failure mode
- Security Operations — abuse-detection alerts and compromise-response playbooks
- Observability — full metrics reference
- Getting Started — initial setup and credential rotation
- Appendix A — Capacity Targets & SLOs
- Appendix E — Capacity Planning