Upgrade and Rollback Procedures¶
Audience: Platform engineer
For upgrade strategy intent, see §2.6 of the architecture doc. For initial installation of the GMC, see install.md — this document covers day-2 upgrade and rollback.
The three independently versioned components — GMC, AGC, and worker image — each upgrade on their own cadence with separate procedures below.
Table of Contents¶
- Pre-Upgrade Validation Checklist
- Migration Notes
- BREAKING: spec.namespaceQuota removed — the ResourceQuota is now platform-owned
- Tenant namespaces now require the actions-gateway.github.com/tenant marker label
- Worker pods are now cleaned up automatically (one-time sweep recommended)
- AGC Deployment renamed from actions-gateway-agc to actions-gateway-controller
- GMC manager NetworkPolicy is now enabled by default
- GMC Upgrade
- GMC install and upgrade via Helm (recommended)
- Step 1: Upgrade the CRDs
- Step 2: Roll the GMC Deployment
- Step 3: Post-Upgrade Validation
- Rollback
- AGC Upgrade
- Per-Tenant Upgrade Procedure
- Rollback
- Proxy Upgrade
- Step 1: Pre-Upgrade Checks
- Step 2: Update the Proxy Image
- Step 3: Watch the Rollout
- Step 4: Post-Upgrade Validation
- Rollback
- Worker Image Upgrade
- Upgrade Procedure
- Canary Testing a New Worker Image
- Minimum Version Requirement
- Rollback
- Post-Upgrade Validation
- Zero-Downtime Configuration
Pre-Upgrade Validation Checklist¶
Before upgrading any component, confirm the system is healthy:
# 1. No active incidents or RateLimited conditions
kubectl get actionsgateway --all-namespaces
# 2. All AGC pods healthy
kubectl get pods --all-namespaces -l app=actions-gateway-controller
# 3. All proxy pools healthy
kubectl get pods --all-namespaces -l app=actions-gateway-proxy
# 4. No CrashLoopBackOff pods
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed | grep -v Terminating
# 5. No recent reconcile errors
# Metric: rate(actions_gateway_reconcile_errors_total[5m]) == 0
Also check the release notes for the new version before upgrading, particularly: - CRD schema changes (new required fields, removed fields, validation tightening). - Behavior changes that require configuration updates before the new binary takes effect.
Migration Notes¶
BREAKING: spec.namespaceQuota removed — the ResourceQuota is now platform-owned¶
This is a breaking CRD change, made pre-1.0 while the API can still break. The
spec.namespaceQuota field has been removed from the ActionsGateway CRD. The
namespace ResourceQuota (and any LimitRange) is now platform-owned: the
platform admin creates and manages it on the tenant namespace, and the gateway
operates within it but never creates or mutates it. The GMC's resourcequotas
write RBAC has been dropped (least privilege — Q122/Q130). The rationale: a
tenant-set quota is no real cap (the tenant could raise it in their own CR) and it
fought GitOps and tenant-operator stacks (Capsule, HNC, vCluster, kiosk) that
already own namespaces and quotas.
What you must do before (or as part of) the upgrade:
- Provision a platform-managed
ResourceQuotain each tenant namespace before the new GMC takes over — the gateway no longer creates one, so a namespace with no quota becomes uncapped. For each tenant that relied onspec.namespaceQuota, read the current values and create a standaloneResourceQuota:
# Inspect the GAG-managed quota the old GMC created (named "actions-gateway")
kubectl get resourcequota actions-gateway -n <tenant-namespace> -o yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: <tenant>-quota
namespace: <tenant-namespace>
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
pods: "50"
- Adopt or replace the orphaned GAG-created quota. A
ResourceQuotathe old GMC created carries anownerReferenceto theActionsGatewayCR, so it would be garbage-collected if the CR were ever deleted. Either adopt it by stripping that ownerReference (so it survives independently), or delete it and recreate a platform-managed one as in step 1:
# Adopt: drop the ownerReference so the quota is no longer GC-tied to the CR
kubectl patch resourcequota actions-gateway -n <tenant-namespace> \
--type=json -p='[{"op":"remove","path":"/metadata/ownerReferences"}]'
- Drop
namespaceQuotafrom yourActionsGatewaymanifests / GitOps. On upgrade the CRD's structural-schema pruning silently drops the now-unknown field from stored and re-applied CRs — applying a manifest that still setsnamespaceQuotais not rejected, the field is just ignored. Remove it from source so intent stays clear.
BREAKING: priorityTiers PriorityClasses now require a platform allowlist; per-tier preemptionPolicy removed¶
Two breaking CRD/admission changes, made pre-1.0 (Q132). Both concern
spec.runnerGroups[].priorityTiers:
-
The GMC validating webhook now rejects any
priorityClassNamenot on the platform allowlist. The allowlist is the new GMC--allowed-priority-classesflag (comma-separated class names) and is empty by default, so after upgrade everyActionsGatewaythat setspriorityTierswill be rejected on its next apply until you configure the flag. The rationale: a tenant-chosen, cluster-scopedPriorityClasswith the defaultPreemptLowerPrioritypolicy could evict other tenants' running worker pods — a cross-tenant isolation break. -
The per-tier
preemptionPolicyfield has been removed from thePriorityTierschema. It was never wired to worker pods (a no-op) and was a tenant-controlled preemption lever; preemption is now governed solely by the platform-createdPriorityClassobject. Structural-schema pruning silently drops the now-unknown field from stored/re-applied CRs (no apply rejection); remove it from source so intent stays clear.
Migration steps:
- Before rolling the GMC, decide which
PriorityClassnames tenants may use, ensure those classes exist (create them withpreemptionPolicy: Neverunless a tier is genuinely meant to preempt cross-tenant — see security-operations.md § Priority classes), and set--allowed-priority-classeson the GMC Deployment (or the chart values). - Audit existing CRs for the classes they reference:
kubectl get actionsgateway -A -o jsonpath='{range .items[*]}{range .spec.runnerGroups[*]}{range .spec.priorityTiers[*]}{.priorityClassName}{"\n"}{end}{end}{end}' | sort -u. Every name in that list must be on the allowlist or the next apply/reconcile of that CR is rejected. - Drop
preemptionPolicyfrom yourpriorityTiersmanifests / GitOps.
Tenant namespaces now require the actions-gateway.github.com/tenant marker label¶
The GMC's cluster-wide write grants are now gated by two ValidatingAdmissionPolicies
(both shipped by the Helm chart, gated on admissionPolicy.enabled):
namespace-psa-guard gates namespaces:patch, and gmc-tenant-resource-guard
gates create/update/delete of all tenant provisioning resources (Deployments,
Services, ServiceAccounts, RoleBindings, Roles, NetworkPolicies, HPAs, PDBs,
RunnerGroups, and Secret create/update). Both deny the GMC unless the target
namespace already carries actions-gateway.github.com/tenant: "true". Existing
tenant namespaces created before this change do not have the label, so after
upgrade the GMC cannot stamp their Pod Security Admission labels or provision any
resources in them, and each affected ActionsGateway will emit a
NamespaceMarkerMissing warning event.
Before (or immediately after) upgrading, label every existing tenant namespace:
# Label all namespaces that currently hold an ActionsGateway CR.
kubectl get actionsgateway -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\n"}{end}' \
| sort -u \
| xargs -I{} kubectl label namespace {} actions-gateway.github.com/tenant=true --overwrite
For a phased rollout where you cannot label every namespace up front, temporarily set
both bindings' validationActions to [Audit] (instead of [Deny]) so denials are
logged but not enforced, label the namespaces, then restore [Deny] on each.
Worker pods are now cleaned up automatically (one-time sweep recommended)¶
AGC versions with Q95 delete completed worker pods after completedPodTTL
(default 5m) and stuck-Pending worker pods after pendingPodDeadline (default
10m), and stamp every new worker pod and job Secret with an OwnerReference to
its RunnerGroup. Two operator-visible consequences:
- Behaviour change: completed worker pods no longer linger indefinitely.
If your debugging workflow relied on inspecting old pods, raise
completedPodTTLon the affectedrunnerGroups[]entries (see tenant-onboarding). - One-time sweep: pods created by pre-upgrade AGC versions whose RunnerGroup still exists are reaped automatically after upgrade, but pods whose RunnerGroup was already deleted have no OwnerReference and no reconciler to reap them. Clean those up once per tenant namespace:
# Terminal worker pods left behind by pre-Q95 AGCs (label is stamped on all worker pods)
kubectl delete pods -n <tenant-namespace> \
-l app.kubernetes.io/name=actions-runner \
--field-selector 'status.phase!=Running,status.phase!=Pending'
AGC Deployment renamed from actions-gateway-agc to actions-gateway-controller¶
Deployments and resources created by the GMC are now named actions-gateway-controller
instead of actions-gateway-agc. After upgrading the GMC:
- The GMC creates a new
actions-gateway-controllerDeployment in each tenant namespace. - The old
actions-gateway-agcDeployment is left orphaned (still running but no longer managed). Remove it manually per tenant:
- Pods labelled
app=actions-gateway-agcbecomeapp=actions-gateway-controller. Update any Prometheus alerts, Grafana dashboards, or PodMonitor selectors that reference the old label before upgrading.
GMC manager NetworkPolicy is now enabled by default¶
The default install ships the GMC manager NetworkPolicy enabled
(networkPolicy.enabled=true). This flips the controller-manager pod to default-deny ingress and
admits its :8443 /metrics endpoint only from namespaces labelled
metrics: enabled. If your Prometheus runs in an unlabelled namespace, GMC
manager scrapes will start failing after upgrade. Label it before (or right
after) upgrading:
The validating-webhook port (9443) is re-allowed from any source, so CR
admission is unaffected. This change also adds a PodDisruptionBudget
(minAvailable: 1) and priorityClassName: system-cluster-critical to the
manager — no operator action required. Runtime NetworkPolicy enforcement depends
on your CNI; see observability.md. The Prometheus
ServiceMonitor remains opt-in behind the metrics.serviceMonitor.enabled
chart value.
GMC Upgrade¶
The GMC runs replicas: 2 with leader election. Only one replica reconciles at any time; leadership transfers seamlessly during a rolling update. In-flight reconciliations are idempotent — the new leader re-derives state and converges without producing duplicate resources.
The active replica releases its leader lease on graceful shutdown (--leader-elect-release-on-cancel, on by default), so during a rollout the standby takes over within one retry period (~2s) instead of waiting out the full lease (~15s). This is why the Deployment's short terminationGracePeriodSeconds: 10 introduces no reconcile gap. If you run on a slow or heavily loaded API server and see spurious leader-lease losses (the GMC restarting with "failed to renew lease"), widen the timing with the tunables below rather than disabling leader election:
| Flag | Default | Purpose |
|---|---|---|
--leader-elect-lease-duration |
15s |
How long a candidate waits before force-acquiring a stale lease. |
--leader-elect-renew-deadline |
10s |
How long the leader keeps retrying a renewal before stepping down. |
--leader-elect-retry-period |
2s |
Interval between election attempts (and the failover floor with release-on-cancel). |
--leader-elect-release-on-cancel |
true |
Release the lease on SIGTERM for fast failover. Leave on. |
The invariant lease-duration > renew-deadline > retry-period × 1.2 is validated at startup; a misordered set makes the GMC exit immediately with a message naming the offending flags.
GMC install and upgrade via Helm (recommended)¶
The shipped install artifact is the actions-gateway Helm chart, published and cosign-signed to the GHCR OCI registry (oci://ghcr.io/actions-gateway/charts/actions-gateway); the charts/actions-gateway/ source path is the dev/CI copy of the same chart. The chart is the sole install/upgrade vehicle — there is no kustomize path. For dev/CI iteration make deploy wraps helm install of the local chart with floating image tags.
General availability —
v1.0.0(chart version = release tag without the leadingv). Pin--version 1.0.0; copy the image digests from the release notes and verify the chart/image signatures before installing (see release.md § Verify the publish).
# First install (from the published, signed OCI chart)
helm install gag oci://ghcr.io/actions-gateway/charts/actions-gateway \
--version 1.0.0 \
--namespace gmc-system --create-namespace \
--set gmc.image.digest=sha256:<gmc> \
--set agc.image.digest=sha256:<agc> \
--set proxy.image.digest=sha256:<proxy>
# Upgrade in place to a newer published chart version (carries CRD field changes — see below)
helm upgrade gag oci://ghcr.io/actions-gateway/charts/actions-gateway \
--version <new-chart-version> --namespace gmc-system --reuse-values \
--set gmc.image.digest=sha256:<new-gmc>
# Roll back to the previous release
helm rollback gag --namespace gmc-system
Four upgrade-time behaviors are specific to this chart:
gmc.image.digestis required at render time. Bothhelm installandhelm upgradefail withgmc.image must be pinned by digest …when the release values carry no digest — e.g. a values file that omits it, or--reset-values.--reuse-values(as in the example above) carries the previously pinned digest forward; pass--set gmc.image.digest=sha256:<new-gmc>to move to the new release's image. See the troubleshooting runbook. Dev/test only:allowFloatingImageTags=trueopts out.- CRDs upgrade with the release. The
ActionsGatewayandRunnerGroupCRDs ship as templates undertemplates/crds/withhelm.sh/resource-policy: keep, not the chart-rootcrds/directory — Helm never upgrades resources incrds/. So ahelm upgradeapplies additive CRD field changes automatically, andhelm uninstallpreserves the CRDs (and every tenant'sActionsGateway/RunnerGroupobject) rather than cascade-deleting them. You do not run a separate CRD apply step. TheRunnerGroupCRD is sourced from the AGC authoritative copy. - The webhook cert path depends on
certManager.enabled. With the defaultcertManager.enabled=true, cert-manager issues and rotates the serving cert; nothing to do on upgrade. WithcertManager.enabled=false, the chart generates a self-signed serving cert and wires the webhookcaBundleitself. On an in-placehelm upgradethe chart reuses the existingwebhook-server-certSecret (it looks the Secret up), so the cert does not rotate; it only regenerates if that Secret is missing (a fresh install, or after you delete it to force rotation). Ahelm template(no cluster) cannot look the Secret up and therefore renders a fresh cert each time — expected for offline rendering only. - The
namespace-psa-guardandgmc-tenant-resource-guardbindings deny by default. If you are upgrading a cluster whose existing tenant namespaces are not yet labeledactions-gateway.github.com/tenant=true, label them before the upgrade (see the migration note above), or the GMC's namespace patches and all tenant-resource writes will be denied. To stage the rollout you can temporarily set both bindings toAuditby editingvalidationActionson eachValidatingAdmissionPolicyBinding, then flip them back toDenyonce the labels are in place.
helm upgrade rolls the GMC Deployment (and carries additive CRD field changes —
no separate CRD apply step). Watch the rollout:
The rolling update replaces one replica at a time. Leadership transfers before the old leader is deleted. The total rollout time is typically < 30 seconds.
Post-upgrade validation¶
# Confirm both replicas are on the new image
kubectl get pods -n gmc-system -o wide
# Confirm the GMC has re-elected a leader
kubectl get lease -n gmc-system
# Confirm no new reconcile errors appeared
# Metric: actions_gateway_reconcile_errors_total
# Spot-check one ActionsGateway CR
kubectl describe actionsgateway -n <namespace> <name>
Rollback¶
Roll back to the previously deployed release with helm rollback (see the Helm
section above):
helm rollback gag --namespace gmc-system
kubectl rollout status deploy/gmc-controller-manager -n gmc-system
helm rollback restores the prior release's values and manifests. CRDs carry
helm.sh/resource-policy: keep, so they are not rolled back automatically; if the
rollback targets a different CRD schema version, consult the release notes for any
CRD migration.
AGC Upgrade¶
The AGC runs replicas: 1. Every AGC upgrade incurs a brief drain window — the period between the old pod terminating and the new pod acquiring sessions. During this window:
- In-flight long polls are dropped. GitHub redelivers these jobs within ~2 minutes (GitHub's redelivery window).
- Per-job RenewJob loops are abandoned. Any job whose lock window (~10 minutes per renewal) expires before the new AGC starts will be cancelled by GitHub. These require manual re-run.
- Queued but unacquired jobs are redelivered after the session TTL expires (typically < 2 minutes).
Scheduling guidance. Schedule AGC upgrades during low-traffic periods (off-peak hours, weekends) when in-flight job count is minimal. If zero-downtime is required, accept that GitHub redelivery provides effective recovery for most jobs.
Per-Tenant Upgrade Procedure¶
Upgrade each tenant's AGC one at a time. If tenants are independent, you may parallelize across namespaces.
Step 1: Drain the AGC before upgrading (optional, reduces blackout window)
The AGC's SIGTERM handler calls DELETE /sessions for all open sessions before exiting, causing GitHub to immediately re-queue unacquired jobs rather than waiting for session TTL. To rely on this:
- Ensure
terminationGracePeriodSecondson the AGC Deployment is ≥ 30 seconds (the default). - Do not use
kubectl delete poddirectly — it sends SIGKILL without a grace period. Usekubectl rollout restartorkubectl set imageinstead.
Step 2: Update the AGC image
The GMC manages the AGC Deployment. To update the AGC image, update the GMC's configuration (Helm values or Kustomize overlay) with the new AGC image tag and re-deploy the GMC, which will then roll each tenant's AGC Deployment. Alternatively, patch per-namespace:
Step 3: Watch the rollout
Step 4: Confirm session recovery
# sessions should return to >= 1 per RunnerGroup within a few seconds of pod startup
# Metric: actions_gateway_active_sessions{namespace="<namespace>"}
# No new renewjob errors
# Metric: rate(actions_gateway_renewjob_errors_total{namespace="<namespace>"}[5m])
Step 5: Check for cancelled jobs
After the rollout, verify that jobs active during the restart have either completed or been redelivered. Check the GitHub Actions UI for any unexpectedly cancelled runs.
Rollback¶
kubectl rollout undo deploy/actions-gateway-controller -n <namespace>
kubectl rollout status deploy/actions-gateway-controller -n <namespace>
Then confirm sessions are re-established and job acquisition resumes.
Proxy Upgrade¶
The proxy pool is HPA-managed and stateless. Rolling updates are non-disruptive as long as the PodDisruptionBudget (minAvailable: 1) is respected during the rollout.
Step 1: Pre-Upgrade Checks¶
# Confirm the PodDisruptionBudget is in place
kubectl get pdb -n <namespace> actions-gateway-proxy
# Confirm current replica count
kubectl get deploy -n <namespace> actions-gateway-proxy
# Confirm the HPA is healthy (TARGETS should show a percentage, not <unknown>)
kubectl get hpa -n <namespace>
Step 2: Update the Proxy Image¶
The GMC manages the proxy Deployment. Update the proxy image via the GMC's Helm values or Kustomize overlay, then re-deploy the GMC (which will reconcile the updated image into all tenant proxy Deployments). To patch a single namespace:
Step 3: Watch the Rollout¶
The rolling update replaces one proxy pod at a time. Kubernetes honours the PodDisruptionBudget and only terminates a pod once its replacement is Ready.
In-flight CONNECT tunnels through the old proxy pod will be interrupted when that pod is terminated. The AGC and worker pods will reconnect through the remaining proxy pods automatically. For high-concurrency tenants, schedule the upgrade during a low-traffic window to minimise connection resets.
Step 4: Post-Upgrade Validation¶
# All proxy pods on the new image
kubectl get pods -n <namespace> -l app=actions-gateway-proxy -o wide
# HPA still computing utilization (not <unknown>)
kubectl get hpa -n <namespace>
# No spike in token or renewjob errors after the rollout
# Metrics: token_refresh_errors_total, renewjob_errors_total
Rollback¶
kubectl rollout undo deploy/actions-gateway-proxy -n <namespace>
kubectl rollout status deploy/actions-gateway-proxy -n <namespace>
Worker Image Upgrade¶
Worker image upgrades are non-disruptive: the new image takes effect on future jobs; running pods complete on the old image.
Upgrade Procedure¶
Update spec.runnerGroups[N].workerImage in the ActionsGateway CR:
kubectl edit actionsgateway -n <namespace> <name>
# Update spec.runnerGroups[N].workerImage to the new image digest
The GMC propagates the change to the RunnerGroup CR. The AGC starts using the new image on the next job acquisition. No restart required.
Production recommendation: pin to a digest, not a tag:
This ensures the exact same image is used for all jobs until explicitly changed, and enables unambiguous rollback.
Canary Testing a New Worker Image¶
To test a new image on a subset of jobs before rolling it out broadly:
- Add a second
RunnerGroupwith the new image and a distinct label (e.g.canary). - Update a subset of workflows to use
runs-on: [self-hosted, canary]. - Monitor job success rates. If healthy, update the main
RunnerGroupand remove the canary group.
Minimum Version Requirement¶
GitHub enforces a minimum runner version at session creation time. If the worker image contains a runner below this threshold, the session goroutine will receive a 400 Bad Request and surface a VersionTooOld condition on the RunnerGroup. Monitor actions_gateway_active_sessions and RunnerGroup conditions for this symptom after deploying an older image.
Rollback¶
Set workerImage back to the previous digest:
kubectl patch actionsgateway -n <namespace> <name> \
--type=json \
-p='[{"op":"replace","path":"/spec/runnerGroups/0/workerImage","value":"<previous-digest>"}]'
Post-Upgrade Validation¶
After any component upgrade:
# All ActionsGateway CRs healthy
kubectl get actionsgateway --all-namespaces
# Active sessions restored
# Metric: actions_gateway_active_sessions per namespace
# No spike in errors
# Metrics: token_refresh_errors_total, renewjob_errors_total, reconcile_errors_total
# Pod creation latency within SLO
# Metric: histogram_quantile(0.95, rate(actions_gateway_pod_creation_latency_seconds_bucket[5m]))
If a regression is detected within the first 15 minutes after an upgrade, roll back immediately rather than investigating in production. Investigate using a non-production environment.
Zero-Downtime Configuration¶
The GMC and worker image upgrades are non-disruptive. The AGC upgrade is the only component with a brief drain window. To minimize its impact:
- Time upgrades outside peak hours to reduce the number of in-flight jobs at risk.
- Rely on SIGTERM drain —
kubectl rollout restart(notdelete pod) gives the AGC time to callDELETE /sessionsbefore the pod exits, reducing the redelivery window from session TTL (minutes) to pod startup time (seconds). - Use a generous
terminationGracePeriodSeconds(≥ 30s). The AGC's SIGTERM handler is fast (a few hundred milliseconds for most tenants), but give it headroom for high-listener-count namespaces. - Accept the blackout as a known cost rather than attempting zero-downtime tricks. GitHub's 2-minute redelivery window means most jobs survive an AGC restart transparently; the risk window is only jobs whose
renewjoblock happens to expire during the restart (unlikely in practice for a < 5-second restart).