Skip to content

Troubleshooting Guide

Audience: SRE, Platform engineer

Each section below covers a specific failure mode: symptoms, likely cause, diagnostic commands, and resolution steps.


Table of Contents

How to Validate a Fresh Deployment

Run these checks immediately after deploying a new tenant gateway or upgrading existing components.

# 1. Check ActionsGateway status
kubectl get actionsgateway -n <namespace> -o yaml | grep -A 20 status:

# 2. Confirm the AGC pod is running
kubectl get deploy -n <namespace> actions-gateway-controller
kubectl logs -n <namespace> deploy/actions-gateway-controller --tail=50

# 3. Confirm the proxy pool is healthy
kubectl get deploy -n <namespace> actions-gateway-proxy
kubectl get hpa -n <namespace>

# 4. Confirm RunnerGroup resources exist
kubectl get runnergroup -n <namespace>

# 5. Check for condition errors
kubectl get actionsgateway -n <namespace> -o jsonpath='{.status.conditions}' | jq .

Expected state after a healthy deployment:

  • ActionsGateway condition Ready=True.
  • ActionsGateway condition AGCAvailable=True.
  • ActionsGateway condition ProxyAvailable=True.
  • AGC Deployment: READY 1/1.
  • Proxy Deployment: READY count >= minReplicas.
  • HPA: TARGETS showing a CPU percentage (not <unknown>).
  • Each RunnerGroup has at least one listener session (actions_gateway_active_sessions > 0).

Helm Render Fails: gmc.image Must Be Pinned by Digest

Symptoms. helm install, helm upgrade, helm lint, or helm template of the actions-gateway chart fails immediately with:

Error: execution error at (actions-gateway/templates/deployment.yaml:...):
gmc.image must be pinned by digest: set gmc.image.digest=sha256:<64 hex digits>
(see docs/operations/install.md, "Pin images by digest").
DEV/TEST ONLY: set allowFloatingImageTags=true to allow a floating tag.

Cause. gmc.image.digest is empty in the release values. The chart enforces digest pinning of the GMC's own controller image at render time (secure by default): nothing at runtime validates the image the GMC itself runs from, so an empty digest must never silently fall back to a mutable :latest tag. Common ways to get here:

  • A fresh install without --set gmc.image.digest=sha256:<gmc>.
  • A helm upgrade with a values file (or --reset-values) that omits the digest. (--reuse-values carries the previously pinned digest forward.)
  • Offline rendering (helm template / helm lint) without supplying a digest.

Resolution.

  • Production: pin the digest published for the release you are installing (see release.md for where digests are recorded):
helm upgrade --install gag charts/actions-gateway \
  --namespace gmc-system \
  --set gmc.image.digest=sha256:<gmc> \
  --set agc.image.digest=sha256:<agc> \
  --set proxy.image.digest=sha256:<proxy>
  • Dev/test only: --set allowFloatingImageTags=true allows a floating tag for the GMC image and disables the GMC's startup digest check on the AGC/proxy images. Never use it in production.
  • Offline rendering: any well-formed digest satisfies the check, e.g. --set-string gmc.image.digest=sha256:1111111111111111111111111111111111111111111111111111111111111111.

Note the contrast with the AGC/proxy images: those are validated by the GMC at startup (a floating tag there crash-loops the GMC — see install.md § Pin images by digest), while the GMC's own image is validated by the chart at render time.


GMC Not Provisioning Tenant Resources

Symptoms. An ActionsGateway CR was applied but nothing has been created in the tenant namespace: no AGC Deployment, no proxy Deployment, no RunnerGroup resources.

Likely causes. - GMC pod is not running or not the leader. - GMC lacks permission to write to the tenant namespace (RBAC misconfiguration during initial GMC install). - The ActionsGateway CR failed admission validation (check for validation errors in kubectl apply output or Events).

Diagnostics.

# Check whether the GMC is running and has a leader
kubectl get lease -n gmc-system
kubectl get pods -n gmc-system

# Check GMC logs for reconcile errors
kubectl logs -n gmc-system deploy/gmc-controller-manager --tail=100 | grep -i error

# Check events on the ActionsGateway CR
kubectl describe actionsgateway -n <namespace> <name>

# Check the Ready condition
kubectl get actionsgateway -n <namespace> <name> -o jsonpath='{.status.conditions}' | jq .

Resolution. - If the GMC pod is not running, restore it from its Deployment. - If RBAC is missing, re-run helm upgrade --install of the chart (RBAC ships with it). - If the admission webhook is rejecting the CR, fix the CR spec and re-apply. - If a reconcile error is logged (e.g. failed to create Deployment), check the actions_gateway_reconcile_errors_total metric and read the full error from the GMC logs. Fix the underlying permissions or quota issue and the GMC's reconciler will retry.


Tenant Namespace Missing the Managed-Tenant Marker Label

Symptoms. An ActionsGateway never becomes Ready. kubectl describe shows a Warning event with reason NamespaceMarkerMissing, and the GMC log reports a Forbidden error stamping Pod Security Admission labels, citing the namespace-psa-guard admission policy. This is common immediately after upgrading a cluster whose tenant namespaces predate the policy (see Upgrade — Migration Notes).

Cause. The GMC's cluster-wide namespaces:patch grant is gated by the namespace-psa-guard ValidatingAdmissionPolicy, which denies the GMC any namespace that is not labelled actions-gateway.github.com/tenant: "true". The label confines the grant to managed tenants so a compromised GMC cannot relabel kube-system PSA (see Security §5.1/§5.3). The GMC never sets this label itself — a trusted administrator must apply it. The same marker also gates the gmc-tenant-resource-guard policy, which confines every tenant-resource write (Deployments, Secrets, RoleBindings, …) to marked namespaces; provisioning fails at the PSA-stamping step first, so NamespaceMarkerMissing is the signal you will see, but applying the label clears both gates.

Diagnostics.

# Confirm the warning event
kubectl describe actionsgateway -n <namespace> <name> | grep -A2 NamespaceMarkerMissing

# Check whether the marker label is present
kubectl get namespace <namespace> \
  -o jsonpath='{.metadata.labels.actions-gateway\.github\.com/tenant}'   # want: true

# Confirm both policies and their bindings are installed
kubectl get validatingadmissionpolicy gmc-namespace-psa-guard gmc-tenant-resource-guard
kubectl get validatingadmissionpolicybinding gmc-namespace-psa-guard-binding gmc-tenant-resource-guard-binding

Resolution. Apply the marker label as an administrator, then the GMC reconciler retries automatically:

kubectl label namespace <namespace> actions-gateway.github.com/tenant=true

If the GMC's ServiceAccount is installed under a non-default namespace or name, also confirm the policy's matchConditions username (system:serviceaccount:gmc-system:gmc-controller-manager) matches your install.


ActionsGateway Stuck Deleting (Teardown Blocked on a Failing Delete)

Symptoms. You deleted an ActionsGateway, but the CR does not disappear: kubectl get actionsgateway -n <namespace> still lists it with a non-empty metadata.deletionTimestamp, and kubectl describe shows a repeating Warning event with reason TeardownIncomplete. Some tenant resources (e.g. the AGC Deployment, RoleBinding, or a ServiceAccount) are still present in the namespace.

Cause. Teardown is fail-closed by design (Q125): the GMC keeps the actions-gateway.github.com/gmc-cleanup finalizer on the CR and requeues until it can confirm every owned resource is deleted (or already gone). If a delete keeps failing — most often an API-server error, or a Forbidden from an admission policy or revoked RBAC — the finalizer is retained on purpose so a live, credentialed AGC Deployment is never orphaned by a half-finished teardown. A NotFound is treated as success, so an already-deleted resource never blocks convergence.

Diagnostics.

# Confirm the CR is mid-deletion and which resources remain
kubectl get actionsgateway -n <namespace> <name> -o jsonpath='{.metadata.deletionTimestamp}{"\n"}{.metadata.finalizers}{"\n"}'
kubectl describe actionsgateway -n <namespace> <name> | grep -A3 TeardownIncomplete

# The event message names the namespace and the underlying error; also check the GMC log
kubectl logs -n gmc-system deploy/gmc-controller-manager --tail=50 | grep -i "delete resource during teardown"

Resolution. Fix the underlying delete failure — restore API-server health, or re-grant the GMC the delete verb / re-apply the gmc-tenant-resource-guard marker if the namespace lost its actions-gateway.github.com/tenant=true label (the policy gates DELETE too, so an unmarked namespace blocks teardown). The reconciler retries on its own backoff and removes the finalizer automatically once every delete is confirmed. Do not manually strip the finalizer to force the CR away — that re-introduces the orphaned-AGC failure mode the fail-closed behaviour exists to prevent; clear the real delete error instead.


AGC CrashLoopBackOff or Not Acquiring Jobs

Symptoms. The AGC pod is restarting repeatedly, or it is running but actions_gateway_active_sessions is zero and actions_gateway_jobs_acquired_total is not incrementing even when jobs are queued.

Likely causes. - GitHub App Secret is missing, malformed, or contains an invalid private key. - GitHub App installationId or appId is wrong. - The proxy pool is not reachable from the AGC (network policy or proxy pod not ready). - The AGC binary was built with an incompatible runner version (GitHub returns 400 on session creation).

Diagnostics.

# Check pod status and restarts
kubectl get pod -n <namespace> -l app=actions-gateway-controller

# Check logs for startup errors
kubectl logs -n <namespace> deploy/actions-gateway-controller

# Check that the referenced Secret exists and has the right keys
kubectl get secret -n <namespace> <gitHubAppRef.name>
kubectl get secret -n <namespace> <gitHubAppRef.name> -o jsonpath='{.data}' | jq 'keys'
# Expected keys: appId, installationId, privateKey

# Test proxy reachability — the AGC image is distroless (no shell, no curl),
# so spawn an ephemeral curl pod in the same namespace and use the same proxy URL.
kubectl run nettest-$$ -n <namespace> --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --overrides='{"spec":{"automountServiceAccountToken":false,"containers":[{"name":"c","image":"curlimages/curl:latest","command":["sh","-c","curl -x https://actions-gateway-proxy:8080 -sI https://api.github.com"]}]}}'

# Check RunnerGroup conditions
kubectl get runnergroup -n <namespace> -o yaml | grep -A 10 conditions

# Check RunnerGroup events — the AGC emits Warning events for the common failures.
kubectl describe runnergroup -n <namespace> <name>
# Look for:
#   TokenUnavailable          — GitHub App installation token could not be fetched (Secret/appId/installationId).
#   AgentPoolError            — agent Secret provisioning (EnsureAgents) failed.
#   ListenerStartFailed       — listener goroutines could not be (re)started.
#   AgentDeregistrationFailed — agent Secret cleanup on scale-down/delete failed.
#   NoActiveSessions / ListenerActive — Ready condition transitions.

Resolution. - If the Secret is missing or has wrong keys, recreate it. See Getting Started — GitHub App Secret. - If the private key format is wrong, ensure it is a PEM-encoded RSA key starting with -----BEGIN RSA PRIVATE KEY-----. The Secret stringData.privateKey must include the full key including header and footer lines. - If the runner version is outdated, update workerImage in the RunnerGroup spec (or the AGC's --worker-image flag). Watch for RunnerGroup conditions with reason VersionTooOld. - If appId or installationId are wrong, update the Secret.


RunnerGroup ActiveSessions Exceeds maxListeners

Symptoms. kubectl get runnergroup -n <namespace> -o jsonpath='{.items[*].status.activeSessions}' reports a value greater than the group's spec.maxListeners, typically climbing by one after each broker or network outage. GitHub shows more concurrent runner sessions for the group than the configured ceiling.

What happened. On AGC versions without the Q100 fix, a recoverable crash of the permanent baseline listener left the active count at zero for the duration of the restart backoff; a reconcile firing inside that window started a second permanent baseline on top of the pending restart. Permanent listeners are restarted after every recoverable exit and are exempt from the maxListeners ceiling, so each repeat of the race ratchets the session count up by one, indefinitely. Fixed versions make the multiplexer start idempotent, so the race cannot stack baselines.

Resolution. - Upgrade the AGC image to a version with the Q100 fix. - To clear excess listeners immediately on an affected version, restart the AGC Deployment (kubectl rollout restart deploy/actions-gateway-controller -n <namespace>). Listener sessions are in-memory; the restarted AGC re-creates exactly one baseline per RunnerGroup.


RunnerGroup Stops Serving Jobs With Stale Ready=True

Symptoms. A RunnerGroup stops servicing queued jobs even though the AGC pod is healthy, while status.activeSessions and the Ready=True condition still report the group as operational. kubectl get runnergroup -n <namespace> -o jsonpath='{.status.activeSessions}' shows a stale nonzero value that does not match the (zero) sessions GitHub sees for the group.

What happened. The permanent baseline listener exited non-retriably — e.g. GitHub returned 401 Unauthorized on session creation for a credential it considers dead. The multiplexer does not auto-restart a non-retriable exit (that restart is reserved for recoverable crashes), so the in-memory listener count drops to zero. On AGC versions without the Q137 fix the RunnerGroup was only re-reconciled on a watch event (a RunnerGroup edit or a worker-pod lifecycle event) or the 10-hour informer resync, so with no such event the dead baseline — and the status written just before it died — could persist for hours.

Resolution. - Upgrade the AGC image to a version with the Q137 fix. Fixed versions requeue the RunnerGroup on a bounded interval while the listener count is below the desired ceiling, so the reconciler re-runs its zero-listener recovery and revives the baseline within seconds; status.activeSessions and Ready then track reality again. - To recover immediately on an affected version, trigger a reconcile by editing the RunnerGroup (e.g. a no-op annotation change) or restart the AGC Deployment (kubectl rollout restart deploy/actions-gateway-controller -n <namespace>); the restarted AGC re-creates one baseline per RunnerGroup from scratch. - If the baseline keeps exiting non-retriably after revival, the underlying credential or runner-version problem is real — check kubectl describe runnergroup for Degraded / Unauthorized / VersionTooOld conditions and resolve per the AGC CrashLoopBackOff or Not Acquiring Jobs section.


Listener Stalls for Minutes After a Black-Holed Broker Connection

Symptoms. One of a RunnerGroup's sessions stops picking up jobs for minutes at a stretch even though the AGC pod is healthy, the broker is reachable, and other sessions in the same group keep working. The stall typically follows a network event that silently drops an established connection — a firewall/NAT idle-timeout that discards packets without sending a RST, an egress-proxy failover, or a broker-side hang — so the long-poll's TCP connection is black-holed: accepted but never answered. actions_gateway_message_poll_errors_total{reason="timeout"} increments when an affected listener recovers.

What happened. The broker GetMessage long-poll holds the connection open for ~50s waiting for a job. On AGC versions without the Q108 fix the broker client had no response-header deadline, so a black-holed connection blocked the listener goroutine inside a single GetMessage call until the operating system's TCP timeout expired — minutes — during which that listener served no jobs. Fixed versions give the broker client a ResponseHeaderTimeout sized just above the 50s hold: a healthy long-poll is never cut short, but a black-holed connection is torn down a few seconds past the hold, classified as a benign "no message, retry", and the listener immediately opens a fresh long-poll.

Resolution. - Upgrade the AGC image to a version with the Q108 fix. No configuration is required; the bound is built in. - A steady stream of actions_gateway_message_poll_errors_total{reason="timeout"} after upgrade indicates the network is repeatedly black-holing broker connections (rather than wedging a listener). Investigate the egress path — proxy/NAT idle timeouts shorter than the 50s long-poll hold are the usual cause; raise the idle timeout above ~60s so healthy long-polls are not severed mid-hold.


Orphaned RunnerGroup After Removing It From the Spec

Symptoms. A runner group was removed from (or reordered within) spec.runnerGroups on an ActionsGateway, but a RunnerGroup for it still exists and keeps running listeners and worker pods. kubectl get runnergroup -n <namespace> lists more groups than the CR now declares:

# Owner-labelled RunnerGroups for a gateway vs. what the spec now declares
kubectl get runnergroup -n <namespace> -l actions-gateway/owner-name=<gateway-name>
kubectl get actionsgateway <gateway-name> -n <namespace> -o jsonpath='{range .spec.runnerGroups[*]}{.runnerLabels[0]}{"\n"}{end}'

What happened. On GMC versions without the Q101 fix, reconciliation only created/patched the groups currently in the spec and never deleted the ones removed — and because groups were keyed by list index, a remove or reorder could orphan a RunnerGroup CR that kept serving jobs until the entire ActionsGateway was deleted.

Resolution. - Upgrade the GMC to a version with the Q101 fix. Fixed versions reconcile spec.runnerGroups to the desired set: after applying the declared groups, the GMC prunes any owner-labelled RunnerGroup no longer in the spec, and keys pruning on owner labels (not list index) so a reorder never orphans a group. A subsequent reconcile (edit the CR, or wait for the next resync) cleans up any pre-existing orphans automatically. - To remove a stranded group immediately on an affected version, delete its RunnerGroup directly: kubectl delete runnergroup <name> -n <namespace>. The AGC's RunnerGroup cleanup stops its listeners and cascades to its worker pods. Confirm you are deleting an orphan (its runnerLabels are not in the current ActionsGateway spec), not a live group.


Proxy NetworkPolicy Has an Empty GitHub Allowlist

Symptoms. On a freshly provisioned tenant, all proxy egress to GitHub is silently dropped: curl through the proxy times out (no 502), the AGC cannot acquire jobs, and token refresh fails. The proxy NetworkPolicy exists but its ipBlock egress peers are empty.

Likely cause. The IP Range Reconciler's initial api.github.com/meta fetch failed or stalled at GMC startup. The cached ranges seed each proxy NetworkPolicy's ipBlock allowlist; until the first fetch lands, the allowlist is empty. The reconciler retries the initial fetch on a capped exponential backoff (1s→30s), so a transient outage normally self-heals within seconds — but a sustained inability to reach api.github.com from the GMC pod (egress firewall, DNS, or a long GitHub outage) leaves the allowlist empty until connectivity returns.

Diagnostics.

# Inspect the proxy NetworkPolicy's GitHub ipBlock egress peers — empty means the cache never populated.
kubectl get networkpolicy -n <namespace> actions-gateway-proxy \
  -o jsonpath='{.spec.egress[*].to[*].ipBlock.cidr}'

# Look for retry warnings in the GMC log.
kubectl logs -n gmc-system deploy/gmc-controller-manager \
  | grep -i "GitHub IP-range"

Resolution. - Confirm the GMC pod itself can reach api.github.com (corporate egress firewall, DNS, or proxy in front of the cluster). The reconciler retries automatically; once connectivity is restored the next successful fetch patches every existing proxy NetworkPolicy. - If the tenant manages its own egress policy (Cilium/Calico FQDN rules), set spec.proxy.managedNetworkPolicy: false so the reconciler leaves the policy alone.


Worker Pods Stuck Pending

Symptoms. Jobs are acquired (actions_gateway_jobs_acquired_total increments) but worker pods remain in Pending state for more than 60 seconds. pod_creation_latency_seconds p95 exceeds the 15s SLO target.

Likely causes. - Namespace ResourceQuota is exhausted — no pod slot, CPU request, or memory request available. - No node has enough capacity for the pod's requested resources (GPU nodes may be at capacity). - PriorityClass referenced in priorityTiers does not exist. - Image pull is slow due to a large image on a cold node (expected; see SLO targets in Appendix A).

Diagnostics.

# Check quota usage
kubectl describe resourcequota -n <namespace>

# Describe a stuck pod to see the scheduling event
kubectl describe pod -n <namespace> <worker-pod-name>
# Look for: "Insufficient cpu", "Insufficient memory", "Insufficient nvidia.com/gpu",
#           "no nodes available to schedule pods", "didn't match PodDisruptionBudget"

# Check whether the PriorityClass exists
kubectl get priorityclass <priorityClassName>

# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"

Resolution. - If quota is exhausted: raise the platform-owned ResourceQuota on the namespace (kubectl edit resourcequota -n <namespace> <quota-name>) or reduce maxWorkers / last-tier threshold. - If no GPU nodes are available: check node autoscaler status or provision additional nodes. - If a PriorityClass is missing: create it (cluster-admin action) or remove the tier reference. - If image pull is slow (first job on a cold node): this is expected. If it exceeds the p99 SLO (60s), consider pre-pulling the image via a DaemonSet or enabling image streaming.

Deadline. A pod that stays Pending is not held forever: after pendingPodDeadline (default 10m, per-RunnerGroup) the AGC deletes it to free the concurrency-ceiling slot it was holding — see the next runbook section. Diagnose a stuck pod (kubectl describe pod) before the deadline reaps it, or raise pendingPodDeadline temporarily while debugging.


Worker Pod Reaped While Pending (WorkerPodStuckPending)

Symptoms. A Warning Event with reason WorkerPodStuckPending appears on the RunnerGroup (kubectl describe runnergroup -n <namespace>), actions_gateway_worker_pods_reaped_total{reason="pending_deadline"} increments, and the job the pod was created for is cancelled by GitHub (it never started, so its lock lapsed). The worker pod itself is gone.

What happened. The pod stayed Pending longer than the RunnerGroup's pendingPodDeadline (default 10m), so the AGC deleted it. A permanently Pending pod would otherwise hold one of the group's concurrency-ceiling slots forever — the ceiling counts Pending pods. The deadline is a capacity-protection mechanism, not a retry mechanism: the job is not re-queued automatically.

Likely causes. - workerImage (or the podTemplate container image) does not exist or is not pullable from the cluster — ErrImagePull / ImagePullBackOff. - podTemplate scheduling constraints (nodeSelector, tolerations, GPU resources) that no node satisfies. - Node autoscaler provisioning slower than the deadline (common for GPU node pools).

Diagnostics.

# The reap event names the deleted pod and the deadline that fired
kubectl get events -n <namespace> --field-selector reason=WorkerPodStuckPending

# Rate of reaps per group
# PromQL: rate(actions_gateway_worker_pods_reaped_total{reason="pending_deadline"}[1h])

# Reproduce the pull/scheduling failure before the next reap:
# trigger a job, then describe the new Pending pod within the deadline window
kubectl get pods -n <namespace> -l actions-gateway/runner-group=<group> -w
kubectl describe pod -n <namespace> <worker-pod-name>

Resolution. - Fix the unpullable image or unsatisfiable scheduling constraint — that is the root cause; the reap is the messenger. - If scheduling is legitimately slow (autoscaled GPU nodes), raise spec.pendingPodDeadline on the RunnerGroup (or the matching runnerGroups[] entry of the ActionsGateway CR) above the worst-case node-provisioning time, e.g. pendingPodDeadline: "30m". - Re-run the cancelled workflow from the GitHub UI once the cause is fixed.


Proxy Pool Not Scaling

Symptoms. The HPA for the proxy pool shows TARGETS: <unknown>/60% and the replica count does not increase under load.

Likely cause. resources.requests.cpu is unset or zero for proxy pods. The Kubernetes Horizontal Pod Autoscaler (HPA) computes CPU utilization as (current_cpu_usage / requested_cpu). If requests.cpu is zero, the denominator is undefined and the HPA emits <unknown> for the target metric and stops scaling entirely.

Diagnostics.

# Check HPA status
kubectl describe hpa -n <namespace> actions-gateway-proxy

# Check proxy pod resource requests
kubectl get pod -n <namespace> -l app=actions-gateway-proxy -o jsonpath='{.items[0].spec.containers[0].resources}'

# Check metrics-server is running
kubectl get pods -n kube-system -l k8s-app=metrics-server

Resolution.

Ensure spec.proxy.resources.requests.cpu is set to a non-zero value in the ActionsGateway spec. The default is 10m. If you explicitly set resources without including requests.cpu, the whole resources block is replaced and defaults are lost — set all four sub-fields explicitly:

proxy:
  resources:
    requests:
      cpu: "10m"
      memory: "32Mi"
    limits:
      cpu: "100m"
      memory: "64Mi"

After updating the spec, patch the proxy Deployment or trigger a rollout; the HPA will start computing utilization on the next metrics scrape cycle (~30s).


Proxy Tunnel Closed Mid-Stream — Idle or Lifetime Cap

Symptoms. A worker job logs a connection reset, EOF, or broken pipe from the GitHub SDK / curl / git, with no proxy 502 response. The proxy pod itself is healthy and serving other tunnels.

Likely cause. The proxy enforces two per-tunnel deadlines on the CONNECT relay (M-18, 2026-05-31):

  • Idle timeout — 5 minutes of no data in either direction. A long-poll against the GitHub API or a stalled SDK call hits this first.
  • Hard lifetime cap — 6 hours absolute, regardless of activity. A continuous artifact stream or Twirp log relay that exceeds this is torn down even with traffic flowing.

These are not bugs. They bound goroutine and file-descriptor exhaustion from slow or stuck clients. The healthy case (an actively-used GitHub API call) completes in seconds and does not trip either cap.

Diagnostics.

The proxy serves /metrics over mutual TLS on :8443 (not :8081, which now carries only the plaintext /healthz + /readyz probes). Scraping requires the per-tenant scraper client certificate the GMC publishes — see Metrics scrape returns a TLS / connection error for how to fetch the bundle. With the bundle written to scraper.crt / scraper.key / metrics-ca.crt:

ns=<namespace>
# Distribution of tunnel lifetimes; a heavy tail near 21600s (6h) or
# a spike at 300s (5m idle) indicates clients hitting the caps.
curl -s --cert scraper.crt --key scraper.key --cacert metrics-ca.crt \
  "https://actions-gateway-proxy.$ns.svc:8443/metrics" | \
  grep actions_gateway_proxy_tunnel_duration_seconds_bucket

# Active vs. total tunnels — healthy ratio is "active << total".
curl -s --cert scraper.crt --key scraper.key --cacert metrics-ca.crt \
  "https://actions-gateway-proxy.$ns.svc:8443/metrics" | \
  grep -E 'actions_gateway_proxy_connections_(active|total)'

Resolution.

For idle hits: examine the workflow step that stalls. A workflow sleep-ing inside a long-running curl --connect-timeout 0 or a misconfigured webhook receiver are typical causes. The fix is usually in the workflow, not the proxy.

For lifetime-cap hits: split very long-running uploads or streams across multiple HTTP requests. The 6h cap is a safety net for stuck connections; a legitimately-long single stream should be rare.

To change the defaults during an incident, patch the proxy Deployment with environment overrides — note that there is no env-var knob today; defaults are baked into the Server struct and require a code change to adjust. File a Queue item if a tenant repeatedly hits either cap on a legitimate workload.


Metrics scrape returns a TLS / connection error

Symptoms. Prometheus (or a manual curl) of a per-tenant proxy or AGC /metrics endpoint fails with one of:

  • remote error: tls: certificate required / bad certificate — no client cert, or one signed by the wrong CA.
  • connection refused on :8081/metrics — the metrics endpoint moved to :8443 (mTLS); :8081 now serves only /healthz + /readyz.
  • context deadline exceeded / no route — the scraper namespace is not labelled metrics: enabled, so the NetworkPolicy drops the connection before the handshake.

Cause. The proxy and AGC serve /metrics over mutual TLS on :8443 (Q69). A scraper must (1) connect from a namespace labelled metrics: enabled and (2) present a client certificate signed by the per-tenant metrics CA the GMC issues. Both halves are required.

Resolution.

  1. Label the monitoring namespace so the NetworkPolicy admits it:
    kubectl label namespace <prometheus-namespace> metrics=enabled
    
  2. Fetch the scraper client bundle the GMC publishes in each tenant namespace and point the scrape at :8443 with scheme: https:
    ns=<tenant-namespace>
    kubectl get secret actions-gateway-metrics-client -n "$ns" -o jsonpath='{.data.tls\.crt}' | base64 -d > scraper.crt
    kubectl get secret actions-gateway-metrics-client -n "$ns" -o jsonpath='{.data.tls\.key}' | base64 -d > scraper.key
    kubectl get secret actions-gateway-metrics-client -n "$ns" -o jsonpath='{.data.ca\.crt}'  | base64 -d > metrics-ca.crt
    curl -s --cert scraper.crt --key scraper.key --cacert metrics-ca.crt \
      "https://actions-gateway-proxy.$ns.svc:8443/metrics" | head
    
    Delete the extracted key file when finished. For a ServiceMonitor, mount the bundle and reference it from tlsConfig (cert/keySecret/ca).
  3. If the cert is rejected after a CA rotation, the GMC re-issues the whole bundle ~30 days before expiry but pods read certs at startup — restart the proxy/AGC pods (and re-fetch the client bundle) after a rotation.

RateLimited Condition on ActionsGateway

Symptoms. kubectl get actionsgateway shows a RateLimited=True condition. actions_gateway_active_sessions is at or near the per-installation budget.

Likely cause. The GitHub App installation's API budget (15,000 GET /message requests/hour) is exhausted. This occurs when the sum of maxListeners across all RunnerGroups simultaneously bursts to their ceiling for a sustained period.

SLO threshold. A RateLimited condition lasting more than 1 minute during non-peak hours indicates the installation is over budget. Durations exceeding 10 minutes during business hours should page on-call.

Diagnostics.

# Check the condition
kubectl get actionsgateway -n <namespace> <name> -o jsonpath='{.status.conditions}' | jq .

# Check active sessions vs. budget
# Budget: ~208 sessions (15000/hr ÷ 72 polls/session/hr)
# Metric: actions_gateway_active_sessions{namespace="<namespace>"}

# Check per-RunnerGroup maxListeners sum
kubectl get runnergroup -n <namespace> -o jsonpath='{.items[*].spec.maxListeners}'

Resolution. - If a burst is temporary and below 10 minutes: no action required, the condition will clear as the burst subsides. - If maxListeners values are set higher than needed, reduce them. - If the tenant's RunnerGroup count × maxListeners sustainably exceeds the installation budget, shard to a second ActionsGateway CR with a new GitHub App installation. See Appendix E §E.6.


GitHub App Secret Misconfiguration

Symptoms. AGC logs show errors like failed to create installation token, private key: RSA key parse error, or 401 Unauthorized. The ActionsGateway condition AGCAvailable=False with reason CredentialError.

Common misconfigurations.

Error message Likely cause
private key: RSA key parse error PEM key has extra whitespace, missing newline, or wrong format (PKCS#8 instead of RSA PKCS#1).
401 Unauthorized on token exchange appId or installationId is wrong.
404 Not Found on token exchange The GitHub App is not installed in the target organization or the installationId does not match.
422 Unprocessable Entity The App lacks the Actions: Read and Administration: Read permissions.

Diagnostics.

# Check Secret keys exist and are non-empty
kubectl get secret -n <namespace> <name> -o jsonpath='{.data.appId}' | base64 -d
kubectl get secret -n <namespace> <name> -o jsonpath='{.data.installationId}' | base64 -d
kubectl get secret -n <namespace> <name> -o jsonpath='{.data.privateKey}' | base64 -d | head -1
# Expected first line: -----BEGIN RSA PRIVATE KEY-----

# Verify the App ID and installation ID match the GitHub App
# GitHub UI: Settings → Developer settings → GitHub Apps → <app> → General (App ID)
# GitHub UI: Settings → Developer settings → GitHub Apps → <app> → Install App (Installation ID in URL)

Resolution. Re-create the Secret with correct values. To trigger a rolling update on the AGC Deployment after fixing the Secret, change gitHubAppRef.name in the ActionsGateway spec to reference the new Secret name (the GMC will roll the AGC Deployment automatically) or manually restart the Deployment:

kubectl rollout restart deploy/actions-gateway-controller -n <namespace>

See Getting Started — Rotating GitHub App Credentials for the full rotation procedure.


Token Refresh Errors Spiking

Symptoms. actions_gateway_token_refresh_errors_total is increasing. GitHub App installation tokens expire after one hour; if refresh fails, new sessions cannot be established once the token expires.

Likely causes. - GitHub API is temporarily unavailable or returning 5xx errors. - The GitHub App private key was rotated but the Secret was not updated. - Network path from AGC to GitHub API is down (proxy pool issue).

Diagnostics.

# Check the error rate
# Metric: rate(actions_gateway_token_refresh_errors_total[5m])

# Check AGC logs for the error detail
kubectl logs -n <namespace> deploy/actions-gateway-controller | grep "token refresh"

# Test connectivity to GitHub via the tenant proxy (AGC is distroless — use an
# ephemeral curl pod in the same namespace; it picks up the same NetworkPolicy egress).
kubectl run nettest-$$ -n <namespace> --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --overrides='{"spec":{"automountServiceAccountToken":false,"containers":[{"name":"c","image":"curlimages/curl:latest","command":["sh","-c","curl -x https://actions-gateway-proxy:8080 -sI https://api.github.com/app"]}]}}'

Resolution. - If GitHub is temporarily unavailable: the AGC's exponential back-off retry (5s → 60s cap) will recover automatically. Monitor until the error rate returns to zero. - If the private key was rotated: update the Secret. See Getting Started — Rotating GitHub App Credentials. - If the proxy is unreachable: see Proxy Pool Not Scaling and the network connectivity section below.

SLO. Token refresh errors should stay below 1 per hour per tenant. Above this rate, begin investigating immediately. In-flight sessions will fail at the next reconnection once the token expires (~1 hour).


RenewJob Failures Rising

Symptoms. actions_gateway_renewjob_errors_total is increasing. Jobs may start being cancelled by GitHub before completion.

Likely causes. - Network connectivity issues between the AGC and GitHub (via proxy). - GitHub API is temporarily unavailable. - The runner job lock window expired before the renewer could refresh (AGC was slow or restarting).

Diagnostics.

# Check recent error rate
# Metric: rate(actions_gateway_renewjob_errors_total[5m])

# Check AGC logs for renewal errors and job IDs
kubectl logs -n <namespace> deploy/actions-gateway-controller | grep "renewjob"

# Confirm the proxy pool is healthy
kubectl get pods -n <namespace> -l app=actions-gateway-proxy

Resolution. - Transient GitHub API errors: the renewer retries; monitor until the rate returns to zero. - Proxy pool unhealthy: fix the proxy pool (see Proxy Pool Not Scaling). - If the AGC restarted mid-job: jobs whose lock expired will have been cancelled by GitHub. These require manual re-run. Check actions_gateway_eviction_retries_exhausted_total for any jobs that were also evicted.

Each renewjob error is a warning, not an immediate job failure — GitHub grants ~10 minutes per renewal window. A single transient error on a long-running job is rarely fatal.


Sessions Stuck in 401/EOF GetMessage Loops (Tenant Throughput Decays to Zero)

Symptoms. On gateway versions without the Q114 self-heal (≤ the M4 validation build): - AGC logs fill with repeating GetMessage error ... decode response: EOF and later broker: unauthorized (HTTP 401) lines for the same session, backing off forever. - The repo/org runner list (gh api .../actions/runners) loses one runner after each completed job, and the registrations never come back. - RunnerGroup status.activeSessions decays over time; after roughly maxListeners completed jobs, queued workflow jobs wait forever even though the AGC pod is healthy.

Cause. GitHub deletes a JIT-registered runner record once it acquires a job (single-use runners). Pre-fix AGC versions keep polling the dead session with the dead agent's credentials instead of re-registering, so every completed job permanently burns one listener slot (M4 §12, bug 2).

Diagnostics.

# Repeating EOF/401 poll errors
kubectl logs -n <namespace> deploy/actions-gateway-controller | grep -E "decode response: EOF|unauthorized"

# Listener slots remaining
kubectl get runnergroup -n <namespace> -o jsonpath='{.items[*].status.activeSessions}'

# On fixed versions, recycles should be happening instead:
# Metric: rate(actions_gateway_agent_recycles_total[15m])  — roughly tracks job completions
# Metric: actions_gateway_agent_recycle_errors_total       — should stay flat

Resolution. - Upgrade to a gateway version with the Q114 self-heal. Fixed versions re-register each agent after every job (actions_gateway_agent_recycles_total{trigger="post_job"}) and heal stale sessions discovered after a restart (trigger="stale_session" / "startup"); no per-job operator action is needed. - Interim manual recovery on pre-fix versions: delete the RunnerGroup's agent Secrets and restart the AGC so it registers a fresh pool:

kubectl delete secret -n <namespace> -l actions-gateway/runner-group=<group>
kubectl rollout restart deploy/actions-gateway-controller -n <namespace>

Expect 409 Already exists registration errors for any agent that never ran a job — its record survives server-side under an ID the AGC no longer knows. Delete the survivor from GitHub first: find its ID with gh api '.../actions/runners?name=<group>-<index>', then gh api -X DELETE .../actions/runners/<id>. Fixed versions resolve this 409 automatically.

On fixed versions, a sustained rise of actions_gateway_agent_recycle_errors_total means the AGC cannot re-register agents (registration API unreachable, installation token failures, or revoked GitHub App runner-administration permission) — listener capacity shrinks until recycles succeed. Check AGC logs for recycle errors and verify the App's runner permissions.


Network Connectivity Failures

Symptoms. The AGC cannot reach GitHub through the proxy. Logs show connection refused, dial tcp: i/o timeout, or proxy: no response from proxy.

Likely causes. - The proxy pod is not running or not ready. - HTTP_PROXY/HTTPS_PROXY environment variables are incorrect (wrong Service name or port). - actions-gateway-workload NetworkPolicy is blocking the AGC-to-proxy egress path (e.g. proxy ClusterIP changed after a recreate and the rule wasn't reconciled). - actions-gateway-proxy NetworkPolicy is blocking the proxy's egress to GitHub (IP ranges stale or managedNetworkPolicy: false with no replacement rule). - actions-gateway-controller NetworkPolicy is missing — AGC can't reach the K8s API server, so token refresh and webhook health checks fail before any GitHub traffic.

Diagnostics.

# Check proxy pod status
kubectl get pods -n <namespace> -l app=actions-gateway-proxy

# Verify the proxy Service exists and has endpoints
kubectl get svc -n <namespace> actions-gateway-proxy
kubectl get endpoints -n <namespace> actions-gateway-proxy

# Check the AGC container's HTTPS_PROXY env var (distroless — inspect spec, not the running process)
kubectl get pod -n <namespace> -l app=actions-gateway-controller \
  -o jsonpath='{range .items[0].spec.containers[?(@.name=="agc")].env[?(@.name=="HTTPS_PROXY")]}{.name}={.value}{"\n"}{end}'

# Test proxy connectivity using an ephemeral curl pod in the same namespace
kubectl run nettest-$$ -n <namespace> --rm -it --restart=Never \
  --image=curlimages/curl:latest \
  --overrides='{"spec":{"automountServiceAccountToken":false,"containers":[{"name":"c","image":"curlimages/curl:latest","command":["sh","-c","curl -v -x https://actions-gateway-proxy:8080 https://api.github.com 2>&1 | head -20"]}]}}'

# Check NetworkPolicy rules — there are three: workload, agc, proxy
kubectl get networkpolicy -n <namespace>
# Expected: actions-gateway-workload, actions-gateway-controller, actions-gateway-proxy
kubectl describe networkpolicy -n <namespace>

# Check the IP range refresh metric
# Metric: actions_gateway_ip_range_updates_total{namespace="<namespace>"}

Resolution. - If the proxy pod is down: check its logs and restart if necessary. - If the NetworkPolicy egress rules are stale: trigger a manual refresh by temporarily setting spec.proxy.managedNetworkPolicy: false and back to true, or wait for the 24-hour automatic refresh cycle. Check the GitHub meta API for current IP ranges: curl https://api.github.com/meta | jq .actions. - If the NO_PROXY list is missing the cluster service CIDR: update spec.proxy.noProxyCIDRs to include your cluster's service CIDR (see the noProxyCIDRs field documentation in §3.1).


AGC Cannot Reach the Kubernetes API Server (NetworkPolicy + post-DNAT port mismatch)

Symptoms. AGC logs show dial tcp 10.96.0.1:443: i/o timeout (or similar) when calling the K8s API server. The actions-gateway-controller NetworkPolicy appears to allow port 443, yet the connection is silently dropped. Most often surfaces in kind, but possible on any cluster where the kubernetes Service backends listen on a port other than 443.

Cause. NetworkPolicy enforcement evaluates packets after kube-proxy's DNAT. When a pod connects to kubernetes.default.svc (ClusterIP 10.96.0.1:443), kube-proxy DNATs the destination to the apiserver's actual Endpoints address — in kind, that's <node-ip>:6443. The policy controller sees the post-DNAT destination port (6443), and an NP rule that allows only port 443 doesn't match. This is the port-axis equivalent of the ipBlock: <ClusterIP>/32 trap that bit the proxy NP in PR #59.

Diagnostics.

# 1. Confirm the apiserver Endpoints port. If it's 6443, the AGC NP must allow 6443.
kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes \
  -o jsonpath='{.items[0].ports[0].port}{"\n"}'

# 2. Confirm the AGC NetworkPolicy actually allows both 443 and 6443.
kubectl get networkpolicy -n <namespace> actions-gateway-controller -o yaml \
  | yq '.spec.egress[].ports[].port' | sort -u

# 3. If the cluster uses kindnet / kube-network-policies, check the verdict log
#    on the node hosting the AGC pod. Look for lines like:
#      "Pod is not allowed to connect to port" pod="<ns>/<agc-pod>" port=6443
kubectl get pod -n <namespace> -l app=actions-gateway-controller \
  -o jsonpath='{.items[0].spec.nodeName}{"\n"}'
kubectl logs -n kube-system -l app=kindnet --tail=200 --field-selector spec.nodeName=<node-name>

Resolution. Ensure buildAGCNetworkPolicy allows both port 443 (production Service shape) and port 6443 (kind / Endpoints-on-6443 clusters). The shipped policy does this. If you see this on a custom build or a hand-edited NP, add the 6443 rule. The diagnosis writeup at docs/development/networkpolicy-port-matching.md has a minimal repro and the reasoning behind allowing both ports.

If you see the same symptom for an ingress-type rule or for a different Service whose backend port differs from the Service port, the same fix applies: list both ports, or omit the port restriction on that rule.


Worker Pod Runner.Worker Fails TLS Handshake With UntrustedRoot

Symptoms. Worker pod logs (look at the runner container) contain repeated lines like:

System.Security.Authentication.AuthenticationException: The remote certificate is invalid because of errors in the certificate chain: UntrustedRoot

emitted from JobExtension connectivity checks, ResultServer init, JobServerQueue log uploads, the GitHubActionsService log-blob fetch, or RunServer.CompleteJobAsync. The runner retries for ~3 minutes, then exits 1. The AGC then logs worker pod completed phase=Failed, renewjob starts returning 401 Not authorized for this job, and the GitHub workflow concludes cancelled even though the actual job steps may have run.

Cause. Runner.Worker's .NET HttpClient is validating the egress proxy's TLS cert and the worker pod's trust store does not include the cert-manager-issued self-signed CA that signed it. This is the worker-side mirror of the AGC's proxy-CA pinning (§5.2 "Cross-Tenant Proxy CA Trust"): the AGC mounts the CA explicitly so its outbound HTTPS works; worker pods must do the same.

The AGC's pod provisioner is supposed to project the per-tenant actions-gateway-proxy-tls Secret into every worker pod at /etc/actions-gateway/proxy-ca/tls.crt and set PROXY_CA_CERT_PATH so the worker entrypoint wrapper builds a combined SSL_CERT_FILE bundle before exec'ing Runner.Worker. UntrustedRoot means one of those steps did not happen.

Diagnostics.

# 1. Inspect a failed worker pod's spec — the Secret volume must exist.
kubectl get pod -n <namespace> <worker-pod-name> -o yaml \
  | yq '.spec.volumes[] | select(.name=="proxy-ca")'
# Expected: a Secret volume with secretName: actions-gateway-proxy-tls and Items: [{key: tls.crt, path: tls.crt}]
# If empty: the AGC was deployed without PROXY_TLS_SECRET_NAME.

# 2. Confirm the AGC has the PROXY_TLS_SECRET_NAME env wired.
kubectl get pod -n <namespace> -l app=actions-gateway-controller \
  -o jsonpath='{range .items[0].spec.containers[?(@.name=="agc")].env[?(@.name=="PROXY_TLS_SECRET_NAME")]}{.name}={.value}{"\n"}{end}'
# Expected: PROXY_TLS_SECRET_NAME=actions-gateway-proxy-tls
# Empty means the GMC needs to roll the AGC Deployment (likely an upgrade across the 5h boundary).

# 3. Confirm the worker container's PROXY_CA_CERT_PATH env.
kubectl get pod -n <namespace> <worker-pod-name> -o yaml \
  | yq '.spec.containers[] | select(.name=="runner") | .env[] | select(.name=="PROXY_CA_CERT_PATH")'

# 4. Confirm the proxy TLS Secret exists and contains tls.crt.
kubectl get secret -n <namespace> actions-gateway-proxy-tls \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -subject -issuer -dates

Resolution. - If the worker pod has no proxy-ca volume: ensure the AGC was started with PROXY_TLS_SECRET_NAME=actions-gateway-proxy-tls (the GMC injects this automatically — if it's missing, the GMC needs to roll the AGC Deployment, e.g. by bumping ag.Spec or restarting the GMC). - If the volume is present but the wrapper logs nothing about proxy CA trust installed: check that PROXY_CA_CERT_PATH is set on the runner container and the mounted file is non-empty. An empty/missing file is tolerated as a no-op, which silently leaves the runner with no proxy trust — the wrapper log line no proxy CA cert mounted; skipping trust-store install distinguishes this case from a wrapper that ran the install successfully. - If the proxy TLS Secret is missing or the cert has expired: the GMC's cert-manager integration (§2.1 "Proxy Deployer") owns rotation; check the GMC's logs for issuer errors. As a fallback, deleting the Secret triggers reissuance. - If the issue persists after the volume and env are correct: confirm the proxy pod is presenting the cert signed by the CA in the Secret — kubectl exec into a curl pod in the same namespace and run `openssl s_client -connect actions-gateway-proxy:8080 -showcerts