Observability¶
Audience: SRE, Platform engineer
Both the GMC and AGC expose Prometheus metrics at :8080/metrics (no authentication required by default). The standard controller-runtime metrics server is used; additional built-in metrics (reconcile latency, work queue depth, etc.) are emitted automatically alongside the custom metrics below.
For SLO targets associated with these metrics, see Appendix A — Capacity Targets & SLOs.
Table of Contents¶
- Logging
- Distributed Tracing (AGC)
- Enabling tracing
- Enabling tracing on GMC-managed AGCs
- How to Access Metrics
- Install-time scraping prerequisites (GMC manager)
- Full Metrics Reference
- Proxy metrics
- Symptom → Metric Mapping
- Recommended Alert Rules
- SLO Recording Rules
- Grafana Dashboard
- Suggested Panel Layout
- Dashboard Variables
- Label Cardinality Warning
Logging¶
All four components — the GMC, the per-tenant AGC, the egress proxy, and the worker wrapper — emit structured JSON logs at info level by default, one JSON shape per process stream, ready to ship to a log aggregator (Loki, Elasticsearch, CloudWatch, etc.) without reformatting. No flag needs to be set in production; the JSON default is what the GMC-provisioned Deployments run with.
The controllers (GMC, AGC) take controller-runtime's standard zap flags. For local development, pass --zap-devel to switch to human-readable console logs at debug level, or use the finer-grained --zap-encoder / --zap-log-level flags (run a controller with --help for the full set). Application code paths that log through the Go standard library's log/slog are bridged onto the same zap logger, so --zap-log-level governs every line a controller emits — not just the manager's own — and the whole process shares one JSON schema.
The egress proxy and the worker wrapper are not controllers; they read their level from the LOG_LEVEL environment variable (info | debug, default info).
Distributed Tracing (AGC)¶
The per-tenant AGC emits OpenTelemetry traces for its two hottest operational paths:
RunnerGroup.Reconcile— one span per reconcile, attributed withrunnergroup.namespace/runnergroup.name. Errors set the span status.Provisioner.provision— one span per acquired job (the job-to-pod path), with child spansstageJobSecret,countActivePods,createPod, andwaitForCompletion. The root span carriesrunnergroup.*,plan.id,pod.name,active_pods,ceiling.held,priority_class, and the finalpod.phase/pod.reason/duration_seconds.waitForCompletionis usually the long pole, so its child span tells you whether latency is in scheduling/runtime versus the controller.
Each reconcile and each job provision is its own root trace — there is no inbound trace context to continue, and the per-job spans run on the listener goroutines independently of the reconcile that started the pool.
Tracing is opt-in and off by default. With no OTLP endpoint configured the AGC installs no exporter and the spans are no-ops (near-zero cost), so production runs without tracing unless you point it at a collector.
Enabling tracing¶
The AGC reads the standard OpenTelemetry SDK environment variables — there is no bespoke flag. Tracing turns on as soon as an OTLP endpoint is configured:
| Variable | Effect |
|---|---|
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP/gRPC collector address (e.g. otel-collector.observability:4317). Setting either one enables tracing. |
OTEL_SDK_DISABLED=true |
Hard kill switch — forces tracing off even when an endpoint is set. |
OTEL_SERVICE_NAME / OTEL_RESOURCE_ATTRIBUTES |
Override the default service.name (actions-gateway-agc) and add resource attributes. |
OTEL_TRACES_SAMPLER, OTEL_EXPORTER_OTLP_HEADERS, OTEL_EXPORTER_OTLP_TIMEOUT, … |
All other knobs are the SDK's standard env vars. |
On shutdown the AGC flushes buffered spans (5 s budget) before exiting.
Enabling tracing on GMC-managed AGCs¶
The GMC builds the AGC Deployment, so for a GMC-provisioned tenant you do not set these env vars by hand — you declare tracing on the ActionsGateway CR and the GMC translates spec.tracing into the standard OTEL_* env on the AGC Deployment:
apiVersion: actions-gateway.github.com/v1alpha1
kind: ActionsGateway
metadata:
name: team-a
namespace: team-a
spec:
gitHubAppRef:
name: team-a-github-app
gitHubURL: https://github.com/team-a-org
tracing:
endpoint: https://otel-collector.observability:4317 # enables tracing
sampler: parentbased_traceidratio # optional
samplerArg: "0.1" # optional — 10% of traces
resourceAttributes: # optional
deployment.environment: prod
# insecure: true # only for a plaintext in-cluster collector; TLS is the default
spec.tracing field |
AGC env it sets | Notes |
|---|---|---|
endpoint |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT |
Setting it is what enables tracing. Empty → no OTEL_* env, tracing stays off. |
insecure |
OTEL_EXPORTER_OTLP_TRACES_INSECURE |
Defaults to false (TLS). Set true only for a plaintext in-cluster collector. |
sampler |
OTEL_TRACES_SAMPLER |
One of always_on, always_off, traceidratio, parentbased_always_on, parentbased_always_off, parentbased_traceidratio (CRD-enforced enum). |
samplerArg |
OTEL_TRACES_SAMPLER_ARG |
Ratio in [0,1] for the ratio-based samplers. |
resourceAttributes |
OTEL_RESOURCE_ATTRIBUTES |
Rendered as a sorted key=value list. The AGC's own service.name/service.version take precedence. |
No auth headers via env.
spec.tracingdeliberately has no field forOTEL_EXPORTER_OTLP_HEADERS: those can carry bearer tokens, and this project keeps secrets out of environment variables (they leak into process listings and child processes). Authenticate the collector at the network layer instead — an in-cluster collector reached over the tenant's egress path, mutual TLS, or a service mesh.Testing-only passthrough. The
AGC_EXTRA_*mechanism (--allow-agc-extra-envon the GMC, thenAGC_EXTRA_OTEL_EXPORTER_OTLP_ENDPOINT=…in the GMC pod env) still exists but is gated for tests only and not for production use. When both are present,AGC_EXTRA_*wins (it is appended last). Preferspec.tracing.
How to Access Metrics¶
Port forward (ad-hoc):
kubectl port-forward -n <namespace> deploy/actions-gateway-controller 8080:8080
curl http://localhost:8080/metrics
Prometheus operator (production):
Create a ServiceMonitor targeting the AGC and GMC services:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: actions-gateway
namespace: <namespace>
spec:
selector:
matchLabels:
app: actions-gateway-controller
endpoints:
- port: metrics
interval: 30s
The metrics port is named metrics in the Service spec.
Install-time scraping prerequisites (GMC manager)¶
The default GMC install ships the manager NetworkPolicy enabled by default
(networkPolicy.enabled=true). Selecting the manager pod flips it to default-deny on
ingress, so its /metrics endpoint admits traffic only from namespaces carrying
the right label:
- Scraping the GMC manager metrics: label your Prometheus namespace
metrics: enabled, or no scrape will reach the manager:
The validating-webhook port (container 9443) is intentionally re-allowed from
any source — the kube-apiserver that calls it is not a pod in a labeled
namespace, so a source restriction there would silently break every
ActionsGateway admission (failurePolicy: Fail). The webhook is TLS +
caBundle authenticated, so the sensitive surface stays the metrics: enabled
restriction above. No namespace label is required for CR admission.
This applies to the GMC manager only. The per-tenant AGC and proxy metrics are governed by the per-tenant NetworkPolicies the GMC generates (the AGC NP already admits monitoring-namespace scrapes of the metrics port).
Runtime enforcement of these policies depends on the CNI; kindnet's
kube-network-policiesdoes not drop all egress negatives (see the worker egress limitation in troubleshooting.md). The manager NP is verified by manifest review and is pending a Tier-A runtime check.
The ServiceMonitor integration stays opt-in, behind the
metrics.serviceMonitor.enabled chart value (default false): out-of-box
Prometheus Operator scraping. It is left off by default because the
ServiceMonitor CRD only exists once the Prometheus Operator is installed, so
rendering it unconditionally would break helm install on clusters without it.
Full Metrics Reference¶
| Metric | Type | Labels | Description |
|---|---|---|---|
actions_gateway_active_sessions |
Gauge | namespace, runner_group |
Currently open long-poll sessions. One per RunnerGroup at steady state; rises toward maxListeners during bursts. |
actions_gateway_jobs_acquired_total |
Counter | namespace, runner_group |
Jobs successfully acquired from the broker. |
actions_gateway_job_acquisition_errors_total |
Counter | namespace, reason |
Acquisition failures. Reason values: already_claimed (benign race), delivery_window_expired (job redelivered), version_too_old, other. |
actions_gateway_job_duration_seconds |
Histogram | namespace, runner_group |
Wall time from acquirejob success to worker pod terminal phase. |
actions_gateway_pod_creation_latency_seconds |
Histogram | namespace |
Time from acquirejob to pod Scheduled event. Key SLO metric. |
actions_gateway_token_refreshes_total |
Counter | namespace |
Successful GitHub App installation token refreshes. |
actions_gateway_token_refresh_errors_total |
Counter | namespace |
Failed token refresh attempts. See SLO threshold below. |
actions_gateway_renewjob_errors_total |
Counter | namespace |
Failed renewjob calls. Leading indicator for cancelled jobs. |
actions_gateway_eviction_retries_total |
Counter | namespace, runner_group |
Jobs automatically re-queued after worker pod eviction. |
actions_gateway_eviction_retries_exhausted_total |
Counter | namespace, runner_group |
Eviction retries exhausted; job requires manual re-run. |
actions_gateway_worker_pods_reaped_total |
Counter | namespace, runner_group, reason |
Worker pods deleted by the lifecycle reaper. reason="completed_ttl" is routine cleanup after completedPodTTL; reason="pending_deadline" means a pod was stuck Pending past pendingPodDeadline and its job was cancelled — each such reap also emits a WorkerPodStuckPending Warning Event on the RunnerGroup. |
actions_gateway_message_poll_errors_total |
Counter | namespace, reason |
GetMessage errors (excludes empty polls and session expiry — those are normal). reason="rate_limited" is a 429; reason="timeout" is a black-holed long-poll the broker accepted but never answered, bounded by the client response-header deadline and retried (see Listener Stalls After a Black-Holed Broker Connection); reason="other" is any remaining transport/decode error. |
actions_gateway_agent_recycles_total |
Counter | namespace, runner_group, trigger |
Single-use JIT agents re-registered. trigger="post_job" is routine (one per completed job); stale_session/startup mean a dead agent was detected and healed after the fact; reconcile_repair means a parked agent was repaired by the reconciler. |
actions_gateway_agent_recycle_errors_total |
Counter | namespace, runner_group |
Failed agent re-registration attempts. Sustained growth shrinks listener capacity — see the runbook. |
actions_gateway_reconcile_errors_total |
Counter | controller, resource |
GMC/AGC reconcile errors. Non-zero values deserve investigation. |
actions_gateway_ip_range_updates_total |
Counter | namespace |
NetworkPolicy egress rule refreshes from GitHub meta API. |
actions_gateway_managed_gateways |
Gauge | — | Total ActionsGateway CRs currently managed by the GMC. |
Proxy metrics¶
The per-tenant egress proxy exposes its own metrics on its health/metrics
port (:8081, restricted by the L-8 NetworkPolicy — see
security.md L-8). Each proxy is a separate scrape
target; these metrics carry no intrinsic namespace label, so attach one
via the ServiceMonitor/scrape config if you need per-tenant attribution.
| Metric | Type | Labels | Description |
|---|---|---|---|
actions_gateway_proxy_connections_active |
Gauge | — | Currently open CONNECT tunnels. |
actions_gateway_proxy_connections_total |
Counter | — | Total CONNECT tunnels opened. |
actions_gateway_proxy_dial_errors_total |
Counter | — | Upstream dial failures (e.g. blocked-destination attempts). |
actions_gateway_proxy_tunnel_duration_seconds |
Histogram | — | Tunnel lifetime, observed at close. Buckets reach 21600s (the 6h absolute lifetime cap). |
For abuse/compromise detection built on these metrics (slowloris, eviction-retry loops, credential-harvesting), see security-operations.md.
Symptom → Metric Mapping¶
| Symptom | Metric(s) to check | Notes |
|---|---|---|
| Jobs are slow to start | pod_creation_latency_seconds p95/p99 |
SLO: p95 ≤ 15s, p99 ≤ 60s |
| Jobs are randomly cancelled | renewjob_errors_total |
Each sustained error risks a job cancellation |
| Jobs are not being acquired | active_sessions (should be ≥ 1 per RunnerGroup), job_acquisition_errors_total |
Zero sessions = no polling |
| Jobs are queuing but not starting | active_sessions (OK) vs jobs_acquired_total not incrementing |
Check RateLimited condition |
| Runner credentials are broken | token_refresh_errors_total |
Spikes indicate Secret or GitHub App issue |
| Evictions causing re-runs | eviction_retries_total, eviction_retries_exhausted_total |
Exhausted budget requires manual intervention |
| Throughput decaying job by job | agent_recycle_errors_total rising, active_sessions shrinking |
Agent re-registration failing; see the runbook |
| Jobs cancelled without ever starting | worker_pods_reaped_total{reason="pending_deadline"} |
Worker pod stuck Pending past the deadline — fix the image/scheduling cause; see the runbook |
| Proxy autoscaling not working | HPA TARGETS showing <unknown> |
requests.cpu not set on proxy pods |
| GMC/AGC reconcile broken | reconcile_errors_total |
Non-zero sustained rate indicates operator issue |
Recommended Alert Rules¶
The following Prometheus alerting rules map to the SLO targets in Appendix A. Adjust thresholds to match your environment.
groups:
- name: actions-gateway
rules:
# Page: no sessions means no job acquisition
- alert: ActionsGatewayNoActiveSessions
expr: |
actions_gateway_active_sessions == 0
for: 2m
labels:
severity: critical
annotations:
summary: "No active listener sessions for {{ $labels.runner_group }} in {{ $labels.namespace }}"
description: "The AGC has no open long-poll sessions. Jobs queue indefinitely until sessions are restored."
# Page: token refresh errors risk job failures within ~1 hour
- alert: ActionsGatewayTokenRefreshErrors
expr: |
rate(actions_gateway_token_refresh_errors_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "GitHub App token refresh errors in {{ $labels.namespace }}"
description: "Token refresh has been failing for 5+ minutes. Sessions will fail once the current token expires (~1 hour)."
# Page: sustained renewjob failures will cancel running jobs
- alert: ActionsGatewayRenewJobErrors
expr: |
rate(actions_gateway_renewjob_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "RenewJob errors in {{ $labels.namespace }}"
description: "RenewJob is failing at >0.1/s for 5+ minutes. Running jobs may be cancelled."
# Page: p99 pod creation latency SLO breach
- alert: ActionsGatewayPodCreationLatencyP99
expr: |
histogram_quantile(0.99,
rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
) > 60
for: 5m
labels:
severity: critical
annotations:
summary: "Pod creation p99 latency SLO breach in {{ $labels.namespace }}"
description: "p99 pod creation latency exceeds 60s SLO. Check quota and node capacity."
# Ticket: p95 pod creation latency SLO breach
- alert: ActionsGatewayPodCreationLatencyP95
expr: |
histogram_quantile(0.95,
rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
) > 15
for: 10m
labels:
severity: warning
annotations:
summary: "Pod creation p95 latency degraded in {{ $labels.namespace }}"
description: "p95 pod creation latency exceeds 15s SLO. Investigate quota and scheduling."
# Ticket: eviction budget exhausted — manual re-run required
- alert: ActionsGatewayEvictionRetriesExhausted
expr: |
increase(actions_gateway_eviction_retries_exhausted_total[5m]) > 0
labels:
severity: warning
annotations:
summary: "Eviction retry budget exhausted for {{ $labels.runner_group }} in {{ $labels.namespace }}"
description: "A job's eviction retry budget has been exhausted. Manual re-run required."
# Ticket: reconcile errors need investigation
- alert: ActionsGatewayReconcileErrors
expr: |
rate(actions_gateway_reconcile_errors_total[5m]) > 0.033
for: 10m
labels:
severity: warning
annotations:
summary: "Reconcile errors in {{ $labels.controller }} for {{ $labels.resource }}"
description: "Reconcile errors at >2/minute for 10+ minutes. Resources may be stale."
SLO Recording Rules¶
These recording rules pre-compute the metrics needed for burn-rate alerting against the SLO targets in Appendix A. Apply them alongside the alert rules above.
groups:
- name: actions-gateway-slos
interval: 30s
rules:
# Pod creation latency — p95 and p99 per namespace
- record: actions_gateway:pod_creation_latency_seconds:p95
expr: |
histogram_quantile(0.95,
sum by (namespace, le) (
rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
)
)
- record: actions_gateway:pod_creation_latency_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (namespace, le) (
rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
)
)
# Job duration — p50, p95, p99 per namespace and runner_group
- record: actions_gateway:job_duration_seconds:p50
expr: |
histogram_quantile(0.50,
sum by (namespace, runner_group, le) (
rate(actions_gateway_job_duration_seconds_bucket[5m])
)
)
- record: actions_gateway:job_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum by (namespace, runner_group, le) (
rate(actions_gateway_job_duration_seconds_bucket[5m])
)
)
# Token refresh error rate (hourly) — compare against the <1/hr SLO
- record: actions_gateway:token_refresh_errors:rate1h
expr: |
sum by (namespace) (
increase(actions_gateway_token_refresh_errors_total[1h])
)
# Job acquisition success rate — fraction of acquisitions that succeed
- record: actions_gateway:job_acquisition_success_rate:rate5m
expr: |
sum by (namespace, runner_group) (
rate(actions_gateway_jobs_acquired_total[5m])
)
/
(
sum by (namespace, runner_group) (
rate(actions_gateway_jobs_acquired_total[5m])
)
+
sum by (namespace, runner_group) (
rate(actions_gateway_job_acquisition_errors_total[5m])
)
)
Grafana Dashboard¶
The following panels cover the key health and performance signals. Use the recording rules above as data sources where applicable.
Suggested Panel Layout¶
Row 1 — Gateway Health (per namespace)
| Panel | Query | Visualization |
|---|---|---|
| Active sessions | actions_gateway_active_sessions |
Stat / Time series |
| Jobs acquired/min | rate(actions_gateway_jobs_acquired_total[5m]) * 60 |
Time series |
| Token refresh errors | rate(actions_gateway_token_refresh_errors_total[5m]) |
Stat (threshold: >0 = red) |
| RenewJob errors | rate(actions_gateway_renewjob_errors_total[5m]) |
Stat (threshold: >0 = yellow) |
Row 2 — Pod Creation Latency SLO
| Panel | Query | Visualization |
|---|---|---|
| p95 latency | actions_gateway:pod_creation_latency_seconds:p95 |
Gauge (green <15s, yellow <60s, red >60s) |
| p99 latency | actions_gateway:pod_creation_latency_seconds:p99 |
Gauge |
| Latency heatmap | rate(actions_gateway_pod_creation_latency_seconds_bucket[5m]) |
Heatmap |
Row 3 — Job Throughput (per runner_group)
| Panel | Query | Visualization |
|---|---|---|
| Jobs acquired total | increase(actions_gateway_jobs_acquired_total[1h]) |
Bar chart by runner_group |
| Job duration p50/p95 | actions_gateway:job_duration_seconds:p50/p95 |
Time series |
| Eviction retries | increase(actions_gateway_eviction_retries_total[1h]) |
Bar chart |
| Eviction budget exhausted | increase(actions_gateway_eviction_retries_exhausted_total[1h]) |
Stat (threshold: >0 = red) |
Row 4 — Proxy and Quota
| Panel | Query | Visualization |
|---|---|---|
| Proxy replica count | kube_deployment_status_replicas_ready{deployment="actions-gateway-proxy"} |
Time series |
| HPA desired vs. current | HPA metrics from kube_horizontalpodautoscaler_* |
Time series |
| ResourceQuota usage | kube_resourcequota filtered by namespace |
Bar gauge |
Row 5 — GMC Overview
| Panel | Query | Visualization |
|---|---|---|
| Managed gateways | actions_gateway_managed_gateways |
Stat |
| Reconcile errors | rate(actions_gateway_reconcile_errors_total[5m]) |
Time series by controller |
| IP range refreshes | increase(actions_gateway_ip_range_updates_total[24h]) |
Stat |
Dashboard Variables¶
Add these template variables to make the dashboard multi-tenant:
$namespace—label_values(actions_gateway_active_sessions, namespace)— allows filtering to a single tenant$runner_group—label_values(actions_gateway_active_sessions{namespace="$namespace"}, runner_group)— allows filtering to a specific RunnerGroup
Label Cardinality Warning¶
Metric labels are scoped to namespace and runner_group. To avoid label cardinality explosion:
- Do not use dynamically generated
runner_groupnames (e.g. names incorporating PR numbers or commit SHAs). Each unique combination ofnamespace+runner_groupcreates a distinct time series; thousands of unique names will cause memory pressure in Prometheus. - Stable, human-meaningful names like
gpu-2x,cpu-standard,gpu-a100are correct. These are configured in theActionsGatewayspec and should not change after initial setup. - If you need per-workflow or per-repo attribution, use Prometheus recording rules or labels from job metadata, not from RunnerGroup names.