Skip to content

Observability

Audience: SRE, Platform engineer

Both the GMC and AGC expose Prometheus metrics at :8080/metrics (no authentication required by default). The standard controller-runtime metrics server is used; additional built-in metrics (reconcile latency, work queue depth, etc.) are emitted automatically alongside the custom metrics below.

For SLO targets associated with these metrics, see Appendix A — Capacity Targets & SLOs.


Table of Contents

Logging

All four components — the GMC, the per-tenant AGC, the egress proxy, and the worker wrapper — emit structured JSON logs at info level by default, one JSON shape per process stream, ready to ship to a log aggregator (Loki, Elasticsearch, CloudWatch, etc.) without reformatting. No flag needs to be set in production; the JSON default is what the GMC-provisioned Deployments run with.

The controllers (GMC, AGC) take controller-runtime's standard zap flags. For local development, pass --zap-devel to switch to human-readable console logs at debug level, or use the finer-grained --zap-encoder / --zap-log-level flags (run a controller with --help for the full set). Application code paths that log through the Go standard library's log/slog are bridged onto the same zap logger, so --zap-log-level governs every line a controller emits — not just the manager's own — and the whole process shares one JSON schema.

The egress proxy and the worker wrapper are not controllers; they read their level from the LOG_LEVEL environment variable (info | debug, default info).


Distributed Tracing (AGC)

The per-tenant AGC emits OpenTelemetry traces for its two hottest operational paths:

  • RunnerGroup.Reconcile — one span per reconcile, attributed with runnergroup.namespace / runnergroup.name. Errors set the span status.
  • Provisioner.provision — one span per acquired job (the job-to-pod path), with child spans stageJobSecret, countActivePods, createPod, and waitForCompletion. The root span carries runnergroup.*, plan.id, pod.name, active_pods, ceiling.held, priority_class, and the final pod.phase / pod.reason / duration_seconds. waitForCompletion is usually the long pole, so its child span tells you whether latency is in scheduling/runtime versus the controller.

Each reconcile and each job provision is its own root trace — there is no inbound trace context to continue, and the per-job spans run on the listener goroutines independently of the reconcile that started the pool.

Tracing is opt-in and off by default. With no OTLP endpoint configured the AGC installs no exporter and the spans are no-ops (near-zero cost), so production runs without tracing unless you point it at a collector.

Enabling tracing

The AGC reads the standard OpenTelemetry SDK environment variables — there is no bespoke flag. Tracing turns on as soon as an OTLP endpoint is configured:

Variable Effect
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT or OTEL_EXPORTER_OTLP_ENDPOINT OTLP/gRPC collector address (e.g. otel-collector.observability:4317). Setting either one enables tracing.
OTEL_SDK_DISABLED=true Hard kill switch — forces tracing off even when an endpoint is set.
OTEL_SERVICE_NAME / OTEL_RESOURCE_ATTRIBUTES Override the default service.name (actions-gateway-agc) and add resource attributes.
OTEL_TRACES_SAMPLER, OTEL_EXPORTER_OTLP_HEADERS, OTEL_EXPORTER_OTLP_TIMEOUT, … All other knobs are the SDK's standard env vars.

On shutdown the AGC flushes buffered spans (5 s budget) before exiting.

Enabling tracing on GMC-managed AGCs

The GMC builds the AGC Deployment, so for a GMC-provisioned tenant you do not set these env vars by hand — you declare tracing on the ActionsGateway CR and the GMC translates spec.tracing into the standard OTEL_* env on the AGC Deployment:

apiVersion: actions-gateway.github.com/v1alpha1
kind: ActionsGateway
metadata:
  name: team-a
  namespace: team-a
spec:
  gitHubAppRef:
    name: team-a-github-app
  gitHubURL: https://github.com/team-a-org
  tracing:
    endpoint: https://otel-collector.observability:4317  # enables tracing
    sampler: parentbased_traceidratio                    # optional
    samplerArg: "0.1"                                     # optional — 10% of traces
    resourceAttributes:                                   # optional
      deployment.environment: prod
    # insecure: true   # only for a plaintext in-cluster collector; TLS is the default
spec.tracing field AGC env it sets Notes
endpoint OTEL_EXPORTER_OTLP_TRACES_ENDPOINT Setting it is what enables tracing. Empty → no OTEL_* env, tracing stays off.
insecure OTEL_EXPORTER_OTLP_TRACES_INSECURE Defaults to false (TLS). Set true only for a plaintext in-cluster collector.
sampler OTEL_TRACES_SAMPLER One of always_on, always_off, traceidratio, parentbased_always_on, parentbased_always_off, parentbased_traceidratio (CRD-enforced enum).
samplerArg OTEL_TRACES_SAMPLER_ARG Ratio in [0,1] for the ratio-based samplers.
resourceAttributes OTEL_RESOURCE_ATTRIBUTES Rendered as a sorted key=value list. The AGC's own service.name/service.version take precedence.

No auth headers via env. spec.tracing deliberately has no field for OTEL_EXPORTER_OTLP_HEADERS: those can carry bearer tokens, and this project keeps secrets out of environment variables (they leak into process listings and child processes). Authenticate the collector at the network layer instead — an in-cluster collector reached over the tenant's egress path, mutual TLS, or a service mesh.

Testing-only passthrough. The AGC_EXTRA_* mechanism (--allow-agc-extra-env on the GMC, then AGC_EXTRA_OTEL_EXPORTER_OTLP_ENDPOINT=… in the GMC pod env) still exists but is gated for tests only and not for production use. When both are present, AGC_EXTRA_* wins (it is appended last). Prefer spec.tracing.


How to Access Metrics

Port forward (ad-hoc):

kubectl port-forward -n <namespace> deploy/actions-gateway-controller 8080:8080
curl http://localhost:8080/metrics

Prometheus operator (production):

Create a ServiceMonitor targeting the AGC and GMC services:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: actions-gateway
  namespace: <namespace>
spec:
  selector:
    matchLabels:
      app: actions-gateway-controller
  endpoints:
    - port: metrics
      interval: 30s

The metrics port is named metrics in the Service spec.

Install-time scraping prerequisites (GMC manager)

The default GMC install ships the manager NetworkPolicy enabled by default (networkPolicy.enabled=true). Selecting the manager pod flips it to default-deny on ingress, so its /metrics endpoint admits traffic only from namespaces carrying the right label:

  • Scraping the GMC manager metrics: label your Prometheus namespace metrics: enabled, or no scrape will reach the manager:
    kubectl label namespace <prometheus-namespace> metrics=enabled
    

The validating-webhook port (container 9443) is intentionally re-allowed from any source — the kube-apiserver that calls it is not a pod in a labeled namespace, so a source restriction there would silently break every ActionsGateway admission (failurePolicy: Fail). The webhook is TLS + caBundle authenticated, so the sensitive surface stays the metrics: enabled restriction above. No namespace label is required for CR admission.

This applies to the GMC manager only. The per-tenant AGC and proxy metrics are governed by the per-tenant NetworkPolicies the GMC generates (the AGC NP already admits monitoring-namespace scrapes of the metrics port).

Runtime enforcement of these policies depends on the CNI; kindnet's kube-network-policies does not drop all egress negatives (see the worker egress limitation in troubleshooting.md). The manager NP is verified by manifest review and is pending a Tier-A runtime check.

The ServiceMonitor integration stays opt-in, behind the metrics.serviceMonitor.enabled chart value (default false): out-of-box Prometheus Operator scraping. It is left off by default because the ServiceMonitor CRD only exists once the Prometheus Operator is installed, so rendering it unconditionally would break helm install on clusters without it.


Full Metrics Reference

Metric Type Labels Description
actions_gateway_active_sessions Gauge namespace, runner_group Currently open long-poll sessions. One per RunnerGroup at steady state; rises toward maxListeners during bursts.
actions_gateway_jobs_acquired_total Counter namespace, runner_group Jobs successfully acquired from the broker.
actions_gateway_job_acquisition_errors_total Counter namespace, reason Acquisition failures. Reason values: already_claimed (benign race), delivery_window_expired (job redelivered), version_too_old, other.
actions_gateway_job_duration_seconds Histogram namespace, runner_group Wall time from acquirejob success to worker pod terminal phase.
actions_gateway_pod_creation_latency_seconds Histogram namespace Time from acquirejob to pod Scheduled event. Key SLO metric.
actions_gateway_token_refreshes_total Counter namespace Successful GitHub App installation token refreshes.
actions_gateway_token_refresh_errors_total Counter namespace Failed token refresh attempts. See SLO threshold below.
actions_gateway_renewjob_errors_total Counter namespace Failed renewjob calls. Leading indicator for cancelled jobs.
actions_gateway_eviction_retries_total Counter namespace, runner_group Jobs automatically re-queued after worker pod eviction.
actions_gateway_eviction_retries_exhausted_total Counter namespace, runner_group Eviction retries exhausted; job requires manual re-run.
actions_gateway_worker_pods_reaped_total Counter namespace, runner_group, reason Worker pods deleted by the lifecycle reaper. reason="completed_ttl" is routine cleanup after completedPodTTL; reason="pending_deadline" means a pod was stuck Pending past pendingPodDeadline and its job was cancelled — each such reap also emits a WorkerPodStuckPending Warning Event on the RunnerGroup.
actions_gateway_message_poll_errors_total Counter namespace, reason GetMessage errors (excludes empty polls and session expiry — those are normal). reason="rate_limited" is a 429; reason="timeout" is a black-holed long-poll the broker accepted but never answered, bounded by the client response-header deadline and retried (see Listener Stalls After a Black-Holed Broker Connection); reason="other" is any remaining transport/decode error.
actions_gateway_agent_recycles_total Counter namespace, runner_group, trigger Single-use JIT agents re-registered. trigger="post_job" is routine (one per completed job); stale_session/startup mean a dead agent was detected and healed after the fact; reconcile_repair means a parked agent was repaired by the reconciler.
actions_gateway_agent_recycle_errors_total Counter namespace, runner_group Failed agent re-registration attempts. Sustained growth shrinks listener capacity — see the runbook.
actions_gateway_reconcile_errors_total Counter controller, resource GMC/AGC reconcile errors. Non-zero values deserve investigation.
actions_gateway_ip_range_updates_total Counter namespace NetworkPolicy egress rule refreshes from GitHub meta API.
actions_gateway_managed_gateways Gauge Total ActionsGateway CRs currently managed by the GMC.

Proxy metrics

The per-tenant egress proxy exposes its own metrics on its health/metrics port (:8081, restricted by the L-8 NetworkPolicy — see security.md L-8). Each proxy is a separate scrape target; these metrics carry no intrinsic namespace label, so attach one via the ServiceMonitor/scrape config if you need per-tenant attribution.

Metric Type Labels Description
actions_gateway_proxy_connections_active Gauge Currently open CONNECT tunnels.
actions_gateway_proxy_connections_total Counter Total CONNECT tunnels opened.
actions_gateway_proxy_dial_errors_total Counter Upstream dial failures (e.g. blocked-destination attempts).
actions_gateway_proxy_tunnel_duration_seconds Histogram Tunnel lifetime, observed at close. Buckets reach 21600s (the 6h absolute lifetime cap).

For abuse/compromise detection built on these metrics (slowloris, eviction-retry loops, credential-harvesting), see security-operations.md.


Symptom → Metric Mapping

Symptom Metric(s) to check Notes
Jobs are slow to start pod_creation_latency_seconds p95/p99 SLO: p95 ≤ 15s, p99 ≤ 60s
Jobs are randomly cancelled renewjob_errors_total Each sustained error risks a job cancellation
Jobs are not being acquired active_sessions (should be ≥ 1 per RunnerGroup), job_acquisition_errors_total Zero sessions = no polling
Jobs are queuing but not starting active_sessions (OK) vs jobs_acquired_total not incrementing Check RateLimited condition
Runner credentials are broken token_refresh_errors_total Spikes indicate Secret or GitHub App issue
Evictions causing re-runs eviction_retries_total, eviction_retries_exhausted_total Exhausted budget requires manual intervention
Throughput decaying job by job agent_recycle_errors_total rising, active_sessions shrinking Agent re-registration failing; see the runbook
Jobs cancelled without ever starting worker_pods_reaped_total{reason="pending_deadline"} Worker pod stuck Pending past the deadline — fix the image/scheduling cause; see the runbook
Proxy autoscaling not working HPA TARGETS showing <unknown> requests.cpu not set on proxy pods
GMC/AGC reconcile broken reconcile_errors_total Non-zero sustained rate indicates operator issue

The following Prometheus alerting rules map to the SLO targets in Appendix A. Adjust thresholds to match your environment.

groups:
  - name: actions-gateway
    rules:

      # Page: no sessions means no job acquisition
      - alert: ActionsGatewayNoActiveSessions
        expr: |
          actions_gateway_active_sessions == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "No active listener sessions for {{ $labels.runner_group }} in {{ $labels.namespace }}"
          description: "The AGC has no open long-poll sessions. Jobs queue indefinitely until sessions are restored."

      # Page: token refresh errors risk job failures within ~1 hour
      - alert: ActionsGatewayTokenRefreshErrors
        expr: |
          rate(actions_gateway_token_refresh_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GitHub App token refresh errors in {{ $labels.namespace }}"
          description: "Token refresh has been failing for 5+ minutes. Sessions will fail once the current token expires (~1 hour)."

      # Page: sustained renewjob failures will cancel running jobs
      - alert: ActionsGatewayRenewJobErrors
        expr: |
          rate(actions_gateway_renewjob_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "RenewJob errors in {{ $labels.namespace }}"
          description: "RenewJob is failing at >0.1/s for 5+ minutes. Running jobs may be cancelled."

      # Page: p99 pod creation latency SLO breach
      - alert: ActionsGatewayPodCreationLatencyP99
        expr: |
          histogram_quantile(0.99,
            rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
          ) > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod creation p99 latency SLO breach in {{ $labels.namespace }}"
          description: "p99 pod creation latency exceeds 60s SLO. Check quota and node capacity."

      # Ticket: p95 pod creation latency SLO breach
      - alert: ActionsGatewayPodCreationLatencyP95
        expr: |
          histogram_quantile(0.95,
            rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
          ) > 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod creation p95 latency degraded in {{ $labels.namespace }}"
          description: "p95 pod creation latency exceeds 15s SLO. Investigate quota and scheduling."

      # Ticket: eviction budget exhausted — manual re-run required
      - alert: ActionsGatewayEvictionRetriesExhausted
        expr: |
          increase(actions_gateway_eviction_retries_exhausted_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Eviction retry budget exhausted for {{ $labels.runner_group }} in {{ $labels.namespace }}"
          description: "A job's eviction retry budget has been exhausted. Manual re-run required."

      # Ticket: reconcile errors need investigation
      - alert: ActionsGatewayReconcileErrors
        expr: |
          rate(actions_gateway_reconcile_errors_total[5m]) > 0.033
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Reconcile errors in {{ $labels.controller }} for {{ $labels.resource }}"
          description: "Reconcile errors at >2/minute for 10+ minutes. Resources may be stale."

SLO Recording Rules

These recording rules pre-compute the metrics needed for burn-rate alerting against the SLO targets in Appendix A. Apply them alongside the alert rules above.

groups:
  - name: actions-gateway-slos
    interval: 30s
    rules:

      # Pod creation latency — p95 and p99 per namespace
      - record: actions_gateway:pod_creation_latency_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (namespace, le) (
              rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
            )
          )

      - record: actions_gateway:pod_creation_latency_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (namespace, le) (
              rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
            )
          )

      # Job duration — p50, p95, p99 per namespace and runner_group
      - record: actions_gateway:job_duration_seconds:p50
        expr: |
          histogram_quantile(0.50,
            sum by (namespace, runner_group, le) (
              rate(actions_gateway_job_duration_seconds_bucket[5m])
            )
          )

      - record: actions_gateway:job_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (namespace, runner_group, le) (
              rate(actions_gateway_job_duration_seconds_bucket[5m])
            )
          )

      # Token refresh error rate (hourly) — compare against the <1/hr SLO
      - record: actions_gateway:token_refresh_errors:rate1h
        expr: |
          sum by (namespace) (
            increase(actions_gateway_token_refresh_errors_total[1h])
          )

      # Job acquisition success rate — fraction of acquisitions that succeed
      - record: actions_gateway:job_acquisition_success_rate:rate5m
        expr: |
          sum by (namespace, runner_group) (
            rate(actions_gateway_jobs_acquired_total[5m])
          )
          /
          (
            sum by (namespace, runner_group) (
              rate(actions_gateway_jobs_acquired_total[5m])
            )
            +
            sum by (namespace, runner_group) (
              rate(actions_gateway_job_acquisition_errors_total[5m])
            )
          )

Grafana Dashboard

The following panels cover the key health and performance signals. Use the recording rules above as data sources where applicable.

Suggested Panel Layout

Row 1 — Gateway Health (per namespace)

Panel Query Visualization
Active sessions actions_gateway_active_sessions Stat / Time series
Jobs acquired/min rate(actions_gateway_jobs_acquired_total[5m]) * 60 Time series
Token refresh errors rate(actions_gateway_token_refresh_errors_total[5m]) Stat (threshold: >0 = red)
RenewJob errors rate(actions_gateway_renewjob_errors_total[5m]) Stat (threshold: >0 = yellow)

Row 2 — Pod Creation Latency SLO

Panel Query Visualization
p95 latency actions_gateway:pod_creation_latency_seconds:p95 Gauge (green <15s, yellow <60s, red >60s)
p99 latency actions_gateway:pod_creation_latency_seconds:p99 Gauge
Latency heatmap rate(actions_gateway_pod_creation_latency_seconds_bucket[5m]) Heatmap

Row 3 — Job Throughput (per runner_group)

Panel Query Visualization
Jobs acquired total increase(actions_gateway_jobs_acquired_total[1h]) Bar chart by runner_group
Job duration p50/p95 actions_gateway:job_duration_seconds:p50/p95 Time series
Eviction retries increase(actions_gateway_eviction_retries_total[1h]) Bar chart
Eviction budget exhausted increase(actions_gateway_eviction_retries_exhausted_total[1h]) Stat (threshold: >0 = red)

Row 4 — Proxy and Quota

Panel Query Visualization
Proxy replica count kube_deployment_status_replicas_ready{deployment="actions-gateway-proxy"} Time series
HPA desired vs. current HPA metrics from kube_horizontalpodautoscaler_* Time series
ResourceQuota usage kube_resourcequota filtered by namespace Bar gauge

Row 5 — GMC Overview

Panel Query Visualization
Managed gateways actions_gateway_managed_gateways Stat
Reconcile errors rate(actions_gateway_reconcile_errors_total[5m]) Time series by controller
IP range refreshes increase(actions_gateway_ip_range_updates_total[24h]) Stat

Dashboard Variables

Add these template variables to make the dashboard multi-tenant:

  • $namespacelabel_values(actions_gateway_active_sessions, namespace) — allows filtering to a single tenant
  • $runner_grouplabel_values(actions_gateway_active_sessions{namespace="$namespace"}, runner_group) — allows filtering to a specific RunnerGroup

Label Cardinality Warning

Metric labels are scoped to namespace and runner_group. To avoid label cardinality explosion:

  • Do not use dynamically generated runner_group names (e.g. names incorporating PR numbers or commit SHAs). Each unique combination of namespace + runner_group creates a distinct time series; thousands of unique names will cause memory pressure in Prometheus.
  • Stable, human-meaningful names like gpu-2x, cpu-standard, gpu-a100 are correct. These are configured in the ActionsGateway spec and should not change after initial setup.
  • If you need per-workflow or per-repo attribution, use Prometheus recording rules or labels from job metadata, not from RunnerGroup names.