Observability¶

Audience: SRE, Platform engineer

Both the GMC and AGC expose Prometheus metrics at :8080/metrics (no authentication required by default). The standard controller-runtime metrics server is used; additional built-in metrics (reconcile latency, work queue depth, etc.) are emitted automatically alongside the custom metrics below.

For SLO targets associated with these metrics, see Appendix A — Capacity Targets & SLOs.

Table of Contents¶

Logging
Distributed Tracing (AGC)
Enabling tracing
Enabling tracing on GMC-managed AGCs
How to Access Metrics
Install-time scraping prerequisites (GMC manager)
Full Metrics Reference
Proxy metrics
Symptom → Metric Mapping
Recommended Alert Rules
SLO Recording Rules
Grafana Dashboard
Suggested Panel Layout
Dashboard Variables
Label Cardinality Warning

Logging¶

All four components — the GMC, the per-tenant AGC, the egress proxy, and the worker wrapper — emit structured JSON logs at info level by default, one JSON shape per process stream, ready to ship to a log aggregator (Loki, Elasticsearch, CloudWatch, etc.) without reformatting. No flag needs to be set in production; the JSON default is what the GMC-provisioned Deployments run with.

The controllers (GMC, AGC) take controller-runtime's standard zap flags. For local development, pass --zap-devel to switch to human-readable console logs at debug level, or use the finer-grained --zap-encoder / --zap-log-level flags (run a controller with --help for the full set). Application code paths that log through the Go standard library's log/slog are bridged onto the same zap logger, so --zap-log-level governs every line a controller emits — not just the manager's own — and the whole process shares one JSON schema.

The egress proxy and the worker wrapper are not controllers; they read their level from the LOG_LEVEL environment variable (info | debug, default info).

Distributed Tracing (AGC)¶

The per-tenant AGC emits OpenTelemetry traces for its two hottest operational paths:

RunnerGroup.Reconcile — one span per reconcile, attributed with runnergroup.namespace / runnergroup.name. Errors set the span status.
Provisioner.provision — one span per acquired job (the job-to-pod path), with child spans stageJobSecret, countActivePods, createPod, and waitForCompletion. The root span carries runnergroup.*, plan.id, pod.name, active_pods, ceiling.held, priority_class, and the final pod.phase / pod.reason / duration_seconds. waitForCompletion is usually the long pole, so its child span tells you whether latency is in scheduling/runtime versus the controller.

Each reconcile and each job provision is its own root trace — there is no inbound trace context to continue, and the per-job spans run on the listener goroutines independently of the reconcile that started the pool.

Tracing is opt-in and off by default. With no OTLP endpoint configured the AGC installs no exporter and the spans are no-ops (near-zero cost), so production runs without tracing unless you point it at a collector.

Enabling tracing¶

The AGC reads the standard OpenTelemetry SDK environment variables — there is no bespoke flag. Tracing turns on as soon as an OTLP endpoint is configured:

Variable	Effect
`OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` or `OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP/gRPC collector address (e.g. `otel-collector.observability:4317`). Setting either one enables tracing.
`OTEL_SDK_DISABLED=true`	Hard kill switch — forces tracing off even when an endpoint is set.
`OTEL_SERVICE_NAME` / `OTEL_RESOURCE_ATTRIBUTES`	Override the default `service.name` (`actions-gateway-agc`) and add resource attributes.
`OTEL_TRACES_SAMPLER`, `OTEL_EXPORTER_OTLP_HEADERS`, `OTEL_EXPORTER_OTLP_TIMEOUT`, …	All other knobs are the SDK's standard env vars.

On shutdown the AGC flushes buffered spans (5 s budget) before exiting.

Enabling tracing on GMC-managed AGCs¶

The GMC builds the AGC Deployment, so for a GMC-provisioned tenant you do not set these env vars by hand — you declare tracing on the ActionsGateway CR and the GMC translates spec.tracing into the standard OTEL_* env on the AGC Deployment:

apiVersion: actions-gateway.github.com/v1alpha1
kind: ActionsGateway
metadata:
  name: team-a
  namespace: team-a
spec:
  gitHubAppRef:
    name: team-a-github-app
  gitHubURL: https://github.com/team-a-org
  tracing:
    endpoint: https://otel-collector.observability:4317  # enables tracing
    sampler: parentbased_traceidratio                    # optional
    samplerArg: "0.1"                                     # optional — 10% of traces
    resourceAttributes:                                   # optional
      deployment.environment: prod
    # insecure: true   # only for a plaintext in-cluster collector; TLS is the default

`spec.tracing` field	AGC env it sets	Notes
`endpoint`	`OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`	Setting it is what enables tracing. Empty → no `OTEL_*` env, tracing stays off.
`insecure`	`OTEL_EXPORTER_OTLP_TRACES_INSECURE`	Defaults to `false` (TLS). Set `true` only for a plaintext in-cluster collector.
`sampler`	`OTEL_TRACES_SAMPLER`	One of `always_on`, `always_off`, `traceidratio`, `parentbased_always_on`, `parentbased_always_off`, `parentbased_traceidratio` (CRD-enforced enum).
`samplerArg`	`OTEL_TRACES_SAMPLER_ARG`	Ratio in `[0,1]` for the ratio-based samplers.
`resourceAttributes`	`OTEL_RESOURCE_ATTRIBUTES`	Rendered as a sorted `key=value` list. The AGC's own `service.name`/`service.version` take precedence.

No auth headers via env. spec.tracing deliberately has no field for OTEL_EXPORTER_OTLP_HEADERS: those can carry bearer tokens, and this project keeps secrets out of environment variables (they leak into process listings and child processes). Authenticate the collector at the network layer instead — an in-cluster collector reached over the tenant's egress path, mutual TLS, or a service mesh.

Testing-only passthrough. The AGC_EXTRA_* mechanism (--allow-agc-extra-env on the GMC, then AGC_EXTRA_OTEL_EXPORTER_OTLP_ENDPOINT=… in the GMC pod env) still exists but is gated for tests only and not for production use. When both are present, AGC_EXTRA_* wins (it is appended last). Prefer spec.tracing.

How to Access Metrics¶

Port forward (ad-hoc):

kubectl port-forward -n <namespace> deploy/actions-gateway-controller 8080:8080
curl http://localhost:8080/metrics

Prometheus operator (production):

Create a ServiceMonitor targeting the AGC and GMC services:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: actions-gateway
  namespace: <namespace>
spec:
  selector:
    matchLabels:
      app: actions-gateway-controller
  endpoints:
    - port: metrics
      interval: 30s

The metrics port is named metrics in the Service spec.

Install-time scraping prerequisites (GMC manager)¶

The default GMC install ships the manager NetworkPolicy enabled by default (networkPolicy.enabled=true). Selecting the manager pod flips it to default-deny on ingress, so its /metrics endpoint admits traffic only from namespaces carrying the right label:

Scraping the GMC manager metrics: label your Prometheus namespace metrics: enabled, or no scrape will reach the manager:
```
kubectl label namespace <prometheus-namespace> metrics=enabled
```

The validating-webhook port (container 9443) is intentionally re-allowed from any source — the kube-apiserver that calls it is not a pod in a labeled namespace, so a source restriction there would silently break every ActionsGateway admission (failurePolicy: Fail). The webhook is TLS + caBundle authenticated, so the sensitive surface stays the metrics: enabled restriction above. No namespace label is required for CR admission.

This applies to the GMC manager only. The per-tenant AGC and proxy metrics are governed by the per-tenant NetworkPolicies the GMC generates (the AGC NP already admits monitoring-namespace scrapes of the metrics port).

Runtime enforcement of these policies depends on the CNI; kindnet's kube-network-policies does not drop all egress negatives (see the worker egress limitation in troubleshooting.md). The manager NP is verified by manifest review and is pending a Tier-A runtime check.

The ServiceMonitor integration stays opt-in, behind the metrics.serviceMonitor.enabled chart value (default false): out-of-box Prometheus Operator scraping. It is left off by default because the ServiceMonitor CRD only exists once the Prometheus Operator is installed, so rendering it unconditionally would break helm install on clusters without it.

Full Metrics Reference¶

Metric	Type	Labels	Description
`actions_gateway_active_sessions`	Gauge	`namespace`, `runner_group`	Currently open long-poll sessions. One per RunnerGroup at steady state; rises toward `maxListeners` during bursts.
`actions_gateway_jobs_acquired_total`	Counter	`namespace`, `runner_group`	Jobs successfully acquired from the broker.
`actions_gateway_job_acquisition_errors_total`	Counter	`namespace`, `reason`	Acquisition failures. Reason values: `already_claimed` (benign race), `delivery_window_expired` (job redelivered), `version_too_old`, `other`.
`actions_gateway_job_duration_seconds`	Histogram	`namespace`, `runner_group`	Wall time from `acquirejob` success to worker pod terminal phase.
`actions_gateway_pod_creation_latency_seconds`	Histogram	`namespace`	Time from `acquirejob` to pod `Scheduled` event. Key SLO metric.
`actions_gateway_token_refreshes_total`	Counter	`namespace`	Successful GitHub App installation token refreshes.
`actions_gateway_token_refresh_errors_total`	Counter	`namespace`	Failed token refresh attempts. See SLO threshold below.
`actions_gateway_renewjob_errors_total`	Counter	`namespace`	Failed `renewjob` calls. Leading indicator for cancelled jobs.
`actions_gateway_eviction_retries_total`	Counter	`namespace`, `runner_group`	Jobs automatically re-queued after worker pod eviction.
`actions_gateway_eviction_retries_exhausted_total`	Counter	`namespace`, `runner_group`	Eviction retries exhausted; job requires manual re-run.
`actions_gateway_worker_pods_reaped_total`	Counter	`namespace`, `runner_group`, `reason`	Worker pods deleted by the lifecycle reaper. `reason="completed_ttl"` is routine cleanup after `completedPodTTL`; `reason="pending_deadline"` means a pod was stuck Pending past `pendingPodDeadline` and its job was cancelled — each such reap also emits a `WorkerPodStuckPending` Warning Event on the RunnerGroup.
`actions_gateway_message_poll_errors_total`	Counter	`namespace`, `reason`	`GetMessage` errors (excludes empty polls and session expiry — those are normal). `reason="rate_limited"` is a 429; `reason="timeout"` is a black-holed long-poll the broker accepted but never answered, bounded by the client response-header deadline and retried (see Listener Stalls After a Black-Holed Broker Connection); `reason="other"` is any remaining transport/decode error.
`actions_gateway_agent_recycles_total`	Counter	`namespace`, `runner_group`, `trigger`	Single-use JIT agents re-registered. `trigger="post_job"` is routine (one per completed job); `stale_session`/`startup` mean a dead agent was detected and healed after the fact; `reconcile_repair` means a parked agent was repaired by the reconciler.
`actions_gateway_agent_recycle_errors_total`	Counter	`namespace`, `runner_group`	Failed agent re-registration attempts. Sustained growth shrinks listener capacity — see the runbook.
`actions_gateway_reconcile_errors_total`	Counter	`controller`, `resource`	GMC/AGC reconcile errors. Non-zero values deserve investigation.
`actions_gateway_ip_range_updates_total`	Counter	`namespace`	`NetworkPolicy` egress rule refreshes from GitHub meta API.
`actions_gateway_managed_gateways`	Gauge	—	Total `ActionsGateway` CRs currently managed by the GMC.

Proxy metrics¶

The per-tenant egress proxy exposes its own metrics on its health/metrics port (:8081, restricted by the L-8 NetworkPolicy — see security.md L-8). Each proxy is a separate scrape target; these metrics carry no intrinsic namespace label, so attach one via the ServiceMonitor/scrape config if you need per-tenant attribution.

Metric	Type	Labels	Description
`actions_gateway_proxy_connections_active`	Gauge	—	Currently open CONNECT tunnels.
`actions_gateway_proxy_connections_total`	Counter	—	Total CONNECT tunnels opened.
`actions_gateway_proxy_dial_errors_total`	Counter	—	Upstream dial failures (e.g. blocked-destination attempts).
`actions_gateway_proxy_tunnel_duration_seconds`	Histogram	—	Tunnel lifetime, observed at close. Buckets reach 21600s (the 6h absolute lifetime cap).

For abuse/compromise detection built on these metrics (slowloris, eviction-retry loops, credential-harvesting), see security-operations.md.

Symptom → Metric Mapping¶

Symptom	Metric(s) to check	Notes
Jobs are slow to start	`pod_creation_latency_seconds` p95/p99	SLO: p95 ≤ 15s, p99 ≤ 60s
Jobs are randomly cancelled	`renewjob_errors_total`	Each sustained error risks a job cancellation
Jobs are not being acquired	`active_sessions` (should be ≥ 1 per RunnerGroup), `job_acquisition_errors_total`	Zero sessions = no polling
Jobs are queuing but not starting	`active_sessions` (OK) vs `jobs_acquired_total` not incrementing	Check `RateLimited` condition
Runner credentials are broken	`token_refresh_errors_total`	Spikes indicate Secret or GitHub App issue
Evictions causing re-runs	`eviction_retries_total`, `eviction_retries_exhausted_total`	Exhausted budget requires manual intervention
Throughput decaying job by job	`agent_recycle_errors_total` rising, `active_sessions` shrinking	Agent re-registration failing; see the runbook
Jobs cancelled without ever starting	`worker_pods_reaped_total{reason="pending_deadline"}`	Worker pod stuck Pending past the deadline — fix the image/scheduling cause; see the runbook
Proxy autoscaling not working	HPA TARGETS showing `<unknown>`	`requests.cpu` not set on proxy pods
GMC/AGC reconcile broken	`reconcile_errors_total`	Non-zero sustained rate indicates operator issue

Recommended Alert Rules¶

The following Prometheus alerting rules map to the SLO targets in Appendix A. Adjust thresholds to match your environment.

groups:
  - name: actions-gateway
    rules:

      # Page: no sessions means no job acquisition
      - alert: ActionsGatewayNoActiveSessions
        expr: |
          actions_gateway_active_sessions == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "No active listener sessions for {{ $labels.runner_group }} in {{ $labels.namespace }}"
          description: "The AGC has no open long-poll sessions. Jobs queue indefinitely until sessions are restored."

      # Page: token refresh errors risk job failures within ~1 hour
      - alert: ActionsGatewayTokenRefreshErrors
        expr: |
          rate(actions_gateway_token_refresh_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GitHub App token refresh errors in {{ $labels.namespace }}"
          description: "Token refresh has been failing for 5+ minutes. Sessions will fail once the current token expires (~1 hour)."

      # Page: sustained renewjob failures will cancel running jobs
      - alert: ActionsGatewayRenewJobErrors
        expr: |
          rate(actions_gateway_renewjob_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "RenewJob errors in {{ $labels.namespace }}"
          description: "RenewJob is failing at >0.1/s for 5+ minutes. Running jobs may be cancelled."

      # Page: p99 pod creation latency SLO breach
      - alert: ActionsGatewayPodCreationLatencyP99
        expr: |
          histogram_quantile(0.99,
            rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
          ) > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod creation p99 latency SLO breach in {{ $labels.namespace }}"
          description: "p99 pod creation latency exceeds 60s SLO. Check quota and node capacity."

      # Ticket: p95 pod creation latency SLO breach
      - alert: ActionsGatewayPodCreationLatencyP95
        expr: |
          histogram_quantile(0.95,
            rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
          ) > 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod creation p95 latency degraded in {{ $labels.namespace }}"
          description: "p95 pod creation latency exceeds 15s SLO. Investigate quota and scheduling."

      # Ticket: eviction budget exhausted — manual re-run required
      - alert: ActionsGatewayEvictionRetriesExhausted
        expr: |
          increase(actions_gateway_eviction_retries_exhausted_total[5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Eviction retry budget exhausted for {{ $labels.runner_group }} in {{ $labels.namespace }}"
          description: "A job's eviction retry budget has been exhausted. Manual re-run required."

      # Ticket: reconcile errors need investigation
      - alert: ActionsGatewayReconcileErrors
        expr: |
          rate(actions_gateway_reconcile_errors_total[5m]) > 0.033
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Reconcile errors in {{ $labels.controller }} for {{ $labels.resource }}"
          description: "Reconcile errors at >2/minute for 10+ minutes. Resources may be stale."

SLO Recording Rules¶

These recording rules pre-compute the metrics needed for burn-rate alerting against the SLO targets in Appendix A. Apply them alongside the alert rules above.

groups:
  - name: actions-gateway-slos
    interval: 30s
    rules:

      # Pod creation latency — p95 and p99 per namespace
      - record: actions_gateway:pod_creation_latency_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (namespace, le) (
              rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
            )
          )

      - record: actions_gateway:pod_creation_latency_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (namespace, le) (
              rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])
            )
          )

      # Job duration — p50, p95, p99 per namespace and runner_group
      - record: actions_gateway:job_duration_seconds:p50
        expr: |
          histogram_quantile(0.50,
            sum by (namespace, runner_group, le) (
              rate(actions_gateway_job_duration_seconds_bucket[5m])
            )
          )

      - record: actions_gateway:job_duration_seconds:p95
        expr: |
          histogram_quantile(0.95,
            sum by (namespace, runner_group, le) (
              rate(actions_gateway_job_duration_seconds_bucket[5m])
            )
          )

      # Token refresh error rate (hourly) — compare against the <1/hr SLO
      - record: actions_gateway:token_refresh_errors:rate1h
        expr: |
          sum by (namespace) (
            increase(actions_gateway_token_refresh_errors_total[1h])
          )

      # Job acquisition success rate — fraction of acquisitions that succeed
      - record: actions_gateway:job_acquisition_success_rate:rate5m
        expr: |
          sum by (namespace, runner_group) (
            rate(actions_gateway_jobs_acquired_total[5m])
          )
          /
          (
            sum by (namespace, runner_group) (
              rate(actions_gateway_jobs_acquired_total[5m])
            )
            +
            sum by (namespace, runner_group) (
              rate(actions_gateway_job_acquisition_errors_total[5m])
            )
          )

Grafana Dashboard¶

The following panels cover the key health and performance signals. Use the recording rules above as data sources where applicable.

Suggested Panel Layout¶

Row 1 — Gateway Health (per namespace)

Panel	Query	Visualization
Active sessions	`actions_gateway_active_sessions`	Stat / Time series
Jobs acquired/min	`rate(actions_gateway_jobs_acquired_total[5m]) * 60`	Time series
Token refresh errors	`rate(actions_gateway_token_refresh_errors_total[5m])`	Stat (threshold: >0 = red)
RenewJob errors	`rate(actions_gateway_renewjob_errors_total[5m])`	Stat (threshold: >0 = yellow)

Row 2 — Pod Creation Latency SLO

Panel	Query	Visualization
p95 latency	`actions_gateway:pod_creation_latency_seconds:p95`	Gauge (green <15s, yellow <60s, red >60s)
p99 latency	`actions_gateway:pod_creation_latency_seconds:p99`	Gauge
Latency heatmap	`rate(actions_gateway_pod_creation_latency_seconds_bucket[5m])`	Heatmap

Row 3 — Job Throughput (per runner_group)

Panel	Query	Visualization
Jobs acquired total	`increase(actions_gateway_jobs_acquired_total[1h])`	Bar chart by runner_group
Job duration p50/p95	`actions_gateway:job_duration_seconds:p50/p95`	Time series
Eviction retries	`increase(actions_gateway_eviction_retries_total[1h])`	Bar chart
Eviction budget exhausted	`increase(actions_gateway_eviction_retries_exhausted_total[1h])`	Stat (threshold: >0 = red)

Row 4 — Proxy and Quota

Panel	Query	Visualization
Proxy replica count	`kube_deployment_status_replicas_ready{deployment="actions-gateway-proxy"}`	Time series
HPA desired vs. current	HPA metrics from `kube_horizontalpodautoscaler_*`	Time series
ResourceQuota usage	`kube_resourcequota` filtered by namespace	Bar gauge

Row 5 — GMC Overview

Panel	Query	Visualization
Managed gateways	`actions_gateway_managed_gateways`	Stat
Reconcile errors	`rate(actions_gateway_reconcile_errors_total[5m])`	Time series by controller
IP range refreshes	`increase(actions_gateway_ip_range_updates_total[24h])`	Stat

Dashboard Variables¶

Add these template variables to make the dashboard multi-tenant:

$namespace — label_values(actions_gateway_active_sessions, namespace) — allows filtering to a single tenant
$runner_group — label_values(actions_gateway_active_sessions{namespace="$namespace"}, runner_group) — allows filtering to a specific RunnerGroup

Label Cardinality Warning¶

Metric labels are scoped to namespace and runner_group. To avoid label cardinality explosion:

Do not use dynamically generated runner_group names (e.g. names incorporating PR numbers or commit SHAs). Each unique combination of namespace + runner_group creates a distinct time series; thousands of unique names will cause memory pressure in Prometheus.
Stable, human-meaningful names like gpu-2x, cpu-standard, gpu-a100 are correct. These are configured in the ActionsGateway spec and should not change after initial setup.
If you need per-workflow or per-repo attribution, use Prometheus recording rules or labels from job metadata, not from RunnerGroup names.