7. Test Plan¶
← Implementation Phases | Back to index | Next: Glossary →
Testing is structured in three layers. Each layer has a distinct scope, speed contract, and failure signal. All three layers run in CI; only unit and integration tests gate PRs. End-to-end tests run nightly against a staging cluster. Multi-tenant scenarios are explicitly covered at the integration and end-to-end layers, since tenant isolation is a correctness property of the system, not just a deployment concern.
7.1. Unit Tests¶
Scope: Pure logic within a single package — no network, no Kubernetes API, no real file I/O.
Speed contract: Full suite runs in under 30 seconds. Any test requiring a sleep or external call does not belong here.
Tooling: Standard go test ./... with testify/assert. Use go test -race in CI to catch goroutine data races.
What to cover:
- Broker API client — Request construction, header injection, and response parsing for
sessions,message,acquirejob, andrenewjob. Usehttptest.NewServerto serve static JSON fixtures without hitting GitHub. Assert thatacquirejobandrenewjobuse therun_service_urlfrom the message body, not the broker URL. - RenewJob loop — Verify the per-job goroutine calls
renewjobat the correct 60-second interval, stops cleanly when the job completes, and handles a non-200 response fromrenewjobby surfacing an error without panicking. - Rate-limit (429) backoff — Drive the broker API client against an
httptestserver that returns429 Too Many Requestswith aRetry-After: 30header. Assert the client honorsRetry-After, incrementsactions_gateway_message_poll_errors_total{reason="rate_limited"}, and falls back to exponential backoff capped at 5 minutes when the header is absent. Assert that sustained 429s for >10 minutes surface aRateLimitedcondition on the correspondingRunnerGroup. - Token Manager — Use a fake clock to advance time to T-5 minutes before token expiry and assert the Token Manager fetches a new token before the old one expires. Assert that session goroutines reading the token during a refresh window get a valid (old or new) token and are never blocked. Assert that
actions_gateway_token_refresh_errors_totalincrements on each failed refresh attempt; the alerting threshold for this metric is defined in docs/operations/observability.md (> 0 for 5 minutes triggers a page). - Payload decryption — AES-256 decryption of the
TaskAgentMessage.Bodyfield. Test against a pre-generated key/ciphertext pair committed as a fixture. Test failure modes: wrong key, truncated payload, invalid base64. - Session registry — Goroutine lifecycle management: spawn N sessions, verify N goroutines are running, scale down to M, verify M remain with no leaks. Use
goleak.VerifyNone(t)(fromgo.uber.org/goleak) as a test cleanup hook — it identifies leaked goroutines by stack trace, making failures actionable. Do not useruntime.NumGoroutine()deltas, which include Go runtime goroutines and produce unreliable counts. - Label-to-pod mapping — The logic that translates
RunnerGrouprunner labels to a target pod spec. Table-driven tests covering label matches, mismatches, defaults, and invalid configurations. - AGC reconciler state machine — Unit-test the AGC reconciler's desired-vs-actual diffing logic with a fake
client.Client(provided bycontroller-runtime/pkg/client/fake). Cover create, update, scale-up, scale-down, and delete transitions. - GMC reconciler state machine — Unit-test the GMC reconciler with a fake client. For a given
ActionsGatewayspec, assert that the reconciler produces exactly the expected set of Kubernetes objects (ServiceAccount, Role, RoleBinding, NetworkPolicy, proxy Deployment, proxy Service, proxy PodDisruptionBudget, HPA, AGC Deployment) all within the CR's own namespace. The reconciler does not create aResourceQuota— that is platform-owned (Q130); assert it leaves any pre-existing namespace quota untouched. Table-driven tests covering spec creation, proxy scaling bound changes, and deletion. For credential rotation specifically: assert that updatinggitHubAppRef.Namecauses the GMC to update the AGC Deployment's volume reference to the new Secret name (triggering a rollout) and does not mutate or delete the old Secret. - HPA spec generation — For a range of
ProxyConfiginputs (explicit values, all defaults, boundary values), assert the generatedHorizontalPodAutoscalerhas the correctminReplicas,maxReplicas, andtargetCPUUtilizationPercentage. Assert thatminReplicasis always ≥ 1 and ≤maxReplicas. Assert proxy pods always haveresources.requests.cpuset (required for HPA metric computation). - Proxy env injection — Assert that the AGC Deployment spec produced by the reconciler contains
HTTP_PROXY,HTTPS_PROXY, andNO_PROXYenv vars. AssertNO_PROXYincludeskubernetes.default.svc.cluster.localand the configurednoProxyCIDRs. Assert the same three vars appear in the worker pod template. - Status Conditions — Assert that the GMC sets the
Ready,ProxyAvailable, andAGCAvailableconditions onActionsGatewayStatuscorrectly as components become healthy or degrade. Assert conditions use standardmetav1.Conditiontypes compatible withkubectl wait --for=condition=Ready. - Runner version rejection — Unit-test the session goroutine's handling of a
400 Bad RequestfromPOST /sessionscontaining a version-too-old message. Assert the goroutine surfaces the error as aRunnerGroupcondition rather than silently retrying in a tight loop. - GMC RBAC boundary assertions — Enumerate the generated ClusterRole rules and assert that no rule grants
*verbs onsecrets,pods, ornodesat the cluster level. This is a regression guard against accidental privilege escalation during development. gitHubAppRefnamespace defaulting — Unit-test the defaulting logic: whenNamespaceis omitted, assert it resolves to theActionsGatewayCR's own namespace; when set explicitly, assert that value is used instead.- Reserved namespace blocklist validation — Unit-test the admission webhook logic that rejects
ActionsGatewayCRs created in reserved namespaces. The static defaults arekube-system,kube-public, andgmc-system. The webhook also readsPOD_NAMESPACE(downward API) at setup time and adds it to the set, so installs into a non-default namespace are protected. Tests cover the static defaults plus a custom-install namespace driven by the constructor.
7.2. Integration Tests¶
Scope: Multiple components interacting with real infrastructure dependencies — a live Kubernetes API server and a stubbed GitHub backend. No actual GitHub network calls. No container image builds and no real Kubernetes scheduling — pods are created in the API server but never actually scheduled.
Speed contract: Full suite runs in under 5 minutes. Each test must complete in under 30 seconds. Tests run against a local envtest API server (from controller-runtime). kind is not used for this layer — it requires container builds and is slower than envtest.
Build tag: all integration test files carry //go:build integration. This keeps them out of the unit-test run (go test ./...) and requires an explicit -tags integration flag. CI runs them as a separate job after unit tests pass. Tests live under cmd/{agc,gmc}/internal/controller/integration/. See docs/development/testing.md for the run commands.
Why envtest, not the fake client: the fake client.Client cannot enforce CRD admission validation (CEL rules, x-kubernetes-validations), does not handle ownership references and garbage collection, and cannot test webhook behavior. envtest spins up a real kube-apiserver and etcd binary locally (no kubelet, no scheduler), so CRD schemas, admission webhooks, and status subresources all behave as in production.
Tooling: controller-runtime/pkg/envtest for the Kubernetes API surface. A shared stateful httptest fake broker under internal/brokertest/ for the GitHub broker — tests control it by enqueuing job messages on demand and asserting which sessions were deleted, rather than replaying static fixtures. Standard go test with testify and gomega.Eventually for eventually-consistent assertions. Ginkgo is not used; the integration tests follow the same testing-package style as the unit tests in this repo.
What to cover:
- CRD install and validation — Install both
ActionsGatewayandRunnerGroupCRD schemas intoenvtest. Verify valid manifests are accepted. Verify invalid specs are rejected at admission: the namespace blocklist webhook rejects reserved names; CRD CEL rules rejectpriorityTiersin non-ascending threshold order; CRD CEL rules rejectmaxWorkersvalues that conflict with the lastpriorityTiersthreshold;runnerLabelsis rejected when empty (MinItems=1) or when an item contains whitespace/commas (per-itemPattern). ThesecurityProfileno-downgrade guard is enforced by the GMC validating webhook (not CEL, since it readsmetadata.annotations) and is unit-tested directly: an upgrade (baseline → restricted) and same-value update pass; a downgrade (restricted → baseline, or anything →privileged, including a dropped-field re-default to baseline) is rejected unless theallow-profile-downgrade: "true"annotation is present. - GMC tenant provisioning — Create a namespace, then apply an
ActionsGatewayCR into it. Verify the GMC creates all expected child resources within that same namespace: ServiceAccount (AGC + worker), Role, RoleBinding, NetworkPolicy, proxy Deployment (withresources.requests.cpuset), proxy Service, proxy PodDisruptionBudget, HPA, AGC Deployment (withHTTP_PROXY,HTTPS_PROXY, andNO_PROXYset), and bootstrap RunnerGroups. Verify the GMC does not create or modify the namespace itself, and does not create aResourceQuota(platform-owned — Q130).TestGMC_TenantProvisioning_NoResourceQuotaCreatedasserts a pre-existing platform quota is left untouched. AssertActionsGatewayStatus.Conditionsincludes theProxyAvailableandAGCAvailablecondition types. Note: because envtest does not schedule pods,Deployment.status.readyReplicasstays at 0 — theReadycondition will not becomeTrueand tests assert the non-Ready state is reported correctly rather than assertingReady=True. Additional provisioning cases:gitHubAppRef.namespaceomitted defaults to the CR's own namespace;spec.proxy.noProxyCIDRsis merged with (not replaced by) the mandatory cluster-internal exclusions; updatinggitHubAppRef.namecauses the AGC Deployment to reference the new Secret without deleting the old one. - GMC tenant teardown — Delete an
ActionsGatewayCR and verify the GMC removes all associated resources, including the proxy Deployment, Service, PodDisruptionBudget, and HPA, without affecting any other tenant namespace. Assert that a secondActionsGatewayCR remains fully intact. Also verify that re-applying the same CR after teardown brings all resources back cleanly. - HPA bounds update — Update
spec.proxy.maxReplicason a liveActionsGatewayCR and verify the GMC patches the HPA to reflect the new bound within one reconcile cycle. - Proxy NetworkPolicy content — Verify the content of the generated
NetworkPolicyin the API server: proxy pod egress includes the GitHub CIDR rules; AGC and worker pods have egress rules only to the proxy ClusterIP; DNS egress is always present. Verify thatspec.proxy.managedNetworkPolicy: falsesuppresses the GitHub CIDR egress rules. Verify theIPRangeReconcilerpatches an existing policy when the fetched CIDR set changes. Note: envtest does not run a CNI plugin — these tests verify theNetworkPolicyspec content, not actual packet filtering. - AGC RBAC scope enforcement — Provision an
ActionsGatewayCR so the GMC creates theactions-gateway-controllerServiceAccount and its Role/RoleBinding. Impersonate that ServiceAccount viarest.ImpersonateConfigand attempt to list resources in a different tenant's namespace. Assert the API server returns 403. Assert that listing in the same namespace returns 200. - AGC reconciler end-to-end — Deploy a
RunnerGroup, verify the AGC starts exactly one listener goroutine at rest (the permanent baseline). Enqueue jobs via the fake broker and verify additional goroutines spawn up to.spec.maxListeners. Verify idle goroutines shut down once the queue empties, leaving exactly one active listener. Update.spec.maxListeners, verify the new ceiling takes effect without restarting in-flight goroutines. Delete the resource, verify all goroutines exit and no agent Secrets or worker Pods are orphaned. - Secret lifecycle — Verify that a Secret is created with the correct payload labels when a job is intercepted, scoped to the correct tenant namespace, and deleted after the pod terminates. In envtest, pod phase must be advanced manually (no kubelet) — tests set the pod status to
Succeededvia the status subresource client. - Event-driven pod completion — Verify the provisioner's
InformerPodWaiterresolves a blocked session when the shared Pod informer observes the worker pod reach a terminal phase (and on pod deletion), rather than polling pod state on a timer.TestInformerPodWaiter_RealInformerruns a real manager cache against envtest, creates a pod, advances its status toSucceededvia the status subresource, and assertsWaitForCompletionreturns promptly with the terminal phase. - Worker-Pod watch re-triggers reconciliation — Verify the RunnerGroup controller watches the worker Pods its provisioner creates, so a Pod lifecycle event (create on job acquire, terminal-phase transition, eviction, or delete) re-triggers a reconcile and refreshes the RunnerGroup's status without waiting for the next spec change or cache resync.
TestAGC_Reconciler_WorkerPodEventTriggersReconcileruns a real manager against envtest, lets the controller quiesce (its reconcile count stops increasing), injects a sentinel condition into the listener-condition channel, then creates a labelled worker Pod and asserts the resulting reconcile both increments the reconcile count and flushes the buffered condition into status. The watch is deliberately Pods-only — a Secret watch would establish a Secret informer and cache credential material, violating the W3/H-2 cache-isolation property. - Pod provisioning — Verify that the AGC creates a Pod with the correct image, resource limits, volume mounts, and security context when a job payload is received from the fake broker. Verify controller-enforced invariants are applied unconditionally:
automountServiceAccountToken: false,serviceAccountName: actions-gateway-worker,HTTP_PROXY/HTTPS_PROXY/NO_PROXYenv vars injected with the provisioner's values. VerifypriorityTierstier assignment by pod count. VerifymaxWorkersceiling holds the third pod until an active pod completes. Verify the secure-by-default hardening: withSECURITY_PROFILEunset/baselinethe pod gets pod-levelrunAsNonRoot+ seccompRuntimeDefaultbut no per-container privilege-escalation/capability floor; withrestrictedthe per-containerallowPrivilegeEscalation:false+ drop-ALL floor is added; withprivilegedno SecurityContext defaults are stamped. Verify default500m/1Girequests+limits are stamped when the tenant container omits them, and that explicit tenant SecurityContext/resource values are preserved (gap-fill only). Verify the recommendedapp.kubernetes.io/*labels are present. - Failure recovery — Simulate a non-eviction pod failure (set pod status to
Failedwithoutreason: Evicted) and verify the AGC cleans up the associated Secret without leaking it and without triggering an automatic rerun. Simulate an eviction (setreason: Evicted) and verify the AGC calls the rerun API, incrementsactions_gateway_eviction_retries_total, and cleans up the Secret. VerifymaxEvictionRetries: 0suppresses the rerun call and incrementsactions_gateway_eviction_retries_exhausted_totalinstead. Simulate a namespace ResourceQuota rejection on pod create and verify the provisioner retries up tomaxQuotaRetriestimes, incrementsactions_gateway_quota_retries_totalon each attempt, and incrementsactions_gateway_quota_retries_exhausted_totalwhen the budget is exhausted. VerifymaxQuotaRetries: 0returns an error immediately without touching any quota counter. - SIGTERM session cleanup — Start an AGC against the fake broker, burst the listener count to N sessions. Cancel the reconciler's context (simulating SIGTERM). Assert the AGC issues
DELETE /sessions/{id}for every registered session before the context fully unwinds, confirmed by the fake broker. Assert no goroutine leak viagoleak.VerifyNone. - Worker-pod reaper — Verify the RunnerGroup reconciler's reaper against a real apiserver: a worker pod advanced to a terminal phase via the status subresource is deleted once
completedPodTTLelapses, driven by the real Pod watch and the reconciler'sRequeueAftertimer (TestAGC_Reaper_CompletedPodDeletedAfterTTL); a pod that never schedules (envtest has no scheduler, so every pod is genuinely Pending) is deleted oncependingPodDeadlineelapses, while a fresh Pending pod within the deadline survives (TestAGC_Reaper_StuckPendingPodDeletedAfterDeadline); and a pod created through the real provisioner path carries a controllerOwnerReferenceto its RunnerGroup with the apiserver-assigned UID (TestAGC_Reaper_WorkerPodHasOwnerRef). The ownerRef cascade is not provable in envtest — there is no kube-controller-manager, so no GC controller runs — which is why the cascade hook is asserted here and the operational cleanup behaviour at Tier A.
7.3. End-to-End Tests¶
Scope: the full system deployed into a real Kubernetes cluster. GMC, AGC, and proxy binaries run as actual Pods. Proxy pods are scheduled and connected. Cert-manager issues TLS certificates for the admission webhook. NetworkPolicy is enforced by the CNI. HPA scaling is driven by metrics-server.
Tier structure. End-to-end tests split into three tiers along the "what's required to run them" axis:
| Tier | What it tests | GitHub required? | When it runs |
|---|---|---|---|
| A — Infrastructure | GMC provisioning, proxy scheduling, NetworkPolicy enforcement, HPA, PDB, RBAC, teardown, GMC restart | No | every merge to main |
| B — Lifecycle (fake broker) | AGC session polling, job acquisition, pod creation, eviction retry, SIGTERM cleanup | No | every merge to main |
| C — Real GitHub | Actual workflow dispatch, log streaming, RenewJob across renewal cycles, proxy egress IP routing | Yes (GitHub App credentials) | nightly + on-demand |
Tier A and Tier B run on a local kind cluster — fast enough for CI on every merge and for the inner dev loop. Tier C runs against a real GitHub App and a dedicated test repository; it requires E2E_GITHUB_APP_ID, E2E_GITHUB_APP_INSTALLATION_ID, E2E_GITHUB_APP_PRIVATE_KEY, E2E_GITHUB_ORG, and E2E_GITHUB_REPO to be set, and is skipped at runtime when any are missing.
What kind adds over envtest integration tests:
| Capability | envtest (§7.2) | kind (Tier A/B) |
|---|---|---|
| CRD admission + CEL validation | ✅ | ✅ |
| Admission webhook with cert-manager TLS | ⚠️ requires manual cert workaround | ✅ |
| Real pod scheduling (kubelet present) | ❌ | ✅ |
| Container images actually pulled and run | ❌ | ✅ |
| NetworkPolicy enforcement (CNI) | ❌ | ✅ |
HPA scaling (metrics-server required) |
❌ | ✅ |
| PDB enforcement during node drain | ❌ | ✅ |
| Proxy CONNECT tunnel actually relays bytes | ❌ | ✅ |
| GMC/AGC/proxy binaries running as real Pods | ❌ | ✅ |
| Deployment rollout behavior | ❌ | ✅ |
Speed contract: each Tier A/B test completes in under 3 minutes. Tier A+B together run in under ~30 minutes. Tier C tests take 2–15 minutes each (the 15-minute RenewJob test is the bound). Cluster setup is a one-time BeforeSuite cost, not counted against individual test budgets.
Build tag: all e2e files carry //go:build e2e. Tests are excluded from both go test ./... and the integration test run. Two Tier-A tests (E2E_GMC_HPA_ScalesUpUnderLoad and E2E_GMC_PDB_PreventsEvictionBelowMinAvailable) carry the Ginkgo Label("local-only") and are excluded from CI via --label-filter '!local-only' because they depend on CPU-load timing that is flaky on 2-vCPU GitHub Actions runners. They pass reliably on a local machine. See docs/development/testing.md for run commands.
Cluster shape. A multi-node kind cluster (1 control-plane + 2 workers) is required for pod anti-affinity and PDB tests to be meaningful. kindnet is the default CNI; it accepts NetworkPolicy objects but its bundled kube-network-policies enforcer does not drop egress traffic for the negative cases, so the two runtime egress-negative specs (E2E_GMC_TenantProvisioning_WorkloadEgressBlockedToNonProxyPod, E2E_GMC_TenantProvisioning_WorkerCannotReachK8sAPI) self-skip on kindnet. The make e2e-cluster KIND_CNI=calico profile (see kind-iteration.md) installs Calico instead; on that cluster the negatives assert real packet drops, each paired with an unlabelled control pod that proves the destination is reachable so a drop is attributable to NetworkPolicy enforcement.
Fake GitHub for Tier B. Tier B replaces real GitHub with test/fakegithub/, a standalone HTTP server deployed into the cluster as a Deployment + Service. It implements the broker protocol with stateful behaviour (session registration, controllable job-message delivery, recorded acquire/renew/rerun calls), the runner registration API (generate-jitconfig issuing real JIT blobs, list-by-name, delete, 409 on name collision), and exposes a control API used by tests to inject jobs and assert on calls. It can also simulate GitHub's single-use JIT runner behaviour (Q114): with SINGLE_USE_RUNNERS=true or POST /control/singleuse?enabled=true[&owner=<prefix>], a job acquisition deletes the delivering session's runner record — the dead session then serves one empty 200 (the decode response: EOF signature) and 401s thereafter, and re-registering a surviving name returns 409 — reproducing the M4 §12 death spiral without real GitHub. It is off by default and scoped by session-owner prefix so one spec can opt in without affecting parallel suites. fakegithub also models GitHub's pool-wide opportunistic delivery (M1 Investigation C/D): a job whose session is recycled away before it is acquired is carried to the owner's pending pool and delivered to the next live session, so the post-job re-registration does not strand a job that races a session's recycle window. The AGC is pointed at the fake by setting AGC_EXTRA_* env vars on the GMC, which forwards them into the AGC Deployments it creates (gated by --allow-agc-extra-env=true, set only in e2e). See docs/development/kind-iteration.md for the env-var details.
Tooling: Ginkgo-based suites under cmd/gmc/test/e2e/. A shared cmd/gmc/test/utils/ helper package wraps kubectl, kind, and the fakegithub port-forward. Tier C uses the GitHub REST API to dispatch workflows and poll runs to completion.
What to cover:
- Smoke test — single job, single tenant — Create a namespace, apply one
ActionsGatewayCR into it, dispatch a minimal workflow (echo "hello"), and assert the run completes green with correct log output in the GitHub UI. This is the merge gate. - Parallel job execution — Dispatch a matrix workflow with 10 parallel jobs against a single tenant and assert all 10 complete successfully, verifying the session multiplexer handles concurrent polling without message collisions.
- Multi-tenant isolation — Provision two
ActionsGatewayCRs pointing to different GitHub repositories and different namespaces. Dispatch simultaneous jobs to both. Assert that each job runs in its own namespace, that no Secrets are visible across namespaces, and that one tenant's resource consumption does not affect the other's job throughput. - Proxy egress isolation — Confirm via network observation that GitHub API calls and log stream traffic from both the AGC and worker pods exit through the tenant's proxy Service address, not directly through the cluster NAT. Assert no direct egress to GitHub IPs is observed from AGC or worker pods.
- Proxy HA under disruption — Cordon one node hosting a proxy pod and drain it. Assert the PodDisruptionBudget prevents eviction until another proxy pod is scheduled, and that in-flight jobs are not interrupted during the disruption.
- Tenant provisioning and deprovisioning — Create a namespace, apply an
ActionsGatewayCR, run a job successfully, then delete the CR. Assert all GMC-owned resources (proxy Deployment, proxy Service, HPA, PodDisruptionBudget, AGC Deployment, RBAC, NetworkPolicy) are removed but the namespace itself remains intact. The platform-ownedResourceQuotais not GMC-owned and must survive CR deletion (Q130). Re-apply the CR and verify a fresh gateway and proxy pool come up cleanly and can run jobs again. - Job failure propagation — Dispatch a workflow with a deliberately failing step (
exit 1) and assert the GitHub UI correctly reflects the failure status. Verify the worker pod exits non-zero and is still cleaned up within the tenant namespace. - Worker pod lifecycle cleanup (Tier B,
E2E_AGC_WorkerPodLifecycle) — Two tenants against fakegithub, runSerial(fakegithub session IDs carry no tenant identity, so the suite must not overlap other session-consuming suites). Tenant one uses a fast-exiting worker image and a shortcompletedPodTTL: assert its completed worker pod and job Secret are deleted once the TTL elapses. Tenant two uses an unpullable worker image (reserved.invalidTLD) and a shortpendingPodDeadline: assert the genuinely stuck-Pending pod is reaped after the deadline with aWorkerPodStuckPendingWarning Event on the RunnerGroup. Also assert worker pods carry a controllerOwnerReferenceto their RunnerGroup — this is the tier where the live GC controller makes that cascade real. - Single-use JIT agent self-heal (Tier B,
E2E_AGC_SingleUseSelfHeal) — With fakegithub's single-use simulation enabled (scoped to this spec's RunnerGroup by owner prefix), runmaxListeners + 1sequential jobs. Each acquisition consumes its agent's runner record and kills its session; assert each consumed session is torn down and replaced by a fresh one, and that the final job — the one a pre-Q114 AGC could never serve, having burned every listener slot — still produces a worker pod. - Cross-tenant Secret opacity — After two tenants have completed jobs, assert via namespace inspection that neither tenant's namespace contains Secrets belonging to the other. Assert that the AGC ServiceAccount for tenant A cannot read Secrets in tenant B's namespace.
- Resource cleanup under load — Dispatch 50 sequential jobs across 5 tenants and assert zero pod or Secret leaks remain after all runs complete. Checked by polling all tenant namespaces for residual resources 60 seconds post-completion.
- Proxy HPA scaling — Dispatch a sustained burst of 50 concurrent jobs against a single tenant with
spec.proxy.maxReplicas: 5andspec.proxy.minReplicas: 2. Assert the HPA scales the proxy pool aboveminReplicasduring the burst, and that replica count returns tominReplicaswithin 5 minutes of load subsiding. Assert no jobs are dropped during scale-up or scale-down. - GMC restart resilience — Delete and re-create the GMC pod while tenants are active. Assert that in-flight jobs are not interrupted, the GMC correctly re-derives tenant state on restart, no duplicate resources are created during reconciliation, and the proxy HPAs remain intact.
- AGC restart resilience — Mid-run on a single tenant, delete and re-create the AGC pod. Assert in-flight jobs are not double-acquired, the AGC converges back to the correct goroutine count within one reconcile cycle, and all traffic continues routing through the proxy pool.
- RenewJob under long-running job — Dispatch a workflow that sleeps for 15 minutes. Assert the job completes successfully, confirming the RenewJob loop correctly kept the lock alive across multiple renewal cycles without GitHub cancelling the job.
- Rolling AGC upgrade — Start a steady stream of dispatched workflows against a single tenant, then patch the AGC Deployment image to a new tag mid-flight. Assert the upgrade completes the rolling update successfully, in-flight long polls drop and reconnect (with no duplicated job acquires observed via the broker's audit log), per-job RenewJob loops resume after the new pod starts, and total workflow success rate over the upgrade window stays above 95%. Assert that jobs whose lock expired during the blackout are redelivered and complete on retry.
- GitHub IP range reconciliation — Simulate a GitHub meta API response with a new IP range not present in the existing NetworkPolicy. Trigger a GMC reconcile cycle and assert the proxy pod NetworkPolicy egress rules are updated to include the new range. Assert that
spec.proxy.managedNetworkPolicy: falsesuppresses the update.
← Implementation Phases | Back to index | Next: Glossary →