6. Implementation Phasing & Delivery Milestones¶
← Security | Back to index | Next: Test Plan →
The system is delivered across five milestones over roughly five weeks. Each milestone produces a deliverable and a verifiable success criterion; later milestones build on the artifacts of earlier ones (the probe binary becomes the AGC's polling implementation, the decrypted payload becomes the test fixture for the worker pod, and so on). Operators who prefer to leverage AI-assisted implementation can consult Appendix C for prompting guidance and a discussion of the trade-offs — that material is optional and orthogonal to the milestone structure itself.
| Phase | Days | Focus |
|---|---|---|
| 1 — API Probe | 1–4 | Wire protocol · Auth + decrypt · Broker fixtures |
| 2 — AGC Core | 5–10 | RunnerGroup CRD · Goroutine loop · AGC CRUD safe |
| 3 — Worker | 11–16 | Pipe wrapper · Dockerfile · E2E smoke test |
| 4 — GMC + Proxy | 17–22 | ActionsGateway CRD · GMC reconciler · Proxy + HPA deploy |
| 5 — Harden & Ship | 23–26 | Security policies · Multi-tenant load · 1000-session burst |
Milestone 1: Wire Protocol Probe (Days 1–4)¶
See docs/plan/milestone-1.md for the implementation plan and current status.
- Deliverable: A standalone Go binary under
cmd/probe/that runs the full pre-execution sequence: authenticate via GitHub App credentials →POST /sessions→ long-pollGET /message→POST /acquirejobon therun_service_urlextracted from the message body → start arenewjobloop every 60 seconds. The probe prints the decrypted job payload to stdout and continues renewing until cancelled. Decrypted payload is committed as a fixture undertestdata/. - Success Criteria: The probe acquires a real job and renews its lock at least three times without GitHub cancelling it. The committed payload becomes the ground-truth fixture for all subsequent integration tests.
- Investigation —
AcknowledgeRunnerRequest: The official runner source (MessageListener.cs) callsAcknowledgeRunnerRequestAsync(runnerRequestId, sessionId)after handing a job to the worker. This endpoint is not documented and its role is unclear — it may be the broker API's replacement for the oldDeleteMessagecall, or a no-op acknowledgment. The probe should attempt this call after a successfulacquirejoband observe whether omitting it causes any downstream issue (e.g., the same job being redelivered, or session errors). If confirmed necessary, add it to the execution flow in §4.2 and §3.3. -
Risk investigation — egress IP variance: Before finishing this milestone, route the probe's broker API calls through a local two-pod proxy pool (two
httptest-backedCONNECTproxies bound to different ports, simulating different egress IPs) and verify thatsessions,message,acquirejob, andrenewjobcalls all succeed when each call lands on a different proxy. If any call fails or returns an unexpected status, pause and evaluate before proceeding to the proxy pool design in Milestone 4 — the fallback options aresessionAffinity: ClientIPon the proxy Service (low effort) or per-goroutine proxy assignment (higher fidelity). -
Investigation — session reuse after
acquirejob: The adaptive listener model in §2.2 requires that a goroutine can callGET /messageagain on the samesessionIdimmediately after a successfulPOST /acquirejob. Using the probe, acquire a job and then — without callingDELETE /sessions— immediately re-enter theGET /messagelong-poll on the same session. Confirm whether GitHub accepts the renewed poll and eventually delivers a second job, or returns a session error (e.g. 404, 410, or a protocol-level rejection). If session reuse is not permitted, the AGC must callDELETE /sessionsandPOST /sessionsbetween each job, adding registration latency to every acquisition cycle. Document the observed behavior in §3.3 and adjust the Session Multiplexer implementation plan in Milestone 2 accordingly. -
Investigation — job delivery throttling by session count: Confirm whether GitHub's broker delivers jobs only to sessions that were registered before the job was queued, or whether it will deliver to any session that starts polling after the job arrives. To test: queue two jobs simultaneously with only one session registered, then register a second session mid-queue and observe whether both sessions receive a job. If GitHub throttles delivery to the registered session count at queue time, the adaptive spawn-on-acquire model may miss jobs during bursts (a job arrives while the replacement listener's
POST /sessionscall is in flight). If GitHub delivers opportunistically to any ready session, the model is safe. If throttling is confirmed, evaluate pre-spawning a small warm standby pool (2–3 sessions per RunnerGroup) as a mitigation, and update Appendix E accordingly.
Milestone 2: AGC Controller & Reconciler (Days 5–10)¶
See docs/plan/milestone-2.md for the implementation plan and current status.
- Deliverable: A deployable AGC scaffolded with
controller-runtime/kubebuilderthat reconcilesRunnerGroupCRs into adaptive listener goroutine pools: one permanent goroutine at rest, spawning additional goroutines on demand up tomaxListenersduring bursts, with idle goroutines shutting down once the queue drains. Includes the Token Manager (mutex-protected installation token with T-5min proactive refresh), the per-job RenewJob loop, themaxWorkerssimple pod-count ceiling (enforced without PriorityClass whenpriorityTiersis absent), and the polling implementation lifted from the Milestone 1 probe. Unit tests cover create/update/delete lifecycle transitions, goroutine spawn/kill/leak detection (viagoleak.VerifyNone), idle-shutdown triggering after 50 consecutive empty 202 responses, and clock-based assertions that token refresh fires before expiry without interrupting in-flight goroutines. - Success Criteria: Creating, scaling
maxListeners, and deleting aRunnerGroupin a localkindcluster produces no goroutine leaks (verified viapprofandgoleak) and no orphaned Kubernetes resources. The goroutine count at rest is exactly one per RunnerGroup regardless ofmaxListenerssetting. Token refresh completes without goroutines restarting.
Milestone 3: Worker Pod & Pipe Handoff (Days 11–16)¶
See docs/plan/milestone-3.md for the implementation plan and current status.
- Deliverable: A worker container image plus the pod-provisioning logic in the AGC: Dockerfile, entrypoint wrapper (Go binary that reads the mounted payload Secret and writes to Named Pipes), Secret mount logic, and the
AcquireJob→ pod-create handoff sequence. The Named Pipe handoff is the underdocumented part of this milestone — start by validating the wrapper with the static decrypted payload from Milestone 1 before wiring it into pod creation, so the pipe semantics can be debugged without a live GitHub trigger in the loop. - Success Criteria: A real workflow job dispatched from GitHub appears in the Actions UI with correct step output, timing, and a green checkmark. The worker container exits with code
0on success, and both the pod and the job Secret are garbage-collected by the AGC.
Milestone 4: Gateway Manager Controller + Proxy (Days 17–22)¶
See docs/plan/milestone-4.md for the implementation plan and current status.
- Deliverable: A second operator (
cmd/gmc/) sharing the repo with the AGC, reconcilingActionsGatewayCRs into the full tenant resource set: ServiceAccount, Role, RoleBinding, NetworkPolicy, ResourceQuota, proxy Deployment, proxy Service, PodDisruptionBudget, HPA, AGC Deployment, and bootstrap RunnerGroups. Includes a minimal stateless GoCONNECTproxy implementation (HTTPS tunneling only, no TLS termination), the admission webhook that rejects CRs in reserved namespaces, andHTTP_PROXY/HTTPS_PROXY/NO_PROXYinjection into both the AGC Deployment and the worker pod template. RBAC test enumerates the GMC's generated ClusterRole rules and asserts no*verbs onsecrets,pods, ornodes(regression guard against accidental escalation). - Success Criteria: Applying two
ActionsGatewayCRs in akindcluster produces two independent tenant namespaces, each with a running AGC and a proxy pool with at leastminReplicasReady. Deleting one CR tears down only that tenant's resources. Updatingspec.proxy.maxReplicascauses the HPA to reflect the new bound within one reconcile cycle.
Milestone 5: Hardening & Load Testing (Days 23–26)¶
See docs/plan/milestone-5.md for the implementation plan and current status.
- Deliverable: Production Helm chart or Kustomize overlays with locked-down Pod Security Standards, per-tenant ResourceQuotas, optional gVisor
RuntimeClassconfiguration (see Appendix B), hardened proxy pod specs (read-only root filesystem, no capabilities), and a multi-tenant load harness undertest/load/that simulates multiple tenants in parallel against a staging cluster with the full GMC+AGC stack running. - Success Criteria: 1,000 concurrent virtual runner sessions across 10 simulated tenants sustain a burst load with zero dropped messages, no cross-tenant resource visibility, and no deadlocked Go channels. The proxy HPA scales up under load and returns to
minReplicaswithin 5 minutes of load subsiding. Akube-benchorpolarisscan returns no critical findings.
← Security | Back to index | Next: Test Plan →