🩹

Auto-Remediation Architecture

Overview

Auto-Remediation (T3.7) turns a finding from any of the scanner tiers into an Infrastructure-as-Code patch and routes it through the delivery channel you pick. Three modes are useful at different points in your rollout — pick one per tenant from the Remediation Settings page in the dashboard:

  • Dry-run — patches are generated and audit-logged, never auto-applied. Operator marks-applied or dismisses manually. (Default; safe.)
  • Pull Request — every new patch opens a PR (GitHub) or MR (GitLab) in the repository you nominate. Your reviewers merge.
  • Approval queue — patches land in a pending-approval inbox; an admin clicks Approve to promote.
  • Auto-apply — non-review-required patches apply automatically (kubectl / terraform). Templates flagged requires_review still route through the approval queue.

The same pipeline runs in three deployment shapes — pick whichever matches your environment trust model.


The three tiers

Tier 1 — SaaS-side generator (default; works out of the box)

The SaaS-side generator is a worker inside our core-backend service. It polls tier3_findings for new rows, looks up your tenant's remediation_settings, renders the patch, and writes it to remediation_patches. Strict-ZK posture: SaaS has no plaintext access to the encrypted finding payload, so patch bodies use REPLACE_ME placeholders for resource names — your operator substitutes them during review.

🛰️CUSTOMER NODES
Tentacle
per-node agent
  • Detects IOCs / processes / syscalls
  • AES-GCM encrypts sensitive fields
  • Ships ciphertext only
📨ECHELONGRAPH SAAS
Ingester
gRPC + NATS JetStream
🗄️POSTGRES
tier3_findings
encrypted_payload BYTEA
⚙️ECHELONGRAPH SAAS
Generator
core-backend worker
  • Looks up remediation_settings
  • Picks template from 11-rule catalogue
  • Renders body with REPLACE_ME placeholders
🩹POSTGRES
remediation_patches
source = 'saas'
🖥️DASHBOARD
Remediation Center UI
operator reviews + marks-applied / dismisses

Best for: every customer's first week. No infrastructure changes needed; review patches in the dashboard, copy bodies into your own change-management process.

Tier 2 — Pull Request connector (your repo + PAT)

Same generator, but when Mode = Pull Request the worker opens a PR/MR in *your* repository as soon as the patch is rendered. The personal access token (PAT) lives in GCP Secret Manager under a deterministic name (remediation-{tenant_id}-{provider}-pat); only the secret resource name lives in our database. PATs never round-trip through the dashboard after the initial paste.

⚙️ECHELONGRAPH SAAS
Generator
mode = pr → fetch settings
🗄️POSTGRES
remediation_settings
per-tenant config row
  • mode
  • github_default_repo
  • github_api_base ← self-hosted GHE
  • github_token_secret (resource name)
🔐GCP
Secret Manager
remediation-{tenant}-{provider}-pat
  • PAT plaintext stored here only
  • addVersion on rotation
  • Resource name only in PG
📦SHARED PACKAGE
GitHubClient / GitLabClient
shared/pkg/remediation
  • TLS 1.2 floor + retry/backoff
  • Body cap 256 KiB + panic recovery
  • APIBase honours self-hosted hostname
🌐CUSTOMER GIT
YOUR GitHub / GitLab
github.com · github.<corp>.com/api/v3 · gitlab.<corp>.com/api/v4

Best for: customers who already gate infrastructure changes through PR review. Patches arrive as merge requests with the rollback snippet pre-populated in the description.

Tier 3 — Agent-side full path (master agent applies in your cluster)

The Master agent in your customer cluster runs the remediation engine locally. It scans Kubernetes for misconfigurations (e.g. missing default-deny NetworkPolicy), renders the patch, optionally applies it via kubectl / terraform, and audit-reports the outcome to SaaS through the existing SubmitRemediationOutcome gRPC. SaaS only ever sees the audit row — no patch bodies, no resource names, unless your operator chose to apply (in which case the audit log reflects what actually ran).

Customer K8s Cluster
🧠MASTER POD
K8s Scanner
Periodic List of namespaces + NetworkPolicies
⚙️MASTER POD
Engine.Remediate
renders patch with PLAINTEXT context
🛠️MASTER POD
Applier (kubectl / terraform)
executes inside customer network
📡MASTER POD
GRPCAuditWriter
SubmitRemediationOutcome RPC
EchelonGraph SaaS
📨ECHELONGRAPH SAAS
Ingester
validates + publishes to NATS
⚙️ECHELONGRAPH SAAS
Processor consumer
upserts remediation_patches
🩹POSTGRES
remediation_patches
source = 'agent', applied_via = kubectl|terraform|pr
🖥️DASHBOARD
Remediation Center UI
green AGENT badge

Enable via Helm:

master:
  remediation:
    enabled: true
    mode: dry-run    # or approval / pr / auto
    autoApply: false
    pollInterval: 5m

Best for: regulated environments where exfiltration is unacceptable, or air-gapped clusters where the Master must apply patches itself. The dashboard shows agent-produced rows with a green AGENT badge so you can audit what ran.


Self-hosted Git Enterprise (Walmart, Coupang, JPMorgan, …)

Many enterprises don't put their infrastructure code on github.com. Common patterns:

  • GitHub Enterprise Server at github..com (e.g. github.walmart.com, github.coupang.net)
  • GitLab self-hosted (Omnibus / Helm / Dedicated) at gitlab..com
  • Bitbucket Data Center — not yet supported (open a feature request)

The Pull Request connector handles GHE and GitLab self-hosted natively. Each server exposes the same REST API as its public counterpart, just at a customer-controlled hostname:

PlatformPublic defaultSelf-hosted format
GitHub.comhttps://api.github.com
GitHub Enterprise Serverhttps:///api/v3
GitLab.comhttps://gitlab.com/api/v4
GitLab self-hostedhttps:///api/v4

In Remediation SettingsGitHub connector (or GitLab), paste the API base URL into the optional API base URL field. Leave blank for public GitHub.com / GitLab.com. Make sure the PAT you paste was issued by the same instance — a github.com token won't authenticate against github.walmart.com and vice versa.

Required PAT scopes:

  • GitHub: repo (private repo + branch + commit + PR)
  • GitLab: api

Network reachability: SaaS-side Tier 2 calls these APIs from our Cloud Run egress. If your enterprise instance is behind a corporate firewall not reachable from the public internet, you have two options:

  1. Allow-list our Cloud Run egress IPs (we can provide them under NDA).
  2. Switch to Tier 3 (agent-side) — the Master pod runs inside your network and reaches your Git server directly without any inbound egress allow-listing.

Where secrets live

ItemStoragePlaintext exposed to SaaS?
GitHub / GitLab PATGCP Secret Manager, name remediation-{tenant}-{provider}-patOnly at PUT time (HTTPS), never at rest
Patch body (Tier 1 SaaS)remediation_patches.body (Postgres)Yes (REPLACE_ME placeholders only — no resource names)
Patch body (Tier 3 Agent)remediation_patches.body (Postgres)Yes — your operator chose to delegate apply, so the audit row reflects what ran
Customer infrastructure codeYOUR GitHub / GitLabNo — we open a PR; your reviewer merges

Tokens are stored in Secret Manager rather than the database so they survive a database compromise. The settings UI accepts the PAT once via HTTPS POST; the backend writes it via the Secret Manager v1 addVersion API and stores only the secret resource name in Postgres.


Choosing a mode (decision tree)

Are you on day 1?
yes
dry-run
just observe + audit-log
no
Have you tuned out the false-positive templates?
no
approval queue
admin gates each apply
yes
Where do infra changes normally live?
in your repo
Pull Request
reviewers merge in GitHub / GitLab
in the cluster
Tier 3 agent
Master applies in your cluster
approval
admin gate
auto
zero-touch

End-to-end walkthrough — first PR-mode patch

  1. Open Remediation Settings in the dashboard (admin role required).
  2. Choose Mode = Pull Request.
  3. Under GitHub connector, paste:
- Default repository: acme-corp/infrastructure

- Base branch: main - API base URL: leave blank for github.com, or paste https://github.acme.com/api/v3 for GHE - Personal access token: ghp_… (the dashboard never displays it again)

  1. Click Save settings. The backend writes the PAT to Secret Manager and stores the resource name in remediation_settings.
  2. Wait for the next finding (the validation cluster fires T3.6-IOC-DOMAIN every few seconds for testing).
  3. Within ~30 seconds, the generator polls tier3_findings, renders the patch, fetches the PAT from Secret Manager, and opens a PR.
  4. The Remediation Center row flips to pr_opened with a working PR # link.
  5. Your reviewer merges; the dashboard's per-row "Mark applied" button records the apply.

If anything fails (e.g. PAT lacks repo scope, branch already exists), the row flips to failed with the GitHub error message in error_message so you can debug from the drawer.


Operational notes

  • Per-tenant matched_count — duplicate detections collapse onto one row; the row shows "matched 12× · last 2m ago" so you don't chase ghosts.
  • Rotation — paste a new PAT into the same field on Remediation Settings; the backend calls Secret Manager addVersion so the latest version is always read by the generator. Old versions remain accessible in Secret Manager for audit; you can disable them post-hoc.
  • Rollback — every patch has a stored reverse snippet. The "Rollback" button on status='applied' rows runs the reverse (or, on Tier 1, hands you the snippet to run yourself).
  • Dismiss — false-positive patches go to status='dismissed' with a reason field so subsequent re-detections don't re-noise the list.