🩹

Auto-Remediation Architecture

Overview

Auto-Remediation (T3.7) turns a finding from any of the scanner tiers into an Infrastructure-as-Code patch and routes it through the delivery channel you pick. Three modes are useful at different points in your rollout — pick one per tenant from the Remediation Settings page in the dashboard:

Dry-run — patches are generated and audit-logged, never auto-applied. Operator marks-applied or dismisses manually. (Default; safe.)
Pull Request — every new patch opens a PR (GitHub) or MR (GitLab) in the repository you nominate. Your reviewers merge.
Approval queue — patches land in a pending-approval inbox; an admin clicks Approve to promote.
Auto-apply — non-review-required patches apply automatically (kubectl / terraform). Templates flagged requires_review still route through the approval queue.

The same pipeline runs in three deployment shapes — pick whichever matches your environment trust model.

The three tiers

Tier 1 — SaaS-side generator (default; works out of the box)

The SaaS-side generator is a worker inside our core-backend service. It polls tier3_findings for new rows, looks up your tenant's remediation_settings, renders the patch, and writes it to remediation_patches. Strict-ZK posture: SaaS has no plaintext access to the encrypted finding payload, so patch bodies use REPLACE_ME placeholders for resource names — your operator substitutes them during review.

🛰️CUSTOMER NODES

Tentacle

per-node agent

•Detects IOCs / processes / syscalls
•AES-GCM encrypts sensitive fields
•Ships ciphertext only

📨ECHELONGRAPH SAAS

Ingester

gRPC + NATS JetStream

🗄️POSTGRES

tier3_findings

encrypted_payload BYTEA

⚙️ECHELONGRAPH SAAS

Generator

core-backend worker

•Looks up remediation_settings
•Picks template from 11-rule catalogue
•Renders body with REPLACE_ME placeholders

🩹POSTGRES

remediation_patches

source = 'saas'

🖥️DASHBOARD

Remediation Center UI

operator reviews + marks-applied / dismisses

Best for: every customer's first week. No infrastructure changes needed; review patches in the dashboard, copy bodies into your own change-management process.

Tier 2 — Pull Request connector (your repo + PAT)

Same generator, but when Mode = Pull Request the worker opens a PR/MR in *your* repository as soon as the patch is rendered. The personal access token (PAT) lives in GCP Secret Manager under a deterministic name (remediation-{tenant_id}-{provider}-pat); only the secret resource name lives in our database. PATs never round-trip through the dashboard after the initial paste.

⚙️ECHELONGRAPH SAAS

Generator

mode = pr → fetch settings

🗄️POSTGRES

remediation_settings

per-tenant config row

•mode
•github_default_repo
•github_api_base ← self-hosted GHE
•github_token_secret (resource name)

🔐GCP

Secret Manager

remediation-{tenant}-{provider}-pat

•PAT plaintext stored here only
•addVersion on rotation
•Resource name only in PG

📦SHARED PACKAGE

GitHubClient / GitLabClient

shared/pkg/remediation

•TLS 1.2 floor + retry/backoff
•Body cap 256 KiB + panic recovery
•APIBase honours self-hosted hostname

🌐CUSTOMER GIT

YOUR GitHub / GitLab

github.com · github.<corp>.com/api/v3 · gitlab.<corp>.com/api/v4

Best for: customers who already gate infrastructure changes through PR review. Patches arrive as merge requests with the rollback snippet pre-populated in the description.

Tier 3 — Agent-side full path (master agent applies in your cluster)

The Master agent in your customer cluster runs the remediation engine locally. It scans Kubernetes for misconfigurations (e.g. missing default-deny NetworkPolicy), renders the patch, optionally applies it via kubectl / terraform, and audit-reports the outcome to SaaS through the existing SubmitRemediationOutcome gRPC. SaaS only ever sees the audit row — no patch bodies, no resource names, unless your operator chose to apply (in which case the audit log reflects what actually ran).

Customer K8s Cluster

🧠MASTER POD

K8s Scanner

Periodic List of namespaces + NetworkPolicies

⚙️MASTER POD

Engine.Remediate

renders patch with PLAINTEXT context

🛠️MASTER POD

Applier (kubectl / terraform)

executes inside customer network

📡MASTER POD

GRPCAuditWriter

SubmitRemediationOutcome RPC

EchelonGraph SaaS

📨ECHELONGRAPH SAAS

Ingester

validates + publishes to NATS

⚙️ECHELONGRAPH SAAS

Processor consumer

upserts remediation_patches

🩹POSTGRES

remediation_patches

source = 'agent', applied_via = kubectl|terraform|pr

🖥️DASHBOARD

Remediation Center UI

green AGENT badge

Enable via Helm:

master:
  remediation:
    enabled: true
    mode: dry-run    # or approval / pr / auto
    autoApply: false
    pollInterval: 5m

Best for: regulated environments where exfiltration is unacceptable, or air-gapped clusters where the Master must apply patches itself. The dashboard shows agent-produced rows with a green AGENT badge so you can audit what ran.

Self-hosted Git Enterprise (Walmart, Coupang, JPMorgan, …)

Many enterprises don't put their infrastructure code on github.com. Common patterns:

GitHub Enterprise Server at github..com (e.g. github.walmart.com, github.coupang.net)
GitLab self-hosted (Omnibus / Helm / Dedicated) at gitlab..com
Bitbucket Data Center — not yet supported (open a feature request)

The Pull Request connector handles GHE and GitLab self-hosted natively. Each server exposes the same REST API as its public counterpart, just at a customer-controlled hostname:

Platform	Public default	Self-hosted format
GitHub.com	`https://api.github.com`	—
GitHub Enterprise Server	—	`https:///api/v3`
GitLab.com	`https://gitlab.com/api/v4`	—
GitLab self-hosted	—	`https:///api/v4`

In Remediation Settings → GitHub connector (or GitLab), paste the API base URL into the optional API base URL field. Leave blank for public GitHub.com / GitLab.com. Make sure the PAT you paste was issued by the same instance — a github.com token won't authenticate against github.walmart.com and vice versa.

Required PAT scopes:

GitHub: repo (private repo + branch + commit + PR)
GitLab: api

Network reachability: SaaS-side Tier 2 calls these APIs from our Cloud Run egress. If your enterprise instance is behind a corporate firewall not reachable from the public internet, you have two options:

Allow-list our Cloud Run egress IPs (we can provide them under NDA).
Switch to Tier 3 (agent-side) — the Master pod runs inside your network and reaches your Git server directly without any inbound egress allow-listing.

Where secrets live

Item	Storage	Plaintext exposed to SaaS?
GitHub / GitLab PAT	GCP Secret Manager, name `remediation-{tenant}-{provider}-pat`	Only at PUT time (HTTPS), never at rest
Patch body (Tier 1 SaaS)	`remediation_patches.body` (Postgres)	Yes (REPLACE_ME placeholders only — no resource names)
Patch body (Tier 3 Agent)	`remediation_patches.body` (Postgres)	Yes — your operator chose to delegate apply, so the audit row reflects what ran
Customer infrastructure code	YOUR GitHub / GitLab	No — we open a PR; your reviewer merges

Tokens are stored in Secret Manager rather than the database so they survive a database compromise. The settings UI accepts the PAT once via HTTPS POST; the backend writes it via the Secret Manager v1 addVersion API and stores only the secret resource name in Postgres.

Choosing a mode (decision tree)

Are you on day 1?

yes↓

dry-run

just observe + audit-log

no↓

Have you tuned out the false-positive templates?

no↓

approval queue

admin gates each apply

yes↓

Where do infra changes normally live?

in your repo↓

Pull Request

reviewers merge in GitHub / GitLab

in the cluster↓

Tier 3 agent

Master applies in your cluster

approval

admin gate

auto

zero-touch

End-to-end walkthrough — first PR-mode patch

Open Remediation Settings in the dashboard (admin role required).
Choose Mode = Pull Request.
Under GitHub connector, paste:

- Default repository: acme-corp/infrastructure

- Base branch: main - API base URL: leave blank for github.com, or paste https://github.acme.com/api/v3 for GHE - Personal access token: ghp_… (the dashboard never displays it again)

Click Save settings. The backend writes the PAT to Secret Manager and stores the resource name in remediation_settings.
Wait for the next finding (the validation cluster fires T3.6-IOC-DOMAIN every few seconds for testing).
Within ~30 seconds, the generator polls tier3_findings, renders the patch, fetches the PAT from Secret Manager, and opens a PR.
The Remediation Center row flips to pr_opened with a working PR # link.
Your reviewer merges; the dashboard's per-row "Mark applied" button records the apply.

If anything fails (e.g. PAT lacks repo scope, branch already exists), the row flips to failed with the GitHub error message in error_message so you can debug from the drawer.

Operational notes

Per-tenant matched_count — duplicate detections collapse onto one row; the row shows "matched 12× · last 2m ago" so you don't chase ghosts.
Rotation — paste a new PAT into the same field on Remediation Settings; the backend calls Secret Manager addVersion so the latest version is always read by the generator. Old versions remain accessible in Secret Manager for audit; you can disable them post-hoc.
Rollback — every patch has a stored reverse snippet. The "Rollback" button on status='applied' rows runs the reverse (or, on Tier 1, hands you the snippet to run yourself).
Dismiss — false-positive patches go to status='dismissed' with a reason field so subsequent re-detections don't re-noise the list.

Previous← Tier 3 Zero-Knowledge Decryption — SDK Reference