Sign in Start free

Solutions / Auto-Remediator

Remediation agent · Available on all plans

Auto-Remediator

Triage, diagnose, propose — execute with your approval.

The Auto-Remediator is Veirox's first-line responder. When an alert fires, it investigates the actual system state, writes a structured diagnosis, proposes a concrete fix, and requests your approval before executing. Every step is auditable. Nothing destructive happens without a human in the loop.

Clone this template See how it works ↓

Does this match you?

Built for teams that want 24/7 response without a 24/7 on-call.

Good fit if you…

✓Run production systems where seconds matter — APIs, payments, streaming
✓Have documented runbooks in Markdown, Confluence, or Google Docs
✓Want the agent to do the work, not just route alerts
✓Require human approval on destructive changes (rollbacks, restarts, scale)
✓Ship under audit and need every action cryptographically logged

Probably not yet if you…

!Have zero documented runbooks — the agent needs some procedural knowledge to remediate
!Need unattended remediation without any human approval (that's a future release)
!Run entirely in an air-gapped network without our Veirox Connect option

Lifecycle

Five stages, every time.

1

Ingest — webhook arrives

An alert fires from AlertManager, Grafana, Datadog, Sentry, or PagerDuty. Veirox verifies the signature, applies your routing rules, deduplicates repeats, and creates a session in <200 ms.

Uses: webhook ingress · alert correlation · HMAC verification

2

Triage — read-only investigation

The agent reads system state — kubectl get/describe, cloud CLI calls, SQL EXPLAIN, recent deploys, trace data, log snippets. Nothing is modified. Your runbooks are pulled from the knowledge base by semantic match.

Uses: runbook retrieval · trace/log tools · read-only tool allowlist

3

Diagnose — structured finding

The agent writes a finding with severity, blast radius, suspected root cause, and evidence (log lines, metric screenshots, linked commits). The finding is persisted, searchable, and cross-linked to the alert and session.

Uses: findings database · change correlator · structured output

4

Propose — concrete fix with plan

The agent writes a specific remediation plan: exact commands, expected outcomes, rollback steps. It opens an approval request with the full reasoning attached — delivered to Slack, Telegram, Web, or email.

Uses: approvals engine · plan review · Slack/email notifications

5

Execute — after approval, on your rules

An operator approves in one tap. The agent executes with the same RBAC as your team, streams output live, and writes a closing entry with outcome and metric validation. If something deviates from the plan, the agent pauses and re-requests approval.

Uses: sandboxed execution · must_always/must_never rules · live session audit

Sample run

Payment API latency spike at 02:14 AM.

Real transcript structure from a production Auto-Remediator session.

02:14:03

⚠ Alert received — payment-api p99 > 1500 ms (AlertManager, severity=critical)

02:14:05

● Investigating — fetched 1 runbook: "High API latency"

02:14:09

kubectl get pods -n payments | 4 of 5 ready · 1 restart in last 10 min

02:14:12

kubectl logs payment-api-7b · saw 847 connection timeouts to payment-db

02:14:17

Recent deploys: v2.3.1 merged 47 min ago — diff shows DB_POOL_SIZE 20 → 5

02:14:20

🔍 Root cause identified — deploy v2.3.1 shrunk connection pool; traffic is saturating it

02:14:22

💡 Recommend rollback to v2.3.0 · expected recovery in 90 s · no data migration risk

Approval request → @sre-oncall

Approve rollback Revise plan Reject

02:14:47

✓ Approved by alex@company.com

02:14:50

kubectl rollout undo deployment/payment-api · rolling back…

02:16:23

✓ Resolved — p99 back to 180 ms · incident closed · post-mortem draft saved

Total wall time: 2 min 20 s. Human time: the 25 s to tap Approve. Compare to your median MTTR.

Connected

What Auto-Remediator plugs into.

Ingress

Any webhook source

AlertManager, Grafana, Datadog, Sentry, PagerDuty, Stripe, or generic HMAC.

Investigation

Your infrastructure tools

kubectl, helm, cloud CLIs, database clients, HTTP APIs, or custom tool servers.

Knowledge

Your runbooks

Markdown files, Confluence pages, Google Docs — ingested once, retrieved on-demand.

Change tracking

Git + deploy signals

GitHub, GitLab, Bitbucket — correlate alerts to commits and deploys within the incident window.

Approvals

Where your team already is

Slack native buttons, Telegram inline, WhatsApp, email, or the Veirox console.

Private infra

Veirox Connect tunnel

Reach private Kubernetes, on-prem DBs, air-gapped networks — no firewall rules.

Safety

Guardrails your security team will sign off on.

Tool allowlist

The agent can only invoke tools explicitly allowed for this project. No surprises, no unbounded shell.

`must_never` rules

Declare forbidden actions at the project level (e.g. "never drop tables", "never restart prod before noon"). Runtime enforced.

Plan review before execution

Every destructive step is surfaced as an approval request with full reasoning. The human always sees the exact command before it runs.

Signed, time-bound approval links

Slack and email approvals use cryptographically signed links with TTL. No link sharing, no stale approvals.

Full session audit

Every tool call, model input, and decision is logged. Exportable as Markdown or PDF for compliance.

Secrets never touch model context

Credentials are referenced by friendly name. The raw value flows through the secrets vault, never the LLM.

Typical outcomes

What teams measure after deploying Auto-Remediator.

78%

of qualifying alerts resolved without waking a human

<3 min

median time from alert to resolution (vs. ~22 min manual)

100%

of destructive actions auditable with full reasoning

Measured across design-partner deployments. Your mileage depends on runbook coverage and alert quality.

Getting started

Rolled out in three steps.

1

Clone the starter

Open /console/tasks → New from template → "Auto-Remediator". Starter system prompt, tool allowlist, and approval policy ship out of the box.

2

Wire your alerts

Add a webhook from AlertManager, Grafana, Datadog, or PagerDuty. Point it at the task's ingest URL. See webhook setup.

3

Add your runbooks

Paste Markdown, upload a folder, or connect Confluence. The agent retrieves them by semantic match. Start with your top 3 alert types.

FAQ

Common questions.

Can it run fully autonomously, without human approval?

By default, no — every destructive action requires explicit approval. If you want no-approval automation for a narrow set of safe operations (e.g. clear a cache, rotate a non-prod key), you can whitelist specific tool calls in the must_always rules. Most teams never turn this on in production.

What if our runbooks are wrong or outdated?

The agent tries the runbook step and compares the observed outcome to the expected one. If they diverge, it pauses and opens a finding with the discrepancy — so the next person to edit the runbook has the data to fix it.

How does it avoid doing the same "fix" to every alert?

The agent is not pattern-matching — it's investigating each alert on the actual current system state before proposing anything. Two alerts on the same service at different times can (and often do) produce different remediation plans.

Will it work with our custom internal tools?

Yes — register them as custom tool servers (Model Context Protocol). The agent discovers them at startup and treats them like any first-party integration, with the same audit and approval pipeline.

Which plan do I need?

Available on all plans — Free for evaluation, Pro for production traffic, Business for RBAC + audit export + private tunnel to private infra. See pricing.

Clone Auto-Remediator in 60 seconds.

Starter prompt, tool allowlist, approval policy — all out of the box. Fire a test alert from the Test Console and watch it work.

Start free ← All use cases