Triage, diagnose, propose — execute with your approval.
The Auto-Remediator is Veirox's first-line responder. When an alert fires, it investigates the actual system state, writes a structured diagnosis, proposes a concrete fix, and requests your approval before executing. Every step is auditable. Nothing destructive happens without a human in the loop.
Does this match you?
Good fit if you…
Probably not yet if you…
Lifecycle
An alert fires from AlertManager, Grafana, Datadog, Sentry, or PagerDuty. Veirox verifies the signature, applies your routing rules, deduplicates repeats, and creates a session in <200 ms.
Uses: webhook ingress · alert correlation · HMAC verification
The agent reads system state — kubectl get/describe, cloud CLI calls, SQL EXPLAIN, recent deploys, trace data, log snippets. Nothing is modified. Your runbooks are pulled from the knowledge base by semantic match.
Uses: runbook retrieval · trace/log tools · read-only tool allowlist
The agent writes a finding with severity, blast radius, suspected root cause, and evidence (log lines, metric screenshots, linked commits). The finding is persisted, searchable, and cross-linked to the alert and session.
Uses: findings database · change correlator · structured output
The agent writes a specific remediation plan: exact commands, expected outcomes, rollback steps. It opens an approval request with the full reasoning attached — delivered to Slack, Telegram, Web, or email.
Uses: approvals engine · plan review · Slack/email notifications
An operator approves in one tap. The agent executes with the same RBAC as your team, streams output live, and writes a closing entry with outcome and metric validation. If something deviates from the plan, the agent pauses and re-requests approval.
Uses: sandboxed execution · must_always/must_never rules · live session audit
Sample run
Real transcript structure from a production Auto-Remediator session.
Approval request → @sre-oncall
Total wall time: 2 min 20 s. Human time: the 25 s to tap Approve. Compare to your median MTTR.
Connected
Ingress
AlertManager, Grafana, Datadog, Sentry, PagerDuty, Stripe, or generic HMAC.
Investigation
kubectl, helm, cloud CLIs, database clients, HTTP APIs, or custom tool servers.
Knowledge
Markdown files, Confluence pages, Google Docs — ingested once, retrieved on-demand.
Change tracking
GitHub, GitLab, Bitbucket — correlate alerts to commits and deploys within the incident window.
Approvals
Slack native buttons, Telegram inline, WhatsApp, email, or the Veirox console.
Private infra
Reach private Kubernetes, on-prem DBs, air-gapped networks — no firewall rules.
Safety
The agent can only invoke tools explicitly allowed for this project. No surprises, no unbounded shell.
must_never rulesDeclare forbidden actions at the project level (e.g. "never drop tables", "never restart prod before noon"). Runtime enforced.
Every destructive step is surfaced as an approval request with full reasoning. The human always sees the exact command before it runs.
Slack and email approvals use cryptographically signed links with TTL. No link sharing, no stale approvals.
Every tool call, model input, and decision is logged. Exportable as Markdown or PDF for compliance.
Credentials are referenced by friendly name. The raw value flows through the secrets vault, never the LLM.
Typical outcomes
78%
of qualifying alerts resolved without waking a human
<3 min
median time from alert to resolution (vs. ~22 min manual)
100%
of destructive actions auditable with full reasoning
Measured across design-partner deployments. Your mileage depends on runbook coverage and alert quality.
Getting started
Open /console/tasks → New from template → "Auto-Remediator". Starter system prompt, tool allowlist, and approval policy ship out of the box.
Add a webhook from AlertManager, Grafana, Datadog, or PagerDuty. Point it at the task's ingest URL. See webhook setup.
Paste Markdown, upload a folder, or connect Confluence. The agent retrieves them by semantic match. Start with your top 3 alert types.
FAQ
By default, no — every destructive action requires explicit approval. If you want no-approval automation for a narrow set of safe operations (e.g. clear a cache, rotate a non-prod key), you can whitelist specific tool calls in the must_always rules. Most teams never turn this on in production.
The agent tries the runbook step and compares the observed outcome to the expected one. If they diverge, it pauses and opens a finding with the discrepancy — so the next person to edit the runbook has the data to fix it.
The agent is not pattern-matching — it's investigating each alert on the actual current system state before proposing anything. Two alerts on the same service at different times can (and often do) produce different remediation plans.
Yes — register them as custom tool servers (Model Context Protocol). The agent discovers them at startup and treats them like any first-party integration, with the same audit and approval pipeline.
Available on all plans — Free for evaluation, Pro for production traffic, Business for RBAC + audit export + private tunnel to private infra. See pricing.
Starter prompt, tool allowlist, approval policy — all out of the box. Fire a test alert from the Test Console and watch it work.