Robustness against errors and faults
Description
Article 15(3) — High-risk AI systems are resilient to errors, faults, inconsistencies; redundancy and fail-safe measures implemented.
⚠️ Risk Impact
AI systems fail in surprising ways under degraded conditions. Without redundancy + fail-safe, a transient infrastructure issue (rate-limited API, GPU OOM, network partition) cascades into a customer-facing AI failure.
🔍 How EchelonGraph Detects This
EchelonGraph's Tier 1 Cloud Scanner automatically checks for this condition across all connected cloud accounts. Violations are flagged as high-severity findings with remediation guidance.
🔧 Remediation
Implement: (1) decline-to-answer thresholds when confidence is low, (2) circuit breakers for downstream dependencies, (3) graceful degradation to deterministic fallbacks, (4) chaos-engineering tests covering AI-specific failure modes.
💀 Real-World Attack Scenario
A travel-booking AI relied on a live flight-data API. When the API rate-limited, the AI fell back to cached flight info — silently — and confidently quoted unavailable flights to customers for 4 hours. Refunds + customer-service cost: €380K. Article 15(3) violation: the system didn't have a documented fail-safe for upstream API failure.
💰 Cost of Non-Compliance
Article 15(3) robustness gap: up to €15M / 3% revenue. Customer-trust impact of confident-but-wrong AI: 11% churn spike in measured incidents (Edelman 2024).
📋 Audit Questions
- 1.What happens when your top AI system loses access to its primary data source?
- 2.What is the decline-to-answer threshold? How was it set?
- 3.When was the last chaos-engineering test on AI infrastructure?
- 4.Show me a circuit-breaker that fired in production in the last 30 days.
🎯 MITRE ATT&CK Mapping
🏗️ Infrastructure as Code Fix
resource "prometheus_alert_rule" "ai_circuit_breaker_open" {
name = "ai_dependency_circuit_breaker_open"
expr = "rate(ai_circuit_breaker_state{state=\"open\"}[5m]) > 0"
for = "5m"
labels = { severity = "page" }
annotations = { summary = "AI dependency circuit-breaker opened — investigating dependency health" }
}⚡ Common Pitfalls
- ⛔Caching responses for resilience without invalidating on data-source changes
- ⛔Setting confidence-threshold decline at 0.5 universally — too aggressive for some use cases, too lenient for others
- ⛔Not testing the fallback path — it works in dev but breaks in prod under load
📈 Business Value
Robust AI systems retain customer trust through infra incidents. The difference between 'confidently wrong' and 'gracefully degraded' is the difference between a viral screenshot and a customer-success conversation.
⏱️ Effort Estimate
3-4 weeks per system for circuit breakers + chaos tests
EchelonGraph ships AI-specific chaos scenarios + circuit-breaker patterns for KServe/Ray/Seldon
🔗 Cross-Framework References
Automate EU AI Act ART15-ROBUSTNESS compliance
EchelonGraph continuously monitors this control across all your cloud accounts.
Start Free →