Unbounded Consumption
Description
Adversarial high-cost queries drain budget, exhaust capacity, or deny service. LLM inference is expensive; unbounded queries enable economic attacks.
⚠️ Risk Impact
A token-heavy query can cost 100-1000× a benign query in inference cost. Unbounded consumption attacks are economically asymmetric — small attacker cost → large defender cost.
🔍 How EchelonGraph Detects This
EchelonGraph's Tier 1 Cloud Scanner automatically checks for this condition across all connected cloud accounts. Violations are flagged as high-severity findings with remediation guidance.
🔧 Remediation
Cap per-request token budget. Rate-limit per principal. Monitor for cost-anomaly patterns. Quota per customer. Reject queries above token threshold; reject input above size threshold.
💀 Real-World Attack Scenario
An LLM-based customer-support chatbot was hit with a barrage of 'write a 50000-word essay on X' queries. Inference cost spiked 800% over 4 hours. The team deployed emergency rate-limits; legitimate customer experience degraded during the response. Total infra cost spike: $34K in 4 hours.
💰 Cost of Non-Compliance
Avg AI cost-spike incident in 2024: $42K per incident (Anyscale). Customer-experience degradation during incident: avg 0.8 NPS drop (Forrester).
📋 Audit Questions
- 1.What is the per-request token cap?
- 2.What is the rate limit per principal?
- 3.Show me the cost-anomaly detection alert rule.
- 4.When did a cost-spike alert last fire?
🏗️ Infrastructure as Code Fix
# Set per-request token cap + per-principal quota
resource "google_api_gateway_api_config" "llm" {
api = google_api_gateway_api.llm_inference.api_id
api_config_id = "v1"
openapi_documents {
document {
contents = filebase64("openapi-with-quota.yaml") # max_tokens=4096; quota=1000/day/principal
path = "openapi-with-quota.yaml"
}
}
}
resource "prometheus_alert_rule" "llm_cost_spike" {
name = "llm_cost_spike"
expr = "sum(rate(llm_token_total[5m])) > 2 * sum(rate(llm_token_total[24h] offset 1d))"
for = "10m"
labels = { severity = "page" }
}⚡ Common Pitfalls
- ⛔No per-request token cap
- ⛔Rate-limit-by-IP only (attackers rotate IPs)
- ⛔No cost-anomaly alert — incident discovered via billing rather than monitoring
📈 Business Value
Unbounded-consumption defence prevents the most-frequent 2024 LLM operational incident. Material for any LLM application with paid inference.
⏱️ Effort Estimate
1-2 weeks for token caps + rate limits + cost-anomaly monitoring
EchelonGraph monitors per-workload cost; alerts on anomaly and auto-rate-limits
🔗 Cross-Framework References
Automate OWASP LLM Top 10 LLM10 compliance
EchelonGraph continuously monitors this control across all your cloud accounts.
Start Free →