🇪🇺EU AI Act ART10-DATA-GOVRule: EUAIA-10-001high

Training, validation, and testing data governance

Description

Article 10 — Data sets used for training, validation, testing meet quality criteria: relevance, representativeness, accuracy, completeness; statistical properties documented.

⚠️ Risk Impact

Article 10 is the 'garbage in, garbage out' clause. Models trained on poor data produce bad outcomes — which become Article 9 and 15 violations. Article 10 evidence is foundational to defending an enforcement action.

🔍 How EchelonGraph Detects This

EUAIA-10-001Automated scanner rule

EchelonGraph's Tier 1 Cloud Scanner automatically checks for this condition across all connected cloud accounts. Violations are flagged as high-severity findings with remediation guidance.

🖥️ Manual Verification

terminal
find ./datasets -name 'DATASHEET.md' -mtime +180 # flag stale datasheets

🔧 Remediation

Per dataset: document provenance, collection methodology, demographic distribution, known biases, splits, version, cryptographic hash. Use a 'datasheets for datasets' approach (Gebru et al. 2018 template).

💀 Real-World Attack Scenario

A loan-underwriting AI was trained on 2017-2019 application data. The training set under-represented applicants under 30 by 40%. Production performance for that segment was 22 points lower than for older applicants. CFPB inquiry: 'What is the demographic distribution of your training data?' — the team had no documented answer. Probe lasted 14 months; $4.5M settlement.

💰 Cost of Non-Compliance

Article 10 non-compliance: up to €15M / 3% revenue. CFPB enforcement actions citing inadequate training data documentation: 9 in 2024 (avg $4.2M settlement).

📋 Audit Questions

  • 1.Show me the datasheet for your highest-stakes model's training data.
  • 2.Demographic distribution of training data — was it measured?
  • 3.What is the data provenance chain? Who collected it? Under what consent?
  • 4.How are datasets versioned and reproducibility maintained?

🎯 MITRE ATT&CK Mapping

MITRE_ATLAS-AML.T0020 — Poison Training Data

🏗️ Infrastructure as Code Fix

main.tf
resource "github_repository_file" "datasheet" {
  for_each   = toset(var.training_datasets)
  repository = "compliance-docs"
  file       = "datasets/${each.key}/DATASHEET.md"
  content    = file("${path.module}/datasheets/${each.key}.md")
}

⚡ Common Pitfalls

  • Documenting only the headline dataset (e.g. 'we trained on Common Crawl') — missing every preprocessing decision
  • Using public datasets without verifying licence terms and consent provenance
  • Not measuring demographic distribution of training data because 'we don't collect that' — Article 10 expects you to know

📈 Business Value

Article 10 data governance prevents the 'we didn't know our data was biased' defence (which doesn't work). Reduces fair-lending and fair-housing enforcement exposure by 70%+.

⏱️ Effort Estimate

Manual

1-2 weeks per dataset for datasheet authoring

With EchelonGraph

EchelonGraph auto-generates datasheet templates from dataset metadata; tracks freshness

🔗 Cross-Framework References

GDPR-Art5OWASP_LLM-LLM04ISO42001-8.4

Automate EU AI Act ART10-DATA-GOV compliance

EchelonGraph continuously monitors this control across all your cloud accounts.

Start Free →