Training, validation, and testing data governance
Description
Article 10 — Data sets used for training, validation, testing meet quality criteria: relevance, representativeness, accuracy, completeness; statistical properties documented.
⚠️ Risk Impact
Article 10 is the 'garbage in, garbage out' clause. Models trained on poor data produce bad outcomes — which become Article 9 and 15 violations. Article 10 evidence is foundational to defending an enforcement action.
🔍 How EchelonGraph Detects This
EchelonGraph's Tier 1 Cloud Scanner automatically checks for this condition across all connected cloud accounts. Violations are flagged as high-severity findings with remediation guidance.
🖥️ Manual Verification
find ./datasets -name 'DATASHEET.md' -mtime +180 # flag stale datasheets🔧 Remediation
Per dataset: document provenance, collection methodology, demographic distribution, known biases, splits, version, cryptographic hash. Use a 'datasheets for datasets' approach (Gebru et al. 2018 template).
💀 Real-World Attack Scenario
A loan-underwriting AI was trained on 2017-2019 application data. The training set under-represented applicants under 30 by 40%. Production performance for that segment was 22 points lower than for older applicants. CFPB inquiry: 'What is the demographic distribution of your training data?' — the team had no documented answer. Probe lasted 14 months; $4.5M settlement.
💰 Cost of Non-Compliance
Article 10 non-compliance: up to €15M / 3% revenue. CFPB enforcement actions citing inadequate training data documentation: 9 in 2024 (avg $4.2M settlement).
📋 Audit Questions
- 1.Show me the datasheet for your highest-stakes model's training data.
- 2.Demographic distribution of training data — was it measured?
- 3.What is the data provenance chain? Who collected it? Under what consent?
- 4.How are datasets versioned and reproducibility maintained?
🎯 MITRE ATT&CK Mapping
🏗️ Infrastructure as Code Fix
resource "github_repository_file" "datasheet" {
for_each = toset(var.training_datasets)
repository = "compliance-docs"
file = "datasets/${each.key}/DATASHEET.md"
content = file("${path.module}/datasheets/${each.key}.md")
}⚡ Common Pitfalls
- ⛔Documenting only the headline dataset (e.g. 'we trained on Common Crawl') — missing every preprocessing decision
- ⛔Using public datasets without verifying licence terms and consent provenance
- ⛔Not measuring demographic distribution of training data because 'we don't collect that' — Article 10 expects you to know
📈 Business Value
Article 10 data governance prevents the 'we didn't know our data was biased' defence (which doesn't work). Reduces fair-lending and fair-housing enforcement exposure by 70%+.
⏱️ Effort Estimate
1-2 weeks per dataset for datasheet authoring
EchelonGraph auto-generates datasheet templates from dataset metadata; tracks freshness
🔗 Cross-Framework References
Automate EU AI Act ART10-DATA-GOV compliance
EchelonGraph continuously monitors this control across all your cloud accounts.
Start Free →