Back to Roadmap
RoadblockArtificial IntelligenceOpen

Training data quality and curation

The quality, composition, and provenance of training data fundamentally determine model capabilities and limitations. Synthetic data generation risks model collapse when models are trained on their own outputs. Benchmark contamination undermines evaluation reliability. The 'data wall' hypothesis suggests that high-quality human-generated text on the open web may be approaching exhaustion. Principled data mixing strategies, decontamination methods, and quality filtering at web scale are critical but under-studied compared to architectural research.

Recent papers / Artificial Intelligence

Uncertainty analysis in digital twins and integration of aleatory uncertainties for virtual entity models

June 10, 2026openalex

G-SENSE: Generalized Sensorless External Force Estimation for Humanoid Robots via Centroidal Dynamics

June 10, 2026openalex