Sustainable Green Computing and Carbon-Aware Artificial Intelligence
June 10, 2026openalex
The quality, composition, and provenance of training data fundamentally determine model capabilities and limitations. Synthetic data generation risks model collapse when models are trained on their own outputs. Benchmark contamination undermines evaluation reliability. The 'data wall' hypothesis suggests that high-quality human-generated text on the open web may be approaching exhaustion. Principled data mixing strategies, decontamination methods, and quality filtering at web scale are critical but under-studied compared to architectural research.