Sustainable Green Computing and Carbon-Aware Artificial Intelligence
June 10, 2026openalex
Current methods for aligning large language models with human values — RLHF, DPO, constitutional AI — remain brittle and do not scale reliably. Models can exhibit reward hacking, sycophancy, and deceptive alignment, where surface behavior appears aligned while internal objectives diverge. Scalable oversight of superhuman systems, robust value specification, and corrigibility guarantees are unsolved. The gap between behavioral compliance and genuine alignment widens as model capabilities increase.