Back to Roadmap
Artificial IntelligencePartial
AI alignment and value alignment
Current methods for aligning large language models with human values — RLHF, DPO, constitutional AI — remain brittle and do not scale reliably. Models can exhibit reward hacking, sycophancy, and deceptive alignment, where surface behavior appears aligned while internal objectives diverge. Scalable oversight of superhuman systems, robust value specification, and corrigibility guarantees are unsolved. The gap between behavioral compliance and genuine alignment widens as model capabilities increase.
Research Domains
safetyfoundations
Keywords
alignmentAI safetyRLHFDPOconstitutional AIscalable oversightreward hackingsycophancyjailbreakvalue alignmentcorrigibilitydeceptive alignment
Last updated: April 8, 2026
Recent Papers(Artificial Intelligence)
DETECTING RARE CORTICAL CONNECTIVITY AROUND THE HUMAN CENTRAL SULCUS: A DEEP LEARNING ANALYSIS OF 37,000+ TRACTOGRAPHIES
April 8, 2026openalex
MULTI-MAP FUSION FOR WEAKLY SUPERVISED DISEASE LOCALIZATION FROM GLOBALLY ASSIGNED DIAGNOSTIC LABELS IN BRAIN MRI
April 8, 2026openalex
EVALUATING SEGMENTATION USING BETTI-1 TOPOLOGICAL METRIC: APPLICATION TO NASAL CAVITIES IN THE CONTEXT OF AIRFLOW SIMULATION
April 8, 2026openalex
Faster 4D Flow MRI Scan with 3D Arbitrary-Scale Super-Resolution
April 8, 2026openalex
Iterative confidence-based pseudo-labeling for semi-supervised lung cancer segmentation under annotation scarcity
April 8, 2026openalex
FALCON: Unfolded Variational Model for Blind Deconvolution and Segmentation in 3d Dental Imaging
April 8, 2026openalex
Diffusion-Based Fourier Domain Deconvolution with Application to Ultrasound Image Restoration
April 8, 2026openalex