Movatterモバイル変換

AI alignment

From Wikipedia, the free encyclopedia

AI conformance to the intended objective

Artificial intelligence (AI)
Part of a series on

Major goals Artificial general intelligence Intelligent agent Recursive self-improvement Planning Computer vision General game playing Knowledge reasoning Natural language processing Robotics AI safety
Approaches Machine learning Symbolic Deep learning Bayesian networks Evolutionary algorithms Hybrid intelligent systems Systems integration
Applications Bioinformatics Deepfake Earth sciences Finance Generative AI Art Audio Music Government Healthcare Mental health Industry Translation Military Physics Projects
Philosophy Artificial consciousness Chinese room Friendly AI Control problem/Takeover Ethics Existential risk Turing test Uncanny valley
History Timeline Progress AI winter AI boom
Glossary Glossary
v t e

In the field ofartificial intelligence (AI),alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is consideredaligned if it advances the intended objectives. Amisaligned AI system pursues unintended objectives.^[1]

It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simplerproxy goals, such asgaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merelyappearing aligned.^[1]^[2] AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).^[1]^[3]

Advanced AI systems may develop unwantedinstrumental strategies, such as seeking power or survival because such strategies help them achieve their assigned final goals.^[1]^[4]^[5] Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations anddata distributions.^[6]^[7] Empirical research showed in 2024 that advancedlarge language models (LLMs) such asOpenAI o1 orClaude 3 sometimes engage in strategic deception to achieve their goals or prevent them from being changed.^[8]^[9]

Today, some of these issues affect existing commercial systems such as LLMs,^[10]^[11]^[12]robots,^[13]autonomous vehicles,^[14] and social mediarecommendation engines.^[10]^[5]^[15] Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities.^[16]^[3]^[2]

Many prominent AI researchers and the leadership of major AI companies have argued or asserted that AI is approaching human-like (AGI) andsuperhuman cognitive capabilities (ASI), and couldendanger human civilization if misaligned.^[17]^[5] These include "AI Godfathers"Geoffrey Hinton andYoshua Bengio and the CEOs ofOpenAI,Anthropic, andGoogle DeepMind.^[18]^[19]^[20] These risks remain debated.^[21]

AI alignment is a subfield ofAI safety, the study of how to build safe AI systems.^[22] Other subfields of AI safety include robustness, monitoring, andcapability control.^[23] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.^[23] Alignment research has connections tointerpretability research,^[24]^[25] (adversarial) robustness,^[22]anomaly detection,calibrated uncertainty,^[24]formal verification,^[26]preference learning,^[27]^[28]^[29]safety-critical engineering,^[30]game theory,^[31]algorithmic fairness,^[22]^[32] andsocial sciences.^[33]^[34]

v t e Existential risk fromartificial intelligence
Concepts	AGI AI alignment AI capability control AI safety AI takeover Consequentialism Effective accelerationism Ethics of artificial intelligence Existential risk from artificial intelligence Friendly artificial intelligence Instrumental convergence Intelligence explosion Longtermism Machine ethics Suffering risks Superintelligence Technological singularity
Organizations	Alignment Research Center Center for AI Safety Center for Applied Rationality Center for Human-Compatible Artificial Intelligence Centre for the Study of Existential Risk EleutherAI Future of Humanity Institute Future of Life Institute Google DeepMind Humanity+ Institute for Ethics and Emerging Technologies Leverhulme Centre for the Future of Intelligence Machine Intelligence Research Institute OpenAI
People	Scott Alexander Sam Altman Yoshua Bengio Nick Bostrom Paul Christiano Eric Drexler Sam Harris Stephen Hawking Dan Hendrycks Geoffrey Hinton Bill Joy Shane Legg Elon Musk Steve Omohundro Huw Price Martin Rees Stuart J. Russell Jaan Tallinn Max Tegmark Frank Wilczek Roman Yampolskiy Eliezer Yudkowsky
Other	Statement on AI risk of extinction Human Compatible Open letter on artificial intelligence (2015) Our Final Invention The Precipice Superintelligence: Paths, Dangers, Strategies Do You Trust This Computer? Artificial Intelligence Act
Category

Movatterモバイル変換

Objectives in AI

Alignment problem

Specification gaming and side effects

Pressure to deploy unsafe systems

Risks from advanced misaligned AI

Development of advanced AI

Power-seeking

Existential risk (x-risk)

Research problems and approaches

Learning human values and preferences

Scalable oversight

Honest AI

Alignment faking

Power-seeking and instrumental strategies

Emergent goals

Embedded agency

Principal-agent problems

Conservatism

Public policy

Dynamic nature of alignment

See also

Footnotes

References

Further reading

External links