Movatterモバイル変換

[0]ホーム

Jump to content

Deep reinforcement learning

Edit links

From Wikipedia, the free encyclopedia

The revision history of this page may containcopyright violations.

Certain historical revisions of this page may meetcriterion RD1 for revision deletion, as they contain significant copyright violations ofhttps://www.sciencedirect.com/science/article/pii/S1044028324000887^{(Copyvios report)} that have been removed in the meantime.

The revisions requested to be redacted are:1323853893^{(Copyvios report)} to 1323868807 (inclusive)

Note to admins: In case of doubt, remove this template and post a message asking for review atWT:CP. Withthis script, go tothe history with auto-selected revisions.

Note to the requestor: Make sure the page has already been reverted to a non-infringing revision or that infringing text has been removed or replaced before submitting this request. This template is reserved for obvious cases only, for other cases refer toWikipedia:Copyright problems.

Note to others: Please do not remove this template until an administrator has reviewed it.

Machine learning that combines deep learning and reinforcement learning

Machine learning anddata mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Deep reinforcement learning (deep RL) is a subfield ofmachine learning that combinesreinforcement learning (RL) anddeep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of thestate space. Deep RL algorithms are able to take in very large inputs (e.g. every pixel rendered to the screen in a video game) and decide what actions to perform to optimize an objective (e.g. maximizing the game score). Deep reinforcement learning has been used for a diverse set of applications including but not limited torobotics,video games,natural language processing,computer vision,^[1] education, transportation, finance andhealthcare.^[2]

Overview

[edit]

Deep learning

[edit]

Depiction of a basic artificial neural network

Deep learning is a form ofmachine learning that transforms a set of inputs into a set of outputs via anartificial neural network. Deep learning methods, often usingsupervised learning with labeled datasets, have been shown to solve tasks that involve handling complex, high-dimensional raw input data (such as images) with less manualfeature engineering than prior methods, enabling significant progress in several fields includingcomputer vision andnatural language processing. In the past decade, deep RL has achieved remarkable results on a range of problems, from single and multiplayer games such asGo,Atari Games, andDota 2 to robotics.^[3]

Reinforcement learning

[edit]

Reinforcement learning is a process in which an agent learns to make decisions through trial and error. This problem is often modeled mathematically as aMarkov decision process (MDP), where an agent at every timestep is in a state $s {\displaystyle s}$ , takes action $a {\displaystyle a}$ , receives a scalar reward and transitions to the next state $s^{'} {\displaystyle s'}$ according to environment dynamics $p(s'|s,a)$ . The agent attempts to learn a policy $\pi (a|s)$ , or map from observations to actions, in order to maximize its returns (expected sum of rewards). In reinforcement learning (as opposed tooptimal control) the algorithm only has access to the dynamics $p(s'|s,a)$ through sampling.

Deep reinforcement learning

[edit]

In many practical decision-making problems, the states $s {\displaystyle s}$ of the MDP are high-dimensional (e.g., images from a camera or the raw sensor stream from a robot) and cannot be solved by traditional RL algorithms. Deep reinforcement learning algorithms incorporate deep learning to solve such MDPs, often representing the policy $\pi (a|s)$ or other learned functions as a neural network and developing specialized algorithms that perform well in this setting.

History

[edit]

Along with rising interest in neural networks beginning in the mid 1980s, interest grew in deep reinforcement learning, where a neural network is used in reinforcement learning to represent policies or value functions. Because in such a system, the entire decision making process from sensors to motors in a robot or agent involves a singleneural network, it is also sometimes called end-to-end reinforcement learning.^[4] One of the first successful applications of reinforcement learning with neural networks wasTD-Gammon, a computer program developed in 1992 for playingbackgammon.^[5] Four inputs were used for the number of pieces of a given color at a given location on the board, totaling 198 input signals. With zero knowledge built in, the network learned to play the game at an intermediate level by self-play andTD( $\lambda$ ).

Seminal textbooks bySutton andBarto on reinforcement learning,^[6]Bertsekas andTsitiklis on neuro-dynamic programming,^[7] and others^[8] advanced knowledge and interest in the field.

Katsunari Shibata's group showed that various functions emerge in this framework,^[9]^[10]^[11] including image recognition, color constancy, sensor motion (active recognition), hand-eye coordination and hand reaching movement, explanation of brain activities, knowledge transfer, memory,^[12] selective attention, prediction, and exploration.^[10]^[13]

Starting around 2012, the so-calleddeep learning revolution led to an increased interest in using deep neural networks as function approximators across a variety of domains. This led to a renewed interest in researchers using deep neural networks to learn the policy, value, and/or Q functions present in existing reinforcement learning algorithms.

Beginning around 2013,DeepMind showed impressive learning results using deep RL to playAtari video games.^[14]^[15] The computer player a neural network trained using a deep RL algorithm, a deep version ofQ-learning they termed deep Q-networks (DQN), with the game score as the reward. They used a deepconvolutional neural network to process 4 frames RGB pixels (84x84) as inputs. All 49 games were learned using the same network architecture and with minimal prior knowledge, outperforming competing methods on almost all the games and performing at a level comparable or superior to a professional human game tester.^[15]

Deep reinforcement learning reached another milestone in 2015 whenAlphaGo,^[16] a computer program trained with deep RL to playGo, became the first computer Go program to beat a human professional Go player without handicap on a full-sized 19×19 board.In a subsequent project in 2017,AlphaZero improved performance on Go while also demonstrating they could use the same algorithm to learn to playchess andshogi at a level competitive or superior to existing computer programs for those games, and again improved in 2019 withMuZero.^[17] Separately, another milestone was achieved by researchers fromCarnegie Mellon University in 2019 developingPluribus, a computer program to playpoker that was the first to beat professionals at multiplayer games of no-limitTexas hold 'em.OpenAI Five, a program for playing five-on-fiveDota 2 beat the previous world champions in a demonstration match in 2019.

Deep reinforcement learning has also been applied to many domains beyond games. In robotics, it has been used to let robots perform simple household tasks^[18] and solve a Rubik's cube with a robot hand.^[19]^[20] Deep RL has also found sustainability applications, used to reduce energy consumption at data centers.^[21] Deep RL forautonomous driving is an active area of research in academia and industry.^[22]Loon explored deep RL for autonomously navigating their high-altitude balloons.^[23]

Algorithms

[edit]

Various techniques exist to train policies to solve tasks with deep reinforcement learning algorithms, each having their own benefits. At the highest level, there is a distinction between model-based and model-free reinforcement learning, which refers to whether the algorithm attempts to learn a forward model of the environment dynamics.

Inmodel-based deep reinforcement learning algorithms, a forward model of the environment dynamics is estimated, usually bysupervised learning using a neural network. Then, actions are obtained by usingmodel predictive control using the learned model. Since the true environment dynamics will usually diverge from the learned dynamics, the agent re-plans often when carrying out actions in the environment. The actions selected may be optimized usingMonte Carlo methods such as thecross-entropy method, or a combination of model-learning with model-free methods.

Inmodel-free deep reinforcement learning algorithms, a policy $\pi (a|s)$ is learned without explicitly modeling the forward dynamics. A policy can be optimized to maximize returns by directly estimating the policy gradient^[24] but suffers from high variance, making it impractical for use with function approximation in deep RL. Subsequent algorithms have been developed for more stable learning and widely applied.^[25]^[26] Another class of model-free deep reinforcement learning algorithms rely ondynamic programming, inspired bytemporal difference learning andQ-learning. In discrete action spaces, these algorithms usually learn a neural network Q-function $Q(s,a)$ that estimates the future returns taking action $a {\displaystyle a}$ from state $s {\displaystyle s}$ .^[14] In continuous spaces, these algorithms often learn both a value estimate and a policy.^[27]^[28]^[29]

Research

[edit]

Deep reinforcement learning is an active area of research, with several lines of inquiry.

Exploration

[edit]

An RL agent must balance the exploration/exploitation tradeoff: the problem of deciding whether to pursue actions that are already known to yield high rewards or explore other actions in order to discover higher rewards. RL agents usually collect data with some type of stochastic policy, such as aBoltzmann distribution in discrete action spaces or aGaussian distribution in continuous action spaces, inducing basic exploration behavior. The idea behind novelty-based, or curiosity-driven, exploration is giving the agent a motive to explore unknown outcomes in order to find the best solutions. This is done by "modify[ing] the loss function (or even the network architecture) by adding terms to incentivize exploration".^[30] An agent may also be aided in exploration by utilizing demonstrations of successful trajectories, or reward-shaping, giving an agent intermediate rewards that are customized to fit the task it is attempting to complete.^[31]

Off-policy reinforcement learning

[edit]

An important distinction in RL is the difference between on-policy algorithms that require evaluating or improving the policy that collects data, and off-policy algorithms that can learn a policy from data generated by an arbitrary policy. Generally, value-function based methods such asQ-learning are better suited for off-policy learning and have better sample-efficiency - the amount of data required to learn a task is reduced because data is re-used for learning. At the extreme, offline (or "batch") RL considers learning a policy from a fixed dataset without additional interaction with the environment.

Inverse reinforcement learning

[edit]

Inverse RL refers to inferring the reward function of an agent given the agent's behavior. Inverse reinforcement learning can be used for learning from demonstrations (orapprenticeship learning) by inferring the demonstrator's reward and then optimizing a policy to maximize returns with RL. Deep learning approaches have been used for various forms of imitation learning and inverse RL.^[32]

Goal-conditioned reinforcement learning

[edit]

Another active area of research is in learning goal-conditioned policies, also called contextual or universal policies $\pi (a|s,g)$ that take in an additional goal $g {\displaystyle g}$ as input to communicate a desired aim to the agent.^[33] Hindsight experience replay is a method for goal-conditioned RL that involves storing and learning from previous failed attempts to complete a task.^[34] While a failed attempt may not have reached the intended goal, it can serve as a lesson for how achieve the unintended result through hindsight relabeling.

Multi-agent reinforcement learning

[edit]

Many applications of reinforcement learning do not involve just a single agent, but rather a collection of agents that learn together and co-adapt. These agents may be competitive, as in many games, or cooperative as in many real-world multi-agent systems.Multi-agent reinforcement learning studies the problems introduced in this setting.

Generalization

[edit]

The promise of using deep learning tools in reinforcement learning is generalization: the ability to operate correctly on previously unseen inputs. For instance, neural networks trained for image recognition can recognize that a picture contains a bird even it has never seen that particular image or even that particular bird. Since deep RL allows raw data (e.g. pixels) as input, there is a reduced need to predefine the environment, allowing the model to be generalized to multiple applications. With this layer of abstraction, deep reinforcement learning algorithms can be designed in a way that allows them to be general and the same model can be used for different tasks.^[35] One method of increasing the ability of policies trained with deep RL policies to generalize is to incorporaterepresentation learning.

Deep RL forfinancial decision-making

There has been a growing body of research on using deep RL for financial problems, particularlyportfolio optimization. Traditional approaches likemodern portfolio theory (MPT) rely on mean-variance optimization to balance risk and return. However, they often lack the adaptability required in volatile markets. Deep RL, on the other hand, reframes the problem as a dynamic decision-making process using frameworks likeMarkov decision processes (MDPs) orpartially observed Markov decision processes (POMDPs).

This approach allows a deep RL agent to continuously interact with the market, making decisions to maximize long-term returns based on evolving data. Essential components of deep RL models, such as state and action spaces, reward functions, and policy optimization techniques, play a crucial role in this adaptability. Models likedeep deterministic policy gradient (DDPG), andproximal policy optimization (PPO) stand out for their application in continuous action spaces and their potential in handling the complexity of financial markets.^[36]^[37]^[38]

Implementation of deep RL in the domain of financial problems remains an evolving area of research.

References

[edit]

^Le, Ngan; Rathour, Vidhiwar Singh; Yamazaki, Kashu; Luu, Khoa; Savvides, Marios (2022-04-01)."Deep reinforcement learning in computer vision: a comprehensive survey".Artificial Intelligence Review.55 (4):2733–2819.arXiv:2108.11510.doi:10.1007/s10462-021-10061-9.ISSN 1573-7462.
^Francois-Lavet, Vincent; Henderson, Peter; Islam, Riashat; Bellemare, Marc G.; Pineau, Joelle (2018). "An Introduction to Deep Reinforcement Learning".Foundations and Trends in Machine Learning.11 (3–4):219–354.arXiv:1811.12560.Bibcode:2018arXiv181112560F.doi:10.1561/2200000071.ISSN 1935-8237.S2CID 54434537.
^Graesser, Laura."Foundations of Deep Reinforcement Learning: Theory and Practice in Python".Open Library Telkom University. Retrieved2023-07-01.
^Demis, Hassabis (March 11, 2016).Artificial Intelligence and the Future (Speech).
^Tesauro, Gerald (March 1995)."Temporal Difference Learning and TD-Gammon".Communications of the ACM.38 (3):58–68.doi:10.1145/203330.203343.S2CID 8763243.
^Sutton, Richard; Barto, Andrew (September 1996).Reinforcement Learning: An Introduction. Athena Scientific.
^Bertsekas, John; Tsitsiklis, Dimitri (September 1996).Neuro-Dynamic Programming. Athena Scientific.ISBN 1-886529-10-8.
^Miller, W. Thomas; Werbos, Paul; Sutton, Richard (1990).Neural Networks for Control.
^Shibata, Katsunari; Okabe, Yoichi (1997).Reinforcement Learning When Visual Sensory Signals are Directly Given as Inputs(PDF). International Conference on Neural Networks (ICNN) 1997. Archived fromthe original(PDF) on 2020-12-09. Retrieved2020-12-01.
^^a ^bShibata, Katsunari; Iida, Masaru (2003).Acquisition of Box Pushing by Direct-Vision-Based Reinforcement Learning(PDF). SICE Annual Conference 2003. Archived fromthe original(PDF) on 2020-12-09. Retrieved2020-12-01.
^Shibata, Katsunari (March 7, 2017). "Functions that Emerge through End-to-End Reinforcement Learning".arXiv:1703.02239 [cs.AI].
^Utsunomiya, Hiroki; Shibata, Katsunari (2008).Contextual Behavior and Internal Representations Acquired by Reinforcement Learning with a Recurrent Neural Network in a Continuous State and Action Space Task(PDF). International Conference on Neural Information Processing (ICONIP) '08. Archived fromthe original(PDF) on 2017-08-10. Retrieved2020-12-14.
^Shibata, Katsunari; Kawano, Tomohiko (2008).Learning of Action Generation from Raw Camera Images in a Real-World-like Environment by Simple Coupling of Reinforcement Learning and a Neural Network(PDF). International Conference on Neural Information Processing (ICONIP) '08. Archived fromthe original(PDF) on 2020-12-11. Retrieved2020-12-01.
^^a ^bMnih, Volodymyr; et al. (December 2013).Playing Atari with Deep Reinforcement Learning(PDF). NIPS Deep Learning Workshop 2013.
^^a ^bMnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning".Nature.518 (7540):529–533.Bibcode:2015Natur.518..529M.doi:10.1038/nature14236.PMID 25719670.S2CID 205242740.
^Silver, David;Huang, Aja; Maddison, Chris J.; Guez, Arthur; Sifre, Laurent; Driessche, George van den; Schrittwieser, Julian; Antonoglou, Ioannis; Panneershelvam, Veda; Lanctot, Marc; Dieleman, Sander; Grewe, Dominik; Nham, John; Kalchbrenner, Nal;Sutskever, Ilya; Lillicrap, Timothy; Leach, Madeleine; Kavukcuoglu, Koray; Graepel, Thore;Hassabis, Demis (28 January 2016). "Mastering the game of Go with deep neural networks and tree search".Nature.529 (7587):484–489.Bibcode:2016Natur.529..484S.doi:10.1038/nature16961.ISSN 0028-0836.PMID 26819042.S2CID 515925.
^Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy; Silver, David (23 December 2020)."Mastering Atari, Go, chess and shogi by planning with a learned model".Nature.588 (7839):604–609.arXiv:1911.08265.Bibcode:2020Natur.588..604S.doi:10.1038/s41586-020-03051-4.PMID 33361790.S2CID 208158225.
^Levine, Sergey;Finn, Chelsea; Darrell, Trevor; Abbeel, Pieter (January 2016)."End-to-end training of deep visuomotor policies"(PDF).JMLR.17.arXiv:1504.00702.
^"OpenAI - Solving Rubik's Cube With A Robot Hand".OpenAI. 5 January 2021.
^OpenAI; et al. (2019).Solving Rubik's Cube with a Robot Hand.arXiv:1910.07113.
^"DeepMind AI Reduces Google Data Centre Cooling Bill by 40%".DeepMind. 14 May 2024.
^"Machine Learning for Autonomous Driving Workshop @ NeurIPS 2021".NeurIPS 2021. December 2021.
^Bellemare, Marc; Candido, Salvatore; Castro, Pablo; Gong, Jun; Machado, Marlos; Moitra, Subhodeep; Ponda, Sameera; Wang, Ziyu (2 December 2020)."Autonomous navigation of stratospheric balloons using reinforcement learning".Nature.588 (7836):77–82.Bibcode:2020Natur.588...77B.doi:10.1038/s41586-020-2939-8.PMID 33268863.S2CID 227260253.
^Williams, Ronald J (1992)."Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning".Machine Learning.8 (3–4):229–256.doi:10.1007/BF00992696.S2CID 2332513.
^Schulman, John; Levine, Sergey; Moritz, Philipp; Jordan, Michael; Abbeel, Pieter (2015).Trust Region Policy Optimization. International Conference on Machine Learning (ICML).arXiv:1502.05477.
^Schulman, John; Wolski, Filip; Dhariwal, Prafulla; Radford, Alec; Klimov, Oleg (2017).Proximal Policy Optimization Algorithms.arXiv:1707.06347.
^Lillicrap, Timothy; Hunt, Jonathan; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2016).Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR).arXiv:1509.02971.
^Mnih, Volodymyr; Puigdomenech Badia, Adria; Mirzi, Mehdi; Graves, Alex; Harley, Tim; Lillicrap, Timothy; Silver, David; Kavukcuoglu, Koray (2016).Asynchronous Methods for Deep Reinforcement Learning. International Conference on Machine Learning (ICML).arXiv:1602.01783.
^Haarnoja, Tuomas; Zhou, Aurick; Levine, Sergey; Abbeel, Pieter (2018).Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. International Conference on Machine Learning (ICML).arXiv:1801.01290.
^Reizinger, Patrik; Szemenyei, Márton (2019-10-23). "Attention-Based Curiosity-Driven Exploration in Deep Reinforcement Learning".ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3542–3546.arXiv:1910.10840.doi:10.1109/ICASSP40776.2020.9054546.ISBN 978-1-5090-6631-5.S2CID 204852215.
^Wiewiora, Eric (2010),"Reward Shaping", in Sammut, Claude; Webb, Geoffrey I. (eds.),Encyclopedia of Machine Learning, Boston, MA: Springer US, pp. 863–865,doi:10.1007/978-0-387-30164-8_731,ISBN 978-0-387-30164-8, retrieved2020-11-16
^Wulfmeier, Markus; Ondruska, Peter; Posner, Ingmar (2015). "Maximum Entropy Deep Inverse Reinforcement Learning".arXiv:1507.04888 [cs.LG].
^Schaul, Tom; Horgan, Daniel; Gregor, Karol; Silver, David (2015).Universal Value Function Approximators. International Conference on Machine Learning (ICML).
^Andrychowicz, Marcin; Wolski, Filip; Ray, Alex; Schneider, Jonas; Fong, Rachel; Welinder, Peter; McGrew, Bob; Tobin, Josh; Abbeel, Pieter; Zaremba, Wojciech (2018).Hindsight Experience Replay. Advances in Neural Information Processing Systems (NeurIPS).arXiv:1707.01495.
^Packer, Charles; Gao, Katelyn; Kos, Jernej; Krähenbühl, Philipp; Koltun, Vladlen; Song, Dawn (2019-03-15). "Assessing Generalization in Deep Reinforcement Learning".arXiv:1810.12282 [cs.LG].
^Jiang, Yifu; Olmo, Jose; Atwi, Majed (2024-09-01)."Deep reinforcement learning for portfolio selection".Global Finance Journal.62: 101016.doi:10.1016/j.gfj.2024.101016.ISSN 1044-0283.{{cite journal}}: CS1 maint: article number as page number (link)
^Choudhary, Himanshu; Orra, Arishi; Sahoo, Kartik; Thakur, Manoj (2025-05-26)."Risk-Adjusted Deep Reinforcement Learning for Portfolio Optimization: A Multi-reward Approach".International Journal of Computational Intelligence Systems.18 (1): 126.doi:10.1007/s44196-025-00875-8.ISSN 1875-6883.
^Avramelou, Loukia; Nousi, Paraskevi; Passalis, Nikolaos; Tefas, Anastasios (2024-03-15)."Deep reinforcement learning for financial trading using multi-modal features".Expert Systems with Applications.238: 121849.doi:10.1016/j.eswa.2023.121849.ISSN 0957-4174.{{cite journal}}: CS1 maint: article number as page number (link)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Deep_reinforcement_learning&oldid=1323869799"

Categories:

Hidden categories:

[8]ページ先頭