Movatterモバイル変換

Bruno Scherrer²¹ &
Matthieu Geist²²

Part of the book series:Lecture Notes in Computer Science ((LNAI,volume 7188))

Included in the following conference series:

European Workshop on Reinforcement Learning

2319Accesses
3Altmetric

Abstract

In the framework of Markov Decision Processes, we consider the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We describe a systematic approach for adaptingon-policy learning least squares algorithms of the literature (LSTD [5], LSPE [15], FPKF [7] and GPTD [8]/KTD [10]) tooff-policy learning with eligibility traces. This leads to two known algorithms, LSTD(λ)/LSPE(λ) [21] and suggests new extensions of FPKF and GPTD/KTD. We describe their recursive implementation, discuss their convergence properties, and illustrate their behavior experimentally. Overall, our study suggests that the state-of-art LSTD(λ) [21] remains the best least-squares algorithm.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Article04 January 2021

An off-policy least square algorithms with eligibility trace based on importance reweighting

Article12 September 2017

On Generalized Bellman Equations and Temporal-Difference Learning

References

Antos, A., Szepesvári, C., Munos, R.: Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path. In: Lugosi, G., Simon, H.U. (eds.) COLT 2006. LNCS (LNAI), vol. 4005, pp. 574–588. Springer, Heidelberg (2006)
Chapter Google Scholar
Baird, L.C.: Residual Algorithms: Reinforcement Learning with Function Approximation. In: ICML (1995)
Google Scholar
Bertsekas, D.P., Yu, H.: Projected Equation Methods for Approximate Solution of Large Linear Systems. J. Comp. and Applied Mathematics 227(1), 27–50 (2009)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
Google Scholar
Boyan, J.A.: Technical Update: Least-Squares Temporal Difference Learning. Machine Learning 49(2-3), 233–246 (1999)
Google Scholar
Bradtke, S.J., Barto, A.G.: Linear Least-Squares algorithms for temporal difference learning. Machine Learning 22(1-3), 33–57 (1996)
Article MATH Google Scholar
Choi, D., Van Roy, B.: A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. DEDS 16, 207–239 (2006)
MATH Google Scholar
Engel, Y.: Algorithms and Representations for Reinforcement Learning. Ph.D. thesis, Hebrew University (2005)
Google Scholar
Geist, M., Pietquin, O.: Eligibility Traces through Colored Noises. In: ICUMT (2010)
Google Scholar
Geist, M., Pietquin, O.: Kalman Temporal Differences. JAIR 39, 483–532 (2010)
MathSciNet MATH Google Scholar
Geist, M., Pietquin, O.: Parametric Value Function Approximation: a Unified View. In: ADPRL (2011)
Google Scholar
Kearns, M., Singh, S.: Bias-Variance Error Bounds for Temporal Difference Updates. In: COLT (2000)
Google Scholar
Maei, H.R., Sutton, R.S.: GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Conference on Artificial General Intelligence (2010)
Google Scholar
Munos, R.: Error Bounds for Approximate Policy Iteration. In: ICML (2003)
Google Scholar
Nedić, A., Bertsekas, D.P.: Least Squares Policy Evaluation Algorithms with Linear Function Approximation. DEDS 13, 79–110 (2003)
MATH Google Scholar
Precup, D., Sutton, R.S., Singh, S.P.: Eligibility Traces for Off-Policy Policy Evaluation. In: ICML (2000)
Google Scholar
Ripley, B.D.: Stochastic Simulation. Wiley & Sons (1987)
Google Scholar
Scherrer, B.: Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In: ICML (2010)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning), 3rd edn. MIT Press (1998)
Google Scholar
Tsitsiklis, J., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)
Article MATH Google Scholar
Yu, H.: Convergence of Least-Squares Temporal Difference Methods under General Conditions. In: ICML (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

INRIA, MAIA Project-Team, Nancy, France
Bruno Scherrer
Supélec, IMS Research Group, Metz, France
Matthieu Geist

Authors

Bruno Scherrer
View author publications
You can also search for this author inPubMed Google Scholar
Matthieu Geist
View author publications
You can also search for this author inPubMed Google Scholar

Editor information

Editors and Affiliations

NICTA and the Australian National University, 7 London Circuit, ACT 2601, Canberra, Australia
Scott Sanner
Research School of Computer Science, Australian National University, ACT 0200, Canberra, Australia
Marcus Hutter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scherrer, B., Geist, M. (2012). Recursive Least-Squares Learning with Eligibility Traces. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_14

Download citation

DOI:https://doi.org/10.1007/978-3-642-29946-9_14
Publisher Name:Springer, Berlin, Heidelberg
Print ISBN:978-3-642-29945-2
Online ISBN:978-3-642-29946-9
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Movatterモバイル変換

Recursive Least-Squares Learning with Eligibility Traces

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

An off-policy least square algorithms with eligibility trace based on importance reweighting

On Generalized Bellman Equations and Temporal-Difference Learning

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Access this chapter

Subscribe and save

Buy Now