Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Reinforcement learning
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== Temporal difference methods ==== {{Main|Temporal difference learning}} The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of ''generalized policy iteration'' algorithms. Many [[Actor-critic algorithm|''actor-critic'' methods]] belong to this category. The second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is Sutton's [[temporal difference]] (TD) methods that are based on the recursive [[Bellman equation]].<ref>{{cite thesis|last = Sutton|first = Richard S.|title = Temporal Credit Assignment in Reinforcement Learning|degree = PhD|publisher = University of Massachusetts, Amherst, MA|url = http://incompleteideas.net/sutton/publications.html#PhDthesis|author-link = Richard S. Sutton|year = 1984|access-date = 2017-03-29|archive-date = 2017-03-30|archive-url = https://web.archive.org/web/20170330002227/http://incompleteideas.net/sutton/publications.html#PhDthesis|url-status = dead}}</ref>{{sfn|Sutton|Barto|2018|loc=[http://incompleteideas.net/sutton/book/ebook/node60.html Β§6. Temporal-Difference Learning]}} The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method,<ref>{{cite journal | doi = 10.1023/A:1018056104778 | last1 = Bradtke | first1 = Steven J. | author-link1 = Steven J. Bradtke | last2 = Barto | first2 = Andrew G. | author-link2 = Andrew G. Barto | title = Learning to predict by the method of temporal differences | journal = Machine Learning | volume = 22 | pages = 33β57 | year = 1996 | citeseerx = 10.1.1.143.857 | s2cid = 20327856 }}</ref> may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue. Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called <math>\lambda</math> parameter <math>(0\le \lambda\le 1)</math> that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on the Bellman equations. This can be effective in palliating this issue.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Reinforcement learning
(section)
Add topic