The mathematics behind TD

The temporal difference (TD) model(Sutton& Barto, 1990) is an extension of the ideas underlying the RWmodel(Rescorla & Wagner, 1972). Mostnotably the TD model abandons the construct of a “trial”, favoringinstead time-based formulations. Also notable is the introduction ofeligibility traces, which allow the model to bridge temporal gaps anddeal with the credit assignment problem.

Implementation note: As ofcalmr version0.6.2, stimulus representation in TD is based on completeserial compounds (i.e., time-specific stimulus elements entirelydiscriminable from each other), and the eligibility traces are of thereplacing type.

General Note: There are several descriptions of the TD model outthere, however, all of the ones I found were opaque when it comes toimplementation. Hence, the following description of the model has afocus on implementation details.

1 - Maintaining stimulus representations

TD maintains stimulus traces as eligibility traces. The eligibilityof stimulus\(i\) at time\(t\),\(e_i^t\), is given by:

\[\tag{Eq. 1}e_i^t = e_i^{t-1} \sigma \gamma + x_i^t\]

where\(\sigma\) and\(\gamma\) are decay and discount parameters,respectively, and\(x_i^t\) is theactivation of stimulus\(i\) at time\(t\) (1 or 0 for present and absentstimuli, respectively).

Internally,\(e_i\) is representedas a vector of length\(d\), where\(d\) is the number of stimuluscompounds (not in the general sense of the word compound, but in termsof complete serial compounds, or CSC). For example, a 2s stimulus in amodel with a time resolution of 0.5s will have a\(d = 4\), and the second entry in thatvector represents the eligibility of the compound active after thestimulus has been present for 1s.

Similarly,\(x_i^t\) entails thespecific compound of stimulus\(i\) attime\(t\), and not the generalactivation of\(i\) at that time. Forexample, suppose two, 2s stimuli,\(A\)and\(B\) are presented with an overlapof 1s, with\(A\)’s onset occurringfirst. Can you guess what stimulus compounds will be active at\(t = 2\) with a time resolution of 0.5s?¹

2 - Generating expectations

The TD model generates stimulus expectations² based on the presentedstimuli,not on the strength of eligibility traces. Theexpectation of of stimulus\(j\) attime\(t\),\(V_j^t\), is given by:

\[\tag{Eq. 2}V_j^t = w_j^{t'} x^t = \sum_i^K w_{i,j}^t x_i^t\]

Where\(w_j^t\) is a matrix ofstimulus weights at time\(t\) pointingtowards\(j\),\('\) denotes transposition, and\(w_{i,j}\) denotes an entry in a squarematrix denoting the association from\(i\) to\(j\). As with the eligibility traces above,the entries in each matrix are the weights of specific stimuluscompounds.

Internally, the\(w_j^t\) isconstructed on a trial-by-trial, step-by-step basis, depending on thestimulus compounds active at the time.

3 - Learning associations

Owing to its name, the TD model updates associations based on atemporally discounted prediction of upcoming stimuli. This temporaldifference error term is given by:

\[\tag{Eq. 3}\delta_j^t = \lambda_j^t + \gamma V_j^t - V_j^{t-1}\]

where\(\lambda_j\) is the value ofstimulus\(j\) at time\(t\), which also determines the asymptotefor stimulus weights towards\(j\).

The temporal difference error term is used to update\(w\) via:

\[\tag{Eq. 4}w_{i,j}^t = w_{i,j}^t + \alpha_i \beta(x_j^t) \delta_j^t e_i^t\]

where\(\alpha_i\) is a learningrate parameter for stimulus\(i\), and\(\beta(x_j)\) is a function thatreturns one of two learning rate parameters (\(\beta_{on}\) or\(\beta_{off}\)) depending on whether\(j\) is being presented or not at time\(t\).

4 - Generating responses

As with many associative learning models, the transformation betweenstimulus expectations and responding is unspecified/left in the hands ofthe user. The TD model does not return a response vector, but itsuffices to assume that responding is the identity function on theexpected stimulus values, as follows:

\[\tag{Eq. 5}r_j^t = V_j^t\]

References

Rescorla, R. A., & Wagner, A. R. (1972). A theory ofPavlovian conditioning:Variations in theeffectiveness of reinforcement and nonreinforcement. In A. H. Black& W. F. Prokasy (Eds.),Classical conditioningII:Current research and theory. (pp. 64–69).Appleton-Century-Crofts.

Sutton, R. S., & Barto, A. G. (1990). Time-derivative models ofPavlovian reinforcement. In M. Gabriel & J. W. Moore(Eds.),Learning and computational neuroscience (pp. 497–537).MIT Press.

Movatterモバイル変換