# Human Choice and Reinforcement Learning (2)

# 1. Answer to post 1

In the previous post, I reviewed the Rescorla-Wagner updating (Delta) rule and its contemporary instantiation. At the end, I asked the following question:

**How should you change the learning rate so that the expected win rate is always the average of all past outcomes?**

We will go over the answer to this question before progressing to the use of the Delta rule in modeling human choice. To begin, refer back to the Delta rule written in the following form:

\[Pr(win)_{t+1} = (1 - \beta) \cdot Pr(win)_{t} + \beta \cdot \lambda_{t}\]

Here, we see that in the Delta rule the expected win probability for the next trial is equal to the *exponentially weighted moving average* of the past expectation and the current outcome. It is easy to show this through a visualization of the expectation over time. For example, imagine that we have a vector of outcomes \(\lambda = [1,0,0,1,1,1,0,1,1,1]\), where 0 and 1 represent losing and winning slot machine rolls, respectively. Note that in this example, the placement of these outcomes within the vector \(\lambda\) indicates their temporal order (i.e. \(\lambda_{1}=1\), \(\lambda_{2}=0\), etc.). Now, if we set an arbitrary learning rate such as \(\beta = 0.05\), what is the expected win rate after iterating through outcomes \(\lambda\)? The R code below demonstrates the use of the Delta rule and an alternative exponential weighting scheme–which takes the form of a *power series*–to determine the expectation on each trial:

```
# For pretty plots
library(ggplot2)
# Assign lambda (lambda[1] == first trial)
lambda <- c(1,0,0,1,1,1,0,1,1,1)
# Set learning rate
beta <- 0.05
### Iterative prediction error (Delta rule) approach ###
# Function that iterates the Rescorla-Wagner rule
# This function is slightly modified from the last post
# to ensure that that final expectation is stored
rw_update <- function(lambda, beta, init) {
# To store expected value for each trial
Pr_win <- vector(length=length(lambda)+1)
# Set initial value
Pr_win[1] <- init
for (t in 1:(length(lambda))) {
Pr_win[t+1] <- Pr_win[t] + beta * (lambda[t] - Pr_win[t])
}
return(Pr_win)
}
# With initial expectation = 0, iterate Delta rule
delta_results <- rw_update(lambda = lambda,
beta = beta,
init = 0)[-1]
# ^
# Remove initial value (0)
### Power series approach ###
# Direct exponential weighting (saving all expectations)
power_ser <- NULL
for (i in 1:10) {
power_ser[i] <- beta * sum((1-beta)^(0:(i-1))*lambda[i:1])
}
### Comparison of both approaches ###
all(round(delta_results, 8) == round(power_ser, 8))
```

`## [1] TRUE`

```
# data.frame for ggplot
all_data <- stack(data.frame(delta = delta_results,
pow_s = power_ser))
# Add trial
all_data[["trial"]] <- rep(1:10, 2)
names(all_data)[2] <- "Approach"
# Visually
p <- ggplot(all_data, aes(x = trial, y = values, color = Approach)) +
geom_line() +
facet_grid(facets = "Approach ~ .") +
ggtitle("Comparison of approaches") +
xlab("Trial Number") +
ylab("Expected Win Probability")
p
```

As you can see in the plots, both the Delta rule and the power series approach yield the same exact expectations when iterated for each trial. However, the power series form requires each past observation while the Delta rule only requires the last expectation–this feature makes the Delta rule form of the equation much more plausible as a processes that people may use to estimate the value of a choice. This is because the computational cost does not increase with the number of past observations.

Through this example, those familiar with economics or signal processing may find the Delta rule familiar. Essentially, we can think of the Delta rule as a smoothing function or a high- or low-pass filter–albeit in the time as opposed to frequency domain–which effectively attenuates the effect of past or current outcomes, respectively. What makes this specific form interesting is again the fact that it can be iterated (i.e. it is recursive), making it a realistic approximation to the computations performed by the brain when estimating some value.

With the above intuitions in mind, we will now get back to the question. How do we change the learning rate to ensure that the current expectation is always the simple average of all past outcomes? Since the above example showed that the Delta rule is really just a moving average where past outcomes are given exponentially decreasing weights, our goal is to make all outcomes equally represented. In other words, we want to weight past and current outcomes equally–this is a *cumulative moving average*. Using our slot machine example, the cumulative moving average formula (in its most common form) is written as follows:

\[Pr(win)_{t+1} = \frac{\lambda_{t} + (t-1) \cdot Pr(win)_{t}}{t}\]

We can re-write the above equation into one that is more familiar to us:

\[Pr(win)_{t+1} = (1-\frac{1}{t}) \cdot Pr(win)_{t} + \frac{1}{t} \cdot \lambda_{t} \]

In the above form, you can see that the cumulative moving average can be computed using the Delta rule by setting the learning rate to \(\frac{1}{t}\), where \(t\) is the trial number. Looking at the equation, you will notice that as \(t\) increases, the weight \((1-\frac{1}{t})\) placed on the past probability estimate \(Pr(win)_{t}\) becomes larger while the weight \(\frac{1}{t}\) on the current outcome \(\lambda_{t}\) shrinks. This behavior ensures that past outcomes are not discounted at a higher rate than current ones.

# 2. Choice mechanisms

With the Delta rule, we can approximate how people with a certain learning rate may update their expectations about an outcome on a trial-by-trial basis, but how does this translate to choice? In the slot machine example, we only had a single choice (i.e. pull the lever and see if you win), so this question never applied to us. **But what if we have 2 slot machines and we want to select the one that will win most frequently?** In this case, we can use the Delta rule to update win probability expectations for each slot machine separately, but what do we do with these values after they are computed? Below, I will describe three different methods that can be used to translate expected values to choice.

## 2.1 Greedy choice

Greedy choice is simple–choose the option with the highest expected value on each trial (i.e. pick the *greedy* option):

\[c_{t} = \underset{s \in S}{argmax}(V(s_{t}))\] where \(c_{t}\) indicates the choice made on trial \(t\) and \(V(s_{t})\) represents the value associated with Slot machine \(s\) on trial \(t\). Applying this logic to our example, this would equate to choosing the slot machine with the highest expected win probability. While this may sound like a good idea, it is important to remember that we do not know what the true win rate is for either slot machine. Instead, we estimate it after each outcome. With this in mind, a simple example (below) shows why the greedy choice method fails in practical applications.

Imagine you are choosing between 2 slot machines, where machine A (\(Slot_{A}\)) has a true win rate of 0.9, and machine B (\(Slot_{B}\)) has a true win rate of 0.5. Obviously, \(Slot_{A}\) is a better option, but this is something you need to learn by making a choice and updating your expectations. Assuming that you start off thinking that each slot machine has a win rate of 0 (which is typical in human decision making models), your first choice will be a random selection between the equivalent options. In our example, imagine that you randomly choose \(Slot_{B}\), the slot machine spins, and then you win! Great, so now (regardless of your learning rate), you will update your expectation of the win rate for \(Slot_{B}\) to a positive, non-zero value. On the next trial, you are greedy, so you again choose \(Slot_{B}\)–it has a higher expected win rate than \(Slot_{A}\). In fact, you will continue to choose \(Slot_{B}\) on each trial without ever considering \(Slot_{A}\)! In this case, even though \(Slot_{A}\) it the optimal choice, you never allow yourself to *explore* alternative choices. Here, we come across a classical problem in reward-learning paradigms–** the exploration-exploitation tradeoff**. The crux of this problem is this: how do you know that a “good” choice is better than other choices that you have yet to explore? Think of it like choosing a job. I chose to study psychology, and I continue to exploit this choice (i.e. I am not exploring other education). How do I know that psychology is for me, though? It is possible that I would have gained more from a computer science degree, but I did not explore that option. In the same way, this compromise exists in simple reinforcement learning paradigms such as choosing the best slot machine.
Below, we will explore two methods that address the exploration-exploitation problem.

## 2.2 \(\epsilon\)-Greedy choice

The \(\epsilon\)-greedy method is a simple extention of the greedy method above. Instead of always choosing the option with the highest expected value, we sometimes (with probability \(\epsilon\)) randomly choose another option:

\[c_{t} = \cases{ \underset{s \in S}{argmax}(V(s_{t})) & \text{with } Pr(1 - \epsilon) \cr \text{random choice} & \text{with } Pr(\epsilon) } \]

By choosing a random option with \(Pr(\epsilon)\), we can avoid getting stuck choosing the non-optimal slot machine. While this method solves our dilemma, it suffers another problem–when randomly selecting options, it gives equal probabilities to each option. Intuitively, a better method would be to choose options *probabilistically with respect to their expected values* (i.e. give high probability to relatively good options and *vice-versa*).

## 2.3 Softmax choice

Also known as the Luce choice rule, the softmax allows choices to be probabilistic with weights respective to expected value:

\[Pr(c_{t} = s \in S) = \frac{e^{V_{s}(t)}}{\sum_{s = 1}^S e^{V_{s}(t)}}\]

where \(e^{V_{s}(t)}\) is the exponentiated expected value of slot machine \(s\) on trial \(t\), and \(\sum_{s = 1}^S e^{V_{s}(t)}\) is the summation of the exponentiated expected value of both slot machines (there are 2 in our example). When the expected values are entered, the softmax equation returns a probability of selecting each slot machine which we can then use to make an actual choice. In practice, the softmax function is used most often in decision making research–moving forward, we will use the softmax choice mechanism to model human decision making.

# 3. Implementation

We now have a full model describing each of the following steps:

- Evaluating an outcome,
- Updating previous representations of choice options, and
- Generating a probability of selecting each choice on the next trial.

This model is simple, but it provides the basic building blocks for more complex models that are found in neuroscience, cognitive science, and decision making literature today. In the next post, we will explore various methods which can be used to estimate the *free parameters* in the model (e.g. the learning rate) when all we have are the person’s choices.