
This is a living document. It is being regularly updated. We will be discussing all these ideas in class in coming days. To access the code I am posting, clone the repo with
git clone https://cs.bc.edu/mctague/t/2025/rl/backgammon.gitI recommend writing your code in separate files which import the
files I post there, so that you can git pull any
modifications and additions I make without affecting your code.
You will create a sequence of increasingly sophisticated RL agents to play backgammon. You will be able to compare the abilities of your own agents (and your classmates’) by pitting them against each other in tournament.
Carefully review the rules of the game. We will follow all rules except we will not use a doubling cube.
You will implement everything from scratch in Python. To make this practical, we will need to use Numba to accelerate some aspects (like legal move enumeration), and high performance libraries for GPU-based linear algebra (JAX) and neural networks (Flax). You can use PyTorch instead if you like.
Each of your agents will learn from scratch, by playing many games against itself. This will require a powerful computer, which you will access through NSF ACCESS.
You will create code to store and load your fully trained agents to and from disk, so that you can transmit them to other computers for further evaluation, and ultimately to submit them with the rest of your work.
Backgammon has roughly \(10^{20}\) states. (A back-of-the-envelope estimate based on stars and bars would be \(\binom{40}{15}^2\approx 1.6\times 10^{21}\), which turns out to overcount by about a factor of 8.) This places backgammon beyond the reach of tabular methods. So all of your agents will have to learn to generalize across states.
We represent the current player as \(+1\) (white) or \(-1\) (black). The current dice roll as a
(sorted) length-2 array of 8-bit int’s. And the board as a
NumPy array of 8-bit
int’s with the layout
index |
0 | 1 | 2 | … | i | … | 24 | 25 | 26 | 27 |
|---|---|---|---|---|---|---|---|---|---|---|
| contents | white bar | point 1 | point 2 | … | point i | … | point 24 | black bar | white off | black off |
The last two int’s are redundant, encoding the number of
checkers which have been borne off by each player. White checkers are
encoded with positive int’s, black with negative (even
in black bar and black off). The absence
of a checker is encoded with a 0.
Warning: This appears to be a nonstandard encoding, but it
allows for cleaner code than standard approaches. To compute the number
of player’s checkers at point i, one simply
computes state[i]*player. If this is negative, then the
opponent has pieces there.
The rules governing which moves are legal in backgammon are somewhat complex and need to be implemented correctly and efficiently. In each move, a player must use the maximum number of dice, and when either die can be used, but not both, the player must use the die with the most pips. To bear off, all of a player’s pieces must be in their home quadrant, and a player can use a die roll \(r\) to bear off a checker closer than \(r\) to the bear-off area only if the player has no checkers further from the bear-off area. (See the rules linked to above for a full description of all rules.)
2-ply search for move selection. In planning phase, the agent considers each legal action \(a\) in the current state (the state includes the player’s current dice roll). For each action \(a\), the agent considers all 21 possible opponent dice rolls \(d_j\), leading to 21 possible states \(s_{a,j}\). The agent considers all possible moves \(a_{opp}\) by the opponent in state \(s_{a,j}\) and applies its own state-value function to the resulting state \(V(S_{a,j,a_{opp}})\). The agent assumes the opponent makes the move \(a_{opp}\) which minimizes this value (and does not itself perform a 2-ply search). As a formula,
\[ a^* = \mathrm{argmax}_a \sum_j P(d_j) \left[ \min_{a_{opp}} V(S_{a,j,a_{opp}}) \right] \]
Programming tip: When a value \(V(S_{a,j,a_{opp}})=\hat{v}(S_{a,j,a_{opp}},w_t)\) is computed on a GPU (which will certainly be the case for your second agent, when \(\hat{v}(\cdot,w_t)\) is a neural network), the state \(S_{a,j,a_{opp}}\) needs to be transferred from the CPU to GPU, and then the computed value needs to be transferred back to the CPU. These transfers have high latency but high bandwidth, and the GPU is optimized for doing many calculations \(\hat{v}(\cdot,w_t)\) in parallel. So to obtain good performance, you should build up an array of all states \(S_{a,j,a_{opp}}\) appearing in the RHS of the formula, transfer this array to the GPU all at once, compute all their values in parallel, and then transfer the array of values back to the CPU. Back on the CPU, compute each sum of the RHS in parallel, using multiple CPU cores. (A more sophisticated and higher performance technique would be to do all of these calculations, including legal move enumeration, on the GPU.)
Another tip: For many but not all moves, the order of submoves does not matter, leading to the same afterstate. For efficiency, 2-ply search should branch on and evaluate only unique afterstates.
This first agent approximates the state value function \(V=\hat{v}(\cdot,w)\) using a matrix which acts on a handcrafted vector of state features:
This approach constrains the agent to finding a linear combination of human-designed features. These features are based on human strategies for the game. The advantage is that the strategy the agent finds is highly interpretable: humans can easily inspect and understand how the agent works. The disadvantage is that humans have to do the work of designing and coding these features, and it is difficult or impossible for the agent to transcend human understanding and discover a strategy not already envisioned by humans.
First implement this as an agent playing a single game, learning online, updating \(V\) move-by-move.
Next, create a vectorized version of the environment in which the agent plays a whole batch of games in parallel. (State is a vector of states, action is a vector of actions, reward is a vector of rewards. When one of the games in the batch ends, start a new one in its place.) Then create a vectorized version of your agent which plays a batch of games in parallel, learning online, updating \(V\) move-by-move, but where these “moves” are batches of moves. The update target is the average of the TD(0) targets across the batch. This will reduce variance of the stochastic gradient update, improving convergence. We will use this vectorized approach in all following agents.
Optionally, implement TD(\(\lambda\)), \(\lambda \in [0.7, 0.9]\). This will improve your agent, and help you prepare for the next agent.
Here \(w_t\) and \(z_t\) are the weight and eligibility trace vectors at time \(t\):
\[ \begin{align*} z_{-1} &= 0 \\ g_t &= \nabla \hat{v}(S_t,w_t) \\ z_t &= \gamma\lambda z_{t-1} + g_t \\ \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) \\ w_{t+1} &= w_t + \alpha \delta_t z_t \\ \end{align*} \]
When learning in batches, \(z_t\), \(g_t\), and \(\delta_t\) are computed and stored separately for each parallel game, but a single weight \(w_t\) is computed by the sum \(w_{t+1}=w_t+\alpha \sum_i \delta_t^{(i)} z_t^{(i)}\).
You get faster convergence to a better fixed point using true online TD(\(\lambda\)) [van Seijen-Sutton 2014], [van Seijen-Mahmood-Pilarski-Machado-Sutton 2016]:
\[ \begin{align*} z_{-1} &= 0 \\ g_t &= \nabla \hat{v}(S_t,w_t) \\ z_t &= \gamma\lambda z_{t-1} + g_t - \underbrace{\alpha\gamma\lambda(z_{t-1}\cdot g_t) g_t}_{\text{online correction term}} \\ \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) \\ w_{t+1} &= w_t + \alpha \delta_t z_t + \underbrace{\alpha(\hat{v}(S_t,w_t) - g_t\cdot w_t)(z_t-g_t)}_{\text{online correction term}} \\ \end{align*} \]
Feature coding. Put each player’s checkers on separate feature planes. For each player, encode the state of the 24 points across 15 planes (each of 2.–8. determine two planes, one for each player, but 1. determines just one plane):
and encode the auxiliary features
Network architecture. Feed these features into the following neural network architecture:
x_out = x_in + Conv(ReLU(LayerNorm(Conv(ReLU(LayerNorm(x_in))))))Learn policy \(\pi_\theta(\cdot|\cdot)\) and value \(\hat{v}(\cdot,w)\) networks in tandem (actor/critic). During 2-ply search, use the policy to prune the search tree to, say, the top 5 moves (for both player and opponent). This agent will be less thorough in planning, so for a given value function will be slightly weaker. But it will plan faster, and therefore will be able to play more games against itself in any given training period, to obtain a better value function.
Network architecture. Take Agent 2’s network, which already has a value head (critic), and add to it:
To use this policy head to select the 5 or so most promising moves, begin by masking out the illegal submoves from the grid. Among the remaining submoves, identify the one with the most logits, and then identify all legal moves beginning with that submove. If there are fewer than 5 such moves, continue to the submove with the second most logits, and add all legal moves beginning with that submove. Continue like this until accumulating at least 5 moves (unless there are fewer than 5 possible moves).
During competition, perform 2-ply search pruned to use just these most promising moves (for both the player and opponent), and select the move among them having the optimal expected value, just as Agents 1 and 2 did.
During training, use this pruned 2-ply search to estimate the value of the current state, and save this value \(V(S_t)\) to the playback buffer. But then, to ensure exploration, sample a move randomly according to the softmax distribution given by the policy head’s logits: randomly sample the first submove (using the “Log-Sum-Exp trick” for numerical stability, after masking out illegal submoves), then feed the resulting intermediate state back into the policy network, and sample the second submove (after masking out illegal submoves), and so on, until obtaining a legal move \(A_\text{sampled}\) along with the probability \(\pi_\text{old}\) of it having been sampled by the current policy network (which will later become “old”). This is the action the agent actually takes, and it records all this information in the playback buffer, which consists of parallel arrays
\[ (S_t, A_\text{sampled}, \pi_\text{old}, R_{t+1}, V(S_t)) \]
The agent learns offline with repeated passes over this playback buffer. During the first pass, this looks like on-policy learning, but with each additional pass the policy drifts further and further from the policy which generated the data, so it starts to look more and more like off-policy learning. For this reason we need to use importance sampling. However, we clip the importance sampling ratios to keep the policy from drifting too far from the original. In this way, we can get higher data efficiency (multiple passes over the data), without letting fluke events in the data distort our policy too much. After a few passes over the data, generate fresh data using the improved policy. (Either discard the old data, or keep it and update all the \(\pi_\text{old}\) and \(V(S_t)\) entries using the improved policy and value networks, which will make the old, earnestly played games look more and more like erratic, highly exploratory games.)
PPO is similar in spirit to the REINFORCE Monte Carlo policy gradient algorithm with baselines, but with a variance-reducing substitute for the Monte Carlo return \(G_t\): Once the episode is generated, estimate the advantage of each action \[a_\pi(S_t,A_t)=q_\pi(S_t,A_t)-v_\pi(S_t)\] (that is, the state-action value using the state value as variance-reducing baseline), specifically compute the generalized advantage estimate
\[ \hat{A}_t = \sum_{n=0}^\infty (\gamma\lambda)^n \delta_{t+n} \]
where \(\delta_t=R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is the TD(0)-error. The parameter \(\lambda\) lets you dial between the Monte Carlo estimate \(G_t-V(S_t)\) of the advantage (for \(\lambda=1\), low bias, high variance) and the TD(0) estimate \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) of the advantage (for \(\lambda=0\), high bias, low variance). Add back the baseline to get a variance-reduced estimate of the return,
\[G^\text{target}_t = \hat{A}_t+V(S_t) . \]
Compute and store these estimates in a parallel array, and train the value network \(\hat{v}(\cdot,w)\) to predict them, by minimizing the loss
\[ \mathcal{L}^\text{VF}(w) = \hat{\mathbf{E}}_t \big( \hat{v}(S_t,w) - G^\text{target}_t \big)^2 \]
where \(\hat{\mathbf{E}}_t\) means the average across the batch.
As for training the policy, recall that by the Policy Gradient Theorem,
\[ \nabla J(\theta) = \nabla v_{\pi_\theta}(S_0) = \nabla \mathbf{E}_{\pi_\theta} \big[ \hat{A}_t \ln \pi_\theta(A_t|S_t) \big]. \]
So if \(\hat{A}_t\) were on-policy data, we could approximate the gradient \(\nabla J(\theta)\) by (automatically) computing the gradient of the loss function
\[ \mathcal{L}^\text{on}(\theta) = \hat{\mathbf{E}}_t \big[ \hat{A}_t \ln \pi_\theta(A_t|S_t) \big]. \]
But since \(\hat{A}_t\) is off-policy data, we could try to correct for that by using the loss function
\[ \mathcal{L}^\text{off}(\theta) = \hat{\mathbf{E}}_t \Big[ \rho_t(\theta) \hat{A}_t \Big] \]
where \(\rho_t(\theta)=\frac{\pi_\theta(A_t|S_t)}{\pi_{\theta_\text{old}}(A_t|S_t)}\) is the importance sampling ratio, since then (because \(\nabla x = x \nabla \ln x\)),
\[ \nabla \mathcal{L}^\text{off}(\theta) = \hat{\mathbf{E}}_t \Big[ \rho_t(\theta) \hat{A}_t \; \nabla \ln \pi_\theta(A_t|S_t) \Big]. \]
(Multiplying \(\hat{A}_t\) by the importance sampling ratio for just the first action \(A_t\), rather than by the \(\lambda\)-weighted ratios along the entire trajectory, is a judicious simplification.) However, to keep the policy from moving too far, we instead minimize the clipped surrogate objective
\[ \mathcal{L}^\text{CLIP}(\theta) = \hat{\mathbf{E}}_t \Big[ \min\Big( \; \rho_t(\theta)\hat{A}_t, \quad \text{clip}\big(\rho_t(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_t \; \Big) \Big] \]
where
\[ \text{clip}(x,a,b) = \begin{cases} b & x\ge b\\ x & a<x<b \\ a & x<a \end{cases} \]
(see [Schulman-Wolski-Dhariwal-Radford-Klimov 2017]). In this way \(\epsilon\) defines a trust region (how far the new policy is allowed to deviate from the old one). The effect is to clip the gradient, so that the network is never subjected to large, violent updates. (This is the “proximal” in PPO.)
Since for efficiency we have made the value and policy networks share a single ResNet backbone, the value and policy weights \(w\) and \(\theta\) actually share components. Regard them as a single vector \(\phi\) (the parameters of the combined network) and perform unified (“end-to-end”) updates by maximizing the sum \[ \mathcal{L}(\phi) = \mathcal{L}^\text{CLIP}(\phi) - c_1 \mathcal{L}^{VF}(\phi) + c_2 H(\pi_\phi) \] where \(c_1\) and \(c_2\) are positive hyperparameters and \[ H(\pi_\phi) = - \sum_m \pi_\phi(m) \log \pi_\phi(m) \] is a Shannon entropy bonus, added to encourage the policy to explore. (This should be the entropy for the distribution over legal moves \(m\), so mask out illegal moves before computing the entropy.)
[One might argue that importance sampling should be used when training the value network as well, but this would add destabilizing variance to the signal used to train the policy.]
Programming tip: Insert both players’ data into the replay buffer, reflecting the opponent’s position so that it appears from first-person perspective. (But be sure to compute the opponents’ generalized advantage estimates separately.) The data within the replay buffer is highly correlated (because the states appearing within a single game are highly correlated, for both players, and these states appear nearby in the replay buffer.) You can decrease the variance of your gradient updates by passing over the replay buffer in randomized minibatches. Form an array of all possible indices in the replay buffer, shuffle it, and then sample the data in minibatches as determined by nonoverlapping windows of indices in this shuffled array.
This agent replaces 2-ply search with Monte Carlo Tree Search (MCTS) making use of a stochastic world model.