RL Final Project, Backgammon

This is a living document. It is being regularly updated. We will be discussing all these ideas in class in coming days. To access the code I am posting, clone the repo with

git clone https://cs.bc.edu/mctague/t/2025/rl/backgammon.git

I recommend writing your code in separate files which import the files I post there, so that you can git pull any modifications and additions I make without affecting your code.

You will create a sequence of increasingly sophisticated RL agents to play backgammon. You will be able to compare the abilities of your own agents (and your classmates’) by pitting them against each other in tournament.

Carefully review the rules of the game. We will follow all rules except we will not use a doubling cube.

You will implement everything from scratch in Python. To make this practical, we will need to use Numba to accelerate some aspects (like legal move enumeration), and high performance libraries for GPU-based linear algebra (JAX) and neural networks (Flax). You can use PyTorch instead if you like.

Each of your agents will learn from scratch, by playing many games against itself. This will require a powerful computer, which you will access through NSF ACCESS.

State representation

Backgammon has roughly \(10^{20}\) states. (A back-of-the-envelope estimate based on stars and bars would be \(\binom{40}{15}^2\approx 1.6\times 10^{21}\), which turns out to overcount by about a factor of 8.) This places backgammon beyond the reach of tabular methods. So all of your agents will have to learn to generalize across states.

We represent the current player as \(+1\) (white) or \(-1\) (black). The current dice roll as a (sorted) length-2 array of 8-bit int’s. And the board as a NumPy array of 8-bit int’s with the layout

`index`	0	1	2	…	i	…	24	25	26	27
contents	white bar	point 1	point 2	…	point i	…	point 24	black bar	white off	black off

contents

white bar

point 1

point 2

…

point i

…

point 24

black bar

white off

black off

The last two int’s are redundant, encoding the number of checkers which have been borne off by each player. White checkers are encoded with positive int’s, black with negative (even in black bar and black off). The absence of a checker is encoded with a 0.

Warning: This appears to be a nonstandard encoding, but it allows for cleaner code than standard approaches. To compute the number of player’s checkers at point i, one simply computes state[i]*player. If this is negative, then the opponent has pieces there.

The rules governing which moves are legal in backgammon are somewhat complex and need to be implemented correctly and efficiently. In each move, a player must use the maximum number of dice, and when either die can be used, but not both, the player must use the die with the most pips. To bear off, all of a player’s pieces must be in their home quadrant, and a player can use a die roll \(r\) to bear off a checker closer than \(r\) to the bear-off area only if the player has no checkers further from the bear-off area. (See the rules linked to above for a full description of all rules.)

The agents:

On-policy TD(0) with linearly approximated state-value function \(V\) applied to handcrafted features.

2-ply search for move selection. In planning phase, the agent considers each legal action \(a\) in the current state (the state includes the player’s current dice roll). For each action \(a\), the agent considers all 21 possible opponent dice rolls \(d_j\), leading to 21 possible states \(s_{a,j}\). The agent considers all possible moves \(a_{opp}\) by the opponent in state \(s_{a,j}\) and applies its own state-value function to the resulting state \(V(S_{a,j,a_{opp}})\). The agent assumes the opponent makes the move \(a_{opp}\) which minimizes this value (and does not itself perform a 2-ply search). As a formula,

\[ a^* = \mathrm{argmax}_a \sum_j P(d_j) \left[ \min_{a_{opp}} V(S_{a,j,a_{opp}}) \right] \]

Programming tip: When a value \(V(S_{a,j,a_{opp}})=\hat{v}(S_{a,j,a_{opp}},w_t)\) is computed on a GPU (which will certainly be the case for your second agent, when \(\hat{v}(\cdot,w_t)\) is a neural network), the state \(S_{a,j,a_{opp}}\) needs to be transferred from the CPU to GPU, and then the computed value needs to be transferred back to the CPU. These transfers have high latency but high bandwidth, and the GPU is optimized for doing many calculations \(\hat{v}(\cdot,w_t)\) in parallel. So to obtain good performance, you should build up an array of all states \(S_{a,j,a_{opp}}\) appearing in the RHS of the formula, transfer this array to the GPU all at once, compute all their values in parallel, and then transfer the array of values back to the CPU. Back on the CPU, compute each sum of the RHS in parallel, using multiple CPU cores. (A more sophisticated and higher performance technique would be to do all of these calculations, including legal move enumeration, on the GPU.)

Another tip: For many but not all moves, the order of submoves does not matter, leading to the same afterstate. For efficiency, 2-ply search should branch on and evaluate only unique afterstates.

This first agent approximates the state value function \(V=\hat{v}(\cdot,w)\) using a matrix which acts on a handcrafted vector of state features:

1. state itself (24+4 features)
2. blot counts (2 features)
3. number of checkers stacked \(\ge3\) in home quadrant (2 features)
1. length of longest consecutive sequence of points with \(\ge2\) checkers (2 features)
2. length of longest consecutive sequence of points with \(\ge2\) checkers within home quadrant (2 features)
3. number of opponent checkers trapped behind longest prime (2 features)
4. number of points in home board with at least 2 checkers (“made point count”) (2 features)
5. Number of checkers within 6 pips of any opponent blot (2 features)
1. “contact phase” or “racing phase” (1 feature) count number of player’s checkers on points 1–12, \(x_{race} = (\text{no checkers past 12-point barrier})/15\). Ranges from 0 (full contact) to 1 (pure race). This feature allows agent to adjust strategy based on game phase.
2. pip count difference (1 feature) difference in total number of pips (distance to bear off)
Product features (certain products of the above, to allow the agent to learn interactions between them)
1. (opponent blot count) \(\times\) (own number of points witin 6 pips of any opponent blot) (2 features)
2. (length of longest consecutive sequence of points with \(\ge2\) checkers) \(\times\) (number of opponent checkers trapped by this sequence) (2 features)
3. \(x_{race} \times\) (pip count difference) (2 features)
4. (number of points in home board with at least 2 checkers) \(\times\) (number of points in home board with at least 2 checkers)

This approach constrains the agent to finding a linear combination of human-designed features. These features are based on human strategies for the game. The advantage is that the strategy the agent finds is highly interpretable: humans can easily inspect and understand how the agent works. The disadvantage is that humans have to do the work of designing and coding these features, and it is difficult or impossible for the agent to transcend human understanding and discover a strategy not already envisioned by humans.

First implement this as an agent playing a single game, learning online, updating \(V\) move-by-move.

Next, create a vectorized version of the environment in which the agent plays a whole batch of games in parallel. (State is a vector of states, action is a vector of actions, reward is a vector of rewards. When one of the games in the batch ends, start a new one in its place.) Then create a vectorized version of your agent which plays a batch of games in parallel, learning online, updating \(V\) move-by-move, but where these “moves” are batches of moves. The update target is the average of the TD(0) targets across the batch. This will reduce variance of the stochastic gradient update, improving convergence. We will use this vectorized approach in all following agents.

Optionally, implement TD(\(\lambda\)), \(\lambda \in [0.7, 0.9]\). This will improve your agent, and help you prepare for the next agent.

Classical TD(\(\lambda\))

Here \(w_t\) and \(z_t\) are the weight and eligibility trace vectors at time \(t\):

\[ \begin{align*} z_{-1} &= 0 \\ g_t &= \nabla \hat{v}(S_t,w_t) \\ z_t &= \gamma\lambda z_{t-1} + g_t \\ \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) \\ w_{t+1} &= w_t + \alpha \delta_t z_t \\ \end{align*} \]

When learning in batches, \(z_t\), \(g_t\), and \(\delta_t\) are computed and stored separately for each parallel game, but a single weight \(w_t\) is computed by the sum \(w_{t+1}=w_t+\alpha \sum_i \delta_t^{(i)} z_t^{(i)}\).

True online TD(\(\lambda\))

You get faster convergence to a better fixed point using true online TD(\(\lambda\)) [van Seijen-Sutton 2014], [van Seijen-Mahmood-Pilarski-Machado-Sutton 2016]:

\[ \begin{align*} z_{-1} &= 0 \\ g_t &= \nabla \hat{v}(S_t,w_t) \\ z_t &= \gamma\lambda z_{t-1} + g_t - \underbrace{\alpha\gamma\lambda(z_{t-1}\cdot g_t) g_t}_{\text{online correction term}} \\ \delta_t &= R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t) \\ w_{t+1} &= w_t + \alpha \delta_t z_t + \underbrace{\alpha(\hat{v}(S_t,w_t) - g_t\cdot w_t)(z_t-g_t)}_{\text{online correction term}} \\ \end{align*} \]

On-policy TD(\(\lambda\)) but with modern 1-dimensional ConvNet and stack of ResNets (v2). Eligibility traces for credit assignment. 2-ply action selection. Trains value head.

Feature coding. Put each player’s checkers on separate feature planes. For each player, encode the state of the 24 points across 15 planes (each of 2.–8. determine two planes, one for each player, but 1. determines just one plane):

Empty (binary, 1 if count = 0, for both players)
Blot (binary, 1 if count = 1)
Made (binary, 1 if count = 2)
Builder (binary, 1 if count = 3)
Basic Anchor (binary, 1 if count = 4)
Deep Anchor (binary, 1 if count = 5)
Permanent Anchor (binary, 1 if count = 6)
Overflow (normalized count above 6, out of 9)

and encode the auxiliary features

Bar Active (binary, 1 if there are checkers on bar)
Bar Scale (normalized count of checkers on the bar, out of 15)
Borne-off (normalized count of checkers borne off, out of 15)

Network architecture. Feed these features into the following neural network architecture:

A 1-dimensional convolutional neural network (ConvNet) with 24 inputs \(\times\) 15 channels (corresponding to feature planes 1.–8. above, for each player), with stride-1 and width-7 kernel (and padded with zeros at the edges), chosen so the network can immediately see the full 6-pip window. Make this ConvNet have \(24\) outputs \(\times\) 128 channels. Feed these into:
A stack of nine identical ResNet v2 blocks. That is, a stack of network blocks having identical architecture (but different weights). Each block has 24 inputs \(\times\) 128 channels coming in and the same going out. Each block contains two ConvNet’s, one feeding into the next, with stride-1 and width-3 kernels (and padded with zeros at the edges). But each ConvNet applies layer normalization followed by ReLU “pre-activation” (that is, before convolution, not after), and each ResNet block has a skip connection around the block. Crucially (this is the “v2” part), this skip connection bypasses the pre-activation layer normalization and ReLU. x_out = x_in + Conv(ReLU(LayerNorm(Conv(ReLU(LayerNorm(x_in))))))
The 24 outputs \(\times\) 128 channels coming from the last ResNet are pooled by global averaging into just 128 outputs and fed, together with the 6 auxiliary features (corresponding to feature planes 9.–10. above), into a final dense (fully connected) network layer having 134 inputs, 64 hidden units, and one output. This final layer’s hidden units use ReLU activations, while its final output uses Tanh activation, which is then scaled by 3 to achieve a final range of \([-3,3]\) (the range of final scores for backgammon).

Proximal Policy Optimization (PPO)

Learn policy \(\pi_\theta(\cdot|\cdot)\) and value \(\hat{v}(\cdot,w)\) networks in tandem (actor/critic). During 2-ply search, use the policy to prune the search tree to, say, the top 5 moves (for both player and opponent). This agent will be less thorough in planning, so for a given value function will be slightly weaker. But it will plan faster, and therefore will be able to play more games against itself in any given training period, to obtain a better value function.

Network architecture. Take Agent 2’s network, which already has a value head (critic), and add to it:

A policy head (actor) receiving the same 134 inputs from the ResNet backbone as the value head, but with its own single layer of 64 fully-connected hidden nodes (with ReLU activation), and a \(25\times25\) grid of outputs (with linear activation) encoding logits for each possible source-destination pair for a single submove (that is, a single checker move).

To use this policy head to select the 5 or so most promising moves, begin by masking out the illegal submoves from the grid. Among the remaining submoves, identify the one with the most logits, and then identify all legal moves beginning with that submove. If there are fewer than 5 such moves, continue to the submove with the second most logits, and add all legal moves beginning with that submove. Continue like this until accumulating at least 5 moves (unless there are fewer than 5 possible moves).

During competition, perform 2-ply search pruned to use just these most promising moves (for both the player and opponent), and select the move among them having the optimal expected value, just as Agents 1 and 2 did.

During training, use this pruned 2-ply search to estimate the value of the current state, and save this value \(V(S_t)\) to the playback buffer. But then, to ensure exploration, sample a move randomly according to the softmax distribution given by the policy head’s logits: randomly sample the first submove (using the “Log-Sum-Exp trick” for numerical stability, after masking out illegal submoves), then feed the resulting intermediate state back into the policy network, and sample the second submove (after masking out illegal submoves), and so on, until obtaining a legal move \(A_\text{sampled}\) along with the probability \(\pi_\text{old}\) of it having been sampled by the current policy network (which will later become “old”). This is the action the agent actually takes, and it records all this information in the playback buffer, which consists of parallel arrays

\[ (S_t, A_\text{sampled}, \pi_\text{old}, R_{t+1}, V(S_t)) \]

The agent learns offline with repeated passes over this playback buffer. During the first pass, this looks like on-policy learning, but with each additional pass the policy drifts further and further from the policy which generated the data, so it starts to look more and more like off-policy learning. For this reason we need to use importance sampling. However, we clip the importance sampling ratios to keep the policy from drifting too far from the original. In this way, we can get higher data efficiency (multiple passes over the data), without letting fluke events in the data distort our policy too much. After a few passes over the data, generate fresh data using the improved policy. (Either discard the old data, or keep it and update all the \(\pi_\text{old}\) and \(V(S_t)\) entries using the improved policy and value networks, which will make the old, earnestly played games look more and more like erratic, highly exploratory games.)

PPO is similar in spirit to the REINFORCE Monte Carlo policy gradient algorithm with baselines, but with a variance-reducing substitute for the Monte Carlo return \(G_t\): Once the episode is generated, estimate the advantage of each action \[a_\pi(S_t,A_t)=q_\pi(S_t,A_t)-v_\pi(S_t)\] (that is, the state-action value using the state value as variance-reducing baseline), specifically compute the generalized advantage estimate

\[ \hat{A}_t = \sum_{n=0}^\infty (\gamma\lambda)^n \delta_{t+n} \]

where \(\delta_t=R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) is the TD(0)-error. The parameter \(\lambda\) lets you dial between the Monte Carlo estimate \(G_t-V(S_t)\) of the advantage (for \(\lambda=1\), low bias, high variance) and the TD(0) estimate \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\) of the advantage (for \(\lambda=0\), high bias, low variance). Add back the baseline to get a variance-reduced estimate of the return,

\[G^\text{target}_t = \hat{A}_t+V(S_t) . \]

Compute and store these estimates in a parallel array, and train the value network \(\hat{v}(\cdot,w)\) to predict them, by minimizing the loss

\[ \mathcal{L}^\text{VF}(w) = \hat{\mathbf{E}}_t \big( \hat{v}(S_t,w) - G^\text{target}_t \big)^2 \]

where \(\hat{\mathbf{E}}_t\) means the average across the batch.

As for training the policy, recall that by the Policy Gradient Theorem,

\[ \nabla J(\theta) = \nabla v_{\pi_\theta}(S_0) = \nabla \mathbf{E}_{\pi_\theta} \big[ \hat{A}_t \ln \pi_\theta(A_t|S_t) \big]. \]

So if \(\hat{A}_t\) were on-policy data, we could approximate the gradient \(\nabla J(\theta)\) by (automatically) computing the gradient of the loss function

\[ \mathcal{L}^\text{on}(\theta) = \hat{\mathbf{E}}_t \big[ \hat{A}_t \ln \pi_\theta(A_t|S_t) \big]. \]

But since \(\hat{A}_t\) is off-policy data, we could try to correct for that by using the loss function

\[ \mathcal{L}^\text{off}(\theta) = \hat{\mathbf{E}}_t \Big[ \rho_t(\theta) \hat{A}_t \Big] \]

where \(\rho_t(\theta)=\frac{\pi_\theta(A_t|S_t)}{\pi_{\theta_\text{old}}(A_t|S_t)}\) is the importance sampling ratio, since then (because \(\nabla x = x \nabla \ln x\)),

\[ \nabla \mathcal{L}^\text{off}(\theta) = \hat{\mathbf{E}}_t \Big[ \rho_t(\theta) \hat{A}_t \; \nabla \ln \pi_\theta(A_t|S_t) \Big]. \]

(Multiplying \(\hat{A}_t\) by the importance sampling ratio for just the first action \(A_t\), rather than by the \(\lambda\)-weighted ratios along the entire trajectory, is a judicious simplification.) However, to keep the policy from moving too far, we instead minimize the clipped surrogate objective

\[ \mathcal{L}^\text{CLIP}(\theta) = \hat{\mathbf{E}}_t \Big[ \min\Big( \; \rho_t(\theta)\hat{A}_t, \quad \text{clip}\big(\rho_t(\theta),1-\epsilon,1+\epsilon\big)\hat{A}_t \; \Big) \Big] \]

where

\[ \text{clip}(x,a,b) = \begin{cases} b & x\ge b\\ x & a<x<b \\ a & x<a \end{cases} \]

(see [Schulman-Wolski-Dhariwal-Radford-Klimov 2017]). In this way \(\epsilon\) defines a trust region (how far the new policy is allowed to deviate from the old one). The effect is to clip the gradient, so that the network is never subjected to large, violent updates. (This is the “proximal” in PPO.)

Since for efficiency we have made the value and policy networks share a single ResNet backbone, the value and policy weights \(w\) and \(\theta\) actually share components. Regard them as a single vector \(\phi\) (the parameters of the combined network) and perform unified (“end-to-end”) updates by maximizing the sum \[ \mathcal{L}(\phi) = \mathcal{L}^\text{CLIP}(\phi) - c_1 \mathcal{L}^{VF}(\phi) + c_2 H(\pi_\phi) \] where \(c_1\) and \(c_2\) are positive hyperparameters and \[ H(\pi_\phi) = - \sum_m \pi_\phi(m) \log \pi_\phi(m) \] is a Shannon entropy bonus, added to encourage the policy to explore. (This should be the entropy for the distribution over legal moves \(m\), so mask out illegal moves before computing the entropy.)

[One might argue that importance sampling should be used when training the value network as well, but this would add destabilizing variance to the signal used to train the policy.]

Programming tip: Insert both players’ data into the replay buffer, reflecting the opponent’s position so that it appears from first-person perspective. (But be sure to compute the opponents’ generalized advantage estimates separately.) The data within the replay buffer is highly correlated (because the states appearing within a single game are highly correlated, for both players, and these states appear nearby in the replay buffer.) You can decrease the variance of your gradient updates by passing over the replay buffer in randomized minibatches. Form an array of all possible indices in the replay buffer, shuffle it, and then sample the data in minibatches as determined by nonoverlapping windows of indices in this shuffled array.

Stochastic MuZero

This agent replaces 2-ply search with Monte Carlo Tree Search (MCTS) making use of a stochastic world model.

[Antonoglou-Schrittwieser-Ozair-Hubert-Silver 2022]