The Colonel Blotto game was proposed by Borel in 1921 . In this game, two players fight on \(n\) simultaneous and independent castles of different values. Each player has a finite budget of soldiers, \(B\), to split over the castles, and for each castles \(k\), the value \(w_k\), will be awarded to the player that allocates more power there. Each player aims to win the game. The game is called symmetric if all players have the same budget, homogenous if all castles have the same value, continuous if allocations can be arbitrary positive real numbers, and discrete if allocations are restricted to be whole numbers. In this project, we consider the 2-person, symmetric, discrete Colonel Blotto game. Moreover, for each \(k = 1, 2, \dots, n\), we let \(w_k = k\). For example, here is one potential match: $$\begin{array}[!htbp] \centering \begin{array}{|c||c|c|c|c|c|c|c|c|c|c||} \hline & & & \\[-8pt] {} & C1 & C2 & C3 & C4 & C5 & C6 & C7 & C8 & C9 & C10 \\[8pt] \hline \hline & & & \\[-8pt] Alice & 10 & 10 & 10 & 10 & 20 & 0 & 0 & 0 & 20 & 20 \\[3pt] \hline \\[-8pt] Carol & 20 & 0 & 10 & 5 & 10 & 5 & 10 & 20 & 0 & 0 \\[3pt] \hline \end{array} \end{array}$$ In the example match above, Alice wins castles 2, 4, and 5, for a total of 11 points. Carol wins castles 1, 6, 7, and 8 — and then castles 9 and 10 — for a total of 41 points. (No one wins castles 3, since that was a tie. Alice does not win castles 9 or 10 because those were awarded to Carol after she won castles 6 through 8.). Well played, Carol!
We are motivated by two primary research questions. Given a fraction of allocations \(t = \frac{p}{B}\), what is the probability \(P(t)\) that Player 1 wins? How might we construct a player that will learn to maximize its chances of winning? The main results from our explorations are as follows.
Here, we consider the Colonel Blotto game with 100 troops and 2 castles. Above is a plot of Player 1's probability of winning the game based on their allocation to castle 1.
A plot of Player 1's probability of winning a Colonel Blotto game with 81 soldiers and 3 castles based on their allocations to castle 1 and 2.
Now, we consider how a Q-Learning agent performs against a Random agent in Colonel Blotto by reproducing results by Noel and, at the same time, we should realize the work above in our results. We set up a Colonel Blotto game with 3 fields and 100 soldiers per player. This provides a total of 5,151 possible actions, or strategies. The winner of the game will receive a reward of \(r = 1\), the loser will receive a reward of \(r = -1\), and no reward is received in the event of a tie. A single game constitutes a single episode in reinforcement learning. We count how many games each player has won over time and show the results. Our reinforcement learning agent uses the Q-Learning algorithm for approximating the optimal policy. We use \(\alpha = 0.1\) as the learning rate and \(\gamma = 1\) as the discount factor. We also set \(\epsilon = 0.2\) as the exploration rate for the agent. For the opponent, we use a Random agent that will select one of the 5,151 possible actions at random with equal probability for all of them. The experiments were run using the OpenSpiel environment simulator in Google Colab. A link to the project can be found here. We ran 100,000 games of Colonel Blotto to see the performance of the reinforcement learning agent and show the results below.
A plot of the number of games won over 100,000 episodes for both the RL and Random agent.
We also dive deeper and look at the Q-values that the agent has assigned to each of the 5,151 possible actions, to better understand what strategy it has arrived at. We show the top 10 and the bottom 10 actions based on their Q-value scores below. $$\begin{array} \centering \begin{array}{|cc|} \hline & & \\[-8pt] \text{Top 10 Actions} & \text{Bottom 10 Actions} \\[8pt] \hline \hline & & \\[-8pt] (44,26,30) & (0,1,99) \\[3pt] \hline \\[-8pt] (34,31,35) & (4,94,2) \\[3pt] \hline \\[-8pt] (68,20,12) & (6,88,6) \\[3pt] \hline \\[-8pt] (3,48,49) & (8,1,91) \\[3pt] \hline \\[-8pt] (37,57,6) & (77,15,8) \\[3pt] \hline \\[-8pt] (25,35,40) & (84,13,3) \\[3pt] \hline \\[-8pt] (47,41,12) & (96,2,2) \\[3pt] \hline \\[-8pt] (64,31,5) & (81,6,13) \\[3pt] \hline \\[-8pt] (31,0,69) & (5,73,22) \\[3pt] \hline \\[-8pt] (33,21,46) & (19,1,80) \\[3pt] \hline \end{array} \end{array} $$ As can be seen, the actions with the highest and lowest Q-values follow from our main results.
It is important to distinguish game theory (the study of interdependent choice and action) from decision theory (the study of choice in contexts where the agent is choosing independently of other agents) and social choice theory (the study of collective decision making). This distinction allows the reader to realize game theory as a tool to study human behavior, and, in particular, ethical behavior. This is because, in game theory, each agent has their own description of which states of the world they like and they act in an attempt to bring about these states of the world. These actions can include both good and bad things happening to other agents. The objective of this section is to offer a brief sketch as to one way that game theory is relevant to ethics.
Before doing so, it remains necessary to examine the concept of utility. Utility is prevalently used in game theory to model an agent’s interests. It quantifies an agent’s degree of preference across a set of available alternatives, and describes how these preferences change when an agent faces uncertainty about which alternative he will receive. This is performed through a utility function, which maps states of the world to real numbers. These numbers are interpreted as measures of an agent’s level of happiness in the given states. Now that we have formally defined self-interest, we can discuss morality.
Edna Ullmann-Margalit’s The Emergence of Norms argues that morality can be understood as a system of norms that enable agents to coordinate their actions in order to achieve mutually advantageous outcomes in situations where the pursuit of self-interest would normally prevent this. She takes the example of two artillery men facing a choice to either stay and use their weapons and fight against the enemy or flee for their safety. If both decide to stay and face the enemy, they may have a high chance of being injured in the attack, but there is a high probability that they will stop the enemy from advancing further. However, if one artillery man stays and the second one flees, the brave artillery man is certain to die in battle while the other one will escape safely. So, each artillery man has a choice of either fleeing or staying and fighting.
So, for each individual artillery man it would be an advantage to flee, despite the actions of the other. From the point of view of utility maximization, both artillery men would be better off if they stayed in their respective positions and faced the enemy, even though this may seem paradoxical, showing that the outcome of individual rational action is being suboptimal. Ullmann-Margalit argues that in this example rationality appears to point to outcomes that are not optimal, morality binds the individual gunners to stay and fight, to face the enemy and defend their positions, so it seems as if the function of morality is to assist us in preventing the failures of rationality.
We only applied a basic neural network to predict a strategy's probability of winning the game and trained a reinforcement learning agent to play the game. We did not spend enough time on exploring and improving their performance. With more time, we would test more on the parameters and neural network layers.
We would like to characterize the Colonel Blotto game with four castles and ultimately generalize to any number of castles.
For next time, we would like to engage with more literature, concretely. This would be for the sake of familiarizing ourselves with terminology, formalizing our own work, and furthering our understanding of the ideas of others already in this space. Moreover, this would allow us to explore connections from game theory to theoretical computer science, probability theory, statistical theory, and economics.
Lastly, we would start coding earlier, so that we could have more time exploring and improving the final results. This would have allowed us to attempt to run our code on Pomona's servers.
Given \(t = (t_1, t_2, t_3, t_4)\), with \(t_1 + t_2 + t_3 + t_4 = 1\), we expect \(P(t)\) is approximately a polynomial of degree 3 when written in terms of \(t_4\). We are in search of the best polynomial to fit the data.
Here, we consider the Colonel Blotto game with 20 troops and 4 castles. Above is a plot of Player 1's probability of winning the game based on their allocation to castle 4. Note that we get this behavior of multiple scatterpoints because \(P(t)\) is a function of more than one variable.
[Bor53] Emile Borel. “The Theory of Play and Integral Equations with Skew Symmetric Kernels”. In: Econometrica 21.1 (1953), pp. 97–100. issn: 00129682, 14680262. url: http://www.jstor.org/ stable/1906946 (visited on 04/29/2022).
[HV21] Keith Hankins and Peter Vanderschraaf. Game theory and ethics. Sept. 2021. url: https:// plato.stanford.edu/entries/game-ethics/.
[Jay21] Siddhartha Jayanti. “Nash Equilibria of The Multiplayer Colonel Blotto Game on Arbitrary Measure Spaces”. In: CoRR abs/2104.11298 (2021). arXiv: 2104.11298. url: https://arxiv. org/abs/2104.11298.
[Lan+19] Marc Lanctot et al. OpenSpiel: A Framework for Reinforcement Learning in Games. 2019. doi: 10.48550/ARXIV.1908.09453. url: https://arxiv.org/abs/1908.09453.
[Noe22] Joseph Christian G. Noel. Reinforcement Learning Agents in Colonel Blotto. 2022. doi: 10.48550/ARXIV.2204.02785. url: https://arxiv.org/abs/2204.02785.