# Using Reinforcement Learning To Learn To Play Tic-Tac-Toe

About a year ago I set myself the goal of writing an algorithm that could learn to play tic-tac-toe.
I didn't want to tell the algorithm what the rules of the game are,
nor did I want it to try and use some kind of calculation to look ahead at possible moves which might lead to a win from
the current state of the board. I wanted the algorithm to "learn" how to play, by playing against itself.
I knew nothing about machine learning, so I spent a bit of time learning about neural networks but didn't get very far
and convinced myself that neural networks wouldn't work and I put the project to the back of my mind.
A few months ago at a Christmas party, I bumped into an
acquaintance, JP, and I ended up telling him about my goal. He suggested reinforcement learning and a few days later
I started reading about it. It didn't take long before I was trying to understand Q-learning and failing fast by getting
lost in the maths. So with my limited knowledge I went about iteratively designing an algorithm to "learn" to play.
Here is the result, which you can play against interactively by clicking on the board or buttons. It contains no code
telling it what good or bad moves are and the AI algorithm knows nothing about the rules of the game - it's simply told what the
outcome of each move is (unfinished, won, drawn, lost). It doesn't even know that the aim is to get three in a row.
All the AI algorithm knows is that X and O make alternate moves and that a square can only contain an X or an O or be emtpy.

The "clear" button clears the board and resets the game. The "reset stats" button clears the score board below the buttons. The "play autonomously" button causes it to start playing against itself and continue learning - it swaps side every hundred games and will occasionally explore randomly, according to its exploration strategy (see below). The "swap roles" button causes it to change sides. The "explore" button can be used to turn off exploration during games against humans.

Tic-tac-toe is played by successively placing 'X's and 'O's (markers) on the board until one side has three in a row, vertically, horizontally or diagonally. To start with, each cell on the board is empty (modelled as a null in Javascript). As the game progresses, the state of the board, a.k.a. "board state", changes. As an example, here is the board state after one move by X in the middle of the top row of the board:

Each state can be converted into a "pattern" which is built by concatenating the cell values from left to right starting with the top row and ending with the bottom row, delimiting the rows with a pipe character. Empty cells are represented using a hyphen ('-'). So the example above becomes: "-X-|---|---". The patterns are used as keys, which map to a list of possible moves from that board state (like a dictionary). To make things simple, I chose that X would always start regardless of whether it was the AI or the human moving first. For each possible move, the algorithm stores the final result that was achieved - a win for X, a win for O or a draw. After the first move, there are eight possible moves, after the second, only seven, etc. Results are recorded in JSON similar to the following, and here is an example showing what the algorithm has learned for the pattern where no move has yet been made:

Each element in the array represents a move which can be made from the board state represented by the key, with 'i' representing the vertical position, and 'j' the horizontal position. So 'xWinRewards' records how well X has done from this board state, in this case an empty board, after moving to the position indicated by i and j. 'oWinRewards' records how well player O has done, and 'drawRewards' shows how likely a draw is. The example above was taken after a hundred thousand games were played. It shows how the algorithm has learned that X is most likely to win if it starts in the centre of the board (position i=1, j=1). It also shows how moving to the centre is more likely to result in a draw than a win for the second player O, because it has gained more rewards for draws than for O winning. You can also see that far fewer games were played where the first player opened in the bottom right (2, 2) because it has gained less rewards for that move.

Notice the term "reward". The numbers recorded above are not the number of games, rather they are rewards (or points) that are given at the end of the game, when the algorithm walks through each move of the game, retroactively awarding points for each move. First it builds the pattern and then it looks that pattern up in its memory. It then adds a reward to the relevant "next move" which was made in the current game, adding it to the relevant field in the object shown above. Take for example the board after an initial move by X to the centre of the board. Let's say that O won later in the game, after moving to position 2,2 as their first move. So the pattern which is looked up is "---|-X-|---". In that memory location, the algorithm finds the object with i=2 and j=2. It adds a reward to the field named "oWinRewards". The algorithm then moves on to the next move which took place in the game and does the same thing for the pattern "---|-X-|--O", since that pattern represents the game after the second move.

Initially I used a reward of just one point. It soon became clear that the algorithm wasn't learning properly and it was still making "stupid" decisions, for example it wouldn't complete two in a row in order to win immediately and would prefer to move elsewhere rather than taking the victory. The reason is that in the beginning it only moves randomly (it has to explore since it has no experience to exploit, see below) and sometimes it would miss the chance to win immediately, but then win later in a game anyway. It learned that it could skip an immediate win and still be successful. The solution was to increase the reward for winning shorter games, so that the incentive was higher.

The next problem I encountered was that the algorithm was giving too much weight to moves which it had already encountered many times, even if a different move was actually better. It resulted in the algorithm doing what appeared to be silly moves, and not really seeming to learn. The solution was to chose the next move based on relative rewards, rather than absolute rewards. That way, the total number of times a move was tried was no longer relevant. Rather the chances of winning were what made the algorithm chose the next move. I later learned that there are different strategies here - you can base the choice of the next move on the average reward, or the best one that has ever been achieved, and the strategy relates to the reinforcement learning algorithm being used (for example Q-learning apparently uses the best reward rather than an average one).

Another factor which influenced how the algorithm learned was whether it learned against a player who always moved randomly or against itself. It became a stronger player when learning against itself, than against a random monkey. That makes sense because if you've only ever had to respond to random moves, you won't have learned how to play someone who knows how to win tic-tac-toe, where more skill is required.

Finally, the last important factor in learning is the trade off between exploration (to try a random move which might lead to learning something new) and exploitation (using existing knowledge to choose a "good" move based on experience). Just like a human, the algorithm starts by only exporing randomly, because it doesn't know better, but the more games it plays, the less it explores and the more it choses moves based on experience. The method used to decide whether to explore or not has quite an effect on how quickly and how well the algorithm learns.

So does the algorithm really learn? Let's take a look by asking some questions.

Yes, because it knows that starting in the centre gives the best chances of a win. The rewards for the possible moves before any player has had a turn (pattern "---|---|---|"), ordered by reward with the best move at the top and the worst at the bottom are as follows:

Remember, reward is calculated relatively rather than absolutely, and rewards given at the end of the game are higher if the game is shorter. Note that the above calculation also weights the rewards, so that the algorithm prefers winning over drawing, and drawing over losing.

If you were wondering whether or not the centre is indeed the best opening move, please read my previous article which shows that contrary to common belief, the centre is a better opening move than a corner. Wikipedia, xkcd and Google all get it wrong.

Here are the rewards for the next moves after an opening in the centre (pattern "---|-X-|---|"):

My previous article and other sources like Wikipedias tic-tac-toe article and Martin Gardners book entitled Hexaflexagons and Other Mathematical Diversions agree that the best counter to an opening move in the centre is a move to a corner. The results above agree because all four corners have higher rewards (again shown on the right side) than the edges.

Yes, it knows how to win instantly, for example with the pattern "OX-|-X-|O--", which looks as follows, where the winning move is the middle of the bottom row (2,1):

The rewards for that pattern are:

Consider the pattern "XO-|-O-|--X" which looks like this:

It's Xs turn and the algorithm would need to block the bottom middle (2, 1) in order to stop O winning on their next move. Here are the rewards for that pattern:

The algorithm knows that doing a move at (2,1) is best and avoids the loss. It also knows that X has never won from here and can only draw by stopping the win for O.

Yes! The algorithm is clever! But not clever enough to read Wikipedia 😀

It's simply learned through playing, that these moves give the highest rewards. Here is what Wikipedia says:

If you look at the interactive board at the top of this article, you will see a line which says "Unique games seen: 4330". In my previous article I confirmed that there are 255,168 unique games, of which only 29,600 are not classed as "stupid" because of missed wins or blocks. So if there are so many games, why has the algorithm only encountered less than 2% of them and why is it still able to play so well and why has it learned as shown above?

We can get a list of all patterns where the algorithm hasn't experienced every possible next move, by opening the Javascript console for the interactive board (right click on the board and select "inspect", in Chrome), and running the following snippet:

The algorithm has only experienced one move from there, although there are actually 5.

Pattern "X--|OX-|---":

So if I am so impressed, am I scared of AI? Not really. It took me a long time to tune this algorithm so that it played reasonably and there are many dimensions which can be tuned:

Just a couple of closing remarks now:

What are the next steps? I have spent a lot of time on this little project. I've learned a lot. One of my original thoughts was that a true test would be to introduce a new game, say Three Men's Morris and see if the same algorithm could learn to play that too. One day... but first it's time for something different...

Finally a few links that I used along the way:

The "clear" button clears the board and resets the game. The "reset stats" button clears the score board below the buttons. The "play autonomously" button causes it to start playing against itself and continue learning - it swaps side every hundred games and will occasionally explore randomly, according to its exploration strategy (see below). The "swap roles" button causes it to change sides. The "explore" button can be used to turn off exploration during games against humans.

Tic-tac-toe is played by successively placing 'X's and 'O's (markers) on the board until one side has three in a row, vertically, horizontally or diagonally. To start with, each cell on the board is empty (modelled as a null in Javascript). As the game progresses, the state of the board, a.k.a. "board state", changes. As an example, here is the board state after one move by X in the middle of the top row of the board:

j=0 | j=1 | j=2 | |

i=0 | X | ||

i=1 | |||

i=2 |

Each state can be converted into a "pattern" which is built by concatenating the cell values from left to right starting with the top row and ending with the bottom row, delimiting the rows with a pipe character. Empty cells are represented using a hyphen ('-'). So the example above becomes: "-X-|---|---". The patterns are used as keys, which map to a list of possible moves from that board state (like a dictionary). To make things simple, I chose that X would always start regardless of whether it was the AI or the human moving first. For each possible move, the algorithm stores the final result that was achieved - a win for X, a win for O or a draw. After the first move, there are eight possible moves, after the second, only seven, etc. Results are recorded in JSON similar to the following, and here is an example showing what the algorithm has learned for the pattern where no move has yet been made:

const memory = { "---|---|---" : [ {i: 1, j: 1, xWinRewards: 185510, oWinRewards: 20028, drawRewards: 60161}, {i: 2, j: 2, xWinRewards: 1390, oWinRewards: 1910, drawRewards: 379}, ... ], ... };

Each element in the array represents a move which can be made from the board state represented by the key, with 'i' representing the vertical position, and 'j' the horizontal position. So 'xWinRewards' records how well X has done from this board state, in this case an empty board, after moving to the position indicated by i and j. 'oWinRewards' records how well player O has done, and 'drawRewards' shows how likely a draw is. The example above was taken after a hundred thousand games were played. It shows how the algorithm has learned that X is most likely to win if it starts in the centre of the board (position i=1, j=1). It also shows how moving to the centre is more likely to result in a draw than a win for the second player O, because it has gained more rewards for draws than for O winning. You can also see that far fewer games were played where the first player opened in the bottom right (2, 2) because it has gained less rewards for that move.

Notice the term "reward". The numbers recorded above are not the number of games, rather they are rewards (or points) that are given at the end of the game, when the algorithm walks through each move of the game, retroactively awarding points for each move. First it builds the pattern and then it looks that pattern up in its memory. It then adds a reward to the relevant "next move" which was made in the current game, adding it to the relevant field in the object shown above. Take for example the board after an initial move by X to the centre of the board. Let's say that O won later in the game, after moving to position 2,2 as their first move. So the pattern which is looked up is "---|-X-|---". In that memory location, the algorithm finds the object with i=2 and j=2. It adds a reward to the field named "oWinRewards". The algorithm then moves on to the next move which took place in the game and does the same thing for the pattern "---|-X-|--O", since that pattern represents the game after the second move.

Initially I used a reward of just one point. It soon became clear that the algorithm wasn't learning properly and it was still making "stupid" decisions, for example it wouldn't complete two in a row in order to win immediately and would prefer to move elsewhere rather than taking the victory. The reason is that in the beginning it only moves randomly (it has to explore since it has no experience to exploit, see below) and sometimes it would miss the chance to win immediately, but then win later in a game anyway. It learned that it could skip an immediate win and still be successful. The solution was to increase the reward for winning shorter games, so that the incentive was higher.

The next problem I encountered was that the algorithm was giving too much weight to moves which it had already encountered many times, even if a different move was actually better. It resulted in the algorithm doing what appeared to be silly moves, and not really seeming to learn. The solution was to chose the next move based on relative rewards, rather than absolute rewards. That way, the total number of times a move was tried was no longer relevant. Rather the chances of winning were what made the algorithm chose the next move. I later learned that there are different strategies here - you can base the choice of the next move on the average reward, or the best one that has ever been achieved, and the strategy relates to the reinforcement learning algorithm being used (for example Q-learning apparently uses the best reward rather than an average one).

Another factor which influenced how the algorithm learned was whether it learned against a player who always moved randomly or against itself. It became a stronger player when learning against itself, than against a random monkey. That makes sense because if you've only ever had to respond to random moves, you won't have learned how to play someone who knows how to win tic-tac-toe, where more skill is required.

Finally, the last important factor in learning is the trade off between exploration (to try a random move which might lead to learning something new) and exploitation (using existing knowledge to choose a "good" move based on experience). Just like a human, the algorithm starts by only exporing randomly, because it doesn't know better, but the more games it plays, the less it explores and the more it choses moves based on experience. The method used to decide whether to explore or not has quite an effect on how quickly and how well the algorithm learns.

So does the algorithm really learn? Let's take a look by asking some questions.

**1) Does the algorithm know where the best starting location is?**Yes, because it knows that starting in the centre gives the best chances of a win. The rewards for the possible moves before any player has had a turn (pattern "---|---|---|"), ordered by reward with the best move at the top and the worst at the bottom are as follows:

{i: 1, j: 1, xWinRewards: 185510, oWinRewards: 20028, drawRewards: 60161} // => 72.0 {i: 2, j: 2, xWinRewards: 1390, oWinRewards: 1910, drawRewards: 379} // => 38.3 {i: 0, j: 0, xWinRewards: 1169, oWinRewards: 1526, drawRewards: 447} // => 38.1 {i: 2, j: 0, xWinRewards: 1291, oWinRewards: 2068, drawRewards: 419} // => 34.7 {i: 0, j: 2, xWinRewards: 1191, oWinRewards: 1920, drawRewards: 495} // => 33.9 {i: 1, j: 2, xWinRewards: 1155, oWinRewards: 2300, drawRewards: 402} // => 30.4 {i: 0, j: 1, xWinRewards: 1122, oWinRewards: 2372, drawRewards: 400} // => 29.2 {i: 1, j: 0, xWinRewards: 844, oWinRewards: 1678, drawRewards: 491} // => 29.0 {i: 2, j: 1, xWinRewards: 1079, oWinRewards: 2292, drawRewards: 474} // => 28.7The reward that is shown on the right of each line, which is used when choosing the next move to make, is calculated as follows:

winRewards = player === X ? possibleMove.xWinRewards : possibleMove.oWinRewards; drawRewards = possibleMove.drawRewards; lossRewards = player === X ? possibleMove.oWinRewards : possibleMove.xWinRewards; total = winRewards + drawRewards + lossRewards; ... finalReward = (100*(winRewards/total)) + (10*(drawRewards/total)) + (-1*(lossRewards/total));

Remember, reward is calculated relatively rather than absolutely, and rewards given at the end of the game are higher if the game is shorter. Note that the above calculation also weights the rewards, so that the algorithm prefers winning over drawing, and drawing over losing.

If you were wondering whether or not the centre is indeed the best opening move, please read my previous article which shows that contrary to common belief, the centre is a better opening move than a corner. Wikipedia, xkcd and Google all get it wrong.

**2) Does it know how to respond to an opening move in the centre?**Here are the rewards for the next moves after an opening in the centre (pattern "---|-X-|---|"):

{i: 2, j: 0, xWinRewards: 26920, oWinRewards: 3712, drawRewards: 9333} // => 11.5 {i: 0, j: 0, xWinRewards: 67866, oWinRewards: 8622, drawRewards: 35873} // => 10.8 {i: 0, j: 2, xWinRewards: 21060, oWinRewards: 2616, drawRewards: 5856} // => 10.8 {i: 2, j: 2, xWinRewards: 21014, oWinRewards: 2574, drawRewards: 6730} // => 10.6 {i: 0, j: 1, xWinRewards: 12096, oWinRewards: 668, drawRewards: 603} // => 5.4 {i: 2, j: 1, xWinRewards: 12502, oWinRewards: 664, drawRewards: 595} // => 5.2 {i: 1, j: 2, xWinRewards: 11850, oWinRewards: 592, drawRewards: 582} // => 4.9 {i: 1, j: 0, xWinRewards: 11994, oWinRewards: 580, drawRewards: 589} // => 4.8

My previous article and other sources like Wikipedias tic-tac-toe article and Martin Gardners book entitled Hexaflexagons and Other Mathematical Diversions agree that the best counter to an opening move in the centre is a move to a corner. The results above agree because all four corners have higher rewards (again shown on the right side) than the edges.

**3) Does it know how to win on the next move when it nearly has three in a row?**Yes, it knows how to win instantly, for example with the pattern "OX-|-X-|O--", which looks as follows, where the winning move is the middle of the bottom row (2,1):

j=0 | j=1 | j=2 | |

i=0 | O | X | |

i=1 | X | ||

i=2 | O |

The rewards for that pattern are:

{i: 2, j: 1, xWinRewards: 7104, oWinRewards: 0, drawRewards: 0} // => 100.0 {i: 1, j: 0, xWinRewards: 224, oWinRewards: 0, drawRewards: 2} // => 99.2 {i: 1, j: 2, xWinRewards: 16, oWinRewards: 8, drawRewards: 0} // => 66.3 {i: 0, j: 2, xWinRewards: 0, oWinRewards: 48, drawRewards: 0} // => -1.0 {i: 2, j: 2, xWinRewards: 0, oWinRewards: 56, drawRewards: 0} // => -1.0So indeed, it knows the reward is highest for (2,1) and will move there to win. It also knows that it doesn't need to block the opponent's imminent win on the left edge (1, 0).

**4) Does it know how to block the opponent from winning on their next turn?**Consider the pattern "XO-|-O-|--X" which looks like this:

j=0 | j=1 | j=2 | |

i=0 | X | O | |

i=1 | O | ||

i=2 | X |

It's Xs turn and the algorithm would need to block the bottom middle (2, 1) in order to stop O winning on their next move. Here are the rewards for that pattern:

1: {i: 2, j: 1, xWinRewards: 0, oWinRewards: 4, drawRewards: 4} // => 4.5 0: {i: 0, j: 2, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0 2: {i: 2, j: 0, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0 3: {i: 1, j: 2, xWinRewards: 0, oWinRewards: 16, drawRewards: 0} // => -1.0 4: {i: 1, j: 0, xWinRewards: 0, oWinRewards: 32, drawRewards: 0} // => -1.0

The algorithm knows that doing a move at (2,1) is best and avoids the loss. It also knows that X has never won from here and can only draw by stopping the win for O.

**5) Does it know about the strategy on Wikipedias tic-tac-toe page which shows how O can best counter opening moves by X?**Yes! The algorithm is clever! But not clever enough to read Wikipedia 😀

It's simply learned through playing, that these moves give the highest rewards. Here is what Wikipedia says:

Player O must always respond to a corner opening with a center mark,...Try it out on the interactive board at the top of this article! The algorithm does this in three out of the four corner openings, strangely not when starting in top right. I guess it needs a little more practice on that opening move.

...and to a center opening with a corner mark.Yes, the algorithm does this.

An edge opening must be answered either with a center mark, a corner mark next to the X, or an edge mark opposite the X.For the bottom and right edges it responds by taking the centre cell. For the left edge it responds by taking the bottom left corner, i.e. next to starting position. For the top, it fails - I guess it needs some more practise here too.

**Since we are pretty much able to answer "yes" to these five questions, I claim that it has indeed learned how to play tic-tac-toe.**If you like that and think it's impressive, read on, because it gets better.If you look at the interactive board at the top of this article, you will see a line which says "Unique games seen: 4330". In my previous article I confirmed that there are 255,168 unique games, of which only 29,600 are not classed as "stupid" because of missed wins or blocks. So if there are so many games, why has the algorithm only encountered less than 2% of them and why is it still able to play so well and why has it learned as shown above?

We can get a list of all patterns where the algorithm hasn't experienced every possible next move, by opening the Javascript console for the interactive board (right click on the board and select "inspect", in Chrome), and running the following snippet:

Object.keys(model.patterns).forEach(function(k){ let len = k.split('').filter(function(l){ return l === 'X' || l === O;}).length; if(model.patterns[k] && model.patterns[k].possibleNextMoves && (model.patterns[k].possibleNextMoves.length + len !== 9)) { console.log(k); } });2927 patterns have missing solutions, i.e. the large majority of them. Let's randomly take one and have a look, e.g. "X--|OXO|---":

j=0 | j=1 | j=2 | |

i=0 | X | ||

i=1 | O | X | O |

i=2 |

The algorithm has only experienced one move from there, although there are actually 5.

{i: 0, j: 1, xWinRewards: 8, oWinRewards: 0, drawRewards: 0}The move to (0, 1) is actually a "stupid" move, because it misses the immediate win which it would get if it moved to (2, 2). But if we take a step back, Os last move was also "stupid" as it should have blocked that potential win. If we look at the rewards for the two patterns where O makes its second move to take the left or right edge, we see the following.

Pattern "X--|OX-|---":

{i: 2, j: 2, xWinRewards: 98, oWinRewards: 12, drawRewards: 21} // => 10.7 {i: 2, j: 0, xWinRewards: 16, oWinRewards: 0, drawRewards: 0} // => 0.0 {i: 1, j: 2, xWinRewards: 8, oWinRewards: 0, drawRewards: 0} // => 0.0Pattern "X--|-XO|---":

{i: 2, j: 2, xWinRewards: 164, oWinRewards: 148, drawRewards: 4} // => 46.5The algorithm has learned to block the potential win by X, and so it will never ever go down the path which resulted in the above game again. Moreover, if we try and recreate this pattern using the interactive board at the top of this article, we find that we can't because the algorithm has learned much better moves. If the algorithm starts, taking the centre, and a human follows by taking the left edge, the algorithm doesn't take the top left corner, because by taking the top edge it can force either an immediate win or a fork and so it wins in all cases. Equally, if a human starts as X and takes the top left corner, the algorithm will take the centre because it is a better defence, and so the board layout above will never happen. Finally, if a human starts in the centre, the algorithm will take the top right corner rather than an edge, because taking the corner is a better defence to an opening in the centre than an edge is. The algorithm only ever came across the pattern above during the exploration phase of its learning cycle. It has found the strongest moves starting from the opening and as such it doesn't need to bother learning about games where it ends up in a weaker position. Humans do the same if we think about chess - games where we move say the king with our second move don't make sense so we don't keep them in our toolbox of best moves.

**I personally find it quite impressive how the algorithm has homed in on optimal games and as such reduced its need to know every possible game**. Originally I questioned whether or not this type of machine learning was "brute force" because it needs to play so many games before becoming strong. I assumed it needed to play many many unique games, and that after "learning" it was just going to use its massive memory to work out what to do. But it isn't actually playing that many games, it just takes time to work out which of the 4,330 games that it knows about are the best ones. If it needed to encounter all 29,600 games which don't involve stupid moves, or indeed more than that number, then I think one could claim that the algorithm uses brute force in order to become strong. But because that isn't the case, I am confident in saying that this algorithm isn't using brute force to learn.**Having done the analysis above, I am a lot more impressed by the algorithm than I was the first time it beat me, which was quite an impressive moment in itself.**So if I am so impressed, am I scared of AI? Not really. It took me a long time to tune this algorithm so that it played reasonably and there are many dimensions which can be tuned:

- Exploration: how long should it explore for and when should it no longer expore? If it explores too long, it just learns to play randomly. If it doesn't explore enough, my teenage son can easily beat it when he uses some unexpected moves.
- Rewards: the question is really "what is the best move", and defining what "best" even means can be tricky - see my previous article. In the case of this algorithm, the rewards are two dimensional, because not only do they depend on whether the game is won/drawn/lost, but also how quickly, so that it learns to not miss immediate wins.
- Experience: how many games does it need to play until it becomes good? I wasn't able to answer "yes" to my five questions until it had played 100,000 games.

Just a couple of closing remarks now:

- The algorithm presented here learns by playing against itself. I did play around with training it against more sophisticated players to start with, and it seemed to improve the rate at which it learned. But I find it quite cool that I was able to remove that and get it to learn against itself. It's effectively taking the algorithm with no knowledge, throwing it into a dark room and saying "hey, go play, and I'll tell you the result of each game, you don't need to know the rules, you can learn them along the way". It learns the rules and the strategies to become a strong player!
- The algorithm can be classed as reinforcement learning. But what about it's subclass, for example is it a q-learning algorithm? Not quite.
- The title of this article could have been "Using *Deep* Reinforcement Learning To Learn To Play Tic-Tac-Toe", which would have required using a neural network to recognise patterns, rather than just matching known board patterns exactly. I played with this a little but couldn't train a network which worked at all. I believe that the problem is that unlike matching letters or images, similar looking patterns in tic-tac-toe or other games like chess, can actually have exact opposite outcomes, one leading to a win and the other leading to a loss. Imagine a game of chess where a king is moved by just one position so that the opponent could put it into checkmate - the patterns would be very similar, the outcomes very different. This is a different problem to classifying hand written letters. But I am really a novice when it comes to neural networks, so it's best if I don't comment further.

What are the next steps? I have spent a lot of time on this little project. I've learned a lot. One of my original thoughts was that a true test would be to introduce a new game, say Three Men's Morris and see if the same algorithm could learn to play that too. One day... but first it's time for something different...

Finally a few links that I used along the way:

- https://puzzling.stackexchange.com/questions/30/what-is-the-optimal-first-move-in-tic-tac-toe
- https://medium.com/machine-learning-for-humans/reinforcement-learning-6eacf258b265
- http://karpathy.github.io/2016/05/31/rl/
- https://www.oreilly.com/ideas/reinforcement-learning-explained
- https://datascience.stackexchange.com/questions/28915/is-this-a-q-learning-algorithm-or-just-brute-force
- https://www.cse.unsw.edu.au/~cs9417ml/RL1/introduction.html
- https://medium.com/@shiyan/get-a-taste-of-reinforcement-learning-implement-a-tic-tac-toe-agent-deda5617b2e4
- https://deeplearning4j.org/deepreinforcementlearning

# The best opening move in a game of tic-tac-toe

As part of a machine learning project, I had to understand tic-tac-toe better, and so I have written an algorithm
which a) finds all the possible unique games and b) gathers statistical information about those games.

Based on Wikipedia's tic-tac-toe article, consider a board with the nine positions numbered as follows:

Assume X always starts. As an example, take the game where X moves top left, followed by O moving top right, then X going middle left followed by O going top middle. These first four moves can be written down as "1342". The game could continue and once completed could be written as "134258769". It's not a perfect game because the first player misses a few opportunities to win and in the end it's a draw.

Every possible combination of moves making up unique games of tic-tac-toe are hence found somewhere between the numbers 123456789 and 999999999 (although probably iterating up to 987654321 suffices). Most of the numbers are illegitimate because each cell is only allowed to be filled once, so for example the number 222222222 does not represent a valid combination. In order to find every valid combination we simply start with that lowest number and iterate up to nine nines, attempting to determine if each number is a valid combination and if it is, record the results of the game.

In order to determine if a combination is valid, we use the following Javascript snippet, which checks that the nine digit number does not contains zeros and contains all of the digits 1 to 9:

A game is analysed using the following algorithm. The empty board is built by creating a two dimensional array of objects using indexes

The full algorithm is available on GitHub and you can run it live in your browser by clicking on this link: stats.html, which takes a few minutes to run.

That page records the outcome of the game from the perspective of the first move. "cornerXWins5" shows how many times X won games that were only 5 moves long by starting in a corner. As part of updating statistics, the results are grouped into games which are:

First of all, notice how there are 255,168 unique games (shown near the bottom of the page after computation is complete). This correlates with other results.

The following table is copied from that page and consolidates these raw results to show how often the starting player can win when opening in the relevant location.

For example, the first player wins in 14652 of the 255168 unique games if they start in the top left. This number comes from adding the number of corner wins after 5, 7 and 9 moves (X doesn't win after 6 or 8 moves, rather O does), for each group of game (0+992+1344 from "stats", 0+1984+0 from "fork stats" and 720+19776+33792 from "stupid stats"), and importantly dividing by four, because there are four corners - they are shared out over the four corners of the table above. Results for the edges are also divided by four, but since there is only one centre cell the result is left undivided.

This table demonstrates that the best location to start is in fact the centre of the board, because starting there wins the most amount of games. This is confirmed by Azahar Machwes blog article Tic Tac Toe: Statistical Analysis with arbitrary Grid Size (albeit for different reasons than he gives) which states:

For example, the first player loses in 10176 of the 255168 unique games when they start in the middle of the top row. This also demonstrates that the opening move for the first player, which leads to the least number of games lost, is also the centre. Finally we can consider how many games are drawn when the first player opens in the relevant cell:

The first player draws least when opening in the centre.

But because we are considering every possible game, this statement does depend upon the players being random in their moves and not necessarily moving so that they win as soon as possible or block an opponent from winning on their next turn. For that reason, the results have been grouped so that we can also analyse what happens when a more clever player plays. Let us disregard all "stupid" games where a move to win instantly is missed or a move to block the opponent winning next is missed. We are now considering a more "perfect player". They are only "more" perfect, rather than exactly "perfect", because the algorithm used to ignore stupid games works as follows. First it checks to see if the player is missing a chance to win immediately. If they are, they count as stupid. Second, it checks to see if the opponent could win in their next move, and if they can, and the current player does not block them, they count as stupid too. Now consider the game "1234567", as it has some special properties. After "12345" the board looks like this:

O, the second player, needs to move to position 7 or 9 in order to attempt to block the fork created when X moved to position 5. Of course any move which O makes is futile, and as such, the move to 6 which O is about to make does not count as stupid, because there is nothing they can do at this stage to avoid losing. Perhaps Os move to position 4 should have counted as stupid, because it didn't block X from getting a fork in their next move. The algorithm doesn't go that far, and as such this game is marked as being non-stupid and won with a fork by X in 7 moves. Since it isn't as good as it could be, I describe the players as more perfect rather than perfect.

So, when considering these more perfect players, the number of games won by the first player moving to the relevant cell then becomes:

The centre no longer leads to the most number of games won, rather starting on an edge improves the chances of a victory. The number of games lost by the first player moving to the relevant cell is:

The centre is still the starting position which results in the least number of losses for the first player. Although the edge has become the starting position which results in the most number of wins (see above), it is also the starting position resulting in the most number of games lost. Draw statistics for these more perfect players look like this:

Starting in the centre still results in the least number of draws. The results of the last three tables are a little conflicting - is it best to start in the centre and avoid a draw or start on an edge which leads to most victories but also to most losses? One way of answering this question is to consolidate the three tables by combining the results using a "reward function" - a formula to sum the individual results which can be weighted if required.

Imagine writing an artificial intelligence algorithm which had to determine the best cell to move to. It might be trained to favour a win over a draw and a draw over a loss. The reward function could be:

The result in this case is that a first move to the centre is still the best. Even if we change the weightings, it isn't possible to tune the reward function so that the corner becomes the best opening move.

Maybe it depends on how "best" is defined. The popular webcomic xkcd posted a complete map of optimal tic-tac-toe moves, perhaps in 2010, which is apparently based on the work titled "Fractal Images of Formal Systems", by Paul St. Denis and Patrick Grim, which shows a more complete map, because it does not show just the top left corner where X starts in the top left. For some reason, only the top left corner of their image was reproduced by xkcd. A few people have asked why this is on the xkcd forum and the best explanation I could find is from someone using the name Algrokoz, who states:

Martin Gardner, an American popular mathematics and science writer wrote the following in the 1988 edition of his book Hexaflexagons and Other Mathematical Diversions

His logic however does seem to be similar to that of the xkcd comment. Interestingly, if we consider what happens when X starts in the centre and O takes a corner, there are also 24 forks by which X can win the game, and this too is the best counter move from the point of view of forks.

If we consider only games won by a fork (see the "fork stats" section on the page which calculates them), then we find that starting in a corner results in 496 (1984/4) forks, compared to only 320 when starting in the centre. Finally we have a situation where it is indeed better to start in the corner.

But is it correct to just consider games that end in a fork? I don't believe it is possible to force a player into a situation where they will be confronted with a fork. Think about the game presented earlier, "1234567", where a fork could have been avoided by O moving to the centre (5) instead of 4. If a player is clever enough to avoid a fork then the game will certainly end in a draw. And so we are talking about players who know how to avoid losing from the start. There is no "best" place to go in order to avoid a draw, when perfect players are playing - Wikipedia and my own analysis shows that when perfect players play, the only possible result is a draw. An optimal opening position can surely only exist when playing a player who might make a mistake or a move which is less than ideal. In that case it makes no sense to analyse only games which end in a fork and we end up having to also consider games that end otherwise. As we have already seen, a corner opening is not the best.

Let's return quickly to the argument about the number of places the second player can move to. Maybe Gardners argument is the same, that in total, because there are four corners and just one centre, there are more moves available to the opponent to avoid a fork, when opening in the centre. But in the same quote he has reduced the number of opening moves to just three, because of board symmetries, so such a conclusion wouldn't be logical.

The problem with Gardners statement (and the graphic published by xkcd) is that the Wikipedia tic-tac-toe page uses them to make the following claim:

Why is all of this so important? I've been working on training a machine learning algorithm to play tic-tac-toe and I was surprised how it learned to open in the centre rather than a corner. I assumed Wikipedia was correct... As a result of this study I shall now endevour to get the Wikipedia page corrected, before a generation of children wrongly learn that opening in a corner is the best move!

Copyright ©2018, Ant Kutschera

Based on Wikipedia's tic-tac-toe article, consider a board with the nine positions numbered as follows:

j=0 | j=1 | j=2 | |

i=0 | 1 | 2 | 3 |

i=1 | 4 | 5 | 6 |

i=2 | 7 | 8 | 9 |

Assume X always starts. As an example, take the game where X moves top left, followed by O moving top right, then X going middle left followed by O going top middle. These first four moves can be written down as "1342". The game could continue and once completed could be written as "134258769". It's not a perfect game because the first player misses a few opportunities to win and in the end it's a draw.

Every possible combination of moves making up unique games of tic-tac-toe are hence found somewhere between the numbers 123456789 and 999999999 (although probably iterating up to 987654321 suffices). Most of the numbers are illegitimate because each cell is only allowed to be filled once, so for example the number 222222222 does not represent a valid combination. In order to find every valid combination we simply start with that lowest number and iterate up to nine nines, attempting to determine if each number is a valid combination and if it is, record the results of the game.

In order to determine if a combination is valid, we use the following Javascript snippet, which checks that the nine digit number does not contains zeros and contains all of the digits 1 to 9:

//gameAsString contains the current number, e.g. '123456798' let isValidGame = true; for(idx = 1; idx < 10; idx++) { if(gameAsString.indexOf('' + idx) === -1 //only valid if every digit from 1-9 is found || gameAsString.charAt(idx-1) === '0') //only 1-9 are valid { isValidGame = false; break; } }If a number does represent a valid combination, the next step is to iterate through each digit starting at the left, to work out when the game is finished (it can be won after just 5 moves, or it can require all 9 moves, ending in a win or a draw). For example numbers starting with '12437' represent combinations completed after just 5 moves. Although there are 9999 such valid combinations, the game is only recorded once and all other combinations of numbers which start with 12437 are discarded.

A game is analysed using the following algorithm. The empty board is built by creating a two dimensional array of objects using indexes

`i`

and `j`

as shown above.
An element in an array of cells represents a cell on the board and has attributes 'i', 'j' and 'v' which records
the contents of the cell, either null, 'X' or 'O', depending on whether a player has moved in that cell.
let board = buildEmtpyBoard(); let moves = ''; for(idx = 0; idx < gameAsString.length; idx++){ let c = gameAsString[idx]; moves += c; //initially e.g. '1', then '12', etc. let coords = convertIndexToCoordinates(parseInt(c)); //turns e.g. '3' into [0,2] if(board[coords[0]][coords[1]].v) throw new Error("cell " + i + "," + j + " is already selected by " + board[coords[0]][coords[1]].v); let player = idx % 2 === 0 ? X : O; let rival = player === X ? O : X; ... //move board[coords[0]][coords[1]].v = player; let winner = checkFinished(board); if(winner){ let o = uniqueGames[moves]; if(!o){ ... //it doesnt exist. update stats updateStats(gameAsString, winner, idx, relevantStats); uniqueGames[moves] = {}; //record this game so it isn't harvested again } break; } }The function

`checkFinished(board)`

simply checks whether or not the game is over (a draw, or a player has three in a row),
and returns 'DRAW', 'X' or 'O' depending upon the outcome of the game.
'uniqueGames' is an object used to track whether or not the current game has already been seen. Using the example above
the combination '12437' will be seen 9999 times, but to all intents and purposes it is just one game of tic-tac-toe
so we discard all subsequent sightings of it.
If the game is finished and has not yet been seen, the statistics are updated.
The full algorithm is available on GitHub and you can run it live in your browser by clicking on this link: stats.html, which takes a few minutes to run.

That page records the outcome of the game from the perspective of the first move. "cornerXWins5" shows how many times X won games that were only 5 moves long by starting in a corner. As part of updating statistics, the results are grouped into games which are:

- "stupid" - the player has missed winning in the current move or missed stopping the opponent from winning in the next move; grouped under "stupid stats"
- won with a "fork" - the winning player had more than one position to move on the board in order to win, when they won; grouped under "fork stats"
- all other games; grouped under "stats"

First of all, notice how there are 255,168 unique games (shown near the bottom of the page after computation is complete). This correlates with other results.

The following table is copied from that page and consolidates these raw results to show how often the starting player can win when opening in the relevant location.

14652 | 14232 | 14652 |

14232 | 15648 | 14232 |

14652 | 14232 | 14652 |

For example, the first player wins in 14652 of the 255168 unique games if they start in the top left. This number comes from adding the number of corner wins after 5, 7 and 9 moves (X doesn't win after 6 or 8 moves, rather O does), for each group of game (0+992+1344 from "stats", 0+1984+0 from "fork stats" and 720+19776+33792 from "stupid stats"), and importantly dividing by four, because there are four corners - they are shared out over the four corners of the table above. Results for the edges are also divided by four, but since there is only one centre cell the result is left undivided.

This table demonstrates that the best location to start is in fact the centre of the board, because starting there wins the most amount of games. This is confirmed by Azahar Machwes blog article Tic Tac Toe: Statistical Analysis with arbitrary Grid Size (albeit for different reasons than he gives) which states:

For the first movers the best starting square by far is 5 (right in the middle and opens up 4 winning lines – most of any starting square, all other squares open up to 3!). Second best option then becomes the corners (1, 3, 7, 9).A similar result is shown when considering how often the first player loses after they start in one of the following locations (this is done by consolidating results of games that end after 6 or 8 moves, i.e. when the second player wins, beating the first player):

7896 | 10176 | 7896 |

10176 | 5616 | 10176 |

7896 | 10176 | 7896 |

For example, the first player loses in 10176 of the 255168 unique games when they start in the middle of the top row. This also demonstrates that the opening move for the first player, which leads to the least number of games lost, is also the centre. Finally we can consider how many games are drawn when the first player opens in the relevant cell:

5184 | 5184 | 5184 |

5184 | 4608 | 5184 |

5184 | 5184 | 5184 |

The first player draws least when opening in the centre.

**As such, the centre is the best position for the first player to start, as it leads to a) the most number of games won and b) the least number of draws and c) the least number of games lost.**But because we are considering every possible game, this statement does depend upon the players being random in their moves and not necessarily moving so that they win as soon as possible or block an opponent from winning on their next turn. For that reason, the results have been grouped so that we can also analyse what happens when a more clever player plays. Let us disregard all "stupid" games where a move to win instantly is missed or a move to block the opponent winning next is missed. We are now considering a more "perfect player". They are only "more" perfect, rather than exactly "perfect", because the algorithm used to ignore stupid games works as follows. First it checks to see if the player is missing a chance to win immediately. If they are, they count as stupid. Second, it checks to see if the opponent could win in their next move, and if they can, and the current player does not block them, they count as stupid too. Now consider the game "1234567", as it has some special properties. After "12345" the board looks like this:

X | O | X |

O | X | |

O, the second player, needs to move to position 7 or 9 in order to attempt to block the fork created when X moved to position 5. Of course any move which O makes is futile, and as such, the move to 6 which O is about to make does not count as stupid, because there is nothing they can do at this stage to avoid losing. Perhaps Os move to position 4 should have counted as stupid, because it didn't block X from getting a fork in their next move. The algorithm doesn't go that far, and as such this game is marked as being non-stupid and won with a fork by X in 7 moves. Since it isn't as good as it could be, I describe the players as more perfect rather than perfect.

So, when considering these more perfect players, the number of games won by the first player moving to the relevant cell then becomes:

1080 | 1240 | 1080 |

1240 | 480 | 1240 |

1080 | 1240 | 1080 |

The centre no longer leads to the most number of games won, rather starting on an edge improves the chances of a victory. The number of games lost by the first player moving to the relevant cell is:

304 | 608 | 304 |

608 | 128 | 608 |

304 | 608 | 304 |

The centre is still the starting position which results in the least number of losses for the first player. Although the edge has become the starting position which results in the most number of wins (see above), it is also the starting position resulting in the most number of games lost. Draw statistics for these more perfect players look like this:

1668 | 2144 | 1668 |

2144 | 816 | 2144 |

1668 | 2144 | 1668 |

Starting in the centre still results in the least number of draws. The results of the last three tables are a little conflicting - is it best to start in the centre and avoid a draw or start on an edge which leads to most victories but also to most losses? One way of answering this question is to consolidate the three tables by combining the results using a "reward function" - a formula to sum the individual results which can be weighted if required.

Imagine writing an artificial intelligence algorithm which had to determine the best cell to move to. It might be trained to favour a win over a draw and a draw over a loss. The reward function could be:

reward = (100*oddsOfWinning) + (10*oddsOfDrawing) + (-1*oddsOfLosing)The table of rewards for perfect play, using the above weightings, is:

1,509,144 | 1,464,864 | 1,509,144 |

1,464,864 | 1,605,264 | 1,464,864 |

1,509,144 | 1,464,864 | 1,509,144 |

The result in this case is that a first move to the centre is still the best. Even if we change the weightings, it isn't possible to tune the reward function so that the corner becomes the best opening move.

Maybe it depends on how "best" is defined. The popular webcomic xkcd posted a complete map of optimal tic-tac-toe moves, perhaps in 2010, which is apparently based on the work titled "Fractal Images of Formal Systems", by Paul St. Denis and Patrick Grim, which shows a more complete map, because it does not show just the top left corner where X starts in the top left. For some reason, only the top left corner of their image was reproduced by xkcd. A few people have asked why this is on the xkcd forum and the best explanation I could find is from someone using the name Algrokoz, who states:

Opening in the corner is optimal because of the number of chances it gives O to not play optimally, not because of total potential win scenarios.That means a corner opening gives the opponent less places to go in order to play optimally, whatever "optimally" means. Assuming that the best counter to an opening in a corner is the centre, and assuming that the best counter to an opening in the centre is a corner, then yes, one could argue that there are four corners and just one centre, so it is better to start in a corner, because a random opponent has a one in eight chance of finding the best counter move. Compare this to opening in the centre where a random opponent has a four in eight chance of finding the best counter move. We will examine this logic in a bit...

Martin Gardner, an American popular mathematics and science writer wrote the following in the 1988 edition of his book Hexaflexagons and Other Mathematical Diversions

If played "rationally" by both sides, the game must end in a draw. The only chance of winning is to catch an unwary opponent in a "trap" where a row can be scored on the next move in two ways, only one of which can be blocked. Of the three possible opening plays- a corner, the center or a side box- the strongest opening is the corner, because the opponent can avoid being trapped at the next move only by one of the eight possible choices: the center. Conversely, center opening traps can be blocked only by seizing a corner.The "traps" which he is talking about are what Wikipedia and I refer to as forks - plays which result in a winning opportunity in more than one cell, forcing the opponent to lose. In his statement Gardner is saying that the only place the opponent can avoid a fork, after an opening in a corner, is the centre. If we analyse just games that end in a fork, and search for what happens after two moves we find that when X starts in a corner and O takes the centre, there are still 24 games that X can win with a fork, so I am not sure that Gardners statement is correct. However of all the other second moves, countering with the centre results in the least amount of possible forks later in the game (this is easy to calculate by grouping game results by their two opening moves, see the "second move analysis" shown on the statistics page after the computation completes).

His logic however does seem to be similar to that of the xkcd comment. Interestingly, if we consider what happens when X starts in the centre and O takes a corner, there are also 24 forks by which X can win the game, and this too is the best counter move from the point of view of forks.

If we consider only games won by a fork (see the "fork stats" section on the page which calculates them), then we find that starting in a corner results in 496 (1984/4) forks, compared to only 320 when starting in the centre. Finally we have a situation where it is indeed better to start in the corner.

But is it correct to just consider games that end in a fork? I don't believe it is possible to force a player into a situation where they will be confronted with a fork. Think about the game presented earlier, "1234567", where a fork could have been avoided by O moving to the centre (5) instead of 4. If a player is clever enough to avoid a fork then the game will certainly end in a draw. And so we are talking about players who know how to avoid losing from the start. There is no "best" place to go in order to avoid a draw, when perfect players are playing - Wikipedia and my own analysis shows that when perfect players play, the only possible result is a draw. An optimal opening position can surely only exist when playing a player who might make a mistake or a move which is less than ideal. In that case it makes no sense to analyse only games which end in a fork and we end up having to also consider games that end otherwise. As we have already seen, a corner opening is not the best.

Let's return quickly to the argument about the number of places the second player can move to. Maybe Gardners argument is the same, that in total, because there are four corners and just one centre, there are more moves available to the opponent to avoid a fork, when opening in the centre. But in the same quote he has reduced the number of opening moves to just three, because of board symmetries, so such a conclusion wouldn't be logical.

The problem with Gardners statement (and the graphic published by xkcd) is that the Wikipedia tic-tac-toe page uses them to make the following claim:

The first player, who shall be designated "X", has 3 possible positions to mark during the first turn... Player X can win or force a draw from any of these starting marks; however, playing the corner gives the opponent the smallest choice of squares which must be played to avoid losing[16]. This makes the corner the best opening move for X, when the opponent is not a perfect player.Reference 16 is the Gardner book that I reference above. The last sentence in the Wikipedia quote is problematic because of the phrase

*"when the opponent is not a perfect player"*. As argued above, if players are not perfect, we need to consider at least games ending in a fork and otherwise, but probably also some games which involve stupid moves. As shown in this article, an opening in the centre is better than one in a corner.Why is all of this so important? I've been working on training a machine learning algorithm to play tic-tac-toe and I was surprised how it learned to open in the centre rather than a corner. I assumed Wikipedia was correct... As a result of this study I shall now endevour to get the Wikipedia page corrected, before a generation of children wrongly learn that opening in a corner is the best move!

Copyright ©2018, Ant Kutschera