Tic Tac Toe

This page is a demo of methods to computationally solve a simple game. Tic-Tac-Toe is a good example because it's easy to see what moves are good or bad and to debug when a strategy is not working. I started with Ultimate Tic-Tac-Toe, but since I'm not very good at that myself, I decided to implement each algorithm with regular tic-tac-toe first to make sure I could get them working correctly.

Note that for the 1-player modes, when you play for the first time the server may have to start up, which takes about 20 seconds. After that startup, the AI opponents should respond within a second.

Also, on this page, X always plays first, and in the 1-player modes each game alternates between you playing as X and the computer playing as X.

There are a few simple opponents to play against, which I created as baselines:

2-Player mode: Play against a friend (or yourself) on the same device.
1P vs Random: Play against an opponent that chooses moves uniformly at random from the available spaces.
1P vs Perfect: The opponent that always makes the optimal move. I hardcoded this myself.
1P vs Beatable: The (hardcoded) opponent that can be beaten with the right strategy when you're X. See if you can figure this one out. Many humans can be beaten with the same strategy, even if they think they know how to play tic-tac-toe well.

There are also a few interesting AI opponents: the minimax algorithm, deep Q-networks and (coming soon) Monte Carlo tree search and policy gradient-based methods like PPO. These are explained in the sections below. Generally, minimax "looks a few moves ahead" and finds the path to the best outcome, whereas the rest are reinforcement learning (RL) methods, which means they have learned by playing games against themselves and other opponents and improving over time.

If you want to try any of them out for yourself, the code is available at https://github.com/SixtusTheSixth/tic-tac-toe.

Minimax is a recursive algorithm for choosing optimal moves in a two-player (zero-sum) game. It relies on the assumption that your opponent will always make the best move available to them, so you should choose the move that is best assuming worst-case opponent behavior.

The game tree

Any game state, here a configuration of the board, can be thought of as a node in a tree. Its children are the positions reachable by each legal move, their children are the positions reachable one move later, and so on down to terminal states, or positions where the game is over. Terminal states have known scores: in this implementation, +10 for a win, -10 for a loss, and -5 for a draw. (Why -5 and not 0 for a draw? A draw is better than a loss but worse than a win, so giving it a negative value causes the algorithm to avoid draws when a win is available. We'll revisit score design more carefully in the RL sections, where it turns out to be more important.)

From non-terminal positions, scores propagate upward by alternating logic: on your turn, you pick the child with the maximum score (you want to win); on the opponent's turn, they pick the child with the minimum score (they want you to lose). The root's score is therefore the outcome of optimal play from both sides.

function minimax(board, is_my_turn):
            if game_over(board):
                return terminal_score(board)   # +10, -10, or -5

            scores = [minimax(make_move(board, mv), not is_my_turn)
                    for mv in legal_moves(board)]

            return max(scores) if is_my_turn else min(scores)

Tic-tac-toe has at most 9 × 8 × 7 × … × 1 = 362,880 possible move sequences, but because the game can end before all squares are filled, the actual tree is considerably smaller, around 255,000 nodes. That's small enough to solve completely from the very first move in less than a second.

Negamax: a cleaner formulation

The two-branch structure above (maximize on my turn, minimize on theirs) is a bit redundant. In a zero-sum game, my score and my opponent's score always sum to zero: score(position, me) = -score(position, opponent). This means minimizing the opponent's score is the same as maximizing the negation of their score. Negamax exploits this: always maximize, but negate the child's returned score before comparing. The two cases collapse into one:

function negamax(board, cur_player):
            if game_over(board):
                return score_from_cur_player_perspective(board)

            return max(
                # negate child score because it's from the other player's perspective
                -negamax(make_move(board, mv), other_player)
                for mv in legal_moves(board)
            )

You can see this directly in the implementation: score = -childScore on every recursive return, and the function always takes the maximum. Beyond being more concise, negamax makes alpha-beta pruning (next section) easier to implement correctly, since there's only one case to handle.

One practical shortcut

Even though tic-tac-toe is small enough to search exhaustively from move one, the implementation uses a cutoff: N_AB = 8, meaning negamax only runs for the last 8 moves of the game. When the board is empty, the algorithm plays a random corner instead. This might look like a lazy optimization, but it mirrors a genuinely important constraint in harder games — chess, Go, and most real-world problems are far too large to search to the end from any non-trivial position. Playing a corner first is also a strong choice in its own right: it creates more forking threats than a center opening and puts more pressure on the opponent.

The 1P vs Perfect mode above uses a hardcoded optimal strategy (not minimax), which you can verify is unbeatable. The 1P vs Minimax mode uses the implementation described here — also unbeatable in practice for tic-tac-toe, but by search rather than by hard-coded rules.

Minimax is correct, but it does redundant work. Consider a position where you've already found a move that guarantees a score of, say, +5. While exploring a different branch, you find that it gives the opponent a response that scores -8 from your perspective. You can stop exploring that branch immediately—you already have something better, so you'd never choose this line regardless of what else it contains. Alpha-beta pruning formalizes this reasoning and applies it at every node in the tree.

Alpha and beta

At any point in the search, we track two bounds:

Alpha (α): the best score the current player can already guarantee from moves examined so far—a floor on what we'll accept.
Beta (β): the best score the opponent can already guarantee—from our perspective, a ceiling on what we can reach given best play by the opponent.

We pass both bounds into each recursive call. If we ever find a move that scores above β, we can prune: the opponent has a better option elsewhere in the tree and would never let this position be reached. Similarly, any move that scores at or below α can't improve on what we've already found and is skipped.

function negamax_ab(board, cur_player, α, β):
            if game_over(board):
                return score_from_cur_player_perspective(board)

            for mv in legal_moves(board):
                score = -negamax_ab(make_move(board, mv), other_player, -β, -α)

                if score >= β: return score   # beta cutoff: prune remaining moves
                α = max(α, score)           # tighten the floor

            return α

Notice that when recursing, the bounds are passed as -β, -α (negated and swapped). This is the negamax convention: since the child's score is negated before returning, the child's floor and ceiling are the negations of the parent's ceiling and floor.

How much does it help?

In the worst case (moves always examined in the worst order), alpha-beta does no better than plain minimax and visits the same O(b^d) nodes, where b is the branching factor and d is the depth. In the best case, when the best move is always examined first, it cuts the effective branching factor from b to roughly √b, which means you can search twice as deep for the same computational cost. In practice, with reasonable move ordering, the improvement is closer to the best case than the worst.

For tic-tac-toe this doesn't matter much, since the tree is small enough either way. But this is the same algorithm that made chess engines tractable long before neural networks entered the picture, and it's still used in combination with learned evaluation functions in modern engines. The version here is already sufficient to make the minimax player unbeatable: try it above.

The ceiling constraint in practice

One subtlety worth noting: the implementation code uses lowerBound and upperBound rather than the conventional α/β naming, initialized to -9 and +9 at the root (just outside the range of achievable scores). A score above the upper bound causes an immediate cutoff and returns the full move sequence accumulated so far; a score below the lower bound is simply skipped. This is standard — the naming is cosmetic.

Coming soon.

0. Demo Explanation

1. Minimax

2. Optimization: Alpha-Beta Pruning

3. RL Basics & Q-Learning

4. Deep Q-Networks