Codenames Helper

In the game Codenames, you need to find a word that is associated with one or more given words. Here, you can input one to six words and press "Get Words", and ten options of connecting words will appear (it should take a few seconds). In my experience, usually one or two are actually helpful.

The program works by representing each word's meaning as a vector, and finding the words which maximize the product of their vectors' cosine similarity with each of your input words' vectors (see further explanation below). The output words are in decreasing order of relevance.

Explanation:

This project uses word vectors, which are a bunch of numbers that are supposed to represent the meaning of a word.

How does that work in practice? For example, letting k be the vector for the work 'king', m the vector for 'man', w for 'woman', and q for 'queen', we might have:

k - m + w = q.

Also, the Euclidean distance between k and q should be much less than between k and w, and the cosine similarity (see below) between the former should be much greater than between the latter.

I'm not going to go into the details of how these vectors are found here. I'll just say that nowadays, word vectors tend to be derived from the first layer of large language models, and before that there were two popular (related) statistical word vector models, word2vec and GloVe.

I use pre-trained GloVe vectors here. Specifically, I'm starting with 50-dimensional vectors trained on a 6 billion-word dataset with a vocabulary of ~400,000 (from the official website). I then take the first 40,000 most common words, removing non-words like "42" or "," as well as some foreign words and proper nouns that can't be used as clues in Codenames. I did have to manually add some multi-word phrases (e.g. "ice cream", "new york") that appear in Codenames but not in the GloVe vocabulary, which only contains single words. I estimated those with some combination of the vectors of the individual words: for example, the vector for "ice cream" is the average of the vectors for "ice" and "cream".

As described in the GloVe paper (on page 8), I'm calculating similarity scores between two words by taking the cosine similarity between their word vectors (= dot product divided by product of the vectors' magnitudes). But first, I normalize the vectors so that their magnitudes are 1, which means that the cosine similarity is just the dot product. This is because the vectors by default vary quite a bit in magnitude.

To find the best suggestions for multiple input words, I take the cosine similarity of each word in the vocabulary with each of the input words. Then I take the geometric mean of these cosine similarities, and call that the relevance score for the word. The geometric mean is to penalize the situation where a word is really close to one of the input words but far from the others. The algorithm just finds the 10 words with the highest relevance scores, excluding the input words themselves.

I use one more optimization—in the base Codenames game, there is a limited vocabulary of 400 words. I can pre-compute the top 1000 or 3000 closest words to each of these and their cosine similarities. Then, when two real Codenames words are given as input, I can just find the intersection of those two sets of 1000 or 3000 similar words, and order the intersection by the geometric mean of similarity scores. This is much faster than calculating cosine similarities with all 40,000 words in the GloVe vocabulary, but in practice both approaches seem fast enough that most of the lag is due to the network delay querying my server where the computation is performed.

In the future, it would not be difficult to incorporate avoiding the opposing team's words and the assassin word, but this program as is struggles with more than two words, so I don't think it would be very useful to implement that. It would also be very interesting to input a set of 8 or 9 words (as in the beginning of the game) and find the partition into sets of 2 or 3 or 4 that maximizes the relatedness in each set.

In the meantime, enjoy! As I'm writing this, I am realizing that many people on the Internet have had the same idea, but it was still cool to put together something that is personally useful.

Update 4/26: This page works properly now. Previously, I was using PyScript to load a larger vocabulary of ~73,300 GloVe vectors into the user's browser and then perform the similarity algorithm in the browser, which would take a couple hundred megabytes of the user's RAM and was rather slow. I've now offloaded the computation onto a Hugging Face Space, which I query through a proxy hosted on Netlify in order to more securely store my Hugging Face private token. I was also previously normalizing the vectors along the wrong axis: I was normalizing along each feature instead of making the length of each vector 1; this was because of my misreading the GloVe paper and was causing the results to be somewhat less useful.