SAT2Vec: Word2Vec Versus SAT Analogies

July 11, 2016 #machine-learning

An impressionistic painting titled 'Night Before the Exam' by Leonid Pasternak. The painting shows four students sitting around a kitchen table studying for a exam. One student holds a skull, while the others longue around smoking or studying books or papers.

Word embeddings, like Word2Vec and GloVe, have proved to be a powerful way of representing text for machine learning algorithms. The idea behind these methods is relatively simple: words that are close to each other in the training text should be close to each other in the vector space. Of course, you could achieve this by having all the words in the exact same spot, but that wouldn’t form a useful model, so there is a second requirement: words that are not close to each other in the text should not be close to each other in the vector space.

Analogies

This simple algorithm produces some neat features, the coolest of which is the existence of semantic meaning of the directions in the vector space. The canonical example of this is that the analogy King : Man :: Queen : Woman holds true mathematically in the vector space as follows:

King − Man = Queen − Woman

Which is more often rewritten as:

King − Man + Woman = Queen

This shows that one of the directions in our model is “gender”! These analogies exist for other concepts as well, for example:

Paris − France + Japan = Tokyo

And (as shown by my friend and colleague Patrick Callier):

Workin − Working + Going = Goin

Word2Vec is to the SAT as?

So with all these analogies embedded in the model, I started thinking back to when analogies were most prevalent in my life: SAT college entrance exam preparation! How would the model fare if asked to complete a few SAT analogies?

To find out, I grabbed Google’s pretrained Word2Vec model, which was trained on Google News, and then scraped 36 practice SAT analogies with answers from various websites. Once I had the analogies, I calculated the difference between the vectors for each pair of words. For example, for King : Man :: Queen : Woman, I would calculate the vector sum for King − Man and also for Queen − Woman. I then computed the cosine distance between the vectors from the prompt pair of words and the potential answer pairs and ranked them from lowest to highest. If the model performed well for an analogy, the correct pair would be the lowest distance from the prompt pair, otherwise it would be further away.

You can find the Jupyter Notebook used to run the model here (Rendered on Github). You will need the analogies data and pretrained model. The model has been stripped down to only contain the words that appear in the analogies to save space.

Results

Here is an example where the model determined the right answer, that is, the correct answer (in bold) is ranked first:

authenticity : counterfeit	Distance
reliability : erratic	0.758
mobility : energetic	0.977
argument : contradictory	0.997
reserve : reticent	1.009
anticipation : solemn	1.049

Note that reliability : erratic was the word pair with the lowest distance, that is, the model predicted that it was the correct answer. Just as ‘counterfeit’ implies lack of authenticity, so ‘erratic’ implies lack of reliability. The model did in fact succeed in its prediction.

However, the model often failed, as it does for the prompt paltry : significance:

paltry : significance	Distance
austere : landscape	0.803
redundant : discussion	0.829
banal : originality	0.861
oblique : familiarity	0.895
opulent : wealth	0.984

Here the correct answer is ranked third. Overall, the model ranked the correct answer first about 20% of the time. The distribution of answers is as follows:

So our model isn’t getting into Berkeley anytime soon; maybe it should try applying to Stanford instead? (Go Bears!)

The model’s answer for all 36 analogies can be found here.