SAT2Vec: Word2Vec Versus SAT Analogies
Word embeddings, like Word2Vec and GloVe, have proved to be a powerful way of representing text for machine learning algorithms. The idea behind these methods is relatively simple: words that are close to each other in the training text should be close to each other in the vector space. Of course, you could achieve this by having all the words in the exact same spot, but that wouldn’t form a useful model, so there is a second requirement: words that are not close to each other in the text should not be close to each other in the vector space.
Analogies
This simple algorithm produces some neat features, the coolest of which is the existence of semantic meaning of the directions in the vector space. The canonical example of this is that the analogy King : Man :: Queen : Woman
holds true mathematically in the vector space as follows:
King − Man = Queen − Woman
Which is more often rewritten as:
King − Man + Woman = Queen
This shows that one of the directions in our model is “gender”! These analogies exist for other concepts as well, for example:
Paris − France + Japan = Tokyo
And (as shown by my friend and colleague Patrick Callier):
Workin − Working + Going = Goin
Word2Vec is to the SAT as?
So with all these analogies embedded in the model, I started thinking back to when analogies were most prevalent in my life: SAT college entrance exam preparation! How would the model fare if asked to complete a few SAT analogies?
To find out, I grabbed Google’s pretrained Word2Vec model, which was trained on Google News, and then scraped 36 practice SAT analogies with answers from various websites. Once I had the analogies, I calculated the difference between the vectors for each pair of words. For example, for King : Man :: Queen : Woman
, I would calculate the vector sum for King − Man
and also for Queen − Woman
. I then computed the cosine distance between the vectors from the prompt pair of words and the potential answer pairs and ranked them from lowest to highest. If the model performed well for an analogy, the correct pair would be the lowest distance from the prompt pair, otherwise it would be further away.
You can find the Jupyter Notebook used to run the model here (Rendered on Github). You will need the analogies data and pretrained model. The model has been stripped down to only contain the words that appear in the analogies to save space.
Results
Here is an example where the model determined the right answer, that is, the correct answer (in bold) is ranked first:
authenticity : counterfeit | Distance |
---|---|
reliability : erratic | 0.758 |
mobility : energetic | 0.977 |
argument : contradictory | 0.997 |
reserve : reticent | 1.009 |
anticipation : solemn | 1.049 |
Note that reliability : erratic
was the word pair with the lowest distance, that is, the model predicted that it was the correct answer. Just as ‘counterfeit’ implies lack of authenticity, so ‘erratic’ implies lack of reliability. The model did in fact succeed in its prediction.
However, the model often failed, as it does for the prompt paltry : significance
:
paltry : significance | Distance |
---|---|
austere : landscape | 0.803 |
redundant : discussion | 0.829 |
banal : originality | 0.861 |
oblique : familiarity | 0.895 |
opulent : wealth | 0.984 |
Here the correct answer is ranked third. Overall, the model ranked the correct answer first about 20% of the time. The distribution of answers is as follows:
So our model isn’t getting into Berkeley anytime soon; maybe it should try applying to Stanford instead? (Go Bears!)
The model’s answer for all 36 analogies can be found here.