Understanding Word Vectors and Implementing Skip-gram with Negative Sampling

1. Introduction: How to represent words

Any machine learning model needs to be fed by data. Models learn and extract knowledge from data. Data could exist in various forms such as images, videos or text. However, no matter what type of data, they all have to be transformed into numerical values so that computers could understand and process them. For example, a grayscale image is seen by computers as a matrix of pixels, where each pixel is represented by an integer ranging from 0 to 255 depending on its intensity from black (0) to white (255). Color images, on the other hand, compose three values or channels for each pixel, in which each channel measures the intensity in one dimension of the color space (includes red, green and blue dimensions) and hence are perceived as 3D arrays of integers.

So, how to enable computer to apprehend text data? Computers only see numbers, therefore we must transform text into numbers. We could then break the text into smaller units similar to what we do with pixels for images. A natural choice is using an integer to represent each word. A dictionary mapping each word with an integer and vice versa is thus maintained to encode (and decode) words. However, unlike the pixels in images signifying the intensity in three different color dimensions, an integer could hardly tell us anything about its represented word such as its meaning, its senses, the difference or similarity with other words etc. If a number is not sufficient to express a word, how about using a vector comprising many numbers where each number could encode some aspect of the word meaning? In fact, this is the most prominent method used to represent words to feed meaning-related models nowadays. The vectors used to represent words is hence called word vectors, word embeddings or word representations.

The earliest work of using vectors to represent words as points in high-dimensional space is the work of Osgood et al. (1957) on affective meanings or connotations, the aspects of a word's meaning that are related to emotions, sentiments, opinions or evaluations. Osgood et al. defined three dimensions over which a word could be expressed. These dimensions are:

  • valence: the pleasantness of the stimulus
  • arousal: the intensity of emotion provoked by the stimulus
  • dominance: the degree of control exerted by the stimulus
Valence Arousal Dominance
courageous 8.05 5.5 7.38
music 7.67 5.57 6.5
heartbreak 2.45 5.65 3.58
cub 6.71 3.95 4.24
life 6.68 5.59 5.89

(source: Chapter 6: Vector Semantics - Speech and Language Processing. Dan Jurafsky and James H. Martin.)

However, three dimensions is unlikely to convey multiple aspects of a word's meaning. But how should we build such a computational model that could capture different aspects of word meaning and transfer them into different dimension as we want? First, we need to define what a word meaning is. Linguists in the 1950's came up with a hypothesis called distributional hypothesis which emphasizes the role of context in word meaning.

Words that occur in similar context tend to have similar meanings.

It suggests that we could define a word by the environment or distribution it occurs in language use. This hypothesis is the fundamental concept lying at the heart of almost all words representation learning models. In fact, vector models based on distributional hypothesis are now the standard way to encode the meaning of words. There are two commonly used models: count-based methods and Word2vec. I will briefly describe these two methods in Part 2 and Part 3. The detailed implementation of Skip-gram with Negative Sampling (SGNS), one of the algorithms of Word2vec, is presented in Part 4.

Once again, I want to stress the importance of context to a word's meaning.

You shall know a word by the company it keeps. (J. R. Firth 1957)

2. Count-based methods

The models using count-based methods are generally based on a co-occurrence matrix i.e. a matrix that counts how often words appear together.

(1) Word-Document matrix
  • Each row $i$ represents a word in the dictionary and each column $j$ represents a document.
  • Each cell $(i, j)$ counts the number of times a word in row $i$ appears in the document $j$.
  • Matrix dimension: $V \times M$, where $V$ is the vocabulary size and $M$ is the number of documents.
  • Word vectors: Row vectors with dimension $M$.
  • The context chosen to define a word: documents.
(2) Window-based co-occurrence matrix
  • Each row $i$ represents a word and each column $j$ also represents a word in the dictionary.
  • Each cell $(i, j)$ counts the number of times a (target) word in row $i$ and a (context) word in column $j$ co-occur in a window of $m$ words around the target word.
  • Matrix dimension: $V \times V$, where $V$ is the vocabulary size
  • Word vectors: Row or column vectors with dimension $V$.
  • The context chosen to define a word: a window of $2m$ words (or less depending on the word's position) around the target word.
(3) Term Frequency - Inverse Document Frequency (Tf-idf) weighted vectors

The simple raw frequency in word-document and word-word matrices has some limitations in measuring the association between words when it comes to frequently used words like the, a, an, it, they etc. The ubiquitous words do not carry discriminative information and therefore are deemed to be unimportant. Term frequency - inverse document frequency (tf-idf) is an algorithm used to adjust a word frequency based on the following two terms:

  • term frequency (tf): the frequency of a word ($t$) in a document ($d$). We use $\log_{10}$ of the frequency to downweigh the raw frequency of a word. For example, a word has $tf=2$ if it appears 10 times, $tf=3$ if it appears 100 times etc. $$\begin{equation*} tf_{t, d} = \begin{cases} 1 + \log_{10} & \text{count($t, d$) } \text{if count($t, d$)} > 0\\ 0 & \text{otherwise} \end{cases} \end{equation*}$$

  • inverse document frequency (idf): inverse of the document frequency ($df_t$), which is the number of documents a word $t$ occurs in. It gives higher weights to words that occur only in a few documents since these terms may carry useful information to distinguish the documents from other documents in the collection. $$\begin{equation*} idf_t = \log_{10} \left(\frac{N}{df_t} \right) \end{equation*}$$

The tf-idf weighting of a word $t$ in document $d$, denotes as $w_{t, d}$, is the product of the two terms $tf_{t,d}$ and $idf_t$. $$\begin{equation*} w_{t, d} = tf_{t, d} \times idf_t \end{equation*}$$

(4) Positive Pointwise Mutual Information (PPMI) weighted matrix

PPMI is an alternative weighting function to tf-idf. This method is based on PMI, a measure of association used in information theory. The intuition is that the association between 2 words $w$ and $c$ could be measured by how often these 2 words co-occur compared with the probability that they appear together purely by chance assuming they were independent. It is common to use positive values of PMI as negative ones is not intuitively interpretable. $$\begin{equation*} PPMI(w,c) = \text{max} \left (\log_2 \frac{P(w,c)}{P(w)P(c)}, 0 \right) \end{equation*}$$

Assuming we have a co-occurrence matrix $F$ with target words in rows and contexts words in columns. $f_{ij}$ denotes the elements at row $i$ and column $j$. $f_{ij}$ counts the number of times word $w_i$ occurs in context $c_j$. We have the PPMI value of word $w_i$ with context $c_j$ as follows: $$\begin{equation*} P(w,c) = p_{ij} = \frac{f_{ij}}{\sum_{i}\sum_{j} f_{ij}} \end{equation*}$$ $$\begin{align*} P(w) = p_{i*} = \frac{\sum_{j}f_{ij}}{\sum_{i}\sum_{j} f_{ij}} ; \quad P(c) = p_{*j} = \frac{\sum_{i}f_{ij}}{\sum_{i}\sum_{j} f_{ij}} \end{align*}$$ $$\begin{equation*} PPMI_{ij} = \text{max} \left (\log_2 \frac{p_{ij}}{p_{i*}p_{*j}}, 0 \right) \end{equation*}$$

3. Word2vec

Word vectors constructed using count-based methods are highly sparse. The dimension is the vocabulary size $V$ with many zeros elements as one word normally does not co-occur with many other words in the dictionary. Although there exists algorithms to efficiently handle sparse matrices, the high dimensionality of word vectors would add complexity to the models using them as inputs. One may think of performing dimentionality reduction using Singular Value Decomposition (SVD), however it would be very costly (quadratic cost) to perform SVD since the vocabulary size $V$ is large (usually $10^4$ or $10^6$).

Instead of directly computing all co-occurrence word counts, iteration-based methods learn to encode the probability of a word given its context in one iteration at a time. The idea is to train a model with parameters are the word vectors themselves on some loss function. Iteratively, we update the parameters to reduce the loss and eventually learn our word representation vectors. Word2vec proposed by Mikolov et al. (2013) is an efficient method based on that principle to learn word embeddings. There are 2 algorithms presented in the Word2vec paper: continuous bag-of-words (CBOW) and skip-gram.

(1) Continuous bag-of-words (CBOW)

CBOW tries to predict a center word from its surrounding context. Let $w_i$ denotes the word $i$ from vocabulary. The parameters of our model consist of two matrices $\V \in \RR^{n \times V}$ and $\U \in \RR^{V \times n}$. $v_i$ is the $i$-th column of $\V$, the input vector of word $w_i$. $u_i$ is the $i$-th row of $\U$, the output vector of word $w_i$.

In CBOW model, the context vector $\hat{v}$ is represented by taking average of the word context vectors $\begin{equation}\hat{v} = \frac{v_{c-m} + \dots + v_{c-1} + v_{c+1} + \dots v_{c+m}}{2m} \end{equation}$. We want to maximize the probability of the center word given its context of window $m$, which is equivalent to minimize the negative logarithm of that probability i.e. $\begin{equation}P(u_c|\hat{v}) \end{equation}$.

The probability of a word vector given another word vector is computed using the softmax of the dot product of the 2 vectors. The dot product gives us a score. The higher this score is, the closer the 2 vectors or 2 words are together. The softmax operator turns this score into probability.

The objective function for one word is as follow: $$\begin{align*} \text{minimize } J &= - \log P(w_c|w_{c-m}, \dots , w_{c-1}, w_{c+1}, \dots w_{c+m}) \\\\ &= - \log P(u_c|\hat{v}) \\\\ &= - \log \frac{\exp(u_c ^\top \hat{v})}{\sum_{j=1}^{V} \exp(u_j ^\top \hat{v})} \\\\ &= -u_c ^\top \hat{v} + \log \sum_{j=1}^{V} \exp(u_j ^\top \hat{v}) \\\\ \end{align*}$$

We compute the derivatives of the loss function $J$ with respect to the parameters and update the parameters $\U, \V$ iteratively to eventually obtain the parameters, which are the word vectors themselves.

(2) Skip-gram

Opposite to CBOW, skip-gram tries to predict context words from a center word. Using similar notations to CBOW and based on the same principle, the objective function using skip-gram for one word is then formulated as follow: $$\begin{align*} \text{minimize } J &= - \log P(w_{c-m}, \dots , w_{c-1}, w_{c+1}, \dots w_{c+m}|w_c) \\\\ &= - \log \prod_{j=0, j \neq m}^{2m} P(w_{c-m+j}|w_c) \\\\ &= - \log \prod_{j=0, j \neq m}^{2m} P(u_{c-m+j}|v_c) \\\\ &= - \log \prod_{j=0, j \neq m}^{2m} \frac{\exp (u_{c-m+j}^ \top) v_c}{\sum_{k=1}^{V} \exp(u_k ^\top v_c)} \\\\ &= - \sum_{j=0, j \neq m}^{2m} u_{c-m+j}^\top v_c + 2m \log \sum_{k=1}^{V} \exp(u_k ^\top v_c) \end{align*}$$

Similar to CBOW, we update the parameters using stochastic gradient descent to obtain the word vectors.

4. My Implementation of Skip-gram with Negative Sampling (SGNS)

The authors of Word2vec proposed negative sampling, a method to speed up the training process and improve the model performance in the paper Distributed Representations of Words and Phrases and their Compositionality (Tomas Mikolov et al.). Now, let me walk through the implementation of skip-gram with negative sampling where I take the derivatives and build the algorithms on my own from scratch using Numpy. You could refer to my code for further details in the following Github repository.

A main idea in negative sampling is that instead of looping over the entire vocabulary to compute the loss function, we could just take several negative examples from a noise distribution whose probabilities match the word frequency. We will use sigmoid function instead of softmax. The loss function to optimize is also changed. I will describe in more details the implementation steps, the loss function, the optimization process, the crucial part of the training loop and how my model performs in the next sections.

(1) Implementation steps

Below is the steps on how to implement SGNS. This is based on Chapter 6: Vector Semantics in the book Speech and Language Processing by Dan Jurafsky and James H. Martin.

  1. Obtain positive examples from the neigboring context of a target word
  2. Obtain negative examples by randomly sampling words in the lexicon based on a unigram distribution (which is built using words frequency)
  3. Train the model by optimizing the loss function
  4. Use the regression weights as the embedding vectors
(2) Loss function

Let $V$ denote the total number of words in our vocabulary and $d$ denote the embedding dimension (which is a hyperparameter). The parameters of the model consist of two matrices: $\W\in\RR^{d\times V}$ and $\C\in\RR^{V\times d}$. For any word (index) $t$ let $\w_t,\c_t$ respectively denote the $t^\text{th}$ column of $\W$ and the $t^\text{th}$ row of $\C$. The loss function over the whole training set can be expressed as:

$$\begin{equation} L(\thetab) = \sum_{(t,p) \in +} - \log \frac{1}{1 + \exp(-\w_t^\top \c_p)} + \sum_{(t,n) \in -} - \log \frac{1}{1 + \exp(\w_t^\top \c_n)}, \label{eq:loss} \end{equation}$$ which can be understood as the negative log-likelihood of the data. The signs $+$ and $-$ respectively denote the set of positive and negative training instances.

Let me devote some time to explain a little bit how to derive the loss function in \eqref{eq:loss}. $t$, $c$, $p$, $n$ are used to denote a target word, a word in the target word's context, a positive sample (word belongs to the context) and a negative sample (word do not belong to the context) respectively.

Recall that similarity score between two vectors could be measured using the dot product between these two vectors. The higher this dot product score is, the more similar or closer the two vectors are together. This score could be turned into probability using the logistic or sigmoid function $\sigma(x)$ where $\sigma(x) = \frac{1}{1 + exp(-x)}$. For example, the probability between similarity of 2 words $t$ and $c$ is computed as $\frac{1}{1 + exp(-\w_t^\top\w_c)}$, where $\w_t$ and $\w_c$ denote the vectors for word $t$ and word $c$ respectively. The higher this probability is, the more probable that word $t$ is similar to word $c$.

The goal of our learning algorithm is to:

  • (1) Maximize the similarity between a target word $t$ and positive words $p$ from the real context of $t$, i.e. maximize $P(+ | t,p)$.
  • (2) Minimize the similarity between a target word $t$ and negative words $n$ drawn by random sampling from a pool of negative samples, i.e. minimize $P(- | t,n)$.

These two goals correspond to the two summands in \eqref{eq:loss} respectively. More specifically:

  • For pairs in the positive set $(t, p) \in +$: instead of maximizing $P(+|t,p)$, we minimize $-\log P(+|t,p)$, i.e. minimize $- \log \frac{1}{1 + \exp(-\w_t^\top \c_p)}$.
  • For pairs in the negative set $(t, n) \in -$: minimizing $P(-|t,n)$ is equivalent to maximizing $1 - P(-|t,n) = 1 - \frac{1}{1+\exp(-\w_t^\top \c_n)} = \frac{1}{1 + \exp(\w_t^\top \c_n)}$. Thus instead of minimizing $P(-|t,n)$, we minimize $-\log \frac{1}{1 + \exp(\w_t^\top \c_n)}$.

One strong and important assumption of skip-gram is that all context words are independent, thus allowing us to multiply the probabilities of pairs of words. We then convert the multiplication to addition by taking the logarithmic values as above. Therefore, we have the two summands for all pairs in the positive set and all pairs the negative set as in \eqref{eq:loss}.

(3) Optimization process

The parameters of the model is $\thetab = (\W, \C)$ and the objective is to minimize $L$ with respect to $\W$ and $\C$. A common solution is to use stochastic gradient descent, where in each iteration we would minimize a local loss function with respect to a positive pair $(t,p)$: $$\begin{equation}\label{eq:loss-tp} L_{(t,p)} = -\log \frac{1}{1 + \exp(-\w_t^\top \c_p)} + \sum_{n \in \cN(t,p)} - \log \frac{1}{1 + \exp(\w_t^\top \c_n)}, \end{equation}$$ where $\cN(t,p)$ denotes the set of negative samples with respect to $t$, generated for the positive sample $p$ (for one positive sample, $K$ negative samples are generated). With simple calculus, I obtain the following partial derivatives: $$\begin{align} \frac{\partial{L_{(t,p)}}}{\partial {\w_t}} &= - s_p\c_p + \sum_{n \in \cN(t,p)} s_n\c_n, \\ \frac{\partial{L_{(t,p)}}}{\partial{\c_p}} &= - s_p\w_t, \\ \frac{\partial{L_{(t,p)}}}{\partial{\c_n}} &= s_n\w_t \quad\forall n \in \cN(t,p) \end{align}$$ where I have denoted \begin{equation} s_p = \frac{1}{1 + \exp(\w_t^\top \c_p)},\quad s_n = \frac{1}{1 + \exp(-\w_t^\top \c_n)}. \end{equation} Once we have computed the above derivatives, we can apply gradient descent updates as follows: $\begin{align} \w_t &\gets \w_t - \alpha \frac{\partial{L_{(t,p)}}}{\partial {\w_t}},\label{eq:descent-wt}\\ \c_p &\gets \c_p - \alpha \frac{\partial{L_{(t,p)}}}{\partial{\c_p}},\label{eq:descent-cp}\\ \c_n &\gets \c_n - \alpha \frac{\partial{L_{(t,p)}}}{\partial{\c_n}},\label{eq:descent-cn} \end{align}$ where $\alpha$ is the step-size (which is another hyperparameter).

(4) Implementation of training loop

An epoch of the training loop can be sketched as follows:

  • For each word $t$:
    • For each positive sample $p$ with respect to $t$:
      • Get the set $\cN(t,p)$ of negative samples (w.r.t. $t$).
      • Compute the partial derivatives $\frac{\partial{L_{(t,p)}}}{\partial {\w_t}}, \frac{\partial{L_{(t,p)}}}{\partial{\c_p}}$ and $\frac{\partial{L_{(t,p)}}}{\partial{\c_n}}$.
      • Update $\w_t, \c_p$ and $\c_n$ using \eqref{eq:descent-wt}, \eqref{eq:descent-cp} and \eqref{eq:descent-cn}.
      • Compute the loss value \eqref{eq:loss-tp} and add to the accumulated loss.

Below is the Python code of the inner loop. In the code, I used the variable $s_{neg}$ to denote a vector of all $s_n$, i.e. $\set{s_n}_{n \in \cN(t,p)}$. For further detail please refer to the Github code.

# context vector of the postive sample and the negative ones
cp = C[p, :]
C_neg = C[negative_samples, :]

# intermediate values that are helpful
sp = sigmoid(-np.dot(cp, wt))
s_neg = sigmoid(np.dot(C_neg, wt))

# Compute partial derivatives
dwt = -sp*cp + np.dot(s_neg, C_neg)
dcp = -sp*wt
dC_neg = np.outer(s_neg, wt)

# Gradient descent updates
wt -= stepsize*dwt
cp -= stepsize*dcp
C_neg -= stepsize*dC_neg

loss = -np.log(sigmoid(np.dot(cp, wt))) \
      + np.sum(-np.log(sigmoid(-np.dot(C_neg, wt))))
loss_epoch += loss

Another possible implementation can consist of only one loop:

  • For each word $t$:
    • Get the set of $\cP(t)$ of positive samples (w.r.t. $t$).
    • Get the set $\cN(t)$ of negative samples (w.r.t. $t$).
    • Compute partial derivatives, gradient descent updates, and loss computation (as before).

This is indeed my first attempt but then I realized that this is not memory efficient (although it can be faster).

I obtained the word vectors from the parameters of the matrix $\W$. We could also get the word embeddings by taking average of the two matrices $\W$ and $\C$ or by concatenating them.

(5) Model performance

My SGNS model has been trained on the Billion Word Corpus and achieved a correlation of 0.9 with the human-annotated word similarity SimLex-999, a standard resource for the evaluation of word representation learning models.

References

  1. Efficient estimation of word representations in vector space. Mikolov, Tomas, et al.
  2. Distributed Representations of Words and Phrases and their Compositionality. Mikolov, Tomas, et al.
  3. Speech and Language Processing, (3rd ed. draft). Dan Jurafsky and James H. Martin.
  4. Lecture notes CS224N: Deep Learning for NLP.. Francois Chaubard, Rohit Mundra, Richard Socher.
Avatar
Hang Le
PhD Candidate in Speech and Language Processing

My interests include machine learning, deep learning and natural language processing.