Words as Vectors

Vector space model is well known in information retrieval where each document is represented as a vector. The vector components represent weights or importance of each word in the document. The similarity between two documents is computed using the cosine similarity measure.

Although the idea of using vector representation for words also has been around for some time, the interest in word embedding, techniques that map words to vectors, has been soaring recently. One driver for this has been Tomáš Mikolov’s Word2vec algorithm which uses a large amount of text to create high-dimensional (50 to 300 dimensional) representations of words capturing relationships between words unaided by external annotations. Such representation seems to capture many linguistic regularities. For example, it yields a vector approximating the representation for vec(‘Rome’) as a result of the vector operation vec(‘Paris’) – vec(‘France’) + vec(‘Italy’).

Word2vec uses a single hidden layer, fully connected neural network as shown below. The neurons in the hidden layer are all linear neurons. The input layer is set to have as many neurons as there are words in the vocabulary for training. The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is same as the input layer. Thus, assuming that the vocabulary for learning word vectors consists of V words and N to be the dimension of word vectors, the input to hidden layer connections can be represented by matrix WI of size VxN with each row representing a vocabulary word. In same way, the connections from hidden layer to output layer can be described by matrix WO of size NxV. In this case, each column of WO matrix represents a word from the given vocabulary. The input to the network is encoded using “1-out of -V” representation meaning that only one input line is set to one and rest of the input lines are set to zero.

Screen Shot 2015-04-10 at 4.16.00 PM

To get a better handle on how Word2vec works, consider the training corpus having the following sentences:

“the dog saw a cat”, “the dog chased the cat”, “the cat climbed a tree”

The corpus vocabulary has eight words. Once ordered alphabetically, each word can be referenced by its index. For this example, our neural network will have eight input neurons and eight output neurons. Let us assume that we decide to use three neurons in the hidden layer. This means that WI and WO will be 8×3 and 3×8 matrices, respectively. Before training begins, these matrices are initialized to small random values as is usual in neural network training. Just for the illustration sake, let us assume WI and WO to be initialized to the following values:

WI = 

Screen Shot 2015-04-10 at 8.54.39 PM

W0 =

Screen Shot 2015-04-10 at 8.54.57 PM

Suppose we want the network to learn relationship between the words “cat” and “climbed”. That is, the network should show a high probability for “climbed” when “cat” is inputted to the network. In word embedding terminology, the word “cat” is referred as the context word and the word “climbed” is referred as the target word. In this case, the input vector X will be [0 1 0 0 0 0 0 0]t. Notice that only the second component of the vector is 1. This is because the input word is “cat” which is holding number two position in sorted list of corpus words. Given that the target word is “climbed”, the target vector will look like [0 0 0 1 0 0 0 0 ]t.

With the input vector representing “cat”, the output at the hidden layer neurons can be computed as

Ht = XtWI = [-0.490796 -0.229903 0.065460]

It should not surprise us that the vector H of hidden neuron outputs mimics the weights of the second row of WI matrix because of 1-out-of-V representation. So the function of the input to hidden layer connections is basically to copy the input word vector to hidden layer. Carrying out similar manipulations for hidden to output layer, the activation vector for output layer neurons can be written as

HtWO = [0.100934  -0.309331  -0.122361  -0.151399   0.143463  -0.051262  -0.079686   0.112928]

Since the goal is produce probabilities for words in the output layer,  Pr(wordk|wordcontext) for k = 1, V, to reflect their next word relationship with the context word at input, we need the sum of neuron outputs in the output layer to add to one. Word2vec achieves this by converting activation values of output layer neurons to probabilities using the softmax function. Thus, the output of the k-th neuron is computed by the following expression where activation(n) represents the activation value of the n-th output layer neuron:

Screen Shot 2015-04-12 at 10.00.48 PM

Thus, the probabilities for eight words in the corpus are:

0.143073   0.094925   0.114441   0.111166   0.149289   0.122874   0.119431   0.144800

The probability in bold is for the chosen target word “climbed”. Given the target vector [0 0 0 1 0 0 0 0 ]t, the error vector for the output layer is easily computed by subtracting the probability vector from the target vector. Once the error is known, the weights in the matrices WO and WI
can be updated using backpropagation. Thus, the training can proceed by presenting different context-target words pair from the corpus. In essence, this is how Word2vec learns relationships between words and in the process develops vector representations for words in the corpus.

Continuous Bag of Words (CBOW) Learning

The above description and architecture is meant for learning relationships between pair of words. In the continuous bag of words model, context is represented by multiple words for a given target words. For example, we could use “cat” and “tree” as context words for “climbed” as the target word. This calls for a modification to the neural network architecture. The modification, shown below, consists of replicating the input to hidden layer connections C times, the number of context words, and adding a divide by C operation in the hidden layer neurons. [An alert reader pointed that the figure below might lead some readers to think that CBOW learning uses several input matrices. It is not so. It is the same matrix, WI, that is receiving multiple input vectors representing different context words]

Screen Shot 2015-04-12 at 10.58.21 PM

With the above configuration to specify C context words, each word being coded using 1-out-of-V representation means that the hidden layer output is the average of word vectors corresponding to context words at input. The output layer remains the same and the training is done in the manner discussed above.

Skip-Gram Model

Skip-gram model reverses the use of target and context words. In this case, the target word is fed at the input, the hidden layer remains the same, and the output layer of the neural network is replicated multiple times to accommodate the chosen number of context words. Taking the example of “cat” and “tree” as context words and “climbed” as the target word, the input vector in the skim-gram model would be  [0 0 0 1 0 0 0 0 ]t, while the two output layers would have [0 1 0 0 0 0 0 0] t and [0 0 0 0 0 0 0 1 ]t as target vectors respectively. In place of producing one vector of probabilities, two such vectors would be produced for the current example. The error vector for each output layer is produced in the manner as discussed above. However, the error vectors from all output layers are summed up to adjust the weights via backpropagation. This ensures that weight matrix WO for each output layer remains identical all through training.

In above, I have tried to present a simplistic view of Word2vec. In practice, there are many other details that are important to achieve training in a reasonable amount of time. At this point, one may ask the following questions:

1. Are there other methods for generating vector representations of words? The answer is yes and I will be describing another method in my next post.

2. What are some of the uses/advantages of words as vectors. Again, I plan to answer it soon in my coming posts.

36 thoughts on “Words as Vectors

  1. Hi, first of all, I would like to apologize you for disturbing on this. I am very new to machine learning and NLP field and I found that your example here are very useful for my practice as I can keep tracking my computation to see if I go correctly or not for a computing step. If possible, and not troubling you much, could you explain further about how can we calculate the backpropagation of Softmax function after we get an error? Again, I’m sorry for troubling you and would be very grateful if you could help me understand this til the end. Thank you.
    (So my code that following your calculation was stop at Error = Y – probability which is equal to [-0.14307333, -0.0949255, -0.11444132, 0.8888341, -0.14928925, -0.12287422, -0.11943087, -0.14479961])


    1. Backpropagation algorithm is pretty well known and it is a part of most packages for machine learning. Please check any book on machine learning to see details of backpropagation. Let me know if you still have a difficulty. Thanks for visiting my blog.


  2. I want to try to implement word2vec to Vietnamase language, but I’m confused about the pre-trained vectors, when I tried to use in the English language I use Google News-vectors-negative300.bin.gz (about 3.4GB) for pre-trained vectors and it works good. if i do with vietnam language should I make the data pre-trained vectors themselves ??
    how to make a pre-trained vectors such as Google News-vectors-negative300.bin.gz, then I try to convert Google News-vectors-negative300.bin to text format the result as:

    3000000 300
    0.001129 -0.000896 0.000319 0.001534 0.001106 -0.001404 -0.000031 -0.000420 -0.000576 0.001076 -0.001022 -0.000618 -0.000755 0.001404 -0.001640 -0.000633 0.001633 -0.001007 -0.001266 0.000652 -0.000416 -0.001076 0.001526 -0.000275 0.000140 0.001572 0.001358 -0.000832 -0.001404 0.001579 0.000254 -0.000732 -0.000105 -0.001167 0.001579

    how to change a letter or word into the form above ??


  3. Thanks for this great explanation. I’ve been really looking for something like that, i’m not ANN expert but this post made me understand at least!! 🙂


  4. Hello. I have used CBOW(n-gram) model with scikit-learn package for classify Bangla Lanuage data. But i need implementation of skip gram on my thesis project which scikit-learn don’t have. After searching google and different blog i come to know that Word2Vec has implementation of skip gram. That’s where i am confused . Do word2vec use skip gram model or CBOW model? if use skip gram can you provide me any resource ??? it will be a great help for me. Thanks in advance


  5. Hi, may I know how did u calculate to get these values please?

    I understand this part:
    Ht = XtWI = [-0.490796 -0.229903 0.065460]
    HtWO = [0.100934 -0.309331 -0.122361 -0.151399 0.143463 -0.051262 -0.079686 0.112928]

    but from here how do i derived to here:

    Thus, the probabilities for eight words in the corpus are:

    0.143073 0.094925 0.114441 0.111166 0.149289 0.122874 0.119431 0.144800

    Please help me. Thanks


    1. The prob. are calculated by taking the ratio of the output of every output node with the sum of all output nodes outputs. This is shown in the post just above the probability values via the use of softmax function.


  6. I tried to calculate the result softmax and got a different result.
    where is the mistake?

    0.100934 + -0.309331 + -0.122361 + -0.151399 + 0.143463 + -0.051262 + -0.079686 + 0.112928 = -0.356714

    0.100934 / -0.356714 = -0.282954972
    -0.309331 / -0.356714 = 0.867168095
    -0.122361 / -0.356714 = 0.343022702
    -0.151399 / -0.356714 = 0.424426852
    0.143463 / -0.356714 = -0.402179337
    -0.051262 / -0.356714 = 0.143706162
    -0.079686 / -0.356714 = 0.223389046
    0.112928 / -0.356714 = -0.316578548



    1. You are not performing exponentian. That is why you are getting different numbers as well positive and negative numbers. With exponentian as given in the formula in the post, all numbers will be positive and within 0-1.


  7. Hi Krishan,
    very nice explanation of word2vec procedure 🙂 but still I cannot understand what is the final word representation vector? is it the vector of probabilities over dictionary? if not, than how one can extract the final result?
    Thanks for answering 🙂


    1. The final word representation vectors are read from WI matrix at the end of the training. Each row gives a word vector. Thus, the i-th row will provide vector representation for the i-th word in the dictionary.


      1. But for this case you will get 3 values(probabilities) for each word. Do we need to add them ? If yes, then value in front of Climbed word should be high. is it ?


  8. Hi Krishan,

    Very intuitive explanation. Thanks.

    I have watched Stanford DP for NLP (http://web.stanford.edu/class/cs224n/syllabus.html) course, which explain the word2vec by modeling the probability of observing a word knowing the context
    $$p(o|c) = \exp(u_o^Tv_c)/(\sum_{w=1}^Wexp(u_w^Tv_c)$$
    Here it stresses we should use 2 vectors to represent a word. I don’t know why we need to use this model until I see your post. The 2 vectors are actually hidden units and weight vectors from hidden unit to output layer.



  9. Very nice post!
    “That is, the network should show a high probability for “climbed” when “cat” is inputted to the network. ” this should be from the corpus, in another words, from training data. So it should be used to calculate the error? however i didnt see you use this to calculate the error. the whole procedure above should be the same if we dunt have the corpus but just a vocabulary with the eight words?

    thx and look fwd ur reply.


  10. It is being used to calculate error. Pl. take a look at “The probability in bold is for the chosen target word “climbed”. Given the target vector [0 0 0 1 0 0 0 0 ]t, the error vector for the output layer is easily computed by subtracting the probability vector from the target vector. Once the error is known, the weights in the matrices WO and WI
    can be updated using backpropagation.” Only thing is that I haven’t detailed this part of the process.


  11. Dear Sir,

    I am Sivashankari. I am able to understand theory concepts. But I am unable to do the evaluation of my understanding level with any concrete example. Is there any solved example with the dataset( 10 sentences).


  12. Good article.
    I’d like to point out one thing about the figure in CBOW section which has possibility to mislead readers.
    You indicated many WI matric in the figure per one context word. The figure seems to describe there are different WI matric to be trained for each input word.
    However, WI is the final target matrix (word embeddings) which we ultimately want to get.
    Strictly speaking it would be ONE WI matrix which is connected to many context word input vectors and many hidden layer neurons for better understanding.
    Please let me know if you have different opinions.


    1. Thanks for liking the article. All WI matrices are identical. There is no separate index to different matrices. Also, the article states “The modification, shown below, consists of replicating the input to hidden layer connections C times, the number of context words, and adding a divide by C operation in the hidden layer neurons.”, which implies identical WI matrices.


      1. I am not talking about your knowledge or how correctly you understand the model. (Please Read the comment once again.) I am talking about the figure itself. As mentioned in the comment, the figure could mislead readers since you *implied identical WI matrices* in the article.
        I thought that this blog is for education/sharing purpose. If so, the article should provide knowledge as clear as possible for others. In terms of that, this article has a room for improvement. However, from your reply you are just saying that you know what I know. It’s your choice whether you improve the quality of article. Thanks.


  13. Hallo Krishan,
    i did not understand how you get the the activation vector for output layer neurons
    HtWO = [0.100934 -0.309331 -0.122361 -0.151399 0.143463 -0.051262 -0.079686 0.112928]
    Can you please explain this step?


    1. Hidden layer output (H) is calculated earlier. The weight matrix W0 is known/defined earlier. So the output is simply a product between the transpose of the hidden layer output and W0.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s