2: Word Embeddings¶
A large proportion of the efforts of the research and commercial community in NLP has been focused on processing the data in the best way possible for downstream applications. The methods that we have covered so far has been the result of a continuous evolution in preserving information for machine learning inference. One of the largest breakthroughs in NLP has been the invention of word embeddings.
The basic assumption about word embeddings is that every single word is defined largely by its context. That is words that mean roughly the same thing, appear in similar contexts. Thus if you take a sufficiently large (and diverse) dataset, it is possible to define each word based on its distance from the others. You can imagine a giant cloud where each individual word is a droplet, and is surrounded by similar words, where the distance is this similarity (in this context a cosine distance is used). How it works is visualised on Visualsiation of word2vec. Note how if we color the vectors based on their values, similar vectors (like “king” and “queen”) will have similarities, while something like “water” would be very different..
In the end of the embeddings process every word is represented by a n-dimentional vector (i.e. “coordinates” in the the cloud), that can be used as an imput to a machine learning model. With some linear algebra we can replace spam and ham SMS messages with their respective embeddings, and feed that to a machine learning classifier, such as as Random Forest (even though deep learning methods are often much more accurate in this context).
For some domain specific use cases (let’s say you have a very specific dataset, such as in the medical domain, that contains a lot of abbreviations that are specific to that case and are not normally encountered in other corpora) it makes sense to create our owm word embeddings. But for most NLP applications it makes sense to take advantage of pre-trained models, for example those computed on Wikipedia or other similar datasets, that are available for free.
Blended learning: TensorfFlow Word2vec
TensorFlow is one of the two most popular deep learning open source libraries, supported by a few companies, and originating at Google. They have also provided an excellent tutorial on how to use Tensorflow to create your own embeddings. We will be covering how to do this in this module, but if you want to have a deeper understanding it is useful to have a look at that tutorial, available here (45 minutes).
Blended learning: Illustrated Word2vec
A lot of the topics covered in data science can be a bit abstract to grasp at first, and some might have an easier way to do that with visual understanding. Have a look at an excellent illustrated and detailed guide to word embeddings with word2vec here (1 hour).
Making custom word embeddings with Fasttext¶
Fasttext is an open-source NLP library from Facebook. Since it is a product of a large company which deals with huge quantities of text data, this software is built for performance, usability and scale.
While Fasttext contains also modules for other NLP tasks, such as classification and language detection, for the purposes of this tutorial we will focus on creating custom word embeddings.
We will also use another open source library, called gensim
to use those word embeddings. This is a library that can also be used for other NLP tasks, such as unsupervised learning (i.e. Topic Modeling), but that will be covered in a separate section.
Blended learning: Fasttext paper
In order to gain complete understanding it makes sense to invest time in reading the fundamental original work (there are a multitude of papers on the topic, but just a few of them are very influential and inspire the further research).
The original Fasttext research paper from Facebook is one such paper, and you can read it here (1 hour).
import fasttext
import pandas as pd
from gensim.models import FastText as fText
from gensim.models.fasttext import load_facebook_vectors
As a first step for training our embeddings we should store the original .csv
in a format that is suitable for fasttext
. This format is a plain .txt
file, with one column containing the text (one entry per line), and also no headings or index.
reviews_data = pd.read_csv("../data/reviews_data.csv")
reviews_data["reviewText"].to_csv("../data/reviews_data_embeddings_training_data.txt", header=False, index=False, sep="\t", mode="a")
Now that we have the data prepared, we can start to train the model. For this we use the train_unsupervised
function.
model = fasttext.train_unsupervised("../data/reviews_data_embeddings_training_data.txt", model="skipgram")
Now that we have the model we should store it for a future use.
model.save_model("models/reviews_vec.vec")
There are different options that can be used to load those embeddings, but for our use case we will use the gensim
pacakge.
fastText_wv = load_facebook_vectors("models/reviews_vec.vec")
Now let’s have a look what can we actually do with those embeddings, and do they make sense? One thing which we can try is to compute the most similar to a specific word embeddings, let’s see payment
:
fastText_wv.most_similar("service")
[('guy', 0.9153076410293579),
('visit', 0.9117871522903442),
('$75', 0.8935893177986145),
('paying', 0.8484381437301636),
('Weve', 0.8340294361114502),
('4-5', 0.8236839771270752),
('months', 0.7747606635093689),
('per', 0.7211363315582275),
('routed', 0.6958457231521606),
('appliance', 0.6802433729171753)]
Those are sorted by cosine similarity. How does an individual word look like?
Blended learning: Cosine similarity
Read more about cosine similarity and the math behind it here (1 hour).
Exercise
Imagine your task is to build a tool that sorts job candidates based on their similarity. How could you approach this problem with custom word embeddings? Write up a pseudo-code description of your approach.
fastText_wv["service"]
array([ 0.03369893, -0.30342257, -0.31287557, 0.06328347, -0.15422097,
0.6900311 , 0.2902713 , -0.04522721, 0.12707095, -0.10173076,
-0.35954508, -0.37574846, 0.3420938 , -0.14806445, -0.06105737,
-0.6152239 , 0.48375046, 0.31523395, -0.41773838, -0.25677437,
0.39742705, -0.00882626, 0.24042241, 0.60639435, -0.6874998 ,
0.20789383, 0.5564708 , -0.48442906, 0.19938357, -0.42489457,
-0.25001803, 0.12958974, -0.5063852 , 0.17509942, -1.0968775 ,
-0.57158804, -0.36259872, -0.23189294, -0.19346552, 0.5832819 ,
0.18999015, -0.06430916, 0.7494559 , -0.56587875, 0.3883682 ,
0.6603875 , 0.4528623 , -0.6583048 , 0.18912114, 0.07246456,
-0.1075282 , 0.09732103, -0.32490987, 0.68329686, -0.12884144,
0.17648453, -0.48592442, -0.1735929 , -0.40970507, 0.40856415,
-0.49658212, -0.58721876, 0.01796373, 0.6114106 , -0.43736 ,
-0.50215924, 0.786839 , 0.77078503, 0.09432894, -0.40565056,
-0.80805415, 0.42209527, -0.24033622, -0.50208396, -0.7655079 ,
-0.56293756, -0.25264573, -0.6587055 , -0.39461687, -0.3533211 ,
0.17465983, -1.3165748 , -0.05236346, -1.1126766 , -0.01922854,
-0.40551388, -0.03952708, 0.77595365, 0.1453283 , 0.34026647,
1.0555615 , -0.71291816, 0.7598502 , 0.4585296 , -0.720232 ,
0.0925275 , -0.6108507 , 0.2989141 , -0.51312596, 0.31303704],
dtype=float32)
Additional information
By using any pre-trained model, we are relying on the assumption that the data it was trained upon is representative. Of course, this is not always the case, and word embeddings are no exception. In order to get some context on how bias can creep in this technology as well, read this blog post from Google Developers explaining the issue.
As we imagined, this is a vector of length 300
.
Word Embeddings Visualisation with t-SNE¶
So we can keep manually investigating if those embeddings make sense, but is there a better way? One thing we can do to have a higher level overview on the quality of our embeddings, and perhaps even to find some hidden patterns is to try to visualise these vectors. This is not a simple task, since the vectors are n-dimentional (in our case 100), and visualising this on a 2-D or 3-D plane, which is understandable for humans, while preserving roughy the relationships requires a special algorithm.
One option is called t-SNE, which is accessible from the scikit-learn
package. Let’s give it a try.
Additional information
Read the t-SNE paper here (1 hour)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# set figure sizes for plots
from pylab import rcParams
rcParams['figure.figsize'] = 20, 20
X = fastText_wv.wv[fastText_wv.wv.vocab]
/Users/boyanangelov/misk/misk-nlp/venv/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead). """Entry point for launching an IPython kernel.
Exercise
t-SNE is not the only method that can be used for visualising highly dimentional data. Can you find and describe another one?
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
The code below will plot the results in 2-D, while also annotating 300 words (we could annotate more, of course, but the readability of the plot might suffer from too much text).
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
words = list(fastText_wv.wv.vocab)[0:300]
for i, word in enumerate(words):
plt.annotate(word, xy=(X_tsne[i, 0], X_tsne[i, 1]))
plt.show()
/Users/boyanangelov/misk/misk-nlp/venv/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).
Now we can see that there are some clusters of words that are starting to form, and hopefully those make sense.
Using pre-trained word embeddings¶
Now that we have learned how to use our own word embeddings, let’s learn how to use pre-trained ones. As a first step we should download them. A good set is available here, and to download use this command
wget http://nlp.stanford.edu/data/glove.6B.zip
Depending on the speed of your internet it might take a while. Let’s first load the packages we need:
from tqdm import tqdm
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from keras.utils import np_utils
import pickle
Using TensorFlow backend.
stop_words = set(stopwords.words('english'))
le = LabelEncoder()
X = reviews_data["reviewText"]
y = reviews_data["overall"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
This is where we load the embeddings from the raw file into a dictionary.
embeddings_index = {}
f = open('../data/glove.6B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
400000it [00:35, 11264.77it/s]
Blended learning: GloVe
Read the GloVe paper (1 hour)
The following function does some processind and algebra to convert a piece of text into a 300 dimentional vector. Credit for this function goes to Abhishek Thakur (see Blended learning below).
Additional information: How to Approach Any ML Problem on Kaggle
Abhishek has a great tutorial, called “Approaching (Almost) any NLP Problem on Kaggle”. It provides a useful summary of a lot of the things we have learned so far, so it would be useful for you to go through. You can find it as a Kaggle Kernel here.
def sent2vec(s):
words = word_tokenize(s)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
try:
M.append(embeddings_index[w])
except:
continue
M = np.array(M)
v = M.sum(axis=0)
if type(v) != np.ndarray:
return np.zeros(300)
return v / np.sqrt((v ** 2).sum())
And finally we can use this function to replace the words of our data with embeddings:
X_train_glove = [sent2vec(x) for x in tqdm(X_train)]
X_test_glove = [sent2vec(x) for x in tqdm(X_test)]
100%|██████████| 1525/1525 [00:04<00:00, 306.51it/s] 100%|██████████| 752/752 [00:02<00:00, 308.10it/s]
Next there are a few additional transformations that we need to do before we are done with preparing the data for machine learning. The most important one of those it to scale the data. The difference between scaling and standardisation is shown on Scaling and Normalization.
Blended learning: Scaling and normalisation
Read more about scaling and normalisation here (30 minutes).
X_train_glove = np.array(X_train_glove)
X_test_glove = np.array(X_test_glove)
scl = StandardScaler()
X_train_glove_scl = scl.fit_transform(X_train_glove)
X_test_glove_scl = scl.transform(X_test_glove)
y_train_enc = np_utils.to_categorical(y_train)
y_test_enc = np_utils.to_categorical(y_test)
Exercise
Why do we in this case use fit_transform
on the train data, and just transform
on the test one?
Exercise
Why do we need to scale the data for deep learning?
Let’s give it a try, is really one entry in the data a 300 dimentional vector?
X_train_glove[0]
array([-2.32894737e-02, 3.10668293e-02, -4.39745151e-02, -5.64447939e-02,
1.02594390e-03, 1.12070525e-02, 3.20235104e-03, 7.77156604e-03,
2.61046309e-02, -4.52869862e-01, -4.11277637e-04, 6.31472934e-03,
1.52215995e-02, -1.23579139e-02, -2.16242392e-02, 2.32162979e-02,
-4.43424024e-02, -3.73341748e-03, 2.22783145e-02, 1.40877916e-02,
1.95223875e-02, 2.82965936e-02, 4.11917940e-02, 3.25049497e-02,
-3.15855332e-02, 2.13159509e-02, 7.44692143e-03, 1.43622817e-03,
3.66777889e-02, 2.16975231e-02, 3.37503478e-02, 6.12806417e-02,
-3.37418206e-02, 4.43321886e-03, -2.50328988e-01, 5.24989665e-02,
-4.48994078e-02, 1.51936822e-02, -3.56097259e-02, 5.47690243e-02,
-2.48268750e-02, -5.78602822e-03, -2.56457441e-02, -2.95194350e-02,
1.33604351e-02, 4.97801453e-02, 5.30099310e-02, 1.33737382e-02,
-2.12662108e-02, 3.18849273e-02, 1.54437823e-02, 7.23906280e-03,
1.78525653e-02, -3.21954787e-02, -1.27164163e-02, 4.35023718e-02,
-5.51021798e-03, 5.87610248e-03, 2.05822475e-03, 3.60975154e-02,
3.46094072e-02, -1.36471484e-02, 7.25326240e-02, 2.62633171e-02,
-2.78102467e-04, -6.12288713e-02, 2.73836013e-02, 4.06630039e-02,
2.48716865e-02, 1.71409454e-02, 3.17198448e-02, -2.50090007e-02,
4.84977383e-03, 7.07255900e-02, 5.59879001e-03, 1.39085650e-02,
-1.90250054e-02, 1.02943247e-02, -4.13045734e-02, -5.57916388e-02,
-2.31829099e-02, -4.50426601e-02, 5.89222200e-02, -2.77780909e-02,
-6.66003861e-03, 5.11157978e-03, 3.09338840e-03, 3.72455567e-02,
-2.70911288e-02, 9.61873494e-03, 4.42472706e-03, 6.35673478e-02,
-3.58307697e-02, -4.89802547e-02, -1.47702394e-03, -2.47533228e-02,
-8.09888393e-02, -1.15977796e-02, 4.87653241e-02, -1.12257764e-01,
-1.96868237e-02, 4.07871231e-02, -5.16530126e-02, -5.55814803e-02,
-2.43531330e-03, -8.12497921e-03, 2.42574383e-02, 2.44612861e-02,
-8.50794762e-02, 4.00520451e-02, -3.55438553e-02, -3.82371247e-02,
-2.40447130e-02, -4.06305753e-02, -2.11961772e-02, 5.62526882e-02,
-2.53995825e-02, 4.45758924e-02, 1.15506537e-02, -4.56452072e-02,
1.99450012e-02, -7.51915351e-02, 6.62240535e-02, 1.82880778e-02,
-3.40038096e-04, 2.03797668e-02, 1.19775673e-02, 3.51503752e-02,
2.16680523e-02, 1.71746742e-02, 5.21519780e-02, 6.38001934e-02,
3.20830978e-02, 2.48942710e-02, -8.17613583e-03, 3.68397944e-02,
-8.21215007e-03, -5.26166800e-03, 6.91096531e-03, 1.57503244e-02,
-9.40139312e-03, 2.46435567e-03, -9.52579826e-03, -1.54498499e-02,
-1.01902738e-01, -4.87458194e-04, 1.87341031e-02, 3.90754128e-03,
1.48044731e-02, 2.11387649e-02, 9.60999541e-03, -1.12161099e-03,
-3.13932658e-03, -7.34202117e-02, 8.19091648e-02, -1.74888745e-02,
1.03314398e-02, -3.97811867e-02, 1.97458398e-02, 2.63384972e-02,
1.50753334e-02, -8.03321972e-02, -2.95290332e-02, -3.53005603e-02,
2.90658865e-02, 3.72234546e-02, 1.19341612e-02, 2.80832220e-02,
6.01811446e-02, 1.19202295e-02, 3.54915066e-03, 3.24240364e-02,
-1.06825657e-01, 3.88127839e-04, 4.20327717e-03, -3.37486668e-03,
-2.91836057e-02, 5.72584048e-02, 6.08320814e-03, 4.61310707e-03,
4.08669226e-02, -7.09671155e-03, 5.91302142e-02, 2.67652869e-02,
-2.04260610e-02, -3.78775001e-02, 7.66344145e-02, 2.55392268e-02,
5.41477613e-02, 3.66090983e-03, -5.87164331e-03, 4.62315045e-02,
1.66062042e-02, 3.03641874e-02, 1.10087590e-02, -2.91239731e-02,
-6.49839118e-02, -1.66033804e-02, 3.49712931e-03, -6.31746128e-02,
2.16965958e-01, 3.01345438e-02, 7.06991330e-02, 1.92122553e-02,
4.82732318e-02, 3.37176882e-02, -1.00389728e-02, -1.08275819e-03,
-2.77451742e-02, -1.83033315e-03, -2.00415570e-02, -1.24214292e-02,
6.32454008e-02, 5.12878038e-03, 4.00201827e-02, -2.05802638e-03,
2.77434774e-02, -4.43505403e-03, -2.10391052e-04, -2.60170270e-02,
3.74035910e-02, 4.92131477e-03, -3.15390006e-02, 4.66568768e-03,
1.83680877e-02, -1.80903282e-02, -4.40934859e-03, -7.00278580e-03,
1.51425274e-02, -1.86301786e-02, 4.11494710e-02, -2.14782730e-02,
-9.28951660e-04, -7.16459155e-02, 4.10312563e-02, -1.95521768e-03,
-2.33021681e-03, 1.08297542e-02, -1.95159577e-02, -4.89295460e-03,
3.12124155e-02, -2.53360881e-03, 3.55091505e-02, 2.45698523e-02,
-1.52898744e-01, -6.18621632e-02, 5.06714843e-02, 3.20858248e-02,
-3.98602709e-03, -2.43617557e-02, 2.57377047e-02, -4.57251929e-02,
-9.34164785e-03, -7.73427039e-02, 1.02578342e-01, 1.84447393e-02,
-4.07262985e-03, -2.34674048e-02, -2.68124905e-03, 2.92026065e-03,
-5.93280606e-03, -7.20642880e-02, -1.93936769e-02, 2.46707699e-03,
6.85578771e-03, -2.85296992e-04, -5.18337488e-02, 6.10110164e-03,
1.05954679e-02, 8.88415705e-03, 6.38094265e-03, -1.39176007e-02,
3.14134266e-03, 2.20535174e-02, -2.66812649e-02, 2.31649298e-02,
-5.59711337e-01, 1.06887231e-02, 7.14323344e-03, -4.45074076e-03,
-4.74225245e-02, 2.58071860e-03, 2.32525598e-02, 4.23671827e-02,
-2.06068270e-02, 5.04491441e-02, 1.84423141e-02, 2.06283039e-05,
-1.46646611e-02, -2.02254001e-02, -6.08140416e-03, -1.26735009e-02,
-4.31094901e-04, 2.02291682e-02, 3.83833423e-02, 2.13016500e-03,
2.24654120e-03, -6.17554523e-02, 6.78942259e-03, 4.26397696e-02])
Indeed it is. We are now ready to feed this data to a machine learning model, which we will do in the tutorial on Deep Learning in NLP. Now let’s finally export this data for later usage. Using Pickle
is the fastest way to achieve this.
pickle.dump(X_train_glove_scl, open("../data/X_train_glove_scl.pkl", "wb"))
pickle.dump(y_train_enc, open("../data/y_train_enc.pkl", "wb"))
pickle.dump(X_test_glove_scl, open("../data/X_test_glove_scl.pkl", "wb"))
pickle.dump(y_test, open("../data/y_test.pkl", "wb"))
pickle.dump(y_test_enc, open("../data/y_test_enc.pkl", "wb"))
Blended learning: Python object serialisation
Read about object serialisation in Python in the official Pickle documentation (30 minutes).
Exercise
Try the built-in pandas
methods for object serialisation.
Exercise
The creation of custom word embeddings becomes very useful for domain-specific datasets, where a lot of named entity disambiguation is also required. This is very often the case in the medical domain. Use what you have learned here to create your own embeddings on a medical text dataset, for example: https://www.kaggle.com/tboyle10/medicaltranscriptions
spaCy word vectors¶
We could also use the embeddings available in spacy
.
import spacy
nlp = spacy.load("en_core_web_sm")
tokens = nlp("dog cat banana afskfsd")
for token in tokens:
print(token.text, token.has_vector, token.vector_norm, token.is_oov)
dog True 19.266302 True
cat True 19.220264 True
banana True 17.748499 True
afskfsd True 20.882006 True
Additional information
Read about the state of the art universal word and sentence embeddings here (1 hour).
Let’s have a look at an individual vector:
tokens[0].vector
array([ 0.99822044, -0.8781611 , -0.9599147 , -0.8802022 , 1.4011143 ,
-1.4729911 , -1.4483004 , 2.3529506 , 1.4696705 , 4.1085796 ,
4.661976 , 2.9604769 , 4.635996 , -0.84563375, 0.9116936 ,
-1.1318729 , -0.92072326, 1.4788682 , -1.4155934 , -2.4691365 ,
-2.422693 , 0.87474394, -0.7867575 , -1.8145221 , 0.7019544 ,
-1.6173346 , -1.8799448 , -4.580726 , 1.8491042 , -0.32686716,
4.730577 , 0.57223386, 0.7283193 , -0.3618081 , -3.2380333 ,
-0.6483809 , 3.613314 , 0.42308074, -0.49508172, 0.74843705,
3.9148026 , 2.307486 , 0.8387308 , -1.3754001 , -1.1304648 ,
2.155499 , -1.6760478 , -1.2141752 , -1.2715306 , -1.6553345 ,
-0.16264206, -1.3702772 , -1.4764215 , -0.7934667 , -1.8332126 ,
0.7584114 , 5.2410417 , -0.38271964, 0.6616207 , -1.6419213 ,
1.3823496 , -0.98040056, -0.2352474 , -0.4358213 , 0.9533407 ,
-1.0267448 , -1.1989149 , 0.44146824, -1.9613253 , 0.13480479,
-1.6117922 , 1.7746817 , 0.45083126, 1.5055051 , -3.2233422 ,
-0.6783832 , -1.4262296 , -1.3069394 , -0.26983237, 0.42622554,
4.113386 , -2.8683515 , -1.6826023 , -1.4485264 , 0.9647131 ,
2.2337825 , -0.9116962 , -1.5483243 , 1.0004147 , -1.803612 ,
-2.236876 , 0.6904143 , -2.5448341 , 2.2533112 , -0.562313 ,
3.0456731 ], dtype=float32)
We can also compute the similarity between those vectors.
print(tokens[0].similarity(tokens[1]))
0.4805915
Additional information
Word embeddings playground from the University of Turku: http://bionlp-www.utu.fi/wv_demo/
Portfolio Projects¶
Portfolio Project
Custom word embeddings can be extremely useful when applied to specific domains. In most other cases the pre-trained embeddings, such as the ones obtained from the gensim
library can be good enough.
One of the domains which is very specialised, and where information retrieval and abbreviation disambiguation is an issue is healthcare. For this project go and download the medical transcription data from Kaggle and use that to create word embeddings.
Portfolio Project
By using functions such as most_similar
we can easily debug a word embeddings model. Still, such work requires programming skills, and often times the data scientist might not be the most useful person to debug this information (think about the first portfolio project here - the medical dataset) - it might require more domain knowledge.
In order to get around this problem, and make the embeddings accessible to a non-technical user, you can create a graphical interface (see diagram below).
Glossary¶
- GloVe
Global Vectors for Word Representation