4: Unsupervised Learning

Introduction

Having a labeled dataset is a huge luxury in data science. More often than not we have to rely on datasets where we don’t have a label, and try to find patterns and automate. Manually labeling a dataset is time consuming, and sometimes hard - imagine we have a dataset of legal documents, where it might take a team of lawyers several hours to correctly annotate or label a single piece of text.

In those cases we would not be able to use the methods described in the previous chapter on supervised learning, but still there are a few algorithms we can use to do something productive. Those methods fall in the space of “unsupervised learning”. Pretty much all of those methods use some kind of a clustering technique under the hood, which allows to find some kind of clusters in the data, that we can than have a look and perhaps then label (this would then be called semi-supervised learning).

In this tutorial we will cover one of the most common unsupervised learning methods in NLP: latent Dirichlet allocation (LDA). For this we will use the gensim library, that you have already seen in the chapter on word embeddings.

https://ars.els-cdn.com/content/image/1-s2.0-S0164121218302103-gr6.jpg

Fig. 8 LDA overview

import pandas as pd
import gensim
import pyLDAvis.gensim as gensimvis
import pyLDAvis
/Users/boyanangelov/.vscode/extensions/ms-python.python-2020.6.91350/pythonFiles/lib/python/past/types/oldstr.py:36: DeprecationWarning: invalid escape sequence \d
  """

First we should start off by using our aleady processed dataset (tokenization, stop words removal, stemming and lemmatization).

reviews_data = pd.read_csv("../data/reviews_data.csv")
reviews_data = reviews_data[["text_processed"]].dropna()
lda_data = reviews_data["text_processed"].tolist()
lda_data = [d.split() for d in lda_data]
dictionary = gensim.corpora.Dictionary(lda_data)
Lda = gensim.models.ldamodel.LdaModel
doc_term_matrix = [dictionary.doc2bow(doc) for doc in lda_data]

In the following code we specify the most important argument for unsupervised learning - the number of topics (clusters) that we expect to see in the data. As you can imagine this argument can be extremely subjective, and is a part of a trial and error process where we will have to manually inspect the results and see what is a good number to have a good separation.

Blended learning

There are several other techniques used in unsupervised learning in text. The two most important ones that you should be aware of are called Non-negative matrix factorization and Singular value decomposition.

Those are covered in an excellent tutorial from FastAI, shown below.

Topic Modeling with SVD & NMF (1 hour):

ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

Exercise

Play around with the different parameters for the topic model. How does the result change? What parameters are bringing the most benefit?

Now that we have a trained model we can try to inspect it and see what kind of clusters are available, and which words are the key contributing factors for those.

print(ldamodel.print_topics(num_topics=3, num_words=3))
[(0, '0.066*"i" + 0.025*"dryer" + 0.023*"clean"'), (1, '0.043*"i" + 0.043*"vent" + 0.023*"clean"'), (2, '0.043*"i" + 0.041*"rod" + 0.021*"use"')]

And finally if we want to have a more interactive way to visualise the data, we can use the nice pyLDAvis library which allows us exactly that:

vis_data = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary)
# pyLDAvis.display(vis_data)

Exercise

If you are using a model training and testing framework (such as mlflow, for example)1, what metric would you choose to log and benchmark again, and why?

Additional information: HuggingFace and Transformers

A tremendous achievement in the field of NLP in recent years has been the advent of transformer methods (such as BERT). Those have complemented word embeddings for achieving state of the art accuracies on a variety of tasks and start to be more widely adopted in industry as well.

They are not the focus on this module, but there is a variety of tools and open source packages available to assist you. One organisation that has helped a lot to lower the learning curve is HuggingFace, and it’s Github page is worth looking at.

Exercise

In the beginning of this section we had a look at several data annotation tools. As an exercise you should try setting up one of them and annotate enough data in order to build a classifier.

Portfolio projects

Portfolio Project: Healthcare

For this portfolio project you will be revisiting a dataset that you have already worked on - the medical transcriptions data from Kaggle. It could be extremely valuable to see if there are more higher-level patterns in the data - perhaps groups of patients? Such a model can be then used to correctly assign a new patient to a cluster and help improve their diagnosis.

Portfolio project: United Nations Security Council

This second dataset is also one that we are familiar with (from the data pre-processing section). Can we use it try to find patterns in the data? One initial hypothesis that we might have is that the speech transcripts will cluster neatly in the same number of general clusters as the members of the security council. Can we prove or disprove this theory?

Glossary

LDA

latent Dirichlet allocation

SVD

Singular value decomposition

NMF

Negative matrix factorization

Footnotes


1

More on this in the section on MLOps.