Welcome

Description

A large portion of data currently being generated is unstructured in the form of text. In order to make this data accessible for analysis and useful for data science, skills in the domain of Natural Language Processing are required.

NLP methods can be applied to the following:

  • detecting spam

  • translation

  • text to speech

  • speech to text

  • part of speech tagging

  • named entity recognition

  • natural language generation

  • optical character recognition

  • question answering

  • chatbots

  • sentiment analysis

  • topic modeling

  • disambiguation

  • relationship extraction

  • text summarization

  • coreference resulution

Additional information

Topics covered

  • text data processing methods (i.e. tokenization, lemmatization, bag of words, tf-idf) and associated packages with NLTK and spaCy

  • part of speech tagging

  • word embeddings

  • deep learning for text

Learning objectives

At the end of this module students should be able to conduct Natural Language Processing work on real-world large-scale projects with little to no supervision.

Prework

This module makes a few assumptions of your established knowledge regarding your programming and/or data skills. Below are the assumptions made and some resources to read through to make sure you are properly prepared.

Assumptions

Resource

Basic Python skills

Fundamentals of Python

pyData stack

pydata overview

How to use the materials

There are several different elements that help guide you in using the materials. Let’s have a look.

Blended learning

Blended learning

These materials are curated third-party resources that are mandatory to go through. Those can be course platforms, articles, academic papers or blog posts. The estimated time duration to go through them is also specified so you can estimate the effort.

Videos

Videos are added to the sidebar, together with the duration that they take. Those materials are mandatory, and often go deeper in the topic, and are hand-picked by the author.

Additional information

Additional information

This is how additional information boxes look like. This content is optional, but often provides additional useful value for the highly motivated.

Exercises

Note

Exercise Excerises are mandatory hands-on tasks that make sure you understand the content completely.

Portfolio Projects

Portfolio Project

A portfolio project is a crucial element of the curriculum, since it allows the student to demonstrate complete understanding of the material. Those are often more open-ended, larger and creative than the exercises. The goal behind the project is for the student to complete and demonstrate publicly their new expertise on the topic, to potential future employers and/or colleagues.

Setup instructions

For more information about virtual environments and the command line, go to the module Development Environment for Data Science.

Create a virtual environment in the root folder:

python3 -m venv venv

Activate virtual environment:

source venv/bin/activate

At this point the output in the terminal should include the word venv next to it, indicating the activated environment.

You can test it by typing:

which python

To check that the Python version used is the one from the virtual environment,and not the system one. Another thing you can do is check which packages are installed by:

pip freeze

This should return no results, since this is a fresh environment. Now we are ready to install the dependencies:

pip install -r requirements.txt

And to see that everything worked fine, we can check the installed packages again;

pip freeze

What this returns to the console is the same as the contents of the requirements.txt dependencies file. We are good to go!

In order to build the book you need to go go the materials/code folder and run jupyter-book:

cd materials/code
jupyter-book build .

The book will then be built in the _build/html folder, and you can go there and open the html files in your browser.

Note if you want to exit the virtual environment you can type deactivate.