Welcome¶
Description¶
A large portion of data currently being generated is unstructured in the form of text. In order to make this data accessible for analysis and useful for data science, skills in the domain of Natural Language Processing are required.
NLP methods can be applied to the following:
detecting spam
translation
text to speech
speech to text
part of speech tagging
named entity recognition
natural language generation
optical character recognition
question answering
chatbots
sentiment analysis
topic modeling
disambiguation
relationship extraction
text summarization
coreference resulution
Additional information
Topics covered¶
text data processing methods (i.e. tokenization, lemmatization, bag of words, tf-idf) and associated packages with NLTK and spaCy
part of speech tagging
word embeddings
deep learning for text
Learning objectives¶
At the end of this module students should be able to conduct Natural Language Processing work on real-world large-scale projects with little to no supervision.
Prework¶
This module makes a few assumptions of your established knowledge regarding your programming and/or data skills. Below are the assumptions made and some resources to read through to make sure you are properly prepared.
Assumptions |
Resource |
---|---|
Basic Python skills |
Fundamentals of Python |
pyData stack |
How to use the materials¶
There are several different elements that help guide you in using the materials. Let’s have a look.
Blended learning¶
Blended learning
These materials are curated third-party resources that are mandatory to go through. Those can be course platforms, articles, academic papers or blog posts. The estimated time duration to go through them is also specified so you can estimate the effort.
Videos¶
Videos are added to the sidebar, together with the duration that they take. Those materials are mandatory, and often go deeper in the topic, and are hand-picked by the author.
Additional information¶
Additional information
This is how additional information boxes look like. This content is optional, but often provides additional useful value for the highly motivated.
Exercises¶
Note
Exercise Excerises are mandatory hands-on tasks that make sure you understand the content completely.
Portfolio Projects¶
Portfolio Project
A portfolio project is a crucial element of the curriculum, since it allows the student to demonstrate complete understanding of the material. Those are often more open-ended, larger and creative than the exercises. The goal behind the project is for the student to complete and demonstrate publicly their new expertise on the topic, to potential future employers and/or colleagues.
Setup instructions¶
For more information about virtual environments and the command line, go to the module Development Environment for Data Science.
Create a virtual environment in the root folder:
python3 -m venv venv
Activate virtual environment:
source venv/bin/activate
At this point the output in the terminal should include the word venv
next to it, indicating the activated environment.
You can test it by typing:
which python
To check that the Python version used is the one from the virtual environment,and not the system one. Another thing you can do is check which packages are installed by:
pip freeze
This should return no results, since this is a fresh environment. Now we are ready to install the dependencies:
pip install -r requirements.txt
And to see that everything worked fine, we can check the installed packages again;
pip freeze
What this returns to the console is the same as the contents of the requirements.txt
dependencies file. We are good to go!
In order to build the book you need to go go the materials/code
folder and run jupyter-book
:
cd materials/code
jupyter-book build .
The book will then be built in the _build/html
folder, and you can go there and open the html
files in your browser.
Note if you want to exit the virtual environment you can type deactivate
.