👨‍🏫 Week 09 - Managing your data science workflow

DS105 - Data for Data Science

Author

Dr. Jon Cardoso-Silva

Published

22 November 2022

What about Natural Language Processing (NLP)?

I had planned a small introduction to spaCy, a package in Python that handles text as data, but we will not have the time to talk about NLP today!

Unstructured data (text, audio & image) was always planned to be taught next week anyways. So join on Week 10 for an intro to text pre-processing, when I will then focus on tokenization — the process of breaking sentences into words and counting/describing them (without any for loops!)

We will continue to explore the topic of text data on Week 11 when Prof. Ken Benoit will come and deliver a talk about text mining applications.

🖥️ Part I - Conda environments (45-50 min)

In the first part of the lecture, we will explore conda environments together (bring your 💻laptops, and perhaps also coffee!).

And in the labs, you will practice how to share these environments via Github and also how to use Github more effectively as a team. You will learn about the process of Pull Requests for teams.

Note

💡You will be assessed on knowledge of these best practices as part of your final project. Check the Project Marking Criteria: Source Code | Organisation and Source Code | Collaboration.

Open to see lecture notes

Lecture Notes

This is an interactive lecture. I will teach you about conda environments and we will compare how everyone’s conda and python settings are different.

🤔 What is in my `conda`?

Step 1.

Which version of python do you have? Open your terminal and type:

python --version

Compare your version to those of your colleagues.

Step 2.

Which version of conda do you have? Open your terminal and type:

conda list

You should see a list of all packages you have installed, or came installed by default, in your conda default environment.

What version of jupyterlab do you have installed? What about pandas? Compare the version of your packages to those of your colleagues. Are there any differences?

Step 3.

Let’s generate some data! Run the command below in the terminal to save the content of conda list to a text file. Replace my username for yours:

conda list >> conda_list_jonjoncardoso.txt

Step 4.

I will ask you to upload the file you created to Slack

Step 5.

I will then combine all of our data and we will explore the discrepancies in the versions of packages we are all likely to use.

⚙️ How do we “fix” everyone’s environment?

Step 6.

In the terminal, cd to the directory where you keep all the files of your project.

Step 7.

Create a conda environment

💡 Useful link: Managing conda environments`

conda create --prefix .\venv python=3.10

Step 8.

Activate the environment:

macOS/Unix Users
Windows

source activate .\venv

activate .\venv

Step 9.

What’s different about this conda environment?

conda list

Step 10.

Create a new file called requirements.txt and paste the following there:

matplotlib==3.5.3 # version required for plotnine
plotnine==0.10.1 # Python version of ggplot2

numpy>=1.22
pandas==1.4.2
scikit-learn==1.1.3

### UTILS
jupyterlab==3.4.2
tqdm==4.62.0

Step 11.

Try to install it with conda:

conda install --file requirements.txt

Why can’t we install all packages?

Step 12.

Install it with pip:

macOS/Unix Users
Windows

conda install pip
which pip

conda install pip
where.exe pip

Then:

pip install -r requirements.txt

Step 12.

How does the conda environment look like now?

conda list

☕ Coffee Break (10 min)

Use this time to chat, stretch, drink some coffee or just relax for a bit by yourself.

🖥️ Part II - Databases (45-50 min)

Databases: what is it? what is SQL? And how to connect to a database directly through pandas. Initially, this content will come on 🗓️ Week 08 but we didn’t have the time for that.

Open to see lecture notes

Useful links

Relational Database Management System (RDBMS)
- Oracle: What is a Relational Database?
- Google: What is a Relational Database?
Famous Open-source RDBMS:
- MySQL
- PostgreSQL
SQL:
- Good step-by-step SQL Tutorial

👨‍🏫 Week 09 - Managing your data science workflow

🖥️ Part I - Conda environments (45-50 min)

Lecture Notes

🤔 What is in my `conda`?

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

⚙️ How do we “fix” everyone’s environment?

Step 6.

Step 7.

Step 8.

Step 9.

Step 10.

Step 11.

Step 12.

Step 12.

☕ Coffee Break (10 min)

🖥️ Part II - Databases (45-50 min)

Useful links

Follow the steps

Step 13.

Step 14.

Step 15.

🖥️ Part I - Conda environments (45-50 min)

Lecture Notes

🤔 What is in my conda?

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

⚙️ How do we “fix” everyone’s environment?

Step 6.

Step 7.

Step 8.

Step 9.

Step 10.

Step 11.

Step 12.

Step 12.

☕ Coffee Break (10 min)

🖥️ Part II - Databases (45-50 min)

Useful links

Follow the steps

Step 13.

Step 14.

Step 15.

🤔 What is in my `conda`?