πŸ‘¨β€πŸ« Week 09 - Managing your data science workflow

DS105 - Data for Data Science

Author

Dr. Jon Cardoso-Silva

Published

22 November 2022

I had planned a small introduction to spaCy, a package in Python that handles text as data, but we will not have the time to talk about NLP today!

Unstructured data (text, audio & image) was always planned to be taught next week anyways. So join on Week 10 for an intro to text pre-processing, when I will then focus on tokenization β€” the process of breaking sentences into words and counting/describing them (without any for loops!)

We will continue to explore the topic of text data on Week 11 when Prof. Ken Benoit will come and deliver a talk about text mining applications.

πŸ–₯️ Part I - Conda environments (45-50 min)

In the first part of the lecture, we will explore conda environments together (bring your πŸ’»laptops, and perhaps also coffee!).

And in the labs, you will practice how to share these environments via Github and also how to use Github more effectively as a team. You will learn about the process of Pull Requests for teams.

Note

πŸ’‘You will be assessed on knowledge of these best practices as part of your final project. Check the Project Marking Criteria: Source Code | Organisation and Source Code | Collaboration.

Open to see lecture notes

Lecture Notes

This is an interactive lecture. I will teach you about conda environments and we will compare how everyone’s conda and python settings are different.

πŸ€” What is in my conda?

Step 1.

Which version of python do you have? Open your terminal and type:

python --version

Compare your version to those of your colleagues.

Step 2.

Which version of conda do you have? Open your terminal and type:

conda list

You should see a list of all packages you have installed, or came installed by default, in your conda default environment.

What version of jupyterlab do you have installed? What about pandas? Compare the version of your packages to those of your colleagues. Are there any differences?

Step 3.

Let’s generate some data! Run the command below in the terminal to save the content of conda list to a text file. Replace my username for yours:

conda list >> conda_list_jonjoncardoso.txt

Step 4.

I will ask you to upload the file you created to Slack

Step 5.

I will then combine all of our data and we will explore the discrepancies in the versions of packages we are all likely to use.

βš™οΈ How do we β€œfix” everyone’s environment?

Step 6.

In the terminal, cd to the directory where you keep all the files of your project.

Step 7.

Create a conda environment

πŸ’‘ Useful link: Managing conda environments`

conda create --prefix .\venv python=3.10

Step 8.

Activate the environment:

source activate .\venv
activate .\venv

Step 9.

What’s different about this conda environment?

conda list

Step 10.

Create a new file called requirements.txt and paste the following there:

matplotlib==3.5.3 # version required for plotnine
plotnine==0.10.1 # Python version of ggplot2

numpy>=1.22
pandas==1.4.2
scikit-learn==1.1.3

### UTILS
jupyterlab==3.4.2
tqdm==4.62.0

Step 11.

Try to install it with conda:

conda install --file requirements.txt

Why can’t we install all packages?

Step 12.

Install it with pip:

conda install pip
which pip
conda install pip
where.exe pip

Then:

pip install -r requirements.txt

Step 12.

How does the conda environment look like now?

conda list

β˜• Coffee Break (10 min)

Use this time to chat, stretch, drink some coffee or just relax for a bit by yourself.

πŸ–₯️ Part II - Databases (45-50 min)

Databases: what is it? what is SQL? And how to connect to a database directly through pandas. Initially, this content will come on πŸ—“οΈ Week 08 but we didn’t have the time for that.

Open to see lecture notes

Follow the steps

Step 13.

Download and Install DBrowser for SQLite

Step 14.

Download this sample data called ChinookDatabase

Step 15.

Import Chinook Database to the database using DBrowser