πŸ›£οΈ LSE DS202W 2025: Week 01 - Lab Roadmap

Introduction to Python - Understanding the Basics

Author

The DS202 Team

Published

20 Jan 2025

An introduction to programming in Python

Welcome to the first DS202 lab!

The main goal of this lab is to review some fundamental Python concepts that you’ll need throughout this course.

πŸ₯… Learning Objectives

  • Configure your working environment for the course, including Python, VS Code, and Jupyter Notebooks.
  • Start using markdown to write your notes and code for lab activities.
  • Practice consulting and reading the documentation to understand how a function works.
  • Review Python fundamental concepts such as for loops, list comprehensions and functions.
About Using ChatGPT and Similar AI Tools

We’re not against you using AI assistants like ChatGPT in this course. However, for this particular lab, we recommend doing it on your own (searching online is fine, though). This task is your chance to see where you stand with your Python programming skills. If you use an AI assistant now, you won’t get a clear picture of your abilities.

If you get stuck, consider asking your instructor or classmates for help. You can also consult the documentation for the functions and packages you’re using, which is a valuable skill in this field. Don’t worry, you’ll have plenty of chances to use AI help in later labs.

Downloading the student notebook

Click on the below button to download the student notebook.

If you run into problems with setting up Visual Studio Code with running Python from within Quarto (ask your class teacher for help with that!), you can alternatively download the following notebook:

Importing libraries

Python makes excellent use of user-based libraries. Installing a library can be tricky and can be dependent on what kind of computer you have. After installing Python, go to your Windows Powershell/Apple Terminal and try the following options

py --version
python --version
python3 --version

Which ever line of code produces the version of Python that is installed, you then use to install packages. For example, if py works, you install seaborn like so:

py -m pip install seaborn
Note

If you have installed Anaconda, you should use conda to do library installs rather than pip. That’s because conda comes bundled with Anaconda and handles Python library dependencies and conflicts better than pip does. Only use pip if the library is not available on a conda channel.

So instead of the command above, use the following command instead:

py -m conda install seaborn

The previous command uses the conda default channel to install seaborn but some packages are available on alternative conda channels e.g conda-forge or plotly.

To install a package that is available on an alternative conda channel, use this command:

py -m conda install -c channel-name package-name where channel-name is the name of the channel and package-name the name of the channel

e.g if I wanted to install altair (another data visualisation package not used in this lab), I would use the command:

py -m conda install -c conda-forge altair

We will import two central libraries namely numpy and pandas. Python has a neat feature that allows you to create an alias which allows you to reference libraries when employing functions that are created within these libraries. We use np for numpy and pd for pandas as they are the commonly used aliases in the community. As we will be making extensive use of lets_plot (which is Python’s equivalent to ggplot2 in R), we import all the functions by using the * placeholder.

import numpy as np
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()

πŸ‘‰ NOTE: If we wanted to load specific functions as opposed to all functions, we can replace the * placeholder with the names of all functions required, separated by a comma.

Lists

Lists are a core part of the Python languaged and are used to store collections of items.

We use the [] to create a list of values that are comma separated.

Typically, a single list will contain items that share data types. We use = to assign 4 three-item lists to 4 respective objects, x, y, z and s.

x = [22, 30, 42]
y = [6.02214, 3.14159, 6.674]
z = [False, True, False]
s = ["Darth Vader", "Luke Skywalker", "Han Solo"]

print(x,y,z,s)

However, we can also store different data types such as integers, floats, booleians, and characters in a single list. w contains all four data types we’ve discussed.

w = [40, 5.2, False, "Big Foot"]
print(w)

To confirm this, we can use type on each item in w:

print(type(w[0]), type(w[1]), type(w[2]), type(w[3]))

To pull out certain items in a list, we can refer to their index. Python uses zero indexing so in w the 1st element will have index 0, 2nd 1, 3rd 2, and 4th 3. If we want to find β€œBig Foot”, this is how we do it in Python:

w[3]

To change elements in the list, we can reference the list and the element and set it equal to a new value.

print(s)
s[2] = "Princess Leia"
print(s)

Dictionaries

Another central data structure in Python is the dictionary, which expresses key / value pairs. Suppose we wanted to collect data on infant mortality rate (per 100,000) in Bangladesh, Pakistan and India - we can use dictionaries to collect this data.

We use the {} build dictionaries. Keys and values are separated by : and entries are separated by ,.

infant_mortality_dict = {"Bangladesh":20.755,"Pakistan":51,"India":26.619}
infant_mortality_dict

To find out the infant mortality rate of Pakistan, we simply reference the key to display the value like so:

infant_mortality_dict["Pakistan"]

If we want to create a new entry in our dictionary, we can partly rely on the same syntax we use to reference already existing entries, only this time, we set the key of the new entry equal to its value.

infant_mortality_dict["Sri Lanka"] = 5.6
infant_mortality_dict

Arrays

An array is another common data structure in Python. Arrays are composed of a series of values of the same type, e.g. character, integer, float, boolean (i.e True and False), complex or raw.

We then employ np.array to turn these lists into arrays.

πŸ‘‰ NOTE: Arrays are great as (a) they serve as more compact (and thus efficient) versions of lists and (b) we can use Numpy to transform these arrays or calculate summary statistics.

Here are some examples of arrays:

x = np.array(x)
y = np.array(y)
z = np.array(z)
s = np.array(s)

print(x,y,z,s)

Try creating an array from w:

w = np.array(w)
w

πŸ—£οΈ CLASSROOM DISCUSSION: What happened?

Remember that arrays can only contain values that share the same type. w contains integers, floats, booleans, and characters so the array has simply reverted to a character for the simple reason that you can express an integer, float, and boolean as a character but not, for example, a character as a float.

You can use arrays of integers or numerics in arithmetic operations or in computations. For example:

print(np.sort(y))

We can create new vectors by combining previously made vectors. Because we cannot add the elements of two lists without using a for loop (more on this later), we can convert x and y into arrays using numpy.

v = (x**2)/2 - y*3 + 5
print(v)

We can calculate any quantity of interest we like in Python. Let’s calculate the mean of y using the mean function.

y_mean = np.mean(y)
print(y_mean.round(2))

Note that when calculating the average, we included .round(2). We (sneakily) used a method which rounded our average to 2 decimal places. print(y_mean) will give you the correct answer but with an unnecessary amount of decimal places.

Sequences

Let’s create a sequence of numbers from 1 to 10.

a = np.arange(1,10,1)
print(a)

Where did 10 go? The range function will start the sequence from the first number inclusively and end 1 unit before the last number.

a = np.arange(1,11,1)
print(a)

Suppose we want to create a sequence of numbers from -3 to 3 by increments of 0.5.

# Answer here

You can append the elements of two arrays! We create a list of two arrays and employ np.concatenate to do so. We can sort our new array using the sort method.

a_combined = np.concatenate([a,a])
print(a_combined)
a_combined.sort()
print(a_combined)

For loops

Suppose we want to create a list of strings which tells us the year from 2020 to 2024. This can be achieved using the following lines of code:

["The year is " + str(2020),
 "The year is " + str(2021),
 "The year is " + str(2022),
 "The year is " + str(2023),
 "The year is " + str(2024)]

πŸ‘‰ NOTE: We write "The year is " + str(2020) as we cannot concatenate two objects (wrong word???) together that have a different type.

As you quickly see, this is rather tedious since you copy the same code chunk over and over again. Rather than doing this, you could use a for loop to write repetitive parts of code.

Using a for loop, the code above transforms into:

# Instantiate a new list
year_strings = []

# Create a range of years
years = range(2020,2025)

# Use the for loop to append items to the list
for year in years:
    year_strings.append("The year is " + str(year))

# Show the output
year_strings

The best way to understand this loop is as follows: β€œFor each year that is in the sequence 2019:2024, you execute the code chunk print("The year is " + str(year)). Once the for loop has executed the code chunk for every year in the vector (i.e sequence 2019:2024), the loop stops and goes to the first instruction after the loop block.

List comprehensions

A more compact way of looping over elements in a list is by using list comprehensions.

["The year is " + str(year) for year in years]

πŸ‘‰ NOTE: The differences between for loops and list comprehensions may seem trivial with this example, but list comprehensions take advantage of vectorisation - which allows users to apply a function over a list in parallel. for loops, by contrast, apply a given function over each element in a list which, as you might imagine, is going to take more time and processing power.

Using functions and list comprehensions

Functions

In this lab, we’ve encountered and used quite a few different pre-made functions, but sometimes you just need to write your own function to tackle your data, i.e your set/succession of (reproducible) instructions.

A function is simply a code block that performs a specific task (which can be more or less complex), e.g. as calculating a sum.

You should think of writing a function whenever you’ve copied and pasted a block of code more than twice.

You give your function a (meaningful) name (name_of_the_function), define your function arguments (arguments ) i.e the parameters it needs to perform the task it supposed to perform, and put some content in the function. You define how the function should deal with the input/arguments to perform the task it needs to perform in function_content.

Here’s, as an example, of how we can turn our "The year is " + str([year object]) into a function:


def year_string(year):
    return "The year is " + str(year)

Use the function inside of a list comprehension.

[year_string(year) for year in range(2020,2025)]

πŸ‘‰ NOTE: The combination of user-defined functions and list comprehensions is extremely powerful as it is compact and computationally less expensive than using for loops. As a result, we are going to be using this combination a lot during this course!

πŸ‘₯ WORK IN PAIRS/GROUPS: Create a list of strings that tell us infant mortality rate in a given country

  • Create an array of countries using countries = inf_mortality_dict.keys().
  • Define a function that concatenates the string, for example Bangladesh:
    • The infant mortality rate is 20.755 per 100k in Bangladesh.
  • Employ the function to the list of countries created in the first step.
# Create an array of countries


# Define a function


# Use function in a list comprehension

Data frames

Another very common data type is a data frame. Below, we have a very simple example of world population over several decades with data from the World Bank. In this case, we will:

  • Enter a list of values.
  • Create variable labels using a Python dictionary.
  • Convert this dictionary to a Pandas dataframe.
years = [1960, 1970, 1980, 1990, 2000, 2010, 2020]
pop = [3.03, 3.69, 4.44, 5.29, 6.14, 6.97, 7.82]

pop_dict = {"year": years, "population": pop}

pop_df = pd.DataFrame(pop_dict)
print(pop_df)

   year  population
0  1960        3.03
1  1970        3.69
2  1980        4.44
3  1990        5.29
4  2000        6.14
5  2010        6.97
6  2020        7.82

Pandas data frames have their own methods and attributes that we will be leveraging for data cleaning and feature engineering. They can also be used to create useful graphs. While there are many graphical libraries in Python, we will be utilizing lets_plot which provides a Python analogue to R’s ggplot2 package as it provides a syntax that has unparalleled flexibility. Furthermore, the syntax in both programming languages is nearly identical which means you can transfer your graphing skills over to R easily.

(
    ggplot(pop_df, aes("year", "population"))
    + geom_point()
    + geom_line(linetype = "dashed")
    + scale_x_continuous(breaks=np.arange(1960,2030,10))
    + theme(panel_grid_major_x=element_blank())
    + labs(x = "Year", y = "Population (In Billions)", 
           title = "The world's population increased from 3 to 8 billion from 1960 to 2020!")
)

Here is the (rough) equivalent plot using seaborn and matplotlib.

# Import the relevant packages using their common aliases
import matplotlib.pyplot as plt
import seaborn as sns

# Create the plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=pop_df, x="year", y="population", linestyle="--", marker="o")

# Customize x-axis breaks
plt.xticks(np.arange(1960, 2030, 10))

# Add grid and remove major grid lines for x-axis
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.grid(axis="x", visible=False)

# Add labels and title
plt.xlabel("Year")
plt.ylabel("Population (In Billions)")
plt.title("The world's population increased from 3 to 8 billion from 1960 to 2020!")

# Show the plot
plt.show()

πŸ‘‰ NOTE: We adapted ChatGPT to translate the above code. Generative AI tools are pretty incredible in and of themselves and can serve as a game changer for coding. However, we recommend that you use them judiciously - we want you to be able to understand all the code you write!