π£οΈ LSE DS202W 2025: Week 01 - Lab Roadmap
Introduction to Python - Understanding the Basics
An introduction to programming in Python
Welcome to the first DS202 lab!
The main goal of this lab is to review some fundamental Python concepts that youβll need throughout this course.
π₯ Learning Objectives
- Configure your working environment for the course, including Python, VS Code, and Jupyter Notebooks.
- Start using markdown to write your notes and code for lab activities.
- Practice consulting and reading the documentation to understand how a function works.
- Review Python fundamental concepts such as
for
loops, list comprehensions and functions.
Weβre not against you using AI assistants like ChatGPT in this course. However, for this particular lab, we recommend doing it on your own (searching online is fine, though). This task is your chance to see where you stand with your Python programming skills. If you use an AI assistant now, you wonβt get a clear picture of your abilities.
If you get stuck, consider asking your instructor or classmates for help. You can also consult the documentation for the functions and packages youβre using, which is a valuable skill in this field. Donβt worry, youβll have plenty of chances to use AI help in later labs.
Downloading the student notebook
Click on the below button to download the student notebook.
If you run into problems with setting up Visual Studio Code with running Python from within Quarto (ask your class teacher for help with that!), you can alternatively download the following notebook:
Importing libraries
Python makes excellent use of user-based libraries. Installing a library can be tricky and can be dependent on what kind of computer you have. After installing Python, go to your Windows Powershell/Apple Terminal and try the following options
py --version
python --version
python3 --version
Which ever line of code produces the version of Python that is installed, you then use to install packages. For example, if py
works, you install seaborn
like so:
py -m pip install seaborn
If you have installed Anaconda, you should use conda
to do library installs rather than pip
. Thatβs because conda
comes bundled with Anaconda and handles Python library dependencies and conflicts better than pip
does. Only use pip
if the library is not available on a conda
channel.
So instead of the command above, use the following command instead:
py -m conda install seaborn
The previous command uses the conda
default channel to install seaborn
but some packages are available on alternative conda
channels e.g conda-forge
or plotly
.
To install a package that is available on an alternative conda
channel, use this command:
py -m conda install -c channel-name package-name
where channel-name
is the name of the channel and package-name
the name of the channel
e.g if I wanted to install altair
(another data visualisation package not used in this lab), I would use the command:
py -m conda install -c conda-forge altair
We will import two central libraries namely numpy
and pandas
. Python has a neat feature that allows you to create an alias which allows you to reference libraries when employing functions that are created within these libraries. We use np
for numpy
and pd
for pandas
as they are the commonly used aliases in the community. As we will be making extensive use of lets_plot
(which is Pythonβs equivalent to ggplot2
in R), we import all the functions by using the *
placeholder.
import numpy as np
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()
π NOTE: If we wanted to load specific functions as opposed to all functions, we can replace the *
placeholder with the names of all functions required, separated by a comma.
Lists
Lists are a core part of the Python languaged and are used to store collections of items.
We use the []
to create a list of values that are comma separated.
Typically, a single list will contain items that share data types. We use =
to assign 4 three-item lists to 4 respective objects, x
, y
, z
and s
.
= [22, 30, 42]
x = [6.02214, 3.14159, 6.674]
y = [False, True, False]
z = ["Darth Vader", "Luke Skywalker", "Han Solo"]
s
print(x,y,z,s)
However, we can also store different data types such as integers, floats, booleians, and characters in a single list. w
contains all four data types weβve discussed.
= [40, 5.2, False, "Big Foot"]
w print(w)
To confirm this, we can use type
on each item in w
:
print(type(w[0]), type(w[1]), type(w[2]), type(w[3]))
To pull out certain items in a list, we can refer to their index. Python uses zero indexing so in w
the 1st element will have index 0, 2nd 1, 3rd 2, and 4th 3. If we want to find βBig Footβ, this is how we do it in Python:
3] w[
To change elements in the list, we can reference the list and the element and set it equal to a new value.
print(s)
2] = "Princess Leia"
s[print(s)
Dictionaries
Another central data structure in Python is the dictionary, which expresses key / value pairs. Suppose we wanted to collect data on infant mortality rate (per 100,000) in Bangladesh, Pakistan and India - we can use dictionaries to collect this data.
We use the {}
build dictionaries. Keys and values are separated by :
and entries are separated by ,
.
= {"Bangladesh":20.755,"Pakistan":51,"India":26.619}
infant_mortality_dict infant_mortality_dict
To find out the infant mortality rate of Pakistan, we simply reference the key to display the value like so:
"Pakistan"] infant_mortality_dict[
If we want to create a new entry in our dictionary, we can partly rely on the same syntax we use to reference already existing entries, only this time, we set the key of the new entry equal to its value.
"Sri Lanka"] = 5.6
infant_mortality_dict[ infant_mortality_dict
Arrays
An array is another common data structure in Python. Arrays are composed of a series of values of the same type, e.g. character
, integer
, float
, boolean
(i.e True
and False
), complex or raw.
We then employ np.array
to turn these lists into arrays.
π NOTE: Arrays are great as (a) they serve as more compact (and thus efficient) versions of lists and (b) we can use Numpy
to transform these arrays or calculate summary statistics.
Here are some examples of arrays:
= np.array(x)
x = np.array(y)
y = np.array(z)
z = np.array(s)
s
print(x,y,z,s)
Try creating an array from w
:
= np.array(w)
w w
π£οΈ CLASSROOM DISCUSSION: What happened?
Remember that arrays can only contain values that share the same type. w
contains integers, floats, booleans, and characters so the array has simply reverted to a character for the simple reason that you can express an integer, float, and boolean as a character but not, for example, a character as a float.
You can use arrays of integers or numerics in arithmetic operations or in computations. For example:
print(np.sort(y))
We can create new vectors by combining previously made vectors. Because we cannot add the elements of two lists without using a for loop (more on this later), we can convert x
and y
into arrays using numpy
.
= (x**2)/2 - y*3 + 5
v print(v)
We can calculate any quantity of interest we like in Python. Letβs calculate the mean of y
using the mean
function.
= np.mean(y)
y_mean print(y_mean.round(2))
Note that when calculating the average, we included .round(2)
. We (sneakily) used a method which rounded our average to 2 decimal places. print(y_mean)
will give you the correct answer but with an unnecessary amount of decimal places.
Sequences
Letβs create a sequence of numbers from 1 to 10.
= np.arange(1,10,1)
a print(a)
Where did 10 go? The range function will start the sequence from the first number inclusively and end 1 unit before the last number.
= np.arange(1,11,1)
a print(a)
Suppose we want to create a sequence of numbers from -3 to 3 by increments of 0.5.
# Answer here
You can append the elements of two arrays! We create a list of two arrays and employ np.concatenate
to do so. We can sort our new array using the sort
method.
= np.concatenate([a,a])
a_combined print(a_combined)
a_combined.sort()print(a_combined)
For loops
Suppose we want to create a list of strings which tells us the year from 2020 to 2024. This can be achieved using the following lines of code:
"The year is " + str(2020),
["The year is " + str(2021),
"The year is " + str(2022),
"The year is " + str(2023),
"The year is " + str(2024)]
π NOTE: We write "The year is " + str(2020)
as we cannot concatenate two objects (wrong word???) together that have a different type.
As you quickly see, this is rather tedious since you copy the same code chunk over and over again. Rather than doing this, you could use a for loop to write repetitive parts of code.
Using a for
loop, the code above transforms into:
# Instantiate a new list
= []
year_strings
# Create a range of years
= range(2020,2025)
years
# Use the for loop to append items to the list
for year in years:
"The year is " + str(year))
year_strings.append(
# Show the output
year_strings
The best way to understand this loop is as follows: βFor each year that is in the sequence 2019:2024, you execute the code chunk print("The year is " + str(year))
. Once the for
loop has executed the code chunk for every year in the vector (i.e sequence 2019:2024), the loop stops and goes to the first instruction after the loop block.
List comprehensions
A more compact way of looping over elements in a list is by using list comprehensions.
"The year is " + str(year) for year in years] [
π NOTE: The differences between for
loops and list comprehensions may seem trivial with this example, but list comprehensions take advantage of vectorisation - which allows users to apply a function over a list in parallel. for
loops, by contrast, apply a given function over each element in a list which, as you might imagine, is going to take more time and processing power.
Using functions and list comprehensions
Functions
In this lab, weβve encountered and used quite a few different pre-made functions, but sometimes you just need to write your own function to tackle your data, i.e your set/succession of (reproducible) instructions.
A function is simply a code block that performs a specific task (which can be more or less complex), e.g. as calculating a sum.
You should think of writing a function whenever youβve copied and pasted a block of code more than twice.
You give your function a (meaningful) name (name_of_the_function), define your function arguments (arguments ) i.e the parameters it needs to perform the task it supposed to perform, and put some content in the function. You define how the function should deal with the input/arguments to perform the task it needs to perform in function_content.
Hereβs, as an example, of how we can turn our "The year is " + str([year object])
into a function:
def year_string(year):
return "The year is " + str(year)
Use the function inside of a list comprehension.
for year in range(2020,2025)] [year_string(year)
π NOTE: The combination of user-defined functions and list comprehensions is extremely powerful as it is compact and computationally less expensive than using for
loops. As a result, we are going to be using this combination a lot during this course!
π₯ WORK IN PAIRS/GROUPS: Create a list of strings that tell us infant mortality rate in a given country
- Create an array of countries using
countries = inf_mortality_dict.keys()
. - Define a function that concatenates the string, for example Bangladesh:
- The infant mortality rate is 20.755 per 100k in Bangladesh.
- Employ the function to the list of countries created in the first step.
# Create an array of countries
# Define a function
# Use function in a list comprehension
Data frames
Another very common data type is a data frame. Below, we have a very simple example of world population over several decades with data from the World Bank. In this case, we will:
- Enter a list of values.
- Create variable labels using a Python dictionary.
- Convert this dictionary to a Pandas dataframe.
= [1960, 1970, 1980, 1990, 2000, 2010, 2020]
years = [3.03, 3.69, 4.44, 5.29, 6.14, 6.97, 7.82]
pop
= {"year": years, "population": pop}
pop_dict
= pd.DataFrame(pop_dict)
pop_df print(pop_df)
year population0 1960 3.03
1 1970 3.69
2 1980 4.44
3 1990 5.29
4 2000 6.14
5 2010 6.97
6 2020 7.82
Pandas data frames have their own methods and attributes that we will be leveraging for data cleaning and feature engineering. They can also be used to create useful graphs. While there are many graphical libraries in Python, we will be utilizing lets_plot
which provides a Python analogue to Rβs ggplot2
package as it provides a syntax that has unparalleled flexibility. Furthermore, the syntax in both programming languages is nearly identical which means you can transfer your graphing skills over to R easily.
("year", "population"))
ggplot(pop_df, aes(+ geom_point()
+ geom_line(linetype = "dashed")
+ scale_x_continuous(breaks=np.arange(1960,2030,10))
+ theme(panel_grid_major_x=element_blank())
+ labs(x = "Year", y = "Population (In Billions)",
= "The world's population increased from 3 to 8 billion from 1960 to 2020!")
title )
Here is the (rough) equivalent plot using seaborn
and matplotlib
.
# Import the relevant packages using their common aliases
import matplotlib.pyplot as plt
import seaborn as sns
# Create the plot
=(10, 6))
plt.figure(figsize=pop_df, x="year", y="population", linestyle="--", marker="o")
sns.lineplot(data
# Customize x-axis breaks
1960, 2030, 10))
plt.xticks(np.arange(
# Add grid and remove major grid lines for x-axis
="y", linestyle="--", alpha=0.7)
plt.grid(axis="x", visible=False)
plt.grid(axis
# Add labels and title
"Year")
plt.xlabel("Population (In Billions)")
plt.ylabel("The world's population increased from 3 to 8 billion from 1960 to 2020!")
plt.title(
# Show the plot
plt.show()
π NOTE: We adapted ChatGPT to translate the above code. Generative AI tools are pretty incredible in and of themselves and can serve as a game changer for coding. However, we recommend that you use them judiciously - we want you to be able to understand all the code you write!