🗓️ Week 01
Welcome to the course

LSE DS202 – Data Science for Social Scientists

24 Jan 2025

Who we are

Your lecturer

Photo of Ghita Berrada
Dr. Ghita Berrada
Assist. Prof. (Education)
LSE Data Science Institute
📧 E-mail
lecturer
course convenor
  • PhD in Computer Science (University of Twente, Netherlands)
  • Background: Engineering, Databases, Health Informatics, ML for cybersecurity
  • Formerly Research Associate at King’s College London and the University of Edinburgh (School of Informatics)

decision support systems
machine learning applications
databases
provenance
ethical AI/XAI

Teaching Assistants

Photo of Tabtim Duenger
Tabtim Duenger
Data Scientist
The Economist
MSc in Applied Social Data Science (LSE)
📧 E-mail
guest teacher
Photo of Andreas Stöffelbauer
Andreas Stöffelbauer
Data Scientist
Microsoft
MSc in Data Science (LSE)
📧 E-mail
guest teacher
Photo of Stuart Bramwell
Dr Stuart Bramwell
Guest Lecturer
Data Science Institute
DPhil in Politics (Oxford University)
📧 E-mail
guest teacher
Photo of Yassine Lahna
Yassine Lahna
Data Scientist
MSc in Statistical Science (Oxford University)
📧 E-mail
guest teacher

Support Sessions

Photo of Sara Luxmoore
Sara Luxmoore
Research Officer
LSE Data Science Institute and LSE Cities
📧 E-mail
  • 🦸🏻‍♀️ Runs weekly drop-in sessions for all DSI courses!

DS202W Weekly Drop-in sessions:

  • Typically every Wednesday from 10.00 to 11.30am at the COL.1.06 (Visualisation studio) but check announcements and calendar invites for updates.

Administrative Support

Photo of Kevin Kittoe
Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

  • if you cannot find the lecture recording on Moodle
  • when you need an extension for an assignment
    (👉 check LSE’s extension policy)
  • to request a class group change
    (you will be asked to provide a reason for this)
  • to inform us of any other issues that may affect your studies

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest events:

🗓️ Data Science across industries (03 December 2024- 4.00 to 5.30pm)

Machine learning is transforming large parts of the economy, and data scientists have the opportunity of to apply their skills in an incredibly broad variety of domains. The technical field is in rapid progress and professional roles in continuous development as companies navigate successive waves of technological and economic change. Data scientists must therefore craft skill paths which balance focus on rapid learning with capabilities complementing their domain, organisations and wider industry.

Drawing on his experience from startups, consulting and tech, Christian Svalesen, Senior Machine Learning Engineer at SoundCloud will provide insights into what data science roles and projects can involve across industries. He will share advice on how students can prepare and develop through their professional journey.

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Latest events:

🗓️ Navigating Data Science from Academia to Media, and Beyond (23 October 2024 - 4.30 to 6pm)

With the rise in adoption of AI/ML technologies and the increasing demand for data-driven decision-making, data science has become a vital component across many industries, including media. As data science transforms the media landscape - enhancing content personalisation, optimising conversion strategies, and improving audience engagement, it is also becoming an increasingly popular tool for addressing complex business challenges. Navigating a role in this field can be both exciting and challenging.

Tabtim Duenger, Senior Data Scientist at The Economist and Riya Chhikara, Data Scientist at the Economist, both LSE graduates1, will offer insights into their paths to entering the field of data science. They will discuss their experiences in landing their first roles, negotiating their functions and responsibilities within the media sector, and how they use these experiences and networks to continue guiding their careers.

Industry “field trips”

Visit at Lloyds (2023)

Who are you?

Programme Freq
General Course 14
BSc in Economics 11
BSc in Politics and Data Science 6
BSc in Psychological and Behavioural Science 3
BSc in Philosophy, Politics and Economics 2
BSc in Economics and Economic History 1
BSc in International Social and Public Policy and Economics 1
BSc in Politics and Economics 1
BSc in Politics and International Relations 1
Year Count
1 17
2 10
3 11
4 2

Who are you? (cont.)

What is this course about?

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms

  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets

Course Brief

What is this course about?

  • Focus: learn and understand the most fundamental machine learning algorithms
  • No neural networks, no deep learning, no large-scale data
  • How: practical use of machine learning techniques and its metrics, applied to relevant data sets
  • Some but not a lot of theory, math proofs and derivations
  • Lots of coding, examples and exercises

🎯 Learning Objectives

  • Understand the fundamentals of the data science approach, with an emphasis on social scientific analysis and the study of the social, political, and economic worlds;
  • Understand how classical methods such as regression analysis or principal components analysis can be treated as machine learning approaches for prediction or data mining.
  • Know how to fit and apply supervised machine learning models for classification and prediction.
  • Know how to evaluate and compare fitted models, and improve model performance.
  • Use applied computer programming, including the hands-on use of programming through course exercises.
  • Apply the methods learned to real data through hands-on exercises.
  • Integrate the insights from data analytics into knowledge generation and decision-making.
  • Understand an introductory framework for working with natural language (text) data using techniques of machine learning.
  • Learn how data science methods have been applied to a particular domain of study (applications).

📚 Course Structure

  • How will this course be taught?

  • How do I prepare for this course?

🧑🏻‍💻 Labs (90 min each week)

  • Purpose: introduce new concepts and tools which will only be explored in more detail in the lectures
    • Why? So you can come to the lectures with good questions!
  • Typically:
    • your class teacher might give you some context about the new tools/algorithms
    • you will be given time to work on something by yourself
    • there will be moments to share your interpretation of the results of algorithms with the classroom
  • You have to attend the lab you are enrolled in. You can’t switch on the day

Important

There might be some preparatory work to do before each lab!

Always check Moodle/the webpage at least a day before coming to the lab.

More about 🧑🏻‍💻 Labs (90 min each week)

Each week, you will have a roadmap of what to do.

The roadmap will typically contain the following elements:

Type of activity Description
🧑🏻‍🏫 TEACHING MOMENT Your class teacher deserves your full attention
🎯 ACTION POINTS Time to follow the steps in the roadmap.
Try it for a bit, but if you get stuck, call your class teacher.
👥 IN PAIRS/GROUPS You will benefit from completing that task with your peers more than doing it alone
🗣️ CLASSROOM DISCUSSION Your class teacher will facilitate a discussion about the task
📝 SUBMISSION Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

👩🏻‍🏫 Lectures (2 hours per week)

  • The first sessions will have slides, but mostly, it will be live coding
  • Feel free to code along with your lecturer
  • Pair/group exercises and discussions to interpret results
  • Bring a laptop if you can! (💡 you can borrow one from the library)
  • Recorded sessions will be available on Moodle on the next working day

Programming

  • Programming Language:
Logo of the programming language Python
Python
  • Integrated Development Environment (IDE) option:
Logo of the software Visual Studio Code
VS Code


  • In this course, we use:
    • VS Code is a more general IDE, good for many programming languages. The flip-side is that it requires a bit more configuration.

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

Pre-requisites and assumptions

We assume that you have some basic knowledge of:

  1. Descriptive Statistics
  2. Some linear algebra
  3. Programming
  • If you took ST102, you should be fine.
  • Nothing crazy, mostly matrix operations (simpler than MA107)
  • It’s ok if you are new to Python, but do reserve some extra hours in the first weeks to practice the basics.

Teaching Philosophy


  • My teaching approach is grounded in empiricism.
  • I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.1
  • In summary: learning by doing (or said, more bluntly😂, learning by trial and error) serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?


Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

  • Frequently, we will present you with tasks that involve new concepts before diving into the corresponding theory or background knowledge.
    • For example, I might ask you to consult the pandas or scikit-learn or numpy or Python documentation instead of explaining it directly.
  • Reasoning: letting your ‘struggles’ guide the learning process.
  • 👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
    • You are very much encouraged to help and learn from each other.
  • If this course is too easy for you, try to apply its concepts to your own data sets or to more complex problems and bring us your questions.
  • If you feel this teaching style is not working, drop us an e-mail or discuss it during office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

There are three official positions at LSE:

Position 1: No authorised use of generative AI in assessment. (Unless your Department or course convenor indicates otherwise, the use of AI tools for grammar and spell-checking is not included in the full prohibition under Position 1.)

Position 2: Limited authorised use of generative AI in assessment.

Position 3: Full authorised use of generative AI in assessment.
👉 This is the position we adopt in this course

Source: School position on generative AI, LSE Website, September 2024

Our policy in this course

  • You can use AI tools during lectures, labs, and for your assignments.
    • Except when the lecturer or class teachers expressly ask you not to use it.
  • When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
    • Examples:

      I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.

      I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

  • We see many students using ChatGPT during lectures, labs, and assessments.
  • Frankly, most university instructors are clueless as to whether this is helping or hindering your learning.
  • So we did some research to try to figure out:
    • How are students using generative AI tools in their studies?
    • What are the benefits and drawbacks of using generative AI tools?

Participating Courses:

  • DS105W (Data for Data Science)
  • DS202W (Data Science for Social Scientists)
  • ST456 (Deep Learning)
  • PP422 (Data Science for Public Policy)

The GENIAL project

You can read more about the GENIAL project on the project page.

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

  • Good: The students who made the most resourceful use of GenAI remained in control of their learning. They often gave the chatbots a lot of context (“I want to perform web scraping of this website with the library scrapy, the code must contain functions – no classes – and I want to save the data in a CSV file.”) and would always check the code/output generated by GenAI against the course materials or reputable sources. They were able to identify when the AI was suggesting something that was not correct or not following best practices and would never blindly accept the AI’s suggestions.

The GENIAL project

What we have learned so far:

We haven’t fully analysed the data yet (lots of it!⛰️) but here’s what we can say for now about the good and bad aspects of using generative AI tools in education:

  • Bad: If you don’t master a subject, GenAI can make you feel like you do. This pattern was frequent, for example, among students who had gaps in their understanding of programming concepts. They would ask the AI to generate code for them, and the AI would produce code that seemed to work but that generated the incorrect response or was so complex, it was virtually impossible to edit.

Read more about it in our preprint:

Dorottya Sallai, Jonathan Cardoso-Silva, Marcos E. Barreto, Francesca Panero,Ghita Berrada, and Sara Luxmoore. “Approach Generative AI Tools Proactively or Risk Bypassing the Learning Process in Higher Education”, LSE Public Policy Review, 3(3), p. 7, 2024.

☕️ Time for a break

Image created with DALL·E2. Prompt: “Cat drinking tea in a classroom, Renoir style.”

Our first proper lecture will start in a few minutes.

What really is data science? + Python tips

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The academic possibilities

  • Humans and machines nowadays generate A LOT of data ALL THE TIME
  • It has become very cheap to collect and store this data
  • This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

  • How do rumours spread?
  • How can we predict unemployment rates accurately?

New questions enabled by new data/new technologies:

  • Is social media a threat to democracy/public order?
  • Is generative AI a threat to the job market?

We hope that in this reformulated version of the DS202 course, you will learn how to tackle similar questions that are relevant to your field of study.

You might ask:

“How is data science any different from what I have learned in other stats courses?”

Data Science and Social Science

👉 Traditional Statistics in the social sciences: the goal is typically explanation

👉 Data science: the focus is frequently put more on data exploration and prediction

  • Data science is heavily influenced by computer science and engineering
  • There is a strong emphasis on computational efficiency and scalability (due to big data)
  • Many of the algorithms and methods you will learn in this course can be used in both contexts (explanation vs prediction)
    • We will try to highlight the differences in these approaches throughout the course

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

This course is mostly about the ‘20%’ stage. Most of the data we will give you is already clean and ready to be modeled with machine learning.



Next week, we will discuss together what it means for a machine to learn something.


But first, a word about programming skills 👉

Let’s get more technical

  • Python vs R

Python vs R

A few stats

  • Python ranked number 1 in the TIOBE Programming Community index of January 20251(rating of 23.28% and change of +9.32% compared to January 2024) vs R at number 18 (rating of 1.00% and change of +0.27%)
  • Python at the top of the IEEE Spectrum rankings of programming languages of 2024, in two aspects of popularity measured (languages in active use among typical IEEE members and working software engineers (the “Spectrum” ranking) and languages that are in the zeitgeist (the “Trending” ranking)) and is second in the last one (languages that are in demand by employers (the “Jobs” ranking)). R sits at rank 20 in the “Spectrum” ranking, rank 17 in the “Trending” one and at rank 21 for the “Jobs” ranking (see details about the rankings here and the rankings’ methodology here)
  • the PYPL PopularitY of Programming Language index, created by analyzing how often language tutorials are searched on Google, also has Python at the top in January 2025 (share 29.8% with +1.7% increase compared to last year). R ranks at number 6 (share 4.63% and no increase since last year).

Python vs R

Logo of the programming language python
Python
  • Python is a general-purpose programming language
  • It is used for web development, scientific computing, data science, advanced machine learning tools (deep learning), etc.
Logo of the programming language R
R
  • R is more niche. It is a programming language created for statistical computing
  • You can do many other things with R, but it is mostly used for statistics and general data science (except for heavy Machine Learning)

Some Python basics

Data types

  • In R, you assign a variable using the operator <- :
var <- 2
  • Some basic data types:
var <- "value" # A string. Single quotes are OK too

var <- 2.2     # A double (aka numeric)
var <- 2       # Also a double! 😱

# Want an integer? You have to be explicit:
var <- as.integer(2)

  • Whereas in Python, assignments are done with = :
var = 2
  • The python equivalent:
var = "value" # A string. Single quotes are OK too
var = """I want to write 
         sentence without caring for line breaks""" 
         # Python also has an additional option (triple double quotes!!!) to simplify the handling of strings that contain newlines

var = 2.2     # A float
var = 2       # An int (🏅)
var = float(2) # A float

In Python, less is more! Always, be explicit when using the greedier data types…

Python basics

Python lists

  • We can put basic data types (i.e strings, integers, floats) in collections of data (e.g lists, dictionaries, tuples)
# I am creating a list of integers
l = [1, 2, 3, 4]
l

returns:

[1, 2, 3, 4]

You could use the append method to add elements to a list:

l.append(5) # You're adding an element to the list l in-place
l

returns:

[1, 2, 3, 4, 5]

You could also use the extend method to add several elements in one go:

l.extend([6,7,8]) # You're adding elements to the list l in-place
l

returns:

[1, 2, 3, 4, 5, 6, 7, 8]

Yet another way to elements to a list is as follows:

l2=[5,7,9] #defining a list as usual
l2+=[8,0,4] #adding elements to the list
l2

returns:

[5, 7, 9, 8, 0, 4]

Python basics

Python lists (cont.)

  • Lists don’t need to contain a single data type
mixed_type=[2.0,7,"bananas"] # this line is entirely valid!
  • You can define lists of lists and lists of lists of lists, etc…
nested_list=[[1,2,3],[4,5,6]] #An example of nested list
  • You can access the elements of lists and nested lists as follows:
l=[1, 2, 3, 4, 5, 6, 7, 8]
l[0]

returns:

1
nested_list=[[1, 2, 3], [4, 5, 6]]
nested_list[0][-1]
nested_list[1][1]

returns:

3
5

Python basics

Other types of data collections

  • Aside from lists, you also have tuples
my_tuple=(1,2,3,4)
my_tuple

returns

(1, 2, 3)
my_tuple[1]

returns

2

What do you think is the difference here?

Tuples are immutable!

Is there a way to update tuples? Yes!

First method

my_tuple+=(9,) #one way of adding an element to the tuple

Second method

#another way way of adding an element to the tuple
temp=list(my_tuple) #convert the tuple to list
temp+=[9] #append element to list
my_tuple=tuple(temp) #convert the list back to tuple

Python basics

Other types of data collections

Aside from lists and tuples, you also have dictionaries and other more complex data collection types (for these, see the documentation).

A Python dictionary is a collection of key-value pairs, where each key corresponds to its associated value. For example:

# Me trying to do something complicated
my_silly_dictionary={"first_name":"Jane","last_name":"Doe","city":"London"}
my_silly_dictionary

returns:

{'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London'}

You can access dictionary elements as follows:

my_silly_dictionary["first_name"]

returns

'Jane'

You can add an element to the dictionary as follows:

my_silly_dictionary["country"]="UK"

returns

{'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London', 'country': 'UK'}

Python basics

Some basic operations

  • If you run:
type(my_silly_dictionary)
type(var)

you will get the type of your Python object.

The above returns:

<class 'dict'>
<class 'float'> #since var=float(2.0) 

You can get the length of a collection of data (list,dictionary,tuple) with the len function.

len(l) #l=[1, 2, 3, 4, 5, 6, 7, 8]
len(my_tuple) #my_tuple=(1,2,3)
len(my_silly_dictionary) #my_silly_dictionary={'first_name': 'Jane', 'last_name': 'Doe', 'city': 'London'}

returns

3
8
3

Python basics

Something, we need to perform operations repeatedly

We have loops (for or while loops):

result = []
for i in range(100000):
    result.append(i * 2)
result = []
i = 0
while i < 100000:
    result.append(i * 2)
    i += 1

(Note that Python needs indentation and you absolutely can’t mix tabs and spaces!)

And you have list comprehensions (as well as dictionary comprehensions)

result = [i * 2 for i in range(100000)]
# dictionary that associates a number with its square
squares = {x: x**2 for x in range(10)}
print(squares)

returns

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}
# code that produces a dictionary that only contains the pair numbers from the original dictionary
original_dict = {"a": 1, "b": 2, "c": 3, "d": 4}
filtered_dict = {k: v for k, v in original_dict.items() if v % 2 == 0}
print(filtered_dict)
{'b': 2, 'd': 4}

Python basics

Custom functions definition

def my_function(x):
    return x + 1
my_function(2)

In R, the return keyword exists, but it is optional. Whatever is at the last line of the function will be returned.

my_function <- function(x) {
  x + 1
}

Python basics

Custom functions definition

Let’s define functions based on the loops and list comprehension from before. We’ll do some code profiling!

import cProfile

def for_loop_example():
    result = []
    for i in range(100000):
        result.append(i * 2)

def while_loop_example():
    result = []
    i = 0
    while i < 100000:
        result.append(i * 2)
        i += 1

def list_comprehension_example():
    result = [i * 2 for i in range(100000)]

# Profile each function
print("Profiling for loop:")
cProfile.run("for_loop_example()")

print("\nProfiling while loop:")
cProfile.run("while_loop_example()")

print("\nProfiling list comprehension:")
cProfile.run("list_comprehension_example()")

Python basics

Results from the loops and list comprehension profiling

Profiling for loop:
         100004 function calls in 0.022 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.013    0.013    0.021    0.021 <python-input-66>:3(for_loop_example)
        1    0.001    0.001    0.022    0.022 <string>:1(<module>)
        1    0.000    0.000    0.022    0.022 {built-in method builtins.exec}
   100000    0.008    0.000    0.008    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



Profiling while loop:
         100004 function calls in 0.021 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.014    0.014    0.020    0.020 <python-input-66>:8(while_loop_example)
        1    0.001    0.001    0.021    0.021 <string>:1(<module>)
        1    0.000    0.000    0.021    0.021 {built-in method builtins.exec}
   100000    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



Profiling list comprehension:
         4 function calls in 0.003 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.002    0.002    0.002    0.002 <python-input-66>:15(list_comprehension_example)
        1    0.001    0.001    0.003    0.003 <string>:1(<module>)
        1    0.000    0.000    0.003    0.003 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

pandas and scikit-learn (briefly)

  • Python has a base set of functions and libraries that come with the installation of the language (e.g os, collections, math, etc. - see the Python documentation for details)

  • The pandas, numpy and scikit-learn libraries are not part of the standard Python libraries, but they are very popular and very actively maintained packages.

  • These packages contain most of the functionality needed to handle datasets, manipulate them (pandas mainly), perform statistical operations on them and apply machine learning models on them

  • These are the libraries we will rely on most in this course.

Note to R users

Think of the pandas as what tidyverse is to R and to some extent of scikit-learn, as what tidymodels (and perhaps caret) are to R.

A touch of pandas

Example: reading a csv file

import pandas as pd
my_data=pd.read_csv("my_file.csv")

Example: selecting columns

# accessing a column in the dataframe
my_data['col'] 
my_data.col #note that for this to work column "col" should not have any spaces in its name
my_data[[col1,col2]]

A touch of pandas

  • In pandas, you write multiple functions in succession or use method chaining:

Without method chaining

df = pd.read_csv('data.csv')
df = df.fillna(...)
df = df.query('some_condition')
df['new_column'] = df.cut(...)
df = df.pivot_table(...)
df = df.rename(...)

With method chaining

df = (
    pd.read_csv('data.csv')
    .fillna(...)
    .query('some_condition')
    .assign(new_column=df.cut(...))
    .pivot_table(...)
    .rename(...)
)

A touch of pandas

Example: filtering rows

Filtering when the values are integers

df.query("col==2") #case where the column type is an integer

Filtering when the values are strings


df.query("col=='python'") #case where the column type is a string

Example: concatenating dataframes

Say we have two random datasets:

df1 = pd.DataFrame({
    "Name": ["Alice", "Bob"],
    "Age": [25, 30]
})

df2 = pd.DataFrame({
    "Name": ["Charlie", "David"],
    "Age": [35, 40]
})

If we want to concatenate both dataframes vertically (i.e name and age stay the columns) then:

# Concatenate the DataFrames
result = pd.concat([df1, df2], ignore_index=True)

print(result)

which returns

       Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40

Coming Up

  • Next Week’s Lab: Prepare for hands-on exercises in pandas.
  • If you are a former DS105 student: Explore ME204 for code and exercises in the DS105 style.

References

Roser, Max, Hannah Ritchie, and Edouard Mathieu. 2023. “Technological Change.” Our World in Data.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3). https://doi.org/10.1214/10-STS330.