🗓️ Week 01 – Day 01
Introduction

08 Jul 2024

Introductions

Who we are: your lecturer

**Dr. Jon Cardoso-Silva**
Assistant Professor of Data Science (Education)
LSE Data Science Institute
🌐 jonjoncardoso.github.io

nlp

network analysis

optimisation

data science workflow

generative AI for education

machine learning applications

Recent projects: VIMuRe

VIMuRe (social network analysis)

📦 Python and R packages available

De Bacco, Caterina, Martina Contisciani, Jonathan Cardoso-Silva, Hadiseh Safdari, Gabriela Lima Borges, Diego Baptista, Tracy Sweet, et al. 2023. “Latent Network Models to Account for Noisy, Multiply Reported Social Network Data.” Journal of the Royal Statistical Society Series A: Statistics in Society, February, qnac004.

We developed a Bayesian statistical model to uncover the ‘true’ underlying network behind the social network ties reported by individuals.

This is applied to a type of tie elicitation called double-sampling.
The model can detect when a person has the tendency to under- or over-reported their ties.
among other things…

Fig. 9. Example of networks estimated by baseline methods and VIMuRe for one Karnataka village (tie type ‘Visit’). (a) Union (⁠recip.=0.93⁠), (b) intersection (⁠recip.=0.88⁠), and (c) VIMuRe (⁠recip.=0.49⁠).

Recent projects: LSE Course Selection Pathways

Recent projects: Emojis and Political Identities

What can the emojis you use in your social media profile reveal about your political values? 🤔

Collaboration with:

Sara Luxmoore* (Incoming PhD student at UC Berkeley)
Pedro Ramaciotti (Research Scientist @ médialab Sciences Po)

Recent Projects: Generative AI in Education

Joint project with Dr Marcos Barreto (LSE Statistics)

CONTEXT

In higher education, students are increasingly using Generative AI tools like ChatGPT, GitHub Copilot, Bard, Bing AI to enhance their learning experience. These tools offer personalised and immediate assistance for tasks such as summarising literature, brainstorming, and the writing of code and text, even though some outputs may have limitations in terms of transparency and accuracy. Some educators feel encouraged to incorporate these tools into our teaching and assessments to support students, but there is still limited evidence on how effective these generative AI tools are in improving learning outcomes.

This focus group aims to fill that gap and explore the practical applications of these tools and their role in enhancing, specifically, programming skills and critical thinking.

OBJECTIVES

Surveying participants and the academic community for their experiences and expectations related to generative AI.
Reviewing literature on generative AI tools in Education.
Identifying suitable tools for data science/quantitative courses.
Testing selected tools against reference examples and establishing assessment metrics.
Implementing and validating case studies.
Producing evidence to support peers and inform policy decisions

Who we are: your class teacher

Alexander Soldatkin
DPhil Candidate
Oxford School of Global and Area Studies
University of Oxford
Guest Teacher at the LSE DSI
📧 E-mail
class teacher

His research relies on novel ways of using, combining, and automating open data from finance and beyond, such as creating natural language processing algorithms to extract and disambiguate named entities from text. He is currently investigating applications of network analysis in banking and implementing them in his project.

Prior to his DPhil, Alex completed a dual MSc in Public Administration and Government from LSE and Peking University, graduating with distinction and receiving a Prize for Best Dissertation from the Department of Government at LSE, where he researched factionalism and regionalism in Russian banking using geospatial analysis and text mining.

The Data Science Institute

This course is offered by the LSE Data Science Institute (DSI).
DSI is the hub for LSE’s interdisciplinary collaboration in data science
⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

How to get involved?

Our regular courses

DSI offers accessible introductions to Data Science:

DS101

Fundamentals of
Data Science

🎯 Focus:
theoretical concepts of data science

📂 How:
reflections through reading and writing

DS105

Data for
Data Scientists

🎯 Focus:
collection and handling of real data

📂 How:
hands-on coding exercises and a group project

DS202

Data Science for
Social Scientists

🎯 Focus:
fundamental machine learning algorithms

📂 How:
practical use of ML techniques and metrics

Who are you?

What we mean by Data Science

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The mythical unicorn 🦄

knows everything about statistics

able to communicate insights perfectly

fully understands businesses like no one

is a fluent computer programmer

In reality…

We are all jugglers 🤹

Everyone brings a different skill set.
We need multi-disciplinary teams.
Good data scientists know a bit of everything.
- Not fluent in all things
- Understands their strenghts and weaknessess
- They know when and where to interface with others

The Data Science Workflow

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

And this is what this course is about! You will learn some of the most common tools used during this process.

We add the term data engineering to the name of this course for this very reason.

A few words on Data Engineering

The meme is real

The struggle is real.
by u/ali_azg in r/dataengineering

The field is vast

But remember the unicorn 🦄! You don’t need to be an expert in all of these tools.

DataEngineering 2021 in one pic
by u/Legitimate-Cry2837 in dataengineering
Let’s zoom in 🔎 here.

Data Engineer jobs - Part I

Data Engineer jobs - Part II

The aspects of data engineering we will cover in this course:

Data collection via programming
Best practices in:
- Data formatting (tidy data)
- Software engineering (neat, replicable code)
Data pre-processing (with R and SQL)
Data products (dataviz and interactive dashboards)
Modern reporting (Quarto markdown)

Course syllabus

Let’s navigate our website:

lse-dsi.github.io/ME204

Time for coffee ☕

After the break:

The toolbox 🧰 we will be using
base R vs tidyverse
Let’s get you all setup!

The toolbox 🧰

Python vs R ??
- You can do pretty much the same in both languages
- Python is more widely adopted in industry, whereas R is more popular in academia
This course is intended to be language-agnostic. I can teach you the same skills in both languages
- BUT, I will open it up to YOU to decide which language we will use
- We must reach a collective decision tomorrow morning

Python vs R

Python

Pull data from webpages & APIs:
- selenium, scrapy, requests
Reshape data
- pandas, numpy, scipy
Plotting data
- matplotlib, plotnine, altair
Share & Report
- Github Markdown
- Jupyter

Pull data from webpages & APIs:
- RSelenium, httr, rvest
Reshape data
- tidyverse (all packages)
Plotting data
- ggplot
Share & Report
- Github Markdown
- RMarkdown, knitr, Quarto

👨‍💻 Hands-on

⚙️ Let’s get you all setup with the software we will be using in this course
Then I will show you some base R vs tidyverse comparisons

Software

Required

(link)
RStudio
R packages:
- tidyverse

Optional today (will be required later in the course)

R packages:
- xml2
- httr2
- lintr
- testthat
- covr
- usethis
- shiny
Git
Quarto markdown

💡 Summary of RStudio tips

Use R projects to keep your work more organised.
Use the R console directly when experimenting or learning a new package.
Use R scripts when dealing with longer code.
- Don’t forget to give your scripts a meaningful name!
- Comment your code! You won’t remember what you did in a few weeks.
Acquire the habit of sourcing your scripts.
- This will make your code more reproducible
- It will also make it easier to debug
- Though it’s ok to use Ctrl+Enter when you know what you’re doing!
When writing reports, use RMarkdown
Write functions to make your code more readable and reusable

General R tips

Give variables and functions meaningful names
- You won’t know what x means in a few weeks
Use ? to access the documentation of a function
Use ?? to search for a function
Use help(package = "package_name") to list all functions in a package

Pro-tips

R operations are typically vectorised
- Vectorised operations are executed element-wise
Loops (for or while) can be very inefficient in R
- When writing base R code, try to go for apply functions instead
You can use dput() to share your data with others

The parts of an R function

Source: (Grolemund 2014, sec. 2.5)

base R vs tidyverse

Embrace the pipe %>% mindset!
- It’s actually more intuitive than you think
Refer to (Tavares 2018) for a side-by-side comparison of base R and tidyverse:

Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras.
Keep the dplyr cheat sheet handy
- Head over to (Posit 2023) and scroll down to find and download the “Data Transformation with dplyr” cheatsheet

⚙️ Hands-on

Let’s install the software we will be using in this course
- R, RStudio, and the tidyverse package
- VS Code, Python, anaconda, and the pandas package

References

Davenport, Thomas. 2020. “Beyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.55546b4a.

Grolemund, Garrett. 2014. Hands-on Programming with R. First edition. Sebastopol, CA: O’Reilly. https://rstudio-education.github.io/hopr/.

Posit. 2023. “Posit Cheatsheets (R, RStudio, Tidyverse and More).” Posit. https://posit.co/resources/cheatsheets/.

Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. 1st edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.

Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.

Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras. https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html.

🗓️ Week 01 – Day 01 Introduction

Introductions

Who we are: your lecturer

Recent projects: VIMuRe

Recent projects: LSE Course Selection Pathways

Recent projects: Emojis and Political Identities

Recent Projects: Generative AI in Education

Who we are: your class teacher

The Data Science Institute

CIVICA Seminar Series

How to get involved?

Our regular courses

DS101

DS105

DS202

Who are you?

What we mean by Data Science

Data science is…

The mythical unicorn 🦄

In reality…

The Data Science Workflow

The Data Science Workflow

The Data Science Workflow

A few words on Data Engineering

The meme is real

The field is vast

Data Engineer jobs - Part I

Data Engineer jobs - Part II

The aspects of data engineering we will cover in this course:

Course syllabus

Time for coffee ☕

The toolbox 🧰

Python vs R

How should we share code?

👨‍💻 Hands-on

Software

💡 Summary of RStudio tips

General R tips

The parts of an R function

base R vs tidyverse

⚙️ Hands-on

References

🗓️ Week 01 – Day 01
Introduction