🗓️ Week 01 – Day 01
Introduction

LSE ME204

Dr Jon Cardoso-Silva

LSE Data Science Institute

10 Jul 2023

What we will cover today:

  • Introductions
  • What we mean by Data Science
  • A few words on Data Engineering
  • Course syllabus
  • The toolbox 🧰
  • 👨‍💻 Hands-on

Introductions

Who we are: your lecturer

Photo of Jon Cardoso-Silva
Dr. Jon Cardoso-Silva
Assistant Professor of Data Science (Education)
LSE Data Science Institute
🌐 jonjoncardoso.github.io

nlp

text mining

optimisation

data science workflow

relationship-rich education

generative AI for education

machine learning applications

Recent projects: VIMuRe

VIMuRe (social network analysis)

📦 Python and R packages available

De Bacco, Caterina, Martina Contisciani, Jonathan Cardoso-Silva, Hadiseh Safdari, Gabriela Lima Borges, Diego Baptista, Tracy Sweet, et al. 2023. “Latent Network Models to Account for Noisy, Multiply Reported Social Network Data.” Journal of the Royal Statistical Society Series A: Statistics in Society, February, qnac004.

We developed a Bayesian statistical model to uncover the ‘true’ underlying network behind the social network ties reported by individuals.

  • This is applied to a type of tie elicitation called double-sampling.
  • The model can detect when a person has the tendency to under- or over-reported their ties.
  • among other things…

Fig. 9. Example of networks estimated by baseline methods and VIMuRe for one Karnataka village (tie type ‘Visit’). (a) Union (⁠recip.=0.93⁠), (b) intersection (⁠recip.=0.88⁠), and (c) VIMuRe (⁠recip.=0.49⁠).

Recent projects: LSE Course Selection Pathways

Recent projects: Emojis and Political Identities

What can the emojis you use in your social media profile reveal about your political values? 🤔

Collaboration with:

  • Sara Luxmoore* (Incoming PhD student at UC Berkeley)
  • Pedro Ramaciotti (Research Scientist @ médialab Sciences Po)

Read more about it here

Recent Projects: Generative AI in Education

Joint project with Dr Marcos Barreto (LSE Statistics)

CONTEXT

In higher education, students are increasingly using Generative AI tools like ChatGPT, GitHub Copilot, Bard, Bing AI to enhance their learning experience. These tools offer personalised and immediate assistance for tasks such as summarising literature, brainstorming, and the writing of code and text, even though some outputs may have limitations in terms of transparency and accuracy. Some educators feel encouraged to incorporate these tools into our teaching and assessments to support students, but there is still limited evidence on how effective these generative AI tools are in improving learning outcomes.

This focus group aims to fill that gap and explore the practical applications of these tools and their role in enhancing, specifically, programming skills and critical thinking.

OBJECTIVES

  • Surveying participants and the academic community for their experiences and expectations related to generative AI.
  • Reviewing literature on generative AI tools in Education.
  • Identifying suitable tools for data science/quantitative courses.
  • Testing selected tools against reference examples and establishing assessment metrics.
  • Implementing and validating case studies.
  • Producing evidence to support peers and inform policy decisions

Who we are: your class teacher

Photo of Mahsa Dalirrooy-Fard
Mahsa Dalirrooy-Fard
PhD Candidate at LSE Mathematics
📧 m.dalirrooy-fard@lse.ac.uk

mathematics

graph theory

machine learning

software engineering

combinatorial optimisation

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

Sign up for DSI events at lse.ac.uk/DSI/Events

CIVICA Seminar Series

🔗 Link

How to get involved?

Sign up for the DSI Newsletter

Our regular courses

DSI offers accessible introductions to Data Science:

DS101

Fundamentals of
Data Science

🎯 Focus:
theoretical concepts of data science

📂 How:
reflections through reading and writing

DS105

Data for
Data Scientists

🎯 Focus:
collection and handling of real data

📂 How:
hands-on coding exercises and a group project

DS202

Data Science for
Social Scientists

🎯 Focus:
fundamental machine learning algorithms

📂 How:
practical use of ML techniques and metrics

Who are you?

Let’s generate some data with mentimeter!

What we mean by Data Science

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

(Shah 2020, chap. 1) - Emphasis and emojis are of my own making.

The mythical unicorn 🦄

knows everything about statistics

able to communicate insights perfectly

fully understands businesses like no one

is a fluent computer programmer

Of course, such a person does not exist!

See (Davenport 2020) for a more in-depth discussion about this

In reality…

We are all jugglers 🤹

  • Everyone brings a different skill set.
  • We need multi-disciplinary teams.
  • Good data scientists know a bit of everything.
    • Not fluent in all things
    • Understands their strenghts and weaknessess
    • They know when and where to interface with others

See (Schutt and O’Neil 2013, chap. 1) for more on this.

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

⚠️ Note that this is a simplified version of what happens in a data science project.
In practice, the process is not linear, and many feedback loops exist.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda eda->end ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate      

And this is what this course is about! You will learn some of the most common tools used during this process.

We add the term data engineering to the name of this course for this very reason.

A few words on Data Engineering

The meme is real


The struggle is real.
by u/ali_azg in r/dataengineering

Data engineering doesn’t get as much attention as data science, but it’s an ESSENTIAL skill if you want to work with data.

The field is vast

But remember the unicorn 🦄! You don’t need to be an expert in all of these tools.


DataEngineering 2021 in one pic
by u/Legitimate-Cry2837 in dataengineering
Let’s zoom in 🔎 here.

Data Engineer jobs - Part I

Data Engineer jobs - Part II

The aspects of data engineering we will cover in this course:

  • Data collection via programming
  • Best practices in:
    • Data formatting (tidy data)
    • Software engineering (neat, replicable code)
  • Data pre-processing (with R and SQL)
  • Data products (dataviz and interactive dashboards)
  • Modern reporting (Quarto markdown)

👉 This will provide you with a comprehensive understanding of the entire process, making it relatively easy for you to adapt to any other programming language or tools you may encounter in your industry or academic career.

Course syllabus

Let’s navigate our website:

lse-dsi.github.io/ME204

ME204’s favicon was created using DALL-E 2

Time for coffee ☕

After the break:

  • The toolbox 🧰 we will be using
  • base R vs tidyverse

The toolbox 🧰

  • Python vs R ??
    • You can do pretty much the same in both languages
    • Python is more widely adopted in industry, whereas R is more popular in academia
    • In this course we will adopt R

Python vs R

Python

  • Pull data from webpages & APIs:
    • selenium, scrapy, requests
  • Reshape data
    • pandas, numpy, scipy
  • Plotting data
    • matplotlib, plotnine, altair
  • Share & Report
    • Github Markdown
    • Jupyter

  • Pull data from webpages & APIs:
    • RSelenium, httr, rvest
  • Reshape data
    • tidyverse (all packages)
  • Plotting data
    • ggplot
  • Share & Report
    • Github Markdown
    • RMarkdown, knitr, Quarto

How should we share code?

Github!

Use Github for everything related to your project!

  • You will learn to setup Github for your own code on 🗓️ Week 01 Day 03 lab.

Important

Don’t share code via e-mail, Dropbox, Google Drive, or anything like that!

It is a bad practice. Things get messy very quickly.

  • Remember that you have some time to develop your projects.

👨‍💻 Hands-on

  • ⚙️ Let’s get you all setup with the software we will be using in this course
  • Then I will show you some base R vs tidyverse comparisons

The following slides summarise my talking points.

Software

Required

  • (link)
  • RStudio
  • R packages:
    • tidyverse

Optional today (will be required later in the course)

  • R packages:
    • xml2
    • httr2
    • lintr
    • testthat
    • covr
    • usethis
    • shiny
  • Git
  • Quarto markdown

💡 Summary of RStudio tips

  • Use R projects to keep your work more organised.
  • Use the R console directly when experimenting or learning a new package.
  • Use R scripts when dealing with longer code.
    • Don’t forget to give your scripts a meaningful name!
    • Comment your code! You won’t remember what you did in a few weeks.
  • Acquire the habit of sourcing your scripts.
    • This will make your code more reproducible
    • It will also make it easier to debug
    • Though it’s ok to use Ctrl+Enter when you know what you’re doing!
  • When writing reports, use RMarkdown
  • Write functions to make your code more readable and reusable

General R tips

  • Give variables and functions meaningful names
    • You won’t know what x means in a few weeks
  • Use ? to access the documentation of a function
  • Use ?? to search for a function
  • Use help(package = "package_name") to list all functions in a package

Pro-tips

  • R operations are typically vectorised
    • Vectorised operations are executed element-wise
  • Loops (for or while) can be very inefficient in R
    • When writing base R code, try to go for apply functions instead
  • You can use dput() to share your data with others

You can also consult the ‘Hands-on Programming with R’ book (Grolemund 2014) as a reference.

The parts of an R function

Source: (Grolemund 2014, sec. 2.5)

base R vs tidyverse

  • Embrace the pipe %>% mindset!

    • It’s actually more intuitive than you think
  • Refer to (Tavares 2018) for a side-by-side comparison of base R and tidyverse:

    Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras.

  • Keep the dplyr cheat sheet handy

    • Head over to (Posit 2023) and scroll down to find and download the “Data Transformation with dplyr” cheatsheet

References

Davenport, Thomas. 2020. “Beyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.55546b4a.
Grolemund, Garrett. 2014. Hands-on Programming with R. First edition. Sebastopol, CA: O’Reilly. https://rstudio-education.github.io/hopr/.
Posit. 2023. “Posit Cheatsheets (R, RStudio, Tidyverse and More).” Posit. https://posit.co/resources/cheatsheets/.
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. 1st edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras. https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html.

LSE ME204 2023 (W01D01) | ARCHIVE

1 / 40
🗓️ Week 01 – Day 01 Introduction LSE ME204 Dr Jon Cardoso-Silva LSE Data Science Institute 10 Jul 2023

  1. Slides

  2. Tools

  3. Close
  • 🗓️ Week 01 – Day 01 Introduction
  • What we will cover today:
  • Introductions
  • Who we are: your lecturer
  • Recent projects: VIMuRe
  • Recent projects: LSE Course Selection Pathways
  • Recent projects: Emojis and Political Identities
  • Recent Projects: Generative AI in Education
  • Who we are: your class teacher
  • The Data Science Institute
  • CIVICA Seminar Series
  • Slide 12
  • How to get involved?
  • Our regular courses
  • Who are you?
  • What we mean by Data Science
  • Data science is…
  • The mythical unicorn 🦄
  • In reality…
  • The Data Science Workflow
  • The Data Science Workflow
  • The Data Science Workflow
  • A few words on Data Engineering
  • The meme is real
  • The field is vast
  • Data Engineer jobs - Part I
  • Data Engineer jobs - Part II
  • The aspects of data engineering we will cover in this course:
  • Course syllabus
  • Time for coffee ☕
  • The toolbox 🧰
  • Python vs R
  • How should we share code?
  • 👨‍💻 Hands-on
  • Software
  • 💡 Summary of RStudio tips
  • General R tips
  • The parts of an R function
  • base R vs tidyverse
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help