🗓️ Week 01 – Day 01
Introduction

LSE ME204

08 Jul 2024

Introductions

Who we are: your lecturer

Photo of Jon Cardoso-Silva
Dr. Jon Cardoso-Silva
Assistant Professor of Data Science (Education)
LSE Data Science Institute
🌐 jonjoncardoso.github.io

nlp

network analysis

optimisation

data science workflow

generative AI for education

machine learning applications

Recent projects: VIMuRe

VIMuRe (social network analysis)

📦 Python and R packages available

De Bacco, Caterina, Martina Contisciani, Jonathan Cardoso-Silva, Hadiseh Safdari, Gabriela Lima Borges, Diego Baptista, Tracy Sweet, et al. 2023. “Latent Network Models to Account for Noisy, Multiply Reported Social Network Data.” Journal of the Royal Statistical Society Series A: Statistics in Society, February, qnac004.

We developed a Bayesian statistical model to uncover the ‘true’ underlying network behind the social network ties reported by individuals.

  • This is applied to a type of tie elicitation called double-sampling.
  • The model can detect when a person has the tendency to under- or over-reported their ties.
  • among other things…

Fig. 9. Example of networks estimated by baseline methods and VIMuRe for one Karnataka village (tie type ‘Visit’). (a) Union (⁠recip.=0.93⁠), (b) intersection (⁠recip.=0.88⁠), and (c) VIMuRe (⁠recip.=0.49⁠).

Recent projects: LSE Course Selection Pathways

Recent projects: Emojis and Political Identities

What can the emojis you use in your social media profile reveal about your political values? 🤔

Collaboration with:

Recent Projects: Generative AI in Education

Joint project with Dr Marcos Barreto (LSE Statistics)

CONTEXT

In higher education, students are increasingly using Generative AI tools like ChatGPT, GitHub Copilot, Bard, Bing AI to enhance their learning experience. These tools offer personalised and immediate assistance for tasks such as summarising literature, brainstorming, and the writing of code and text, even though some outputs may have limitations in terms of transparency and accuracy. Some educators feel encouraged to incorporate these tools into our teaching and assessments to support students, but there is still limited evidence on how effective these generative AI tools are in improving learning outcomes.

This focus group aims to fill that gap and explore the practical applications of these tools and their role in enhancing, specifically, programming skills and critical thinking.

OBJECTIVES

  • Surveying participants and the academic community for their experiences and expectations related to generative AI.
  • Reviewing literature on generative AI tools in Education.
  • Identifying suitable tools for data science/quantitative courses.
  • Testing selected tools against reference examples and establishing assessment metrics.
  • Implementing and validating case studies.
  • Producing evidence to support peers and inform policy decisions

Who we are: your class teacher

Photo of Alex Soldatkin
Alexander Soldatkin
DPhil Candidate
Oxford School of Global and Area Studies
University of Oxford
Guest Teacher at the LSE DSI
📧 E-mail
class teacher

His research relies on novel ways of using, combining, and automating open data from finance and beyond, such as creating natural language processing algorithms to extract and disambiguate named entities from text. He is currently investigating applications of network analysis in banking and implementing them in his project.

Prior to his DPhil, Alex completed a dual MSc in Public Administration and Government from LSE and Peking University, graduating with distinction and receiving a Prize for Best Dissertation from the Department of Government at LSE, where he researched factionalism and regionalism in Russian banking using geospatial analysis and text mining.

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

How to get involved?

Sign up for the DSI Newsletter

Our regular courses

DSI offers accessible introductions to Data Science:

DS101

Fundamentals of
Data Science

🎯 Focus:
theoretical concepts of data science

📂 How:
reflections through reading and writing

DS105

Data for
Data Scientists

🎯 Focus:
collection and handling of real data

📂 How:
hands-on coding exercises and a group project

DS202

Data Science for
Social Scientists

🎯 Focus:
fundamental machine learning algorithms

📂 How:
practical use of ML techniques and metrics

Who are you?

What we mean by Data Science

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The mythical unicorn 🦄

knows everything about statistics

able to communicate insights perfectly

fully understands businesses like no one

is a fluent computer programmer

In reality…

We are all jugglers 🤹

  • Everyone brings a different skill set.
  • We need multi-disciplinary teams.
  • Good data scientists know a bit of everything.
    • Not fluent in all things
    • Understands their strenghts and weaknessess
    • They know when and where to interface with others

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda eda->end ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate      

And this is what this course is about! You will learn some of the most common tools used during this process.

We add the term data engineering to the name of this course for this very reason.

A few words on Data Engineering

The meme is real


The struggle is real.
by u/ali_azg in r/dataengineering

The field is vast

But remember the unicorn 🦄! You don’t need to be an expert in all of these tools.


DataEngineering 2021 in one pic
by u/Legitimate-Cry2837 in dataengineering
Let’s zoom in 🔎 here.

Data Engineer jobs - Part I

Data Engineer jobs - Part II

The aspects of data engineering we will cover in this course:

  • Data collection via programming
  • Best practices in:
    • Data formatting (tidy data)
    • Software engineering (neat, replicable code)
  • Data pre-processing (with R and SQL)
  • Data products (dataviz and interactive dashboards)
  • Modern reporting (Quarto markdown)

Course syllabus

Let’s navigate our website:

lse-dsi.github.io/ME204

Time for coffee ☕

After the break:

  • The toolbox 🧰 we will be using
  • base R vs tidyverse
  • Let’s get you all setup!

The toolbox 🧰

  • Python vs R ??
    • You can do pretty much the same in both languages
    • Python is more widely adopted in industry, whereas R is more popular in academia
  • This course is intended to be language-agnostic. I can teach you the same skills in both languages
    • BUT, I will open it up to YOU to decide which language we will use
    • We must reach a collective decision tomorrow morning

Python vs R

Python

How should we share code?

Github!

Use Github for everything related to your project!

  • You will learn to setup Github for your own code on 🗓️ Week 01 Day 03 lab.

Important

Don’t share code via e-mail, Dropbox, Google Drive, or anything like that!

It is a bad practice. Things get messy very quickly.

👨‍💻 Hands-on

  • ⚙️ Let’s get you all setup with the software we will be using in this course
  • Then I will show you some base R vs tidyverse comparisons

Software

Required

Optional today (will be required later in the course)

💡 Summary of RStudio tips

  • Use R projects to keep your work more organised.
  • Use the R console directly when experimenting or learning a new package.
  • Use R scripts when dealing with longer code.
    • Don’t forget to give your scripts a meaningful name!
    • Comment your code! You won’t remember what you did in a few weeks.
  • Acquire the habit of sourcing your scripts.
    • This will make your code more reproducible
    • It will also make it easier to debug
    • Though it’s ok to use Ctrl+Enter when you know what you’re doing!
  • When writing reports, use RMarkdown
  • Write functions to make your code more readable and reusable

General R tips

  • Give variables and functions meaningful names
    • You won’t know what x means in a few weeks
  • Use ? to access the documentation of a function
  • Use ?? to search for a function
  • Use help(package = "package_name") to list all functions in a package

Pro-tips

  • R operations are typically vectorised
    • Vectorised operations are executed element-wise
  • Loops (for or while) can be very inefficient in R
    • When writing base R code, try to go for apply functions instead
  • You can use dput() to share your data with others

The parts of an R function

Source: (Grolemund 2014, sec. 2.5)

base R vs tidyverse

  • Embrace the pipe %>% mindset!

    • It’s actually more intuitive than you think
  • Refer to (Tavares 2018) for a side-by-side comparison of base R and tidyverse:

    Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras.

  • Keep the dplyr cheat sheet handy

    • Head over to (Posit 2023) and scroll down to find and download the “Data Transformation with dplyr cheatsheet

⚙️ Hands-on

  • Let’s install the software we will be using in this course
    • R, RStudio, and the tidyverse package
    • VS Code, Python, anaconda, and the pandas package

References

Davenport, Thomas. 2020. “Beyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.55546b4a.
Grolemund, Garrett. 2014. Hands-on Programming with R. First edition. Sebastopol, CA: O’Reilly. https://rstudio-education.github.io/hopr/.
Posit. 2023. “Posit Cheatsheets (R, RStudio, Tidyverse and More).” Posit. https://posit.co/resources/cheatsheets/.
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. 1st edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Tavares, Hugo. 2018. “Syntax Equivalents: Base R Vs Tidyverse.” Data Carpentry Extras. https://tavareshugo.github.io/data_carpentry_extras/base-r_tidyverse_equivalents/base-r_tidyverse_equivalents.html.