LSE ME204 (2023)

Data Engineering for the Social World

Author

Week 01

🗓️ Day 01
(Mon 10 Jul)

🧑‍🏫 Lecture

The Data Science Toolbox

💻 Lab

Recap of base R and tidyverse fundamentals

📖 Revise

Click to see if you’re caught up
  • Ensure you have R and RStudio installed on your computer
  • Read (Grolemund 2014, chap. 2) if you need a refresher on R basics
  • Ensure you have installed tidyverse
  • Download the dplyr cheatsheet (Cetinkaya-Rundel 2023) and keep it handy for reference.
  • Revisit base R vs tidyverse syntax equivalence
  • Skim the textbook references mentioned in the slides to find out more about the topics covered in the lecture.
  • Take note of any concepts that might remain unclear to you even after the lab and bring them to the next lecture.

🗓️ Day 02
(Tue 11 Jul)

🧑‍🏫 Lecture

Data types and common file formats
(CSV, JSON, XML, YAML, and more)

💻 Lab

Manipulating XML files

📖 Revise

Click to see if you’re caught up

Do a bit of self-checking:

Would you be able to explain in simple terms what the following dplyr functions do?

  • select()
  • filter()
  • arrange()
  • summarise()
  • mutate()

If the answer is no, refer to (Wickham and Grolemund 2016, chap. 5) (this book is available for free online). You can also refer to the dplyr documentation and the dplyr cheatsheet.

XML

  • Revisit the code your class instructor shared on Part 2 of the lab (🧑🏻‍🏫 TEACHING MOMENT).
  • Do you really get what’s going on? If not, ask questions on Slack! Don’t be shy, you’re likely not the only one who’s confused. We’ll go through the questions at the beginning of the next lecture.
  • Revisit your own code for the rest of the lab. Try to figure out the reason behind each line of code.

🗓️ Day 03
(Wed 12 Jul)

🧑‍🏫 Lecture

The Internet and the Web
(Add HTTP, HTML, CSS, DOM, JS to your vocabulary)

💻 Lab

GitHub and Markdown

📖 Revise

Click here for suggestions of content
  • Check out this episode of the Babagge podcast from the Economist featuring an interview with the creator of the TCP/IP protocol. Learn about the fascinating journey of how the protocol came to life and gain insights into the relationship between AI and the Internet nowadays:

  • Check out this recent WIRED interview with Sir Tim Berners-Lee, the inventor of the World Wide Web:

🗓️ Day 04
(Thu 13 Jul)

🧑‍🏫 Lecture

Web Scraping
Time to collect data from the Web!

💻 Lab

Web Scraping with the rvest package

📣 Assignment Reveal

The midterm assignment will be revealed during the lecture! Deadline: Tue 18 Jul, 11:59 p.m. BST.

🗓️ Day 05
(Fri 14 Jul)
There are no lectures or labs on Fridays!

📟 Slack support

  • Reach out on the public channels on Slack if you have any questions.
  • Enhance your skills by helping your peers on Slack.
  • (Jon will be monitoring the public Slack channels on Thursday afternoon/Friday morning)

🧑‍💼 Office Hours

  • Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
    • No need to book, but please be patient if there are other students ahead of you in the queue.
    • A typical office hour session lasts 10-15 minutes.

📝 Assignment

  • Work on your midterm assignment so you can enjoy the weekend!

Week 02

🗓️ Day 01
(Mon 17 Jul)

🧑‍🏫 Lecture

Neat functions and tidy data, testing and debugging

💻 Lab

The art of refactoring code

📝 Assignment

  • Complete your midterm assignment if you haven’t done so already.

🗓️ Day 02
(Tue 18 Jul)

🧑‍🏫 Lecture

API queries and JSON tricks

💻 Lab

Working with JSON data

📝 Assignment

  • Today is the deadline for your midterm assignment!

📖 Revise

Click to see if you’re caught up

The best way to learn how to handle API queries and JSON data is to practice! If you’ve done the exercises in the morning and completed everything in the lab later, you should be in good shape. If you’re still unsure about some of the concepts, here is an extra exercise to keep you busy:

  • Using the Core REST API of Wikimedia, identify the API endpoints that allows you to search for articles by content.
  • Use this API endpoint to search for articles in the English Wikipedia related to ‘London School of Economics’ and save the output as a tidy data frame.
  • Using the relevant API endpoint, retrieve the content of each of these articles and save it as a tidy data frame. Pay close attention to the source column of the output. What does it tell you?

(We don’t have solutions to this exercise, but feel free to reach out to us to get feedback on your code!)

🗓️ Day 03
(Wed 19 Jul)

🧑‍🏫 Lecture

Reshaping data for visualisation

💻 Lab

Data visualisation with ggplot2

📖 Revise

Click to see if you’re caught up
  • The best way to revise is by practising your creativity skills with the Camden crime dataset by yourself. Can you come up with interesting questions to ask to the data? Can you answer them with ggplot2?

  • R graphics cookbook: practical recipes for visualizing data by Winston Chang (Chang 2018) (freely available online)

  • ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham (Hadley 2016) (freely available online)

🗓️ Day 04
(Thu 20 Jul)

🧑‍🏫 Lecture

Interactive Dashboards

💻 Lab
🦸🏻‍♀️ Super tech support

Creating a shiny dashboard
Start working on your project

📣 Assignment Reveal

The requirements for your final project will be revealed during the lecture! Deadline: Wed 2 Aug Mon 31 July, 11:59 p.m. BST.
(We have to submit your grades to the Summer School by 3rd August, so we need to move the deadline forward a bit)

🗓️ Day 05
(Fri 14 Jul)
There are no lectures or labs on Fridays!

📟 Slack support

  • Reach out on the public channels on Slack if you have any questions.
  • Enhance your skills by helping your peers on Slack.

🧑‍💼 Office Hours

There won’t be any office hours this week.

Week 03

🗓️ Day 01
(Mon 24 Jul)

🧑‍🏫 Lecture

Intro to databases

💻 Lab

Introduction to SQLite and dbplyr

📖 Revise

Click to see if you’re caught up

🗓️ Day 02
(Tue 25 Jul)

🧑‍🏫 Lecture

More data reshaping
(inner, left and right joins, pivot_longer, pivot_wider and more)

💻 Lab

Data reshaping (joins and pivots)

📖 Revise

Click to see related resources

🗓️ Day 03
(Wed 26 Jul)

🧑‍🏫 Lecture

Basic text mining

💻 Lab

Text mining with stringr (regex) and quanteda

🗓️ Day 04
(Thu 27 Jul)

🧑‍🏫 Lecture

Managing your data pipeline
(Project-oriented workflows, automation, continuous integration, containerisation, and more)

🦸🏻‍♀️ Super tech support

Get help with your project

🗓️ Day 05
(Fri 28 Jul)
There are no lectures or labs on Fridays!

📟 Slack support

  • Reach out on the public channels on Slack if you have any questions.
  • Enhance your skills by helping your peers on Slack.
  • (Jon will be monitoring the public Slack channels on Thursday afternoon/Friday morning)

🧑‍💼 Office Hours

~~- If you are still in London, you can attend office hours on Friday afternoon (2 p.m. - 5 p.m.) ~~ (We won’t be running office hours on Friday.)

References

Cetinkaya-Rundel, Mine. 2023. “Data Transformation with Dplyr :: Cheatsheet.” RStudio. https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf.
Chang, Winston. 2018. R Graphics Cookbook: Practical Recipes for Visualizing Data. Second edition. Beijing ; Boston: O’Reilly. https://r-graphics.org/.
Grolemund, Garrett. 2014. Hands-on Programming with R. First edition. Sebastopol, CA: O’Reilly. https://rstudio-education.github.io/hopr/.
Hadley, Wickham. 2016. Ggplot2. New York, NY: Springer Science+Business Media, LLC. https://ggplot2-book.org/.
Wickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2023. “Writing SQL with Dbplyr: Why You Might Use Dbplyr Instead of Writing SQL Yourself.” Tutorial. Dbplyr Package in R. https://dbplyr.tidyverse.org/articles/sql.html.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st edition. Sebastopol [CA]: O’Reilly. https://r4ds.had.co.nz/.
Wondrasek, James, Katharina Brunner, and Krill Müller. 2020. “Introduction to DBI.” Software. DBI Package in R. https://dbi.r-dbi.org/articles/dbi.