LSE ME204 (2023)
Data Engineering for the Social World
Week 01
🗓️ Day 01
(Mon 10 Jul)
🧑🏫 Lecture
The Data Science Toolbox
💻 Lab
Recap of base R and tidyverse fundamentals
📖 Revise
Click to see if you’re caught up
- Ensure you have R and RStudio installed on your computer
- Read (Grolemund 2014, chap. 2) if you need a refresher on R basics
- Ensure you have installed tidyverse
- Download the
dplyr
cheatsheet (Cetinkaya-Rundel 2023) and keep it handy for reference. - Revisit base R vs
tidyverse
syntax equivalence - Skim the textbook references mentioned in the slides to find out more about the topics covered in the lecture.
- Take note of any concepts that might remain unclear to you even after the lab and bring them to the next lecture.
🗓️ Day 02
(Tue 11 Jul)
🧑🏫 Lecture
Data types and common file formats
(CSV, JSON, XML, YAML, and more)
💻 Lab
Manipulating XML files
📖 Revise
Click to see if you’re caught up
Do a bit of self-checking:
Would you be able to explain in simple terms what the following dplyr
functions do?
select()
filter()
arrange()
summarise()
mutate()
If the answer is no, refer to (Wickham and Grolemund 2016, chap. 5) (this book is available for free online). You can also refer to the dplyr
documentation and the dplyr
cheatsheet.
XML
- Revisit the code your class instructor shared on Part 2 of the lab (🧑🏻🏫 TEACHING MOMENT).
- Do you really get what’s going on? If not, ask questions on Slack! Don’t be shy, you’re likely not the only one who’s confused. We’ll go through the questions at the beginning of the next lecture.
- Revisit your own code for the rest of the lab. Try to figure out the reason behind each line of code.
🗓️ Day 03
(Wed 12 Jul)
🧑🏫 Lecture
The Internet and the Web
(Add HTTP, HTML, CSS, DOM, JS to your vocabulary)
💻 Lab
GitHub and Markdown
📖 Revise
Click here for suggestions of content
Check out this episode of the Babagge podcast from the Economist featuring an interview with the creator of the TCP/IP protocol. Learn about the fascinating journey of how the protocol came to life and gain insights into the relationship between AI and the Internet nowadays:
Check out this recent WIRED interview with Sir Tim Berners-Lee, the inventor of the World Wide Web:
🗓️ Day 04
(Thu 13 Jul)
🧑🏫 Lecture
Web Scraping
Time to collect data from the Web!
💻 Lab
Web Scraping with the rvest
package
📣 Assignment Reveal
The midterm assignment will be revealed during the lecture! Deadline: Tue 18 Jul, 11:59 p.m. BST.
(Fri 14 Jul)
📟 Slack support
- Reach out on the public channels on Slack if you have any questions.
- Enhance your skills by helping your peers on Slack.
- (Jon will be monitoring the public Slack channels on Thursday afternoon/Friday morning)
🧑💼 Office Hours
- Attend office hours on Friday morning (from 10am to 12pm) if you need additional assistance.
- No need to book, but please be patient if there are other students ahead of you in the queue.
- A typical office hour session lasts 10-15 minutes.
📝 Assignment
- Work on your midterm assignment so you can enjoy the weekend!
Week 02
🗓️ Day 01
(Mon 17 Jul)
🧑🏫 Lecture
Neat functions and tidy data, testing and debugging
💻 Lab
The art of refactoring code
📝 Assignment
- Complete your midterm assignment if you haven’t done so already.
🗓️ Day 02
(Tue 18 Jul)
🧑🏫 Lecture
API queries and JSON tricks
💻 Lab
Working with JSON data
📝 Assignment
- Today is the deadline for your midterm assignment!
📖 Revise
Click to see if you’re caught up
The best way to learn how to handle API queries and JSON data is to practice! If you’ve done the exercises in the morning and completed everything in the lab later, you should be in good shape. If you’re still unsure about some of the concepts, here is an extra exercise to keep you busy:
- Using the Core REST API of Wikimedia, identify the API endpoints that allows you to search for articles by content.
- Use this API endpoint to search for articles in the English Wikipedia related to ‘London School of Economics’ and save the output as a tidy data frame.
- Using the relevant API endpoint, retrieve the content of each of these articles and save it as a tidy data frame. Pay close attention to the
source
column of the output. What does it tell you?
(We don’t have solutions to this exercise, but feel free to reach out to us to get feedback on your code!)
🗓️ Day 03
(Wed 19 Jul)
🧑🏫 Lecture
Reshaping data for visualisation
💻 Lab
Data visualisation with ggplot2
📖 Revise
Click to see if you’re caught up
The best way to revise is by practising your creativity skills with the Camden crime dataset by yourself. Can you come up with interesting questions to ask to the data? Can you answer them with
ggplot2
?R graphics cookbook: practical recipes for visualizing data by Winston Chang (Chang 2018) (freely available online)
ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham (Hadley 2016) (freely available online)
🗓️ Day 04
(Thu 20 Jul)
🧑🏫 Lecture
Interactive Dashboards
💻 Lab
🦸🏻♀️ Super tech support
Creating a shiny
dashboard
Start working on your project
📣 Assignment Reveal
The requirements for your final project will be revealed during the lecture! Deadline: Wed 2 Aug Mon 31 July, 11:59 p.m. BST.
(We have to submit your grades to the Summer School by 3rd August, so we need to move the deadline forward a bit)
(Fri 14 Jul)
📟 Slack support
- Reach out on the public channels on Slack if you have any questions.
- Enhance your skills by helping your peers on Slack.
🧑💼 Office Hours
There won’t be any office hours this week.
Week 03
🗓️ Day 01
(Mon 24 Jul)
🧑🏫 Lecture
Intro to databases
💻 Lab
Introduction to SQLite and dbplyr
📖 Revise
Click to see if you’re caught up
- Read more about supported DBMSs and the
DBI
package (Wondrasek, Brunner, and Müller 2020) - Why use
dbplyr
instead of pure SQL? (Wickham, Girlich, and Ruiz 2023) - When should I use SQLite?
- SQLite is faster than your filesystem
- Slow SQLite querying? Read this blogpost about SQLite performance tuning
- What are the limits of SQLite?
🗓️ Day 02
(Tue 25 Jul)
🧑🏫 Lecture
More data reshaping
(inner, left and right joins, pivot_longer, pivot_wider and more)
💻 Lab
Data reshaping (joins and pivots)
📖 Revise
Click to see related resources
- (Wickham and Grolemund 2016, chap. 12.3) – Pivoting section of the R for Data Science book
- (Wickham and Grolemund 2016, chap. 13) – Relational Data chapter of the R for Data Science book
🗓️ Day 03
(Wed 26 Jul)
🧑🏫 Lecture
Basic text mining
💻 Lab
Text mining with stringr
(regex) and quanteda
🗓️ Day 04
(Thu 27 Jul)
🧑🏫 Lecture
Managing your data pipeline
(Project-oriented workflows, automation, continuous integration, containerisation, and more)
🦸🏻♀️ Super tech support
Get help with your project
(Fri 28 Jul)
📟 Slack support
- Reach out on the public channels on Slack if you have any questions.
- Enhance your skills by helping your peers on Slack.
- (Jon will be monitoring the public Slack channels on Thursday afternoon
/Friday morning)
🧑💼 Office Hours
~~- If you are still in London, you can attend office hours on Friday afternoon (2 p.m. - 5 p.m.) ~~ (We won’t be running office hours on Friday.)