πŸ—£οΈ Week 08 Lecture

More Data Reshaping Techniques and Introduction to Databases

Author

Dr Jon Cardoso-Silva

Published

12 March 2025

πŸ₯… Learning Goals
By the end of this lecture, you should be able to: i) Complete your understanding of data reshaping techniques with explode() and melt(), ii) Understand the fundamentals of relational databases, iii) Store and retrieve data using SQLite, iv) Design database schemas appropriate for your Reddit data analysis.
Response rate: 8%
β„Ή
(5 out of 63 DS105W students have completed the course evaluation survey)
8%
0% - 50%
50% - 75%
75% - 100%
Could you assist us in achieving the <strong>75% mark</strong>? Course evaluation surveys are <a href="https://moodle.lse.ac.uk/mod/forum/discuss.php?d=385499#course-evaluation-survey" style="color: #58426c; text-decoration: underline;">extremely important</a> for us, the instructors of DS105. <br><br>

Click here to provide your official feedback on this course. πŸ’‘ Note: Please assess all the instructors you have interacted with (and Jon also counts as your teacher!).

Last updated: 13 March 2025, 15:45

DS105W course icon

πŸ“Time and Location: Thursday, 13 March 2025 from 4-6 pm at MAR.1.04

This week’s lecture will complete our exploration of data reshaping techniques and introduce you to the concept of databases, which will be essential skills for your ✍️ Mini-Project 2. We will build directly on last week’s techniques while adding new tools to your data science toolkit.

πŸ““ Interactive Puzzle-Based Learning Continues

Following the success of last week’s format, we will continue with our puzzle-based learning approach:

  1. New Puzzles, Same Format: We will tackle two new puzzles focused on techniques that mirror the exact challenges you will face when working with Reddit data in your ✍️ Mini-Project 2.

    • Puzzle 3 (about explode()): β€œThe Reddit Tags Challenge”

    • Puzzle 4 (about melt()): β€œThe Reddit Engagement Metrics Challenge”

  2. Competitive Element Continues: The tote bag competition continues! Teams with the best solutions will earn points towards winning Data Science Institute’s tote bags. 🎁

    For this reason, I will not share the puzzles with you until the lecture starts

πŸ“‹ Preparation

Before the lecture

  • Review the techniques we covered in Week 07 (Split-Apply-Combine and JSON Normalisation)
  • Ensure you can access Nuvolos
  • Bring your laptop to participate in the interactive puzzles
  • Get a head start on Mini-Project 2: Consider creating a Reddit account and setting up API credentials (instructions on Moodle)

🎬 Lecture Material

The lecture will be structured in two main parts:

Part 1: Completing Data Reshaping Techniques

We will finish our exploration of data reshaping techniques with two powerful pandas functions:

  1. DataFrame.explode(): Expanding list elements into separate rows.

  2. DataFrame.melt(): Transforming wide data into long format.

Part 2: Introduction to Databases

After a short break, we will dive into databases:

  1. Database Fundamentals: Understanding relational databases and their advantages. Why use databases instead of CSV/JSON files?

  2. Working with SQLite: A lightweight database perfect for your projects. Creating database connections, storing pandas DataFrames in SQLite, and querying data from SQLite.

  3. Database Design for Reddit Data: Practical examples of how to structure your data. Creating appropriate tables for posts, comments, and subreddits, and establishing relationships between tables.

πŸ“₯ Lecture Notebooks

The lecture notebooks will be available here and on Nuvolos at the start of the lecture.

Download the notebooks for today’s lecture:

πŸ“‹ MINI-PROJECT 2 CONNECTION:

Today’s puzzles are specifically designed to prepare you for the Reddit Engagement Analysis project:

  • The techniques we will cover are exactly what you will need to process and analyse Reddit API data
  • The database skills will help you efficiently store and query the data you collect
  • The visualisation approaches will directly translate to creating compelling visuals for your project report

πŸ“₯ Post-Lecture Actions

  1. Review the Jupyter notebooks from today’s lecture
  2. Set up your Reddit API access if you haven’t already (instructions on Moodle)
  3. Start exploring potential subreddits for your ✍️ Mini-Project 2
  4. Practice with the sample solutions notebook
  5. Use the #help channel on Slack if you need clarification or assistance

πŸ“š Additional Resources