DS105 Winter Term Icon

๐Ÿ—ฃ๏ธ Week 08 Lecture

More Data Reshaping Techniques and Introduction to Databases

Author

Dr Jon Cardoso-Silva

Published

12 March 2025

๐Ÿฅ… Learning Goals
By the end of this lecture, you should be able to: i) Complete your understanding of data reshaping techniques with explode() and melt(), ii) Understand the fundamentals of relational databases, iii) Store and retrieve data using SQLite, iv) Design database schemas appropriate for your Reddit data analysis.
Response rate: 8%
โ„น
(5 out of 63 DS105W students have completed the course evaluation survey)
8%
0% - 50%
50% - 75%
75% - 100%
Could you assist us in achieving the <strong>75% mark</strong>? Course evaluation surveys are <a href="https://moodle.lse.ac.uk/mod/forum/discuss.php?d=385499#course-evaluation-survey" style="color: #58426c; text-decoration: underline;">extremely important</a> for us, the instructors of DS105. <br><br>

Click here to provide your official feedback on this course. ๐Ÿ’ก Note: Please assess all the instructors you have interacted with (and Jon also counts as your teacher!).

Last updated: 13 March 2025, 15:45

๐Ÿ“Time and Location: Thursday, 13 March 2025 from 4-6 pm at MAR.1.04

This weekโ€™s lecture will complete our exploration of data reshaping techniques and introduce you to the concept of databases, which will be essential skills for your โœ๏ธ Mini-Project 2. We will build directly on last weekโ€™s techniques while adding new tools to your data science toolkit.

๐Ÿ““ Interactive Puzzle-Based Learning Continues

Following the success of last weekโ€™s format, we will continue with our puzzle-based learning approach:

  1. New Puzzles, Same Format: We will tackle two new puzzles focused on techniques that mirror the exact challenges you will face when working with Reddit data in your โœ๏ธ Mini-Project 2.

    • Puzzle 3 (about explode()): โ€œThe Reddit Tags Challengeโ€

    • Puzzle 4 (about melt()): โ€œThe Reddit Engagement Metrics Challengeโ€

  2. Competitive Element Continues: The tote bag competition continues! Teams with the best solutions will earn points towards winning Data Science Instituteโ€™s tote bags. ๐ŸŽ

    For this reason, I will not share the puzzles with you until the lecture starts

๐Ÿ“‹ Preparation

Before the lecture

  • Review the techniques we covered in Week 07 (Split-Apply-Combine and JSON Normalisation)
  • Ensure you can access Nuvolos
  • Bring your laptop to participate in the interactive puzzles
  • Get a head start on Mini-Project 2: Consider creating a Reddit account and setting up API credentials (instructions on Moodle)

๐ŸŽฌ Lecture Material

The lecture will be structured in two main parts:

Part 1: Completing Data Reshaping Techniques

We will finish our exploration of data reshaping techniques with two powerful pandas functions:

  1. DataFrame.explode(): Expanding list elements into separate rows.

  2. DataFrame.melt(): Transforming wide data into long format.

Part 2: Introduction to Databases

After a short break, we will dive into databases:

  1. Database Fundamentals: Understanding relational databases and their advantages. Why use databases instead of CSV/JSON files?

  2. Working with SQLite: A lightweight database perfect for your projects. Creating database connections, storing pandas DataFrames in SQLite, and querying data from SQLite.

  3. Database Design for Reddit Data: Practical examples of how to structure your data. Creating appropriate tables for posts, comments, and subreddits, and establishing relationships between tables.

๐Ÿ“ฅ Lecture Notebooks

The lecture notebooks will be available here and on Nuvolos at the start of the lecture.

Download the notebooks for todayโ€™s lecture:

๐Ÿ“‹ MINI-PROJECT 2 CONNECTION:

Todayโ€™s puzzles are specifically designed to prepare you for the Reddit Engagement Analysis project:

  • The techniques we will cover are exactly what you will need to process and analyse Reddit API data
  • The database skills will help you efficiently store and query the data you collect
  • The visualisation approaches will directly translate to creating compelling visuals for your project report

๐Ÿ“ฅ Post-Lecture Actions

  1. Review the Jupyter notebooks from todayโ€™s lecture
  2. Set up your Reddit API access if you havenโ€™t already (instructions on Moodle)
  3. Start exploring potential subreddits for your โœ๏ธ Mini-Project 2
  4. Practice with the sample solutions notebook
  5. Use the #help channel on Slack if you need clarification or assistance

๐Ÿ“š Additional Resources