ℹ️ Course Information

All you need to know about DS205 (2024/25)

Author
Image created with the AI embedded in MS Designer using the prompt 'abstract green and blue icon depicting the advanced stages of data wrangling, API design, and scalable pipelines for sustainability-focused data engineering.'

Welcome to LSE DS205 - Advanced Data Manipulation, a LSE Data Science Institute course. This is where you’ll learn to master advanced data engineering techniques and tackle real-world challenges in collaboration with the Transition Pathway Initiative Centre.


What is DS205 about?

DS205 is designed to advance your data manipulation and engineering skills to a professional-grade level. The course emphasises automation, efficiency, and scalability in data workflows, while maintaining a focus on ethical and practical implications in real-world applications.

You will engage in activities like:

  • Building APIs for effective data sharing.
  • Developing web scrapers to gather data when APIs are unavailable.
  • Cleaning and structuring complex datasets.
  • Applying modern NLP tools and data processing pipelines.

🥅 Intended Learning Outcomes

By the end of this course, you should be able to:

  • Use pandas to automate and optimise data cleaning and data processing workflows.
  • Build APIs using FastAPI for structured data retrieval.
  • Develop scalable web scraping workflows with the Scrapy and Selenium packages.
  • Profficiently use GitHub workflows for professional collaboration and version control.
  • Apply NLP techniques for retrieval of information from text data.
  • Apply appropriate pre-trained deep learning models from the HuggingFace library to unstructured data.
  • Collaborate effectively on shared codebases and contribute to projects with a real-world impact.

👥 Our Team

Name: Dr Jon Cardoso-Silva
Links: LSE, GitHub, LinkedIn, 📧
Role: Assistant Professor (Education)
LSE Data Science Institute
Office Hours: 🗓️ Thursdays, 11am-1pm (StudentHub)
Current Focus: Leading DS105/DS205 development and researching GenAI in education

COURSE LEADER | LECTURER

Alexander Soldatkin
DPhil Candidate
Oxford School of Global and Area Studies
📧

CLASS TEACHER

Dr Barry Ledeatte
AI Learning Consultant
Also teaches DS105W
📧

TEACHING SUPPORT

Sara Luxmoore
Research Officer
LSE Data Science Institute and LSE Cities
📧

TEACHING SUPPORT

Terry Zhou
3rd-Year BSc in Politics and Data Science
Undergraduate Research Assistant at DSI 1
(tz1211?)

CODE MAINTAINER

Kevin Kittoe
Teaching & Assessment Administrator
📧

Handles course access, submissions, extensions and admin queries.

ADMIN

Our Industry Partners

The Transition Pathway Initiative Centre evaluates companies’ readiness for transition to a low-carbon economy. Their work involves extensive analysis of messy, unstructured data.

👉🏻 Everything you produce in this course has the potential to help TPI automate their data processing workflows.

Key Collaborators:

Valentin Jahn
Deputy Director Research & Operations

Sylvan Lutz
Policy Officer – ASCOR Analyst

📟 Communication Channels

Throughout the Winter Term, beyond regular class sessions, you can contact teaching staff via Slack, office hours, or dedicated support sessions:

  • Slack: Our primary hub for daily course discussions, resource sharing, and quick questions in #help. We prioritise questions posted to public channels over direct messages.

  • 🆘 Weekly Support Sessions: Every Wednesday, 12:00 pm - 2:00 pm in person at the DSI Visualisation Studio (COL.1.06). Led by Sara Luxmoore. No booking required—just drop in for help with exercises or technical issues.

  • 🧑🏻‍💼 Office Hours: Book 15-minute slots via StudentHub:

    • Jon: Thursdays, 11:00 am - 1:00 pm
    • Alex: Wednesdays, 3:00 pm - 5:00 pm
    • Barry: Fridays, 2:00 pm - 4:00 pm
  • 📧 Email: For formal requests (extensions, class changes), contact (managed by Kevin).

Contact Hours

Here’s our weekly schedule of support and teaching activities:

Contact Hours
Day Activity Time Staff Type
Monday Lecture 10:00 - 12:00 Dr Jon Cardoso-Silva 🗣️ In-person
Slack Support 13:30 - 15:00 Dr Jon Cardoso-Silva 💬 Online
Tuesday Slack Support 11:30 - 12:30 Dr Jon Cardoso-Silva 💬 Online
Labs Afternoon Alex Soldatkin 💻 In-person
Wednesday Drop-in Sessions 12:00 - 14:00 Sara Luxmoore 🛟 In-person (COL.1.06)
Office Hours 15:00 - 17:00 Alex Soldatkin 👥 In-person
Thursday Office Hours 11:00 - 13:00 Dr Jon Cardoso-Silva 👥 In-person
Friday Office Hours 14:00 - 16:00 Dr Barry Ledeatte 👥 In-person
Slack Support 12:00 - 13:30 Dr Barry Ledeatte 💬 Online

Key to Icons:

  • 👥 In-person: Face-to-face interaction in designated office space
  • 💬 Online: Support via Slack channels
  • 🛟 Drop-in: Flexible support sessions—no booking required
  • 🧑‍🏫 In-person Lecture / 💻 In-person Labs: Formal teaching sessions

⌚️ Class Details

📅 Lecture

  • ⏰ Monday: 10:00 - 12:00
  • 📍 KSW.1.01
  • 👤 Dr Jon Cardoso-Silva

📅 Class Group 1

  • ⏰ Tuesday: 15:00 - 16:30
  • 📍 OLD.1.20
  • 👤 Alex Soldatkin

📅 Class Group 2

  • ⏰ Tuesday: 16:30 - 18:00
  • 📍 OLD.1.20
  • 👤 Alex Soldatkin

Teaching Format: We meet weekly for lectures on Monday mornings, followed by hands-on lab sessions on Tuesday afternoons. These sessions run throughout term, except for Reading Week (Week 6).

✍️ Assessment Structure

“How will I be assessed in this course?

Your grade in this course consists of two main components, COURSEWORK, worth 60% and GROUP PROJECT, worth 40%. It is those two components that show up in your student record, but in reality, they are made up of several smaller parts:

20% Individual ✍️ Problem Set 1:
Web Scraping & API Development
Release: ~Week 04
Due: 5 March 2025, 8pm
40% Individual ✍️ Problem Set 2:
RAG System Implementation
Release: ~Week 06
Due: 26 March 2025, 8pm
40% Group Work 👥 Final Project:
TPI Data Pipeline Development
Details: Spring Term
Due: May/June 2025 (TBC)

Weekly formative exercises in Weeks 01-04 will prepare you for the summative assessments. These include hands-on practice with GitHub workflows, API development, and web scraping techniques.

🤖 The use of Generative AI in this course

By students

In this course, we adopt Position 3: Full authorised use of generative AI, as per the LSE’s positions on Generative AI.

This means you can make unrestricted use of Generative AI (GenAI) tools like ChatGPT, GitHub Copilot, Grammarly AI, etc. for all aspects of the course, including assessments.

We view GenAI as a double-edged sword when it comes to learning. I find that students who know the content of the course well enough make a resourceful use of GenAI and learn a lot from it. However, when students are behind or are not engaging with the material, it is easier to over-rely on GenAI (without realising it) and miss out on the learning experience. This typically becomes evident when we mark the assessments, as the style of the code produced deviates significantly from the coding style used in the course.

By Teachers

We extensively use GenAI to clean up the format of lecture notes and the layout of the course materials. AI chatbots also help us brainstorm the ideas for a lecture. For example, we might feed Claude or ChatGPT with the lecture notes from previous years, along with the learning outcomes of the course and notes of what worked and what didn’t last time and ask it to obtain an improved version of the lecture notes.

This helps us focus on the content and pedagogy, while the AI takes care of the formatting. GenAI serves as a tool to bring existing thoughts to life, not as a lazy way to generate new content.

When marking:

When marking assignments, we usually find GenAI productive for providing more detailed and useful feedback to our students, based on our free-text notes.

Here’s how we typically structure feedback:

  1. Initial Feedback: We write a ‘brain dump’ of notes while reviewing your work (e.g., running your code, evaluating edge cases). This is in free text format.
  2. Structuring: After providing Claude or ChatGPT with a list of common mistakes we expect and a structure of template and snippets of solutions and links, we provide our free notes to these tools and they return feedback in the neat Markdown format we expect.
  3. Final Review: We read and further review the formatted notes to ensure feedback aligns with the course standards and is actionable.

In other words, we use large language models for what they do best: language formatting and structuring.

These tools help save time on structuring feedback while maintaining the personal touch and pedagogical value.

Footnotes

  1. Terry has been working at the DSI for the past two years as a UG research assistant and has proved experienced in web scraping, API development and RAG systems. He will help with code review and to manage integration with TPI.↩︎