πŸ—“οΈ Week 01
Welcome to the Course + Food Data Exploration

DS205 – Advanced Data Manipulation

19 Jan 2026

Welcome to DS205!

Course Lead

Dr Jon Cardoso-Silva πŸ“§
@jonjoncardoso
Assistant Professor
LSE Data Science Institute


Current Focus:

  • PhD in Computer Science
  • Experienced in software engineering, data science and data engineering
  • Leading DS105 and DS205 course development
  • Investigating the impact of GenAI impact on higher education ( GENIAL project)

Office Hours:
Wednesdays, 2:00 pm - 5:00 pm
Book via StudentHub

Recent Recognition:
LSESU Teaching Award for Feedback & Communication (2023)

Teaching Support

Dr Barry Ledeatte πŸ“§
AI Learning Consultant
Also teaches DS101A

CLASS TEACHER

Administrative Support

Kevin Kittoe
Teaching & Assessment Administrator (DSI)

ADMINISTRATIVE SUPPORT

Contact Kevin (πŸ“§ DSI.ug@lse.ac.uk) for:

  • Course access issues
  • Assignment submissions
  • Extension requests
  • Administrative queries

Key Information:

  • All extension requests must follow LSE’s extension policy
  • Email response time: 24-48 hours working days
  • Include β€˜[DS205]’ in email subject lines

What can you expect to learn in DS205?

Why this course exists:

  • To incorporate more advanced concepts that could not fit in DS105.
  • To build production-ready solutions for real-world problems.
  • Prepare you for professional-level software development for data manipulation.

πŸ₯… Intended Learning Outcomes

By the end of this course, you should be able to:

  • Use appropriate Python packages (numpy, scipy, pandas, etc.) to automate and optimise data cleaning and data processing workflows.
  • Build APIs using FastAPI for structured data retrieval.
  • Develop scalable web scraping workflows with the Scrapy and Selenium packages.
  • Proficiently use GitHub workflows for professional collaboration and version control.
  • Apply NLP techniques for retrieval of information from text data.
  • Apply appropriate pre-trained deep learning models from the HuggingFace library to unstructured data.
  • Collaborate effectively on shared codebases and contribute to projects with a real-world impact.

✍️ Assessment Structure

20% Individual ✍️ Problem Set 1:
Web Scraping &
API Development
Release: Week 02
Due: ⏰ Week 06
(4 weeks!)
40% Individual ✍️ Problem Set 2:
RAG System Implementation
Release: ~Week 07
Due: ⏰ Week 10
(~20 days)
40% Group Work πŸ‘₯ Final Project:
RAG System Development
πŸ“¦ a real capstone project you will develop with/for the TPI Centre
Release: Week 11
Due: ⏰ 21 May 2026
(~2 months!)

Build as you learn: In the πŸ’» Weekly Labs from W02-W04, your goal will be to adapt what you would have learned in the Monday lecture, only to a different data source.

πŸ’Ό What’s at Stake When You Write Code?

Think of your code as a β€˜public good’ rather than a personal project.

Throughout this course, you’ll write code knowing that someone else will need to understand it, use it, and build upon it. This is to mirror real-world software development of data products and data pipelines where code is read far more often than it’s written.

Aim for code that is:

  • Neat and concise
  • Self-explanatory
  • Well-documented (when not self-explanatory)

All of your assignments (even individual ones) will have a peer component

You’ll experience this cycle repeatedly:

  1. You will write code for yourself, knowing an unknown peer will inherit it (I’ll randomly allocate pairs)
  2. You will inherit code from a peer and complete their work
  3. You will learn in practice what makes code maintainable through direct experience
  4. We write and update a CODE_OF_CONDUCT throughout the course to absorb these lessons and make them explicit.

What kind of data will we be working with?

This year, we’re exploring two thematic arcs:

πŸ₯— Weeks 01–05:
Real Food vs
Ultra-Processed Food
- Start with familiar, accessible data from an API (Open Food Facts)
- Collect unstructured data from the Web (web scraping principles)
- Write data pipelines that are reproducible and well-documented.

🌍 Weeks 07–11:
Food Producers’ Corporate Sustainability Assessments
- Progress to highly unstructured data (corporate statements, reports, etc.)
- Extract relevant information from unstructured data using NLP techniques
- Work with data used and produced by our collaborators at the TPI Centre.

Pre-Requisites

To take this course, you must have taken DS105, ST101, EC1B1 or equivalent (for General Course students)

I expect you to already know:

  • Python data types
    (int, float, str, bool, list, dict)
  • Python collections (list, dict)
  • Flow control (if, for, while)
  • Custom functions (def)
  • Essential data manipulation with pandas
    (indexing, slicing, filtering, grouping, aggregating, some reshaping)
  • How to use the Terminal/Command Line
  • How to use Jupyter Notebooks

What would be great if you already knew:

(but we will cover them in the course)

  • The requests package to collect data from the Internet
  • The essentials of Git/GitHub for version control
    (repository, commit, push, pull, clone, branch, merge, some conflict resolution, etc.)

πŸ“‘ Key Information

πŸ“Ÿ Communication

  • Slack is our main point of contact. The invitation link will be available on Moodle.
  • πŸ“§ Email: Reserved for formal requests (extensions, appeals)
  • πŸ‘₯ Office Hours: Book via StudentHub
  • πŸ†˜ Drop-in Support: See course website for details

Don’t like your laptop for coding?

We have a dedicated cloud environment on Nuvolos
Visit the Nuvolos - First Time Access to learn how to get access to the DS205 environment.



πŸ““ Check the full syllabus

Read the syllabus for week-by-week information on how we will cover the course content and assessments.

Coffee Break β˜•

After the break:

  • A exploratory data analysis (EDA) of food data
  • Systematic data inspection (live coding demo)
  • Setting up a GitHub repository for your course notes
  • How we will use β€œvibe coding” tools in this course
  • What awaits you in the πŸ’» W01 lab

Setting up your coding environment

We will be writing a lot of code throughout this course and we will be using VS Code and Jupyter notebooks for most of our work. We have a dedicated Nuvolos workspace for you to use. However, if you prefer to work locally on your own machine, you will have to make sure you have a few tools installed.

Development Environment

You have two options for your coding environment:

Option 1: Nuvolos Platform
(Strongly recommended)

🎯 ACTION POINT:

Access our Nuvolos - First Time guide.

This cloud environment comes pre-configured with all required tools. It also includes an AI code editor: GitHub Copilot (similar to OpenAI’s Codex, Cursor, Claude Code, and Google’s Antigravity).

Option 2: Local Setup
(Prone to bugs we won’t be able to help with)

You will need to install the following tools:

πŸ“ GitHub Repository Setup

Why Version Control Matters

Establishing good habits early prepares you for:

  • Collaborative Problem Sets (peer collaboration model)
  • Professional software development workflows
  • Tracking your learning progress
  • Organised course notes and reflections

Your Personal Repository

Create a repository for course notes:

πŸ’‘ TIP: Try to use Markdown to write your notes. Choose a pattern of filename convention that works for you. Here are two potential examples:

ds205-notes/
β”œβ”€β”€ README.md
β”œβ”€β”€ week01/
β”‚   β”œβ”€β”€ lecture-notes.md
β”‚   β”œβ”€β”€ lab-reflections.md
β”‚   └── questions.md
β”œβ”€β”€ week02/
└── [subsequent weeks]
ds205-notes/
β”œβ”€β”€ README.md
β”œβ”€β”€ w01-lecture-notes.md
β”œβ”€β”€ w01-lab-reflections.md
β”œβ”€β”€ w01-questions.md
β”œβ”€β”€ w02-lecture-notes.md
β”œβ”€β”€ w02-lab-reflections.md
β”œβ”€β”€ w02-questions.md
└── [subsequent weeks]

Getting Started

  1. Create repository on GitHub (private recommended)
  2. Clone locally: git clone https://github.com/YOUR_USERNAME/ds205-notes.git
  3. Create folder structure
  4. Initial commit and push

πŸ‘¨πŸ»β€πŸ’» Live Coding

The Notebook

We’ll work through: W01-NB01-Lecture-Open-Food.ipynb (find it on Nuvolos). Today’s demonstration shows the complete pipeline from data collection to interactive visualisation:

  1. Data Collection: Collect 1000+ bread products from Open Food Facts API
  2. Systematic Inspection: Best practices when inspecting unfamiliar data
  3. Feature Engineering: Transform nested JSON to analysis-ready format
  4. Advanced Analysis: UMAP clustering reveals nutritional patterns
  5. Interactive Visualisation: Plotly visualisation with NOVA classification

NOTE: I don’t expect you to already know what UMAP is (we’ll talk about it in Week 08) but I expect you to be able to understand most of the code I’ll show.

Pay close attention when I’m demonstrating and take note of any code snippets or Python concepts that you don’t understand.

What You’ll See

  • How pagination works for large datasets
  • Systematic data inspection methodology
  • Feature extraction from nested structures
  • Preview of Week 08+ techniques (dimensionality reduction)
  • Interactive visualisations

I will skip a few details because they will be covered in the lab tomorrow.

Preparing for the Lab

What Awaits You

  • Hands-on practice with Open Food Facts API
  • Systematic data inspection exercises
  • DataFrame manipulation and analysis
  • Visualisation with seaborn

Questions?

  • Reach out on Slack if you need help!
  • Book office hours with me (on StudentHub) if you are unsure of your Python skills.