📝 W06 Summative

2023/24 Winter Term

Author

Dr. Jon Cardoso-Silva

⏲️ Due Date:

21 February 2024 at 5 pm UK time

If you submit after this date without an authorised extension, you will receive a late submission penalty.

Did you have an extenuating circumstance and need an extension? Send an e-mail to 📧

🎯 Main Objectives:

Practice some more Git/GitHub
Practice web scraping in Python

⚖️ Assignment Weight:

30%

This assignment is worth 30% of your final grade.

📚 Preparation

Read carefully. Some details are different from the previous assignments.

Find the GitHub assignment link: Go to our Slack workspace’s #announcements channel. We don’t share the link publicly. It is a private repository for enrolled students only.
Accept the assignment: Click the link, sign in to GitHub, and click the green button.
Access your private repository: You’ll be redirected to a new private repository named ds105w-2024-w06-summative-<yourusername>.
Clone your repository: Clone your repository to your computer. A folder ds105w-2024-w06-summative-<yourusername> will appear on your computer. We will refer to this folder as <github-repo-folder> in the rest of this document.

Install the required packages:

pip install numpy pandas requests scrapy

Mandatory requirement: keep your AI chat logs. If you used AI tools for assistance with this assignment, it’s essential to retain a log of your chat interactions. We will collect your chat logs just before releasing your grades (after grading and de-anonymising your assignment).
- For GENIAL participants: Submit your logs via the standard weekly forms. We’ll take it from there.
- If you have chosen not to participate in the study, we’ll contact you via e-mail when the time comes to submit your logs.
  
  Please Note: There will be no penalties for using or abstaining from using AI tools.

7. Learn how to get help:

Pose questions in the public channels on Slack.
Attend the drop-in sessions facilitated by our colleagues at the Digital Skills Lab.
Refer to the 🙋 Getting Help section below for guidance on collaborating with peers without compromising the originality of your work.

📋 Tasks

The questions below primarily build on principles taught at the 👨🏻‍🏫 W04 Lecture & 👨🏻‍🏫 W05 Lecture as well as the 💻 W05 Lab.

What do we actually want from you?

📝 Task 0: Setup

Open <github-repo-folder>/notebooks/NB01 - Data Collection.ipynb and add your candidate number to the first markdown cell.
Inside the # ⚙️ SETUP section, replace <your candidate number> with your actual candidate number.
Run the code cell inside the Discover your unique assignment section to discover the web page you will be scraping.
Push it to GitHub
- git add
- git commit -m "Setting up my assignment"
- git push

📝 Task 1: Collecting menus (20 marks)

Every LSE department website is different, but they all have a horizontal menu at the top somewhere, typically with links like ‘About’, ‘Research’, ‘People’, etc.

Your task here is to create a little dictionary of the menu of the department’s website you were assigned to using the requests and scrapy libraries.

Add a new section on NB01 - Data Collection.ipynb, add a H1 header and call it Task 1 - Collecting Menus.
Write code to scrape the menu of the department’s website you were assigned to. Collect the links and the text of the menu items.

Write the code to save the scraped data as a Python dictionary called menu with the following structure:

{
    "About": "https://www.lse.ac.uk/...",
    "Research": "https://www.lse.ac.uk/...",
    "People": "https://www.lse.ac.uk/...",
    ...
}

Save it to a JSON file at <github-repo-folder>/data/menu.json.
Commit & push as frequently as you want.

📝 Task 2: Collecting boxes (50 marks)

Not all, but most LSE department webpages have a few ‘content boxes’, typically with some text and a figure. For example, see below a highlighted content box from the LSE Data Science Institute research webpage:

Let’s organise the content of these boxes into a dataframe.

Add a new section on NB01 - Data Collection.ipynb, add a H1 header and call it Task 2 - Scraping the Schedule.
Read the menu.json file you created in Task 1 as a dictionary.

(Sure, in real life, this is unnecessary, given that you already have the dictionary in your memory. But we want to see your file reading skills!)

Scrape all content boxes from each link in your menu dictionary. Extract the title, subtitle, text, image URL, and alt title from each box using Python data structures.

For example, a dictionary representation of the content box highlighted above would be:

{
    "title": "Research Project",
    "subtitle": "(De)constructing health & wellbeing through data"
    "text": "Exploring ways to improve the mapping of spatial health inequalities in England",
    "image_url": "/Cities/Assets/Images/research-page-images/DSI-Deconstructing-health-and-wellbeing/DSI-photo-a-747x560.jpg",
    "image_alt": "DSI-photo-a-747x560"
}

Organise this data into a single dataframe called df_boxes with the following columns

department	page_title	page_url	box_title	box_subtitle	box_text	box_image_url	box_image_alt
Data Science Institute	Research	https://www.lse.ac.uk/dsi/research	Research Project	(De)constructing health & wellbeing through data	Exploring ways to improve the mapping of spatial health inequalities in England	/Cities/Assets/Images/research-page-images/DSI-Deconstructing-health-and-wellbeing/DSI-photo-a-747x560.jpg	DSI-photo-a-747x560
…	…	…	…	…	…	…	…
…	…	…	…	…	…	…	…

👉 Note: The first column is the same for all rows. This is the department you were assigned to.

Leave cells empty whenever the information is not available.

Save the dataframe as a CSV file to <github-repo-folder>/data/boxes.csv.
Commit & push as frequently as you want.

📝 Task 3: Collecting the images (30 marks)

Now, let’s download the images from the boxes you scraped in Task 2.

Add a new section on NB01 - Data Collection.ipynb, add a H1 header and call it Task 3 - Downloading Images.
Read the menu dictionary from the menu.json file and the df_boxes.csv files again.
Use the Python os library to create a folder called <github-repo-folder>/data/images to store the images.
Still using the os library, add sub-folders for each menu page. For example, if you have a menu item called “Research”, you should create a sub-folder called <github-repo-folder>/data/images/research.
Using the requests library, download the images from the box_image_url column of the df_boxes dataframe and save them to the appropriate sub-folder.
Commit & push as frequently as you want.

Wrapping up

Edit the README.md file to include a brief description of your work and the structure of your repository.

✔️ How we will grade your work

We will run your notebook and check if it produces the expected JSON, CSV, and image files and that they are in the right places.
We will check if your notebook (markdown and code) is well-organised, efficient, and doesn’t throw any errors.
In line with typical LSE expectations around course grades, you will score around 70% if you get everything right. More marks are only awarded for exceptional work beyond what we have taught you.

How to get help and how to collaborate with others

🙋 Getting help

You can post general coding questions on Slack but should not reveal code that is part of your solution.

For example, you can ask:

“What is a good way to extract just the middle part of a string?”
“How do I convert a string to a date?”

You are allowed to share ‘aesthetic’ elements of your code, if they are not part of the core of the solution. For example, suppose you find a really cool new way to write a for loop. You can share that on Slack.

If we find that you posted something on Slack that violates this principle without realising, you won’t be penalised for it - don’t worry, but we will delete your message and let you know.

👯 Collaborating with others

You are allowed to discuss the assignment with others, work alongside each other, and help each other. However, you are not allowed to share code or to copy code from others. Pretty much the same rules as above. Copying code from others is plagiarism and will be dealt with accordingly.

Aim to strike a balance: you want to be able to talk to others about programming in general without getting too specific about the assignment. This is a good way to ensure your submission has your personal touch (more likely to earn more marks) and avoid getting into trouble.

🤖 Using AI help?

You can use Generative AI tools such as ChatGPT when doing this research and search online for help. If you do use it, however minimal use you made, you are asked to report in the final section of your notebook, explaining which tools and the extent to which you used it and to keep the chat logs that show evidence of how you used it.

If you are a GENIAL participant, things are easier for you. Just don’t forget to submit your logs via the standard weekly forms.
If you don’t participate in the study, we’ll contact you via e-mail when the time comes to submit your logs.
To see examples of how to report the use of AI tools, see 🤖 Our Generative AI policy.