✍️ Summative Problem Set 02 | W05-W07

DS105 - Data for Data Science

Author

Anton Boichenko & Dr. Jon Cardoso-Silva

Published

26 October 2022

Update 6 November 2022

The deadline has been extended to 14 November 2022.

Welcome to the second summative assessment of this course!

This time you will continue exploring the world of web-scraping. We hope you will have fun completing this one, as we tried to make it as entertaining for you as possible.

Things to know before you start:

  1. Deadline: you have until 14 November 2022, 23:59 UK time to complete your solutions and share the single screenshot via Moodle.
  2. You will be granted a maximum of 100 points for the whole assignment. You will see how much each task is right next to the tasks’ names.
  3. This assessment is worth 15% of your final grade.
  4. Read the instructions carefully and make sure you follow them.

P0: Only for you

This assignment has individual tasks for each student. You will find them on your cloud machine profile.

🎯 ACTION POINTS

  1. Go to your cloud machine user.
  2. Locate the file called summative_04.txt and download it.

This file contains your unique assignment.

Tip
  • You do not need to write or test your scripts on the cloud.

  • In fact, we recommend you access the cloud only to get your custom assignment and to submit your responses later. It will make your life easier.

  • Try prototyping your solutions by using a computational notebook such as Jupyter Notebook, R Notebooks or Google Colab. It is easier to work with code that way, and you can add pieces of text to remind yourself what each part of your code does.

  • Take a look at πŸ”– Week 05 - Appendix for util links

P1: Loving food (30 points)

In this part, you will be collecting data from the BBC Good Food website. You will be collecting recipes for various dishes.

🎯 ACTION POINTS

  1. Write the code that executes what is asked for in Task 1 in summative_04.txt.
  2. Save the code in the file called BBC_food_scrapping (the file extension will depend on the language you use). Make sure you provide your code with comments.

The instructions on submitting your code and data are provided in P4: Upload your solutions below.

P2: Getting the news (50 points)

🎯 ACTION POINTS

  1. Register for the NYT Article Search API and acquire an API key.
  2. Explore the documentation of the API to complete further steps.
  3. Extract the titles of articles, their publishing dates and links to them following the instructions in summative_04.txt.
  4. Plot the average size of the articles (measured in the number of words) for each month in your period.
  5. Save the code in the file called NYT_API_collection (the extension of the file will depend on the language you use). Make sure you provide your code with comments.

P3: Simple English (20 points)

🎯 ACTION POINTS

  1. Take the first 20 articles that you have scrapped in the previous exercise.

  2. Go to this dictionary API and explore how to use it for the next tasks.

  3. Using this API, create a JSON file containing the original 20 article titles and their phonetic representations. For instance, if the title is β€œData Science”, your JSON record should look like this:

    {"Data Science": "ˈdaetΙ™ ˈsaΙͺΙ™ns"}
  4. Save all the phonetic representations and the original titles in the file called NYT_phonetic.json.

P4: Upload your solutions

🎯 ACTION POINTS

  1. Connect to your user on the cloud machine.
  2. Create a folder called summative04 on your user.
  3. Upload all the created files (both your code and the scrapped data) to the folder.
  4. On the terminal, go to summative04 and type ls -lth. Take a screenshot of the terminal and post it to Moodle.
    If you do not complete this stage, we will not mark your assessment.
How to get help?
  • It is okay to team up with your group/class colleagues to work on the problems together.
  • It is ok to use Slack to share links to useful content
    • Share things like β€œTip: I had to convert a JSON file to a dataframe and I used this code”
  • It is also ok to ask generic programming-related questions publicly on Slack. For example, you can ask questions like:
    • β€œHow do I write a loop to convert a list of text to numbers? (R)” or

    • β€œDoes anyone know how to take just the first 20 items on a list ? (Python)” or

    • β€œHow do I do … in Jupyter Notebooks?”

    • β€œCan I do … in Python without installing any other package?”

    • β€œI am having trouble authenticating with the API. When I write

      this small piece of code

      it doesn’t return anything. Anyone else had the same problem?”

  • What we cannot accept:
    • sharing your entire script with others β€” but it is ok to share small pieces of code to ask for help, like the type of code people share on Stackoverflow
    • asking others to do your work for you (LSE regulations on plagiarism)