💻 Week 04 - Class Roadmap (90 min)

2023/24 Winter Term

Author

Published

09 February 2024

Figure 1: Images dreamed up by the Stable Diffusion algorithm ¹. Prompt: “boxplots, abstract digital art”

📚 Learning Objectives

Welcome to the fourth class of the course! This week, it’s time for hands-on practice. We will learn how to use Zotero to manage our references. And we will learn a language called Quarto Markdown, which will allow us to create documents that are both reproducible and interactive.

Prior to this class, you should have installed a bunch of software needed for this tutorial: Anaconda (Python), Quarto, VSCode and Zotero. If this is not the case, head over here for instructions.

🛣️ Roadmap

Step 1: Collecting and saving academic references (~20min)

Familiarizing yourselves with Zotero

To tackle this part of the class, you need to have a Zotero account and at least one of the Zotero browser extension and/or the Zotero desktop application installed. If not, check Step 4 of this guide for instructions.

Log in to your Zotero account.
To make things organized and tidy, we recommend you create collections of references within Zotero, e.g one specific to the references linked to the DS101W module. Head to this page to see how to create a references collection in Zotero; create a new collection under your account and give it a name that is easily recognizable and understandable.
Using the Zotero extension in your browser, save a few relevant references to the new collection you created (check out the 📔 syllabus for some suggestions). Have a look at this tutorial to see how you can add references via your web browser in Zotero.
Download the pdf article below:

Head to the section “Adding PDFs and Other Files” of this tutorial and add this pdf to your Zotero collection.

Download the zip archive below:

Place the archive in the folder you want to unzip it in and unzip the archive. Use this guide to add the metadata of the PDFs from the unzipped archive into your Zotero collection.

Creating a bibliography reference file to include in a Quarto document

In this part of the class, we are going to use Zotero to create a bibliography reference file that will help us include references in the Quarto document we will be editing and formatting later in the class (see Step 2).

In Zotero, add the following articles to a new collection:
- https://academic.oup.com/biomet/article/63/3/581/270932?login=true
- https://hbr.org/2019/10/gdp-is-not-a-measure-of-human-well-being
- https://www.imf.org/external/pubs/ft/fandd/2017/03/coyle.htm
- Make sure to include the name of the author(s), (article) title, journal/magazine, number, issue, page range, and DOI. Make your information as complete as possible
Right-click on the bibliography and select “Export Collection”. Create a BibLaTeX document called references.bib and save it in the same directory where you’ll be putting the skeleton Quarto document from the next step.

This concludes the Zotero part of the session. We’ll now move on to Quarto Markdown practice.

Step 2 : Formatting documents with Quarto Markdown

To tackle this part, you need to have Anaconda, Quarto and VSCode installed. If this is not the case, check Steps 1 to 3 from this guide for instructions.

Open a VSCode workspace (see how here)
Download the .zip archive below that contains a skeleton Quarto Markdown (.qmd) file and place the .qmd file in your workspace. Quarto Markdown files have a .qmd extension.

Open the skeleton Markdown file in VSCode.

You will see a “chunk” of Python code that loads all the libraries you need to finish the task.

Note

You will see that there is a series of short headings and sentences without any formatting. We are going to turn this into an appropriately formatted document with three sections.

There are many ways in which you can format your documents in Quarto, please refer to this page as a user guide throughout.

Formatting document sections - Section 1 (~10 min)

Let’s start by formatting the first section of the document, Section 1: Understanding missing data mechanisms.

🎯 ACTION POINTS:

Note

To check the results of the changes you make to your .qmd file:

open a terminal in VSCode
go to the folder where your .qmd file is located using the cd command e.g if your DS101W_class4_skeleton.qmd file is in /users/lovelace/Documents (i.e the full path to your file is /users/lovelace/Documents/DS101W_class4_skeleton.qmd) then you would type this command on the terminal cd /users/lovelace/Documents/DS101W_class4_skeleton.qmd followed by the Enter key.
then on the terminal, type this command quarto preview DS101W_class4_skeleton.qmd --no-browser. A browser tab will open and you will see how your document would look, once it is rendered in HTML format.

⚠️ The quarto preview command does not render your document yet (i.e no output HTML is created). To render your document (once you are done modifying it!), use the quarto render DS101W_class4_skeleton.qmd on the terminal. You will see that a DS101W_class4_skeleton.html file is created in the same folder as your .qmd file: you can open this new file in your favorite browser (e.g Firefox, Chrome).

Turn the line “Understanding missing data mechanisms” into a section header.
Add the following to the YAML header:
```
bibliography: references.bib
```
Click on references.bib to see the reference Zotero created for the Rubin article.
Directly reference the article using @ArticleReference or [@ArticleReference].
Turn the paragraph about missing data into bullet points.

Formatting your document - Section 2 - Part 1 (~20 min)

Now, let’s format the second section of the document, Data Analysis Case Study: GDP (OECD data).

🎯 ACTION POINTS:

Turn the line “Data Analysis Case Study: GDP (OECD data)” into a section header and the lines “Trends in GDP per capita, 1972 to 2022”, “GDP table for selected countries”,“GDP barplot for selected countries” and “Criticism of the GDP measure” into subsection headers.
Go the OECD website:

https://data.oecd.org/gdp/gross-domestic-product-gdp.htm

Download the GDP (dollars/per capita) table (CSV file) from 1972 to 2022 (full indicator data version) and place the CSV file in the same folder as your Quarto file.

Let’s write some simple Python code. Go to the skeleton Python code at the top of the file.

After the line from plotnine.themes import theme, theme_bw, write a line of code to read the csv file you’ve just downloaded into a data frame.

Check out the following documentation:
- https://quarto.org/docs/get-started/computations/rstudio.html#inline-code
Now use inline Python code to replace the ellipses (…) with the dollar/capita amounts using the avg_gdp_per_year function.

🍵 Break (5-10 min)

Formatting your document - Section 2 - Part 2 (~30 min)

We’ll now continue formatting the second section of the document, Data Analysis Case Study: GDP (OECD data) but turn our attention to the last three subsections of it i.e GDP table for selected countries,GDP barplot for selected countries and Criticism of the GDP measure.

We want to create a table that shows the GDP between 2018 and 2020 for the following countries: the UK, India, China, Spain, the US, Italy and France.

In the skeleton Markdown file, you’ll see the following piece of code:

{python}
  #| eval : FALSE
  #| echo : FALSE 
  l=['GBR','IND','CHN','ESP','USA','ITA','FRA']
  years=range(2018,2021,1)
  df.query('LOCATION.isin(@l)').query('TIME.isin(@years)')

This code will allow you to print out (after a slight modification) the part of the original GDP table where the countries are one of the UK (code GBR), India (code IND), China (code CHN), Spain (code ESP), the US (code USA), Italy (code ITA) and France (code FRA) between the years 2018 and 2020.

As the code is currently written in the skeleton file, the code is not executed and does not return anything.

For the code to be executed, you need to replace the FALSE in the #| eval : FALSE line by TRUE and to allow the result of the execution to be printed out, you need to replace the FALSE in the #| echo : FALSE line by TRUE.

Once you have executed the code snippet, write a Markdown table (see how here) that shows the GDP of the UK, India, China, Spain, the US, Italy and France between 2018 and 2020 and add it to the section GDP table for selected countries. Don’t forget to add a caption to the table. Once you’re done adding the table, change back the options in the Python code snippet to what they originally were i.e #| eval : FALSE and #| echo : FALSE to only leave the Markdown table you’ve written visible.

Next, we’ll move to the GDP barplot for selected countries subsection of the skeleton Markdown file. Under this subsection, you’ll see some Python code snippet (for now not executable) that creates a dataframe that contains the GDP per capita data for the same set of countries as earlier for the year 2020. Try to execute the code to see the content of this dataframe. What we want to do now is create a barplot to visualize the content of this dataframe. So, within the code snippet provided, using the plotnine library, try to write the code for a barplot that shows the GDP per capita per country for 2020 for each of the countries selected previously (download the tutorial below to see how to draw barplots in plotnine or check the plotnine documentation)

Once you’re done with your plot, export it to an image file (e.g .png file) as shown here and include it in the subsection as a figure (follow the instructions in this guide) to do this. Once you’re done including your figure, make sure the code snippet you used to generate the figure is no longer executable (i.e you modify the eval option value back to FALSE so that the line reads as #| eval : FALSE).

In subsection Criticism of the GDP measure, cite the two references from your bibliography relative to GDP that you added to references.bib earlier using the [@ArticleReference] notation. On the web look for an image that illustrates the content of the section and insert it in the subsection in the same way you inserted the barplot you created previously.
When you are done editing your .qmd, you can render it using the quarto render command to generate an HTML file. Visualise your HTML output and check everything is in order.

Footnotes

Read more about Stable Diffusion here ↩︎