πŸ’» Lab 02 – Manipulating XML files

Week 01 – Day 02 - Lab Roadmap (90 min)

Author
Published

11 July 2023

πŸ“‹ LAB DIFFICULTY: πŸ˜… HARDER THAN YESTERDAY

It involves practising with a new data format (XML) as well as interacting with the documentation of an unfamiliar package (xml2).

πŸ₯… Objectives

  • Learn how to use the functions from the xml2 package.
  • Discover how to navigate the documentation of an unfamiliar package (xml2, as well as dplyr).
  • Learn how to think of custom data types
  • Learn how to create an XML object from a data frame.
  • Practice creating for loops in R.
  • Practice creating custom R functions to work on data.

πŸ“š Key Resources

πŸ“‹ Lab Tasks

Today we will continue to work with the Tesco Groceries data set. The point of today’s lab is to help you familiarise yourself with transforming data from a data frame to another format and back.

Important

Today, our focus will be on XML because soon (πŸ—“οΈ Week 01 | Day 04), you’ll be diving into web scraping. Since most websites use HTML, which shares similarities with XML, the skills you learn today will come in handy when extracting data from the web.

Part 1: βš™οΈ Setup (10 min)

  1. Open the RStudio project you created yesterday and create a new R script. Save the script as lab02.R.

  2. We will need the following libraries today:

    library(tidyverse)
    library(xml2)

    Or, if you prefer to load each tidyverse package individually:

    library(readr)
    library(dplyr)
    library(xml2)
  3. Load the same df data frame you created yesterday.

πŸ’‘ TIP: To keep your code organised, create a dedicated section in your lab02.R script to store the code for each part of this lab. Here’s an example of how it could be structured:

#### PART 1: βš™οΈ Setup ####

# Your code for Part 1 goes here (imports)

#### PART 2: Working with XML ####

By separating the code into sections, it will be easier to navigate and find specific parts of your script later on.

Part 2: Working with XML (20 min)

πŸ§‘πŸ»β€πŸ« TEACHING MOMENT

  • Your instructor will show you how to convert a single row of data into XML and save it to a file.

  • Follow along in your own RStudio and ask clarifying questions if you are unsure about anything.

  • By the end, you will have produced a file with the extension .xml whose content looks like the following:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
        <area area_id="E01000001">
            <fat>9.02807973835848</fat>
            <saturate>3.7293430929761</saturate>
            <salt>0.556402429528114</salt>
            <protein>5.38504905777922</protein>
            <sugar>9.65265534963392</sugar>
            <protein>5.38504905777922</protein>
            <carb>16.237019155895</carb>
            <fibre>1.67400716399314</fibre>
            <alcohol>0.347539336551938</alcohol>
        </area>
    </data>

πŸ—£οΈ QUESTIONS TO THE CLASSROOM:

  • What would the XML look like if we had multiple rows of data?

Now let’s get you something to do!

Part 3: Multiple Rows of Data (30 min)

🎯 ACTION POINTS:

  1. Now add code to your lab02.R script to produce a file large_sample.xml that contains the first 10 rows of the df data frame.

  2. Open the file in a text editor and inspect the contents. Does it make sense?

πŸ§‘πŸ»β€πŸ« TEACHING MOMENT

  • Your instructor will show you how to read the XML file you just created back into R. Follow along in your own RStudio and ask clarifying questions if you are unsure about anything.

Part 4: Hierarchical XML (30 min)

Given what you practised above, you should be able to create an even more complex hierarchical XML structure.

🎯 ACTION POINTS:

  1. Modify your code so your XML has a <nutrient> tag and your XML file has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<data>
  <area area_id="E01000001">
    <nutrients>
      <fat>9.02807973835848</fat>
      <saturate>3.7293430929761</saturate>
      <salt>0.556402429528114</salt>
      <protein>5.38504905777922</protein>
      <sugar>9.65265534963392</sugar>
      <protein>5.38504905777922</protein>
      <carb>16.237019155895</carb>
      <fibre>1.67400716399314</fibre>
      <alcohol>0.347539336551938</alcohol>
    </nutrients>
  </area>
  <area area_id="E01000002">
    <nutrients>
      <fat>8.41439397561554</fat>
      <saturate>3.4108180731533</saturate>
      <salt>0.501482787473101</salt>
      <protein>5.32376284962945</protein>
      <sugar>7.88136983026529</sugar>
      <protein>5.32376284962945</protein>
      <carb>14.154434616304</carb>
      <fibre>1.65491274205115</fibre>
      <alcohol>0.611192565144633</alcohol>
    </nutrients>
  </area>
  ...
</data>

🏑 Bonus Task

  1. Create a new R script called lab02_bonus.R and save it in the same directory as your lab02.R script.

  2. Write code to produce an XML file that contains an even deeper hierarchical structure, containing all nutrient statistics. That is, your XML should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <area area_id="E01000001">
    <nutrient-statistics>
      <fat>
        <fat>9.02807973835848</fat>
        <fat_std>13.5937419587009</fat_std>
        <fat_ci95>0.332558201593766</fat_ci95>
        <fat_perc2.5>0</fat_perc2.5>
        <fat_perc25>0.3</fat_perc25>
        <fat_perc50>1.8</fat_perc50>
        <fat_perc75>14.2</fat_perc75>
        <fat_perc97.5>42.1</fat_perc97.5>
      </fat>
      <saturate>
        <saturate>3.7293430929761</saturate>
        <saturate_std>6.75160281412371</saturate_std>
        <saturate_ci95>0.165171657411316</saturate_ci95>
        <saturate_perc2.5>0</saturate_perc2.5>
        <saturate_perc25>0.1</saturate_perc25>
        <saturate_perc50>0.8</saturate_perc50>
        <saturate_perc75>4</saturate_perc75>
        <saturate_perc97.5>21.4</saturate_perc97.5>
      </saturate>
      ...
    </nutrient-statistics>
  </area>
    ...
</data>