π» Lab 02 β Manipulating XML files
Week 01 β Day 02 - Lab Roadmap (90 min)
π LAB DIFFICULTY: π HARDER THAN YESTERDAY
It involves practising with a new data format (XML) as well as interacting with the documentation of an unfamiliar package (xml2
).
π₯ Objectives
- Learn how to use the functions from the
xml2
package. - Discover how to navigate the documentation of an unfamiliar package (
xml2
, as well asdplyr
). - Learn how to think of custom data types
- Learn how to create an XML object from a data frame.
- Practice creating
for
loops in R. - Practice creating custom R functions to work on data.
π Key Resources
π Lab Tasks
Today we will continue to work with the Tesco Groceries data set. The point of todayβs lab is to help you familiarise yourself with transforming data from a data frame to another format and back.
Today, our focus will be on XML because soon (ποΈ Week 01 | Day 04), youβll be diving into web scraping. Since most websites use HTML, which shares similarities with XML, the skills you learn today will come in handy when extracting data from the web.
Part 1: βοΈ Setup (10 min)
Open the RStudio project you created yesterday and create a new R script. Save the script as
lab02.R
.We will need the following libraries today:
library(tidyverse) library(xml2)
Or, if you prefer to load each tidyverse package individually:
library(readr) library(dplyr) library(xml2)
Load the same
df
data frame you created yesterday.
π‘ TIP: To keep your code organised, create a dedicated section in your lab02.R
script to store the code for each part of this lab. Hereβs an example of how it could be structured:
#### PART 1: βοΈ Setup ####
# Your code for Part 1 goes here (imports)
#### PART 2: Working with XML ####
By separating the code into sections, it will be easier to navigate and find specific parts of your script later on.
Part 2: Working with XML (20 min)
π§π»βπ« TEACHING MOMENT
Your instructor will show you how to convert a single row of data into XML and save it to a file.
Follow along in your own RStudio and ask clarifying questions if you are unsure about anything.
By the end, you will have produced a file with the extension
.xml
whose content looks like the following:<?xml version="1.0" encoding="UTF-8"?> data> <area area_id="E01000001"> <fat>9.02807973835848</fat> <saturate>3.7293430929761</saturate> <salt>0.556402429528114</salt> <protein>5.38504905777922</protein> <sugar>9.65265534963392</sugar> <protein>5.38504905777922</protein> <carb>16.237019155895</carb> <fibre>1.67400716399314</fibre> <alcohol>0.347539336551938</alcohol> <area> </data> </
π£οΈ QUESTIONS TO THE CLASSROOM:
- What would the XML look like if we had multiple rows of data?
Now letβs get you something to do!
Part 3: Multiple Rows of Data (30 min)
π― ACTION POINTS:
Now add code to your
lab02.R
script to produce a filelarge_sample.xml
that contains the first 10 rows of thedf
data frame.Open the file in a text editor and inspect the contents. Does it make sense?
π§π»βπ« TEACHING MOMENT
- Your instructor will show you how to read the XML file you just created back into R. Follow along in your own RStudio and ask clarifying questions if you are unsure about anything.
Part 4: Hierarchical XML (30 min)
Given what you practised above, you should be able to create an even more complex hierarchical XML structure.
π― ACTION POINTS:
- Modify your code so your XML has a
<nutrient>
tag and your XML file has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
data>
<area area_id="E01000001">
<nutrients>
<fat>9.02807973835848</fat>
<saturate>3.7293430929761</saturate>
<salt>0.556402429528114</salt>
<protein>5.38504905777922</protein>
<sugar>9.65265534963392</sugar>
<protein>5.38504905777922</protein>
<carb>16.237019155895</carb>
<fibre>1.67400716399314</fibre>
<alcohol>0.347539336551938</alcohol>
<nutrients>
</area>
</area area_id="E01000002">
<nutrients>
<fat>8.41439397561554</fat>
<saturate>3.4108180731533</saturate>
<salt>0.501482787473101</salt>
<protein>5.32376284962945</protein>
<sugar>7.88136983026529</sugar>
<protein>5.32376284962945</protein>
<carb>14.154434616304</carb>
<fibre>1.65491274205115</fibre>
<alcohol>0.611192565144633</alcohol>
<nutrients>
</area>
</
...data> </
π‘ Bonus Task
Create a new R script called
lab02_bonus.R
and save it in the same directory as yourlab02.R
script.Write code to produce an XML file that contains an even deeper hierarchical structure, containing all nutrient statistics. That is, your XML should look like this:
<?xml version="1.0" encoding="UTF-8"?>
data>
<area area_id="E01000001">
<nutrient-statistics>
<fat>
<fat>9.02807973835848</fat>
<fat_std>13.5937419587009</fat_std>
<fat_ci95>0.332558201593766</fat_ci95>
<fat_perc2.5>0</fat_perc2.5>
<fat_perc25>0.3</fat_perc25>
<fat_perc50>1.8</fat_perc50>
<fat_perc75>14.2</fat_perc75>
<fat_perc97.5>42.1</fat_perc97.5>
<fat>
</saturate>
<saturate>3.7293430929761</saturate>
<saturate_std>6.75160281412371</saturate_std>
<saturate_ci95>0.165171657411316</saturate_ci95>
<saturate_perc2.5>0</saturate_perc2.5>
<saturate_perc25>0.1</saturate_perc25>
<saturate_perc50>0.8</saturate_perc50>
<saturate_perc75>4</saturate_perc75>
<saturate_perc97.5>21.4</saturate_perc97.5>
<saturate>
</
...nutrient-statistics>
</area>
</
...data> </