🗓️ Week 01 – Day 02: Data types and common file formats

CSV, JSON, XML, YAML, and more

Author

Published

11 July 2023

Today, we will look at the most common data types and file formats used in data science. More specifically, we will look at:

Distinguishing structured vs unstructured vs semi-structured Data
Primitive types, objects, and data frames
Overview of standard file formats: CSV, JSON and XML
(if time allows) an overview of less common but useful formats: YAML, Parquet, Avro, ORC, and more

👨‍🏫 Lecture Slides

Either click on the slide area below or click here to view it in fullscreen. Use your keypad to navigate the slides.

🎥 Looking for lecture recordings? You can only find the links to those on Moodle.

📖 Recommended readings & Revision

Today’s lab is very important as we’ll need those XML-like data manipulation in 🗓️ Week 01 – Day 04 (Web Scraping), so do take a moment to check you’re on the right track.

Do a bit of self-checking:

Would you be able to explain in simple terms what the following dplyr functions do?

select()
filter()
arrange()
summarise()
mutate()

If the answer is no, refer to (Wickham and Grolemund 2016, chap. 5) (this book is available for free online). You can also refer to the dplyr documentation and the dplyr cheatsheet.

XML

Revisit the code your class instructor shared on Part 2 of the lab (🧑🏻‍🏫 TEACHING MOMENT).
Do you really get what’s going on? If not, ask questions on Slack! Don’t be shy, you’re likely not the only one who’s confused. We’ll go through the questions at the beginning of the next lecture.
Revisit your own code for the rest of the lab. Try to figure out the reason behind each line of code.

Optional

If you are aiming for an A+ in this course, you should attempt the 🏡 Bonus Task. Bring questions to the next lecture if you get stuck.

📟 Communication

Don’t understand why your code is not working? Ask questions on the public channels on Slack.
I will go through the questions posted in the public channels on Slack at the beginning of each lecture.

References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st edition. Sebastopol [CA]: O’Reilly. https://r4ds.had.co.nz/.