ποΈ Week 01 β Day 02: Data types and common file formats
CSV, JSON, XML, YAML, and more
Today, we will look at the most common data types and file formats used in data science. More specifically, we will look at:
- Distinguishing structured vs unstructured vs semi-structured Data
- Primitive types, objects, and data frames
- Overview of standard file formats: CSV, JSON and XML
- (if time allows) an overview of less common but useful formats: YAML, Parquet, Avro, ORC, and more
π¨βπ« Lecture Slides
Either click on the slide area below or click here to view it in fullscreen. Use your keypad to navigate the slides.
π₯ Looking for lecture recordings? You can only find the links to those on Moodle.
π Recommended readings & Revision
Todayβs lab is very important as weβll need those XML-like data manipulation in ποΈ Week 01 β Day 04 (Web Scraping), so do take a moment to check youβre on the right track.
Do a bit of self-checking:
Would you be able to explain in simple terms what the following dplyr
functions do?
select()
filter()
arrange()
summarise()
mutate()
If the answer is no, refer to (Wickham and Grolemund 2016, chap. 5) (this book is available for free online). You can also refer to the dplyr
documentation and the dplyr
cheatsheet.
XML
- Revisit the code your class instructor shared on Part 2 of the lab (π§π»βπ« TEACHING MOMENT).
- Do you really get whatβs going on? If not, ask questions on Slack! Donβt be shy, youβre likely not the only one whoβs confused. Weβll go through the questions at the beginning of the next lecture.
- Revisit your own code for the rest of the lab. Try to figure out the reason behind each line of code.
Optional
If you are aiming for an A+ in this course, you should attempt the π‘ Bonus Task. Bring questions to the next lecture if you get stuck.
π Communication
- Donβt understand why your code is not working? Ask questions on the public channels on Slack.
- I will go through the questions posted in the public channels on Slack at the beginning of each lecture.