๐Ÿง‘โ€๐Ÿซ Week 01 Lecture

Course Logistics & Introduction to Data Science Tools

Author
Published

02 October 2024

Image created with the AI embedded in MS Designer using the prompt 'abstract salmon pink light blue icon depicting the metaphysical experience of cleaning up, reshaping, pivoting, and manipulating data in search of the purest insights in data science.'

Welcome to our very first lecture! ๐ŸŽ‰

Below, you will find the schedule, as well as the written notes for the lecture.

๐Ÿ“ƒ Schedule

๐Ÿ“Location: Thursday 3 October 2024, 4 pm - 6 pm at CLM.5.02

โš ๏ธ The location changed to CLM.5.02!

This first lecture will have two parts:

  1. Course Logistics. (4 pm - 5 pm) Here, we will go over the โ„น๏ธ Course Information together. We will discuss the structure of the course, the teaching philosophy,the topics we will cover each week, how you will be assessed, how to contact us, how to get help, as well as the course policy on the use of Generative AI tools.

  2. Introduction to Data Science Tools. (5 pm - 6 pm) In this part, I will describe the main tools used in professional data science projects (Python, Jupyter Notebooks, Terminal, Git, and more). We will also revisit the concepts covered in the ๐Ÿ“ W01 Formative Exercise and connect them to whatโ€™s to come next week.

๐Ÿ“‹ Preparation

At DS105, we donโ€™t think teaching is restricted to the classroom. We believe learning happens anytime, anywhere, and that frequent human-to-human communication is key to a successful learning experience.

Our lectures are one place where that communication happens. More than just getting new information, this is the place to consolidate your understanding of the self-study material we shared with you by asking questions and engaging in discussions.

The best way to come prepared for this first one is to browse the two links below:

โ„น๏ธ Course Information

Visit the course info page to read everything you need to know about the course, including the topics we will cover each week, how you will be assessed, how to contact us, how to get help, and the course policies.

Bring your questions to the lecture!

๐Ÿ“ W01 Formative Exercise

Completing the exercise will help you arrive well-prepared for the first lecture. This session will be most effective if you have already attempted the exercise and have questions about it.

In particular, what was most confusing/challenging about this exercise?

๐Ÿ“ Lecture Notes

๐Ÿ“‹ TAKE NOTE:

  • You wonโ€™t find โ€œslides for studyingโ€ in this course. I do use slides in my lectures, but they serve as a visual aid to help me organise my thoughts.

  • The studying material is in the written notes below.

  • Let me know if you want me to add notes on any specific topic or expand on something you might want to revisit later.

The lecture covered two main topics. Scroll down to find the notes for each one.

TOPIC 1: A Typical Data Science Workflow

When you work with data, you usually follow a process that goes from gathering data to communicating insights (or deploying a data-driven solution). This process is definitely not linear, you will find yourself going back and forth between steps, but if it helps to think of it as a sequence of steps, you can think of it as follows:

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

Where to find wild data?

Many countries, organisations and researchers offer their data for free to the public. For example, you can find a lot of data to download on these websites:

However, very frequently, you need to collect your own data. This can be done by:

  • APIs: Many websites offer Application Programming Interfaces (APIs), tools that allow you to download data directly from their servers. The API maintainer will typically provide documentation and specify how you can access the data and, importantly, how much you can download.

โš ๏ธ Warning: APIs can be a business model for some companies. They might charge you for access to their data.

  • Web scraping: Extracting data directly from websites (e.g., Wikipedia)

โš ๏ธ Warning: Never collect personal data! It is inappropriate and against the law in many countries. For example, the Data Protection Act 2018, the UKโ€™s version of the European legislation General Data Protection Regulation (GDPR), regulates the collection and processing of personal data.

  • Private databases: Organisations usually have a large amount of data stored in their internal databases. If you are employed by or working at an organisation and have the necessary permissions, you can directly access these databases and retrieve the data.

  • Logs: Many apps you use keep logs of your activity (think of your browser history or the number of steps you took today). You can treat those logs as data and analyse them.

  • Surveys: These are a common way to collect data in social sciences.You can create surveys and collect data from people. Surveys alone are often not enough to make generalisations about a population, but they can be a good starting point and you can combine them with other data sources.

  • Sensors: Many devices have sensors that collect data. For example, your phone has a GPS sensor that tracks your location.

How we gather & store data

When I talk about โ€œgathering data,โ€ I mean reading data from a chosen source before storing or analysing it, whereas storing data refers to saving such data to a file. Under this scenario, when you gather data (without storing) using the Terminal or Python but havenโ€™t saved it anywhere, the data will disappear if you close the Terminal or Python shell.

Sometimes, you donโ€™t need to gather data because the data you collected already comes in a file. In this case, you just need to read it.

All of the above will become clearer once we start collecting data on ๐Ÿ’ป Week 02 Lab and later in ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 07 Lecture.

One could say there are three main ways to think about how we gather and store data:

  1. APIs & Websites: You can use the curl command in the Terminal or the requests library in Python to download data from the web.

๐Ÿ“‹ NOTE:

  • We will play with curl in ๐Ÿ’ป Week 02 Lab next week.
  • We will probably study APIs and learn about the requests library in ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 07 Lecture.
  1. Plain text files: If your data comes from a software (a log file, for example), or you downloaded it from a website, it is probably stored in a file. Alternatively, when you collect the data from the Internet, it is always a good idea to store it in a file that you can read later.

You can use Pythonโ€™s standard open() function to read data from files 1. Still, if the data is in a more structured format, you might prefer to use the pandas library in Python.

There are many standards for storing data in files, and the most common ones are TXT (plain text), CSV (Comma Separated Values)2, and JSON (JavaScript Object Notation).

๐Ÿ“‹ NOTE: We will start to use the pandas library in the ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 03 Lecture once we are confident with our Terminal + basic Python skills.

  1. Databases: There are many libraries in Python that allow you to connect to databases. The most robust and popular one that caters to a wide range of databases is the SQLAlchemy library, but sometimes you might need to use a more specific library for a specific database.

๐Ÿ“‹ NOTE: We will study databases later, in the ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 08 Lecture

How we clean & pre-process data

Even when your data is stored somewhere, you will find that no matter how organised you were when you collected it, it will always need some cleaning and pre-processing.

Cleaning data is the process of removing or correcting errors in the data. This can be as simple as removing a row with missing data or as complex as correcting the spelling of a word in a text.

Pre-processing data, on the other hand, refers to transforming the data into a format more suitable for analysis. For example, sometimes, we need to remove some columns from a dataset, normalise the data so that all the columns have the same scale, or transform the data to make visualisation easier.

โญ๏ธ In this course, we focus a lot of energy on this stage. โญ๏ธ

We will learn to clean and pre-process data efficiently using Python and the pandas library.

How we build a dataset

You might not know yet, but even after you have cleaned and pre-processed it, your data might not be in an easy-to-analyse format. A convention called tidy data makes it easier to analyse data across different programming languages and tools.

โญ๏ธ Creating tidy datasets is another fundamental aspect of this course. โญ๏ธ

We will learn how to build tidy datasets using Python and the pandas library, starting in the ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 03 Lecture.

Eventually, you will learn how to save your (tidy) data to a database. We will explore the SQLite database, starting on ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 08 Lecture.

How we do exploratory data analysis (EDA) & collect insights

Exploratory Data Analysis (EDA) is the process of analysing data to summarise its main characteristics. In this course, we will take a curiosity-driven approach that is more visual than mathematical. It will be closer to data journalism than to statistics. Our focus is not machine learning but understanding our data and communicating what we see to others. We wonโ€™t teach, for example, hypothesis testing or statistical inference.

The default Python library for EDA is pandas, but specifically when it comes to data visualisation, a popular choice are the matplotlib and seaborn libraries.

๐Ÿ‘‰ IMPORTANT! In this course, we will go rogue! Instead of those two popular choices, we will teach you a bit about the programming language R and the popular ggplot package. We wonโ€™t stay with R, but all the visualisations you create in this course will have to be made using the lets-plot package, a Python version of ggplot in R.

By โ€œcollecting insights,โ€ we mean confirming what we suspected about the data, discovering new patterns, finding outliers, and understanding how different variables relate to each other. Itโ€™s more of an art and a craft than a science. Weโ€™ll first discuss this in the ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 03 Lecture, but insight discovery and curiosity-driven analysis will be constant throughout the course.

How we do machine learning

Machine Learning and statistical inference comprise a set of tools that allow you to make inferences and predictions based on data using algorithms. This is a very exciting part of data science, but it is outside the scope of this course. You might want to check our sister course, DS202 - Data Science for Social Scientists, where we cover the fundamentals of machine learning. For the record, in Python, the most popular library for machine learning is scikit-learn.

How we communicate results

This is the final step of a data science project cycle. Youโ€™ve found something interesting in your data and need to spread the word. This can be done in many ways, but the most common ones are:

  • Reports and presentations: Rather than writing .docx files, we will teach you how to write reports in Markdown-powered documents using Jupyter Notebooks or websites.

  • Dashboards: You can create a dashboard with your findings. We will teach you how to create dashboards using the streamlit library in Python on the ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Week 10 Lecture.

We will also try to give you some tips on how to tailor your charts and reports to different audiences.

TOPIC 2: Making sense of files and folders

Itโ€™s really important to understand how to navigate your computerโ€™s folders and files using paths and directories. Even if youโ€™re great at Python, using pandas, databases, and visualisations, youโ€™ll still struggle, and you wonโ€™t understand why if you donโ€™t know your way around your computerโ€™s file system and how paths to files and directories work. Understanding how data and files are stored on your computer is key to working effectively with them.

To truly master programming, learn how to master the command line first!

This is why on Week 01 & 02 we will focus on practising the use of the Terminal (I might sprinkle a few Python commands here and there, but the focus is still on the Terminal). If you did the ๐Ÿ“ W01 Formative Exercise, you already have an idea about what Iโ€™m talking about, but letโ€™s go through it step-by-step.

Let me start from the beginning:

What is an Operating System?

To fully understand the Terminal, you need to understand what an Operating System is. An Operating System is THE main piece of code that runs on your computer that manages what you see on the screen, what you can do with the keyboard and mouse, and how you can interact with the hardware. The Operating System (OS) is the one that sends commands to the hardware on your behalf, whenever you click on a file, or do something on an app in your computer.

It will help to think of a computer as a system made up of these parts:

  • hardware: provides the basic computing resources for the system
  • application programs: define how these resources are used
  • operating system: controls the hardware and coordinates its use among the various application programs for the various users
  • user: a person or a bot (a computer script) that requests actions from the computer.

Typically, these components interact in a layered way, with the user at the top, the application programs in the middle, and the operating system at the bottom:

user User app Application Programs (compilers, web browsers, development kits, etc.) user->app os Operating System app->os hardware Computer Hardware (CPU, memory, I/O devices, etc.) os->hardware

Why bother with this?

Image created with DALLยทE via Bing Chat AI bot. Prompt: โ€œa gigantic wooden question mark looms above the big ben, ultra-realistic awesome paintingโ€

  • It is improbable you will ever need to interact with the kernel directly.
  • But, we often need to install custom software to perform some data analysis
    • This software might not come from Apple or Microsoft Store.
    • Those are things you have to install โ€œmanually.โ€

Common Operating Systems

The most common Operating Systems are:

  • Windows: the most popular OS for personal computers worldwide.

  • macOS: the Operating System of Apple computers.

  • Linux (and its many distributions). It is what YOU will use next week when we access the Nuvolos Cloud Platform.

A few brief notes about the popular Operating Systems:

GNU/Linux

Linux and macOS share a common ancestor: an old operating system called UNIX. Back in the 70s, UNIX was the Operating System of choice for many universities and research institutions. When you interact with the Terminal today, you probably feel like how it was to operate a UNIX system back in the 70s when the notion of a graphical user interface was still a dream.

How the UNIX System III looks like.

How the UNIX System III looks like.

UNIX was a proprietary system ๐Ÿค‘, and many people wanted to have a free version of it. This led to the birth of one of the most influential operating systems: GNU/Linux, or simply Linux. Although Linux is not used a lot in personal computers or laptops, it is the preferred OS for cloud based applications, including for data science.

๐Ÿ”— Learn more about Linux: RedHat - Understanding Linux

Android, the most popular OS for phones worldwide, is based on Linux!

macOS

  • macOS is the Operating System of Apple computers
  • It is a hybrid system. It has a free, open-source component called Darwin, but it also includes proprietary, closed-source components.
  • iOS, Appleโ€™s mobile operating system, is also based on Darwin
  • Darwin is based on BDS UNIX, a derivative of the original UNIX system.

Windows

  • Windows has its own history.
  • Microsoft and IBM co-developed its predecessor, the OS/2 operating system.
  • But then, Microsoft took on its own path and developed its own versions of the OS: Windows NT, Windows 95, Windows 98, Windows 2000, Windows XP, Windows 7, Windows Vista*, etc.
  • Windows popularity can be traced to the success of the Office suite

Files & Filesystems

Image created with DALLยทE via Bing Chat AI bot. Prompt: โ€œrobots sorting and shelving physical files in folders. Circular tables, white room, pops of color, modern, cosy, clean flat designโ€

  • Ultimately, everything in a computer is just a bunch of 0s and 1s
  • Files are a set of conventions that allows us to extract information from them.
  • A directory, or folder, is a place where many files are stored. It is a way to organise files.
    • Think of them as shelves ๐Ÿ—„๏ธ

Files are stored in a hierarchical structure called a filesystem. This structure is similar to a tree, with the root directory 3 at the top, and subdirectories branching out from it.

This kind of hierarchical structure is present in all modern Operating Systems (Windows, MacOs, Linux, etc.). In theory, it can contain infinite sub-directories and files.

The UNIX directory tree

In MacOS as well as in Linux, the directory structure typically looks like this (scroll to the right to see the full tree):

root / bin bin root->bin dev dev root->dev etc etc root->etc home home root->home lib lib root->lib mnt mnt root->mnt proc proc root->proc namedroot root root->namedroot sbin sbin root->sbin tmp tmp root->tmp usr usr root->usr var var root->var jonathan jonathan home->jonathan documents Documents jonathan->documents images Images jonathan->images videos Videos jonathan->videos downloads Downloads jonathan->downloads workspace Workspace documents->workspace ds105 DS105A workspace->ds105 usr_lib lib usr->usr_lib usr_bin bin usr->usr_bin usr_include include usr->usr_include var_log log var->var_log var_mail mail var->var_mail var_spool spool var->var_spool var_tmp tmp var->var_tmp

On Windows, although there is a hierarchical structure, it is very different from the UNIX structure. Windows has a drive letter system, where each drive is a separate filesystem. The most common drives are C:\, D:\, E:\, etc. Here is what a typical Windows filesystem looks like (starting from the C:\ drive):

C_drive C:\ ProgramFiles Program Files C_drive->ProgramFiles ProgramFilesX86 Program Files (x86) C_drive->ProgramFilesX86 Users Users C_drive->Users Windows Windows C_drive->Windows Temp Temp C_drive->Temp Jonathan Jonathan Users->Jonathan Documents Documents Jonathan->Documents Workspace Workspace Jonathan->Workspace Images Images Jonathan->Images Videos Videos Jonathan->Videos Downloads Downloads Jonathan->Downloads AppData AppData Jonathan->AppData DS105 DS105A Workspace->DS105 Local Local AppData->Local LocalLow LocalLow AppData->LocalLow Roaming Roaming AppData->Roaming System32 System32 Windows->System32

Can we get back to the Terminal?

A terminal, or command prompt or the command line is a screen or a window that serves as a very close window to the core of your computer. It is a text-based interface to the computer. There are no graphics (images/video) in the terminal, only text.

๐Ÿ‘‰ Instead of the usual click-and-drag way of using the computer, you have to type a command to change to directories, another for opening a file, another for moving a file, etc.

Crucially, you need to know what to type! Clicking randomly to see what happens wonโ€™t work.

Image of a Terminal where someone is checking all the apps their computer is running. Source: Gortu at English Wikipedia

Shell

Typically, the Terminal runs a program (app) called the shell. Sometimes I will use the terms interchangeably, but if we were to be pedantic, the Terminal is the window, and the shell is the program that runs inside it.

The shell awaits, interprets, processes, executes, and responds to commands typed in by the user.

There are many shells, each has its own features. Here are some popular Linux shells:

  • sh or the Bourne shell: developed at AT&T labs in the 70s by a guy named Stephen Bourne.
  • bash or the Bourne again shell: very popular, compatible with sh shell scripts.
  • Our ๐Ÿ–ฅ๏ธ labs will focus on bash
  • ksh or the Korn shell: provides enhancements over the sh and it is also compatible with bash.
  • csh and tcsh: shells that have a syntax similar to the programming language C.

๐Ÿ‘‰ IMPORTANT! Next week, everyone will use the bash shell on our Nuvolos Cloud Platform.

Windows CMD vs PowerShell

Windows has its own thing going on. There are two main shells on Windows these days:

  • CMD: the old shell, that is still around for compatibility reasons.
  • Powershell: the new shell, that is more powerful and has more features.

How the CMD looks like

How the PowerShell looks like

Although we do give support to PowerShell in this course, we will focus on bash in the labs, as it is the most common shell in the Linux world.

Now what?

Well, the actual usage of the Terminal is a bit more complex than what I can explain here. The best way to learn the commands is with practice, so I recommend doing (or revisiting) the ๐Ÿ“ W01 Formative Exercise.

Footnotes

  1. more on that in the upcoming ๐Ÿ“ W02 Formative Practiceโ†ฉ๏ธŽ

  2. more on that in the upcoming ๐Ÿ“ W02 Formative Practiceโ†ฉ๏ธŽ

  3. Remember that from ๐Ÿ“W01 Formative Exercise?โ†ฉ๏ธŽ