🗓️ Week 01
Welcome to the course

DS105W – Data for Data Science

18 Jan 2024

Who we are

Your lecturer

Photo of Dr Jon Cardoso Silva
Dr. Jon Cardoso-Silva
Assistant Professor of Data Science (Education)
LSE Data Science Institute
📧 E-mail
course convenor lecturer

  • PhD in Computer Science (King’s College London)
  • Background: Engineering, Bio & Health Informatics
  • Former Lead Data Scientist

networks
optimisation
software engineering
machine learning applications
the impact of generative AI for education

Teaching Assistants

Photo of Sara Luxmoore
Sara Luxmoore
GENIAL project
Digital Skills Lab
📧 E-mail
guest teacher
Photo of Alex Soldatkin
Alexander Soldatkin
DPhil Candidate
Oxford School of Global and Area Studies
University of Oxford
📧 E-mail
guest teacher
Photo of Mustafa Can Ozkan
Mustafa Can Ozkan
PhD Candidate
SpaceTimeLab
University College London (UCL)
📧 E-mail
guest teacher

Administrative Support

Photo of Kevin Kittoe
Kevin Kittoe
Teaching and Learning Administrator
LSE Data Science Institute
📧 E-mail

Write an e-mail to Kevin:

  • if you cannot find the lecture recording on Moodle
  • when you need an extension for an assignment
    (👉 check LSE’s extension policy)
  • to request a class group change
    (you will be asked to provide a reason for this)
  • to inform us of any other issues that may affect your studies

The Data Science Institute

  • This course is offered by the LSE Data Science Institute (DSI).
  • DSI is the hub for LSE’s interdisciplinary collaboration in data science
  • ⏭️ Let’s see a few activities that might be of interest to you

CIVICA Seminar Series

Careers in Data Science

Hear from alumni or industry experts about their career paths and how they got to where they are today.

Upcoming event:

🗓️ Keeping London Moving with Data (28 February 4 - 5.30pm)

A talk about life in the data world at TfL. Jemima, Graduate Data Scientist at Transport for London (TfL) will talk about her experience as a Data Science Graduate in our inaugural programme. Lauren Sager Weinstein, Chief Data Officer, at Transport for London (TfL) will talk about how she’s leading TfL’s data strategy, and how all the components of data careers (data scientists, data developers, data product managers, and data users) can come together to deliver on our data vision: To empower our people to make better decisions with data.

Industry “field trips”

Who are you?

Programme Count
BSc in Economics 31
General Course 8
BSc in International Social and Public Policy 2
BSc in Politics and Data Science 2
BSc in Politics and Economics 2
BSc in Psychological and Behavioural Science 2
BSc in Sociology 2
BSc in Economic History 1
BSc in Economics and Economic History 1
BSc in Finance 1
BSc in Philosophy 1
BSc in Politics 1
Year Count
1 35
2 13
3 5
4 1

Who are you? (cont.)

What is this course about?

DS105 – Data for Data Science

📑 Course Brief

  • Focus: learn how to collect and handle so-called “real data”

  • How: hands-on coding exercises and a group project

DS105 – Data for Data Science

📑 Course Brief

  • Focus: learn how to collect and handle so-called “real data”

  • How: hands-on coding exercises and a group project

🎯 Learning Objectives

  • Create terminal commands to effectively navigate the file system and execute programs
  • Analyse and categorize various data types and identify prevalent data formats
  • Use Markup Language (XML) and Markdown format proficiently for document and web page formatting
  • Interpret and adhere to international standards for common data types
  • Assess data quality, implement data cleaning procedures, and troubleshoot common data corruption issues

🎯 Learning Objectives (cont.)

  • Use web scraping and APIs to retrieve data from Internet sources
  • Demonstrate comprehension of database concepts and fundamentals
  • Combine and link data from disparate sources
  • Utilize GitHub, based on the git version control system, for collaborative and version control purposes
  • Use markdown to create reports of data analysis
  • Combine a mix of markdown, HTML and CSS to maintain and customise simple websites

How will it work?

👨🏻‍🏫 THE LECTURES

  • 2-hour sessions but I promise it won’t be tedious
  • On 🗓️ Thursdays 16:00-18:00
  • Only the first couple of lectures have slides
  • From Week 03, we will use Jupyter Notebooks and/or GitHub repositories
  • Bring your laptop and follow along

💻 THE LABS

  • 90-min of practice on the lecture’s topic
  • On 🗓️ Mondays and Tuesdays
    (check your class group)

You will encounter these icons:

  • 👨🏻‍🏫 TEACHING MOMENT: Your class teacher deserves your undivided attention
  • 🎯 ACTION POINTS: Follow the instructions and complete the exercises (alone or in groups)
  • 🗣️ CLASS DISCUSSION: Discuss your findings with your classmates (mediated by class teacher)
  • 📝 SUBMISSION: Submit your work

👉 Now, let’s navigate our Moodle page to see the 📓 Syllabus and to talk about ✍️ Assessments & Feedback.

Teaching Philosophy


  • My teaching approach is grounded in empiricism.
  • I see learning as a transformative process, something that conduces to change, which is best facilitated by active, experience-focused, and exploration-driven activities.1
  • In summary: learning by doing serves as the cornerstone of this course.

Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person climbing a mountain of books, with each book representing a different topic or skill. The person is holding a magnifying glass and a compass, and is looking for new paths and discoveries.”

What does that mean in practice?


Image created with DALL·E via Bing Chat AI bot. Prompt: “An illustration of a person trying to solve a puzzle with pieces that have different symbols and formulas on them. The person is looking at a screen that shows the 📋 Getting Ready guide and has a smile on their face.”

  • Occasionally, I’ll present you with tasks before diving into the corresponding theory or background knowledge.
    • For example: asking you to follow the challenging steps in the 📋 Getting Ready guide even before our first lecture!
  • Reasoning: letting your ‘struggles’ guide the learning process.
  • 👉 allow yourself to make silly mistakes and to ask ‘dumb questions’.
    • But if you feel this is not working, drop me an e-mail or come to my office hours (see 📟 Communication)

AI tools in this course

Do you use ChatGPT, GitHub Copilot, or other AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “An image that shows a classroom where people have their pet AI bot on their desks, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical cute bot. Each student has their own.”

LSE Policy on AI tools

LSE takes challenges to academic integrity and to the value of its degrees with the utmost seriousness. The School has detailed regulations and processes for ensuring academic integrity in summative work.

Unless Departments provide otherwise in guidance on the authorised use of generative AI, its use in summative and formative assessment is prohibited. Departmental Teaching Committees are strongly encouraged to define what constitutes authorised use of Generative AI tools (if any) for students taking courses in their Department. Where they do so, they must clearly communicate this to colleagues, and to students.

Source: LSE (2023) (Emphasis added)

Our policy in this course

  • You can use AI tools during lectures, labs, and for your assignments.
    • Except when the lecturer or class teachers expressly ask you not to use it.
  • When using for assignments, you must acknowledge the use of AI tools and tell us how you used it.
    • Examples:

      I used ChatGPT to provide an initial solution to Question X. The code ran and worked fine, but as it was not efficient to the standards of vectorisation taught in the course, I had to edit the code myself to fix the issue.

      I had GitHub Copilot autocomplete on when writing the code for Question X. The code produced was unnecessarily long and didn’t use the pd.merge command I learned in Week 08, so I went back and edited it.

What do you think of generative AI tools?

Image created with DALL·E via Bing Chat AI bot. Prompt: “A university student typing on their laptop. The student has a pet AI bot on their desk, next to their laptops. The AI bot is ChatGPT but it has been disguised as some sort of mechanical bot. Clean, flat design, photo. Friend or foe?”

The GENIAL project

  • We see many students using ChatGPT during lectures, labs, and assessments.
  • Frankly, most university instructors are clueless as to whether this is helping or hindering your learning.
  • So we are doing some research to try to figure out:
    • How are students using generative AI tools in their studies?
    • What are the benefits and drawbacks of using generative AI tools?

Participating Courses:

  • DS105W (Data for Data Science)
  • DS202W (Data Science for Social Scientists)
  • ST456 (Deep Learning)
  • PP422 (Data Science for Public Policy)

The GENIAL project

How will it work:

  1. Create a ChatGPT 3.5 (OR a Google Bard) account if you don’t have one already.

  2. Open a new ‘chat window’ inside your selected chatbot and tell the AI: ’I will use this chat for all things related to DS105W’

  3. Use this same chat window whenever you feel like using a generative AI tool during the course.

The GENIAL project

The GENIAL project

  • Every student is participating in the study 👉
    • But you can opt-out at any time.

☕️ Time for a break

Image created with DALL·E via Bing Chat AI bot. Prompt: “robots enjoying a coffee break. Circular tables, white room, pops of color, modern, cosy, clean flat design.”

Our first proper lecture will start in a few minutes.

“🧰 The Data Science Toolbox and the Terminal

In the meantime, fill out the form for the GENIAL project:

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the abovementioned tasks.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda eda->end ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate      

And this is what this course is about! You will learn some of the most common tools used during this process.

The toolbox 🧰

The data science dilemma: Python or R ??

  • Python is preferred in the industry (but I 💙 R)
  • R is preferred in academia, especially in the social sciences
  • I find R’s tidyverse to be more intuitive than Python’s pandas
  • This course focuses on Python, but I will share R resources as well
  • In your assignments, you can use either Python or R

Python vs R

Python

Many people struggle with programming because they don’t understand what is going on under the hood.

👉 This is why we spend the first weeks of this course learning and practising with the terminal and file systems.

The Terminal

To truly master programming, learn how to master the command line first

The Terminal

  • A terminal, or command prompt or command line is a screen or a window that lets you access the Operating System’s input and output.
  • There are no graphics (images/video) in the terminal, only text.

Shell

  • Typically, the terminal runs a program (app) called the shell.
  • The shell awaits, interprets, processes, executes, and responds to commands typed in by the user.
  • There are many shells, each has its own features.
  • Popular Linux shells:
    • sh or the Bourne shell: developed at AT&T labs in the 70s by a guy named Stephen Bourne.
    • bash or the Bourne again shell: very popular, compatible with sh shell scripts.
      • Our 🖥️ labs will focus on bash
    • ksh or the Korn shell: provides enhancements over the sh and it is also compatible with bash.
    • csh and tcsh: shells that have a syntax similar to the programming language C.

Windows CMD vs PowerShell

  • Windows has its own thing.
  • For historical reasons, there are two main terminals/shells on Windows these days:

CMD

Powershell

Files & Filesystems

What are files?

Image created with DALL·E via Bing Chat AI bot. Prompt: “robots sorting and shelving physical files in folders. Circular tables, white room, pops of color, modern, cosy, clean flat design”

  • Ultimately, everything in a computer is just a bunch of 0s and 1s
  • Files are a set of conventions that allows us to extract information from them.
    • This might only become clearer after tomorrow’s lab and next week’s lecture.

Hierarchical directory structure

  • This kind of hierarchical structure is still present in all modern Operating Systems (Windows, MacOs, Linux, etc.)
  • A directory, or folder, is a place where many files are stored
    • Think of them as shelves 🗄️
  • In theory, it can contain infinite sub-directories and files

UNIX directory tree

In MacOS as well as in Linux, the directory structure typically looks like this:

root / bin bin root->bin dev dev root->dev etc etc root->etc home home root->home lib lib root->lib mnt mnt root->mnt proc proc root->proc namedroot root root->namedroot sbin sbin root->sbin tmp tmp root->tmp usr usr root->usr var var root->var jonathan jonathan home->jonathan documents Documents jonathan->documents images Images jonathan->images videos Videos jonathan->videos downloads Downloads jonathan->downloads workspace Workspace documents->workspace ds105 lse-ds105-course-notes workspace->ds105 usr_lib lib usr->usr_lib usr_bin bin usr->usr_bin usr_include include usr->usr_include var_log log var->var_log var_mail mail var->var_mail var_spool spool var->var_spool var_tmp tmp var->var_tmp

Now for a little demo ⏭️

Where Git Bash’s root directory is

Operating Systems (⏳)

Let’s go even deeper into the rabbit hole 🐇

What Operating Systems (OS) do

  • A computer can be divided into four parts:
    • hardware — provides the basic computing resources for the system
    • application programs — define how these resources are used
    • operating system — controls the hardware and coordinates its use among the various application programs for the various users
    • user — a person or a bot (a computer script) that requests actions from the computer.

user User app Application Programs (compilers, web browsers, development kits, etc.) user->app os Operating System app->os hardware Computer Hardware (CPU, memory, I/O devices, etc.) os->hardware

Insight into operating systems


“An operating system is similar to a government. Like a government, it performs no useful function by itself. It simply provides an environment within which other programs can do userful work.”

(Silberschatz, Galvin, and Gagne 2005, chap. 1)


  • If this sounds a bit vague, it is because it is!
  • It is actually tricky to specify which programs are part of the OS and which ones are not.
  • Let’s try to define what an OS is anyways ⏭️

Definition of OS

  • The OS is the one programming running at all times on the computer.
    • This is usually also called the kernel
  • There might be other programs running alongside the OS.
    • For example, the Terminal
  • 📱 Mobile computers usually have more “additional” software alongside the kernel, which we call the middleware.
    • These applications support multimedia, graphics, internal app databases, etc..

Why bother with this?

Image created with DALL·E via Bing Chat AI bot. Prompt: “a gigantic wooden question mark looms above the big ben, ultra-realistic awesome painting”

  • It is improbable you will ever need to interact with the kernel directly.
  • But, we often need to install custom software to perform some data analysis
    • This software might not come from Apple or Microsoft Store.
    • Those are things you have to install “manually.”

Tip

Let’s face it. You will always encounter puzzling ⚠️ error messages when programming, no matter how senior or skilled you are.

Understanding a little about how everything is tied together will help you get to the core of the problem more quickly.

History of Operating Systems

History

  • In the early days of modern computing, when computers were not accessible to everyone, software (applications) typically came with their source code open.
  • Open source means you can read precisely which instructions the computer will follow when running.
  • As the industry grew, most software companies released only the binaries — a type of file you can only execute, not read as if it was a text.
    • This includes Operating Systems! ⏭️

A computer from the 1950s
(Computer History Museum n.d.)

UNIX

  • UNIX was the first big Operating System, developed at Bell Labs and AT&T
  • It aimed to be simple* and easy to port to any hardware architecture
  • But, it required a license
  • In the late 1980s and early 1990s, a group of hackers and activists developed free & open source alternatives to UNIX.

How the UNIX System III looks like.

How the UNIX System III looks like.

GNU/Linux

  • This led to the birth of one of the most influential operating systems: GNU/Linux, or simply Linux.
  • Android, the most popular OS for phones worldwide, is based on Linux.
  • Two people were instrumental to the development of Linux
    • Richard Stallman
    • Linus Torvalds

Note

GNU stands for “GNU is not Unix”. Computer nerds love a recursive joke.

A picture of Richard Stallman A picture of Linus Torvalds

macOS

  • macOS is the Operating System of Apple computers
  • It is a hybrid system. It has a free, open-source component called Darwin, but it also includes proprietary, closed-source components.
  • iOS, Apple’s mobile operating system, is also based on Darwin
  • Darwin is based on BDS UNIX, a derivative of the original UNIX system.

Windows

  • Windows has its own history.
  • Microsoft and IBM co-developed its predecessor, the OS/2 operating system.
  • But then, Microsoft took on its own path and developed its own versions of the OS: Windows NT, Windows 95, Windows 98, Windows 2000, Windows XP, Windows 7, Windows Vista*, etc.
  • Windows popularity can be traced to the success of the Office suite

Virtualization

  • Virtualization is a technology that creates the illusion that you are running a separate private computer.
  • You decide how much of your CPU/RAM/Hard drive to share with the virtual machine

Emulators & Virtual Machines

  • You can install an emulator to run Windows inside Mac (and vice-versa)
    • Provided you own a licence to install the other OS
  • You can share files to and from the virtual machine inside the emulator, but the internal machine will “think” it is a separate computer.

Note

  • In the 🖥️ labs on 🗓️ Week 03, you will access a virtual machine that lives in the cloud
  • Example of commercial virtualization software

Windows Subsystem for Linux (WSL)

  • In an attempt to entice Linux users (especially developers), Microsoft added a Linux emulator to Windows named “Windows Subsystem for Linux”
  • You install your preferred Linux distribution
    • Ubuntu is one of the most popular

Tip

  • Our 🖥️ labs in Weeks 1 & 2 will focus on Linux/UNIX-like commands.

References

Computer History Museum. n.d. “1950 Timeline of Computer History.” 1950 Timeline of Computer History. Accessed September 16, 2022. https://www.computerhistory.org/timeline/1950/.
Ebrahim, Mokhtar, and Andrew Mallett. 2018. Mastering Linux Shell Scripting: A Practical Guide to Linux Command-Line, Bash Scripting, and Shell Programming, 2nd Edition. 2nd ed. Birmingham: Packt Publishing.
LSE. 2023. LSE Short-Term Guidance for Teachers on Artificial Intelligence, Assessment and Academic Integrity in Preparation for the 2022-23 Assessment Period.” https://info.lse.ac.uk/staff/divisions/Eden-Centre/Assets-EC/Documents/AI-web-expansion-Feb-23/Updated-Guidance-for-staff-on-AI-A-AI-March-15-2023.Final.pdf.
Pelz, Oliver. 2018. Fundamentals of Linux: Explore the Essentials of the Linux Command Line. Birmingham: Packt Publishing Ltd.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.
Silberschatz, Abraham, Peter B. Galvin, and Greg Gagne. 2005. Operating System Concepts. 7th ed. Hoboken, NJ: J. Wiley & Sons.