DS105 Data for Data Science

🗓️ Week 01 - Part I: Structure of the course

9/30/22

Who we are

The Data Science Institute

The Data Science Institute

Activities of interest to you:

Our courses

DSI offer accessible introductions to Data Science:

DS101

Fundamentals of
Data Science

🎯 Focus:
theoretical concepts of data science

📂 How:
reflections through reading and writing

DS105

Data for
Data Scientists

🎯 Focus:
collection and handling of real data

📂 How:
hands-on coding exercises and a group project

DS202

Data Science for
Social Scientists

🎯 Focus:
fundamental machine learning algorithms

📂 How:
practical use of ML techniques and metrics

Your lecturer



Dr. Jonathan Cardoso-Silva

  • PhD in Computer Science
  • Background: Computer Science,Engineering,Data Science
  • Research:
    • Networks
    • Optimisation
    • Machine Learning applications
    • Data Science Workflow

Teaching Assistants

Photo of Anton Boichenko
Anton Boichenko
Guest Teacher at the DSI
Product Developer at Decoded
MSc in Applied Social Data Science (LSE)

Photo of Mustafa Can Ozkan
Mustafa Can Ozkan
Guest Teacher at the DSI
PhD cand. in the Spacetime Lab (UCL)
MSc in Transport (Imperial/UCL)

Photo of Stuart Bramwell
Dr. Stuart Bramwell
ESRC Postdoctoral Fellow
Department of Methodology
PhD in Politics (Oxford)

Photo of Xiaowei Gao
Xiaowei Gao
Guest Teacher at the DSI
PhD cand. in the Spacetime Lab (UCL)
MSc in Data Science (KCL)

Photo of Yijun Wang
Yijun Wang
Guest Teacher at the DSI
PhD cand. in Health Informatics (KCL)
MSc in Data Science (KCL)

Who are you

Programme Freq
BSc in Economics 11
BSc in Politics and Data Science 5
BSc in Politics and Economics 4
General Course 4
BSc in Philosophy and Economics 2
BSc in International Social and Public Policy with Politics 1
BSc in Mathematics, Statistics and Business 1
BSc in Philosophy, Logic and Scientific Method 1
BSc in Philosophy, Politics and Economics 1
BSc in Politics 1

Degree Programme vs Year of Study

BSc in Economics - Course Selection Options

Learning Objectives

This course will cover the fundamentals of data, with an aim to understanding:

  • how data is generated,
  • how it is collected,
  • how it must be transformed for use and storage
  • how it is stored, and
  • the ways it can be retrieved and communicated.

Learning Objectives (cont.)

The course will also cover:

  • workflow management of individual and collaborative data science project
  • setup and tools for typical data pre-processing (data transformation and data cleaning)
    • frequently the starting point and most time-consuming part of any data science project.

Structure of this course

Syllabus

Intro
    Introduction and key tools for data scientists Week 01
Behind the scenes
    The Terminal: navigating the command line
    The Cloud: accessing and getting data in and out
    The Internet: protocols + scrapping + APIs
Week 02
Week 03
Week 04
Working with data
    The nature and shape of data
    Tabular data: dataframes and databases
    Unstructured data (text, audio & image)
    Text as data, regex and sentiment analysis
Week 05
Week 07
Week 08
Week 09
Applications
    Topic modelling & document similarities
    Data viz with the grammar of graphics
Week 10
Week 11

Structure of lectures 👨🏻‍🏫

Our lectures will be split in two parts:

  • Part I (~ 50 min): Traditional exposition of theoretical content
  • break (~ 10 min): Grab coffee or relax 🧘
  • Part II (~ 50 min): Live demo
    • Typically, demonstration of terminal usage or Jupyter notebooks
    • Feel free to follow along in your own laptops.

Structure of classes 👩‍💻

  • Students will work on weekly, structured problem sets in the staff-led class sessions.
  • Tips to get the most of classes:
    • Bring your own laptops 💻 (most tablets are not suitable for programming)
    • Read the recommended reading prior to the class
    • Attempt to replicate the examples demonstrated in the live demo during the lecture

Class groups

Class groups


Group 01

  • 📆 Fridays
  • ⌚ 09:00 — 10:30
  • 📍 32L.G.06

Group 02

  • 📆 Fridays
  • ⌚ 12:00 — 13:30
  • 📍 NAB.LG.03

Group 03

  • 📆 Fridays
  • ⌚ 16:00 — 17:30
  • 📍 KSW.1.02

Assessments 📔

The breakdown of assessment for this course will be as follows:

Assessments - Problem sets (25%)

  • These will involve a mix of coding tasks and elements of self-assessment (similar to problem sets we will solve in the labs)
  • You will have until the day before the following class to submit your response
  • Summative problem sets will be released on:
    • Week 03 - worth 10% of final mark
    • Week 04 - worth 15% of final mark

Assessments - Group presentations (35%)

  • You will form groups prior to Reading Week
    • Pitch your ideas of API/datasets on Week 04
    • Form the groups on Week 05
  • Group presentations:
    • Week 08 - worth 15% of final mark
    • Week 11 - worth 20% of final mark

Assessments - Final project (40%)

  • Each group will produce a webpage of their project
  • Description of data, research questions, challenges, statistics and simple plots
  • Think of it as a portfolio project!
  • Submission deadline: Lent Term
    • Exact date to be confirmed
    • (end of Jan/2023 - beginning of Feb/2023)

Office hours

  • It is probably a good idea to book office hours if:
    • you struggled with a technical or theoretical aspect of a problem set in the previous week,
    • you have queries about careers in data science,
    • you want guidance in how to apply data science to other things you are studying outside this course.
  • Come prepared. You only have 15 minutes.
  • Ask for help sooner rather than later.
  • Book slots via StudentHub up to 12 hours in advance.

Communication

  • Join our Slack group (more info here)
  • Use the public Slack channels to talk to share links, content (or memes) with your colleagues.
  • Our teaching team will dedicate some time during the week to answer questions or other interactions on Slack.
  • Reserve 📧 e-mail for formal requests: extensions, deferrals, etc.
    • No need to e-mail to inform you will skip a class, for example.

Any questions?

How did we get here?

This abundance of data is strongly associated with the dramatic changes in technology in the past few decades.

St.Peter’s Basilica at the Vatican in
📅 19 April 2005
when Ratzinger
was elected the 265th pope.

St.Peter’s Basilica at the Vatican in
📅 13 March 2013
when Pope Francis
was elected the 266th pope.

We changed how we consume music 🎧

We changed how we consume video 🎞️

Smartphones 📱 are a very recent thing

We spend a lot more time connected

… and our social media habits keep on changing

The possibilities

  • Humans and machines nowadays generate A LOT of data ALL THE TIME
  • It has become cheap to collect and store this data
  • This abundance of data opens up new possibilities for research & policy-making

New data to answer old questions:

How do rumours spread?

New questions enabled by new data:

Is social media a threat to democracy?

What’s next

After our 10-min break ☕:

  • Given all this, what do we mean by data science?
  • A tale of unicorns
  • Approaching the ocean of data: the concept of data wrangling
  • The data science toolkit
  • What to expect of the rest of this course

References

Fischer-Baum, Reuben. 2017. “What ‘Tech World’ Did You Grow up In?” Washington Post. https://www.washingtonpost.com/graphics/2017/entertainment/tech-generations/.
Kolawole, Emi. 2013. “About Those 2005 and 2013 Photos of the Crowds in St. Peter’s Square.” Washington Post. http://wapo.st/WKKTMh.