LSE DS105 (2023)

Data for Data Scientists

Check this page every week to see more info on how to study for the course.

Introduction

The first week is all about setting up your computer and getting familiar with the tools we will use in the course.

πŸ—“οΈ Week 01
25 Sep 2023 -
29 Sep 2023

πŸ§‘β€πŸ« Lecture

The Data Science Toolbox and the Terminal

πŸ’» Lab

Set up your computer and meet the Terminal

πŸ“– Readings

Click to see recommended resources

Indicative

Recommended

  • πŸ“ƒ Academic Article: β€œBeyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists” (Davenport 2020)

Behind the Scenes

Weeks 02 and 03 are about the underlying technologies that power the data science tools we will use later in the course. Don’t underestimate the importance of these topics. If you master them, you will be a lot more productive in the long run.

πŸ—“οΈ Week 02
02 Oct 2023 -
06 Oct 2023

πŸ§‘β€πŸ« Lecture

Operating Systems and the ☁️ Cloud

πŸ’» Lab

Running commands on a remote computer

πŸ“£ Assignment Reveal

Release of Problem Set 01 β€” Shell scripting (10%).
Details: ✍️ W03 Summative

πŸ“– Readings

Click to see recommended resources

Practice some more with the terminal:

Recommended

  • πŸ’» Tutorial: β€œUzing Z Shell on Macs” (Hartl 2020)
  • πŸ’» Tutorial: β€œInstall Ubuntu on WSL2 on Windows 11” (Canonical 2022)
  • πŸ’» Tutorial: β€œWhat is Windows Subsystem for Linux” (Microsoft 2022)
  • πŸ“„ Blog post: β€œWhat is Ubuntu?” (Abubakar 2021)
  • πŸ“ƒ Academic article: β€œTen simple rules for getting started with command-line bioinformatics” (Brandies and Hogg 2021)

Go deeper

πŸ—“οΈ Week 03
09 Oct 2023 -
13 Oct 2023

πŸ†˜ Drop-in sessions

We will host drop-in sessions in early Week 03 to help support you with your Problem Set 01.

⏲️ Deadline

Submit your Problem Set 01 via Moodle a day before the lecture.

πŸ§‘β€πŸ« Lecture

Original title: Data types, File formats, Git and markdown
Revised title: Git, GitHub & Markdown

πŸ’» Lab

Git tutorial + handling your first Git conflict

✍️ Formative

Practice Python for and while loops.
Submit on GitHub Classroom for feedback.
More details in the lecture.

πŸ“– Readings

Click to see recommended resources

Indicative

Recommended

  • πŸ“– Software Documentation: β€œpython requests library” (requests 2023)
  • πŸ“– Software Documentation: β€œData types β€” NumPy v1.24 Manual” (numpy 2022a)
  • πŸ“– Software Documentation: β€œIntro to data structures β€” Pandas v1.5.3 Manual” (pandas 2022b)
  • πŸ“– Software Documentation: β€œHow do I subset data? β€” Pandas v1.5.3 Manual” (pandas 2022a)

Go Deeper

  • πŸ“– Software Documentation: β€œData type objects (dtype) β€” NumPy v1.24 Manual” (numpy 2022b)

      </td>

Collecting Data

In the next few weeks, we will spend some time learning how to collect data from the web. This is a crucial skill for data scientists!

πŸ—“οΈ Week 04
16 Oct 2023 -
20 Oct 2023

πŸ§‘β€πŸ« Lecture

Original title: The Internet and the World Wide Web
Revised title: Data types, File formats & Python tricks

πŸ’» Lab

Web Scraping in Python using the requests and scrapy libraries

πŸ“£ Assignment Reveal

Release of Problem Set 02 β€” Web Scraping (20%).
Details: ✍️ W05 Summative

πŸ—“οΈ Week 05
23 Oct 2023 -
27 Oct 2023

⏲️ Deadline

Submit your Problem Set 02 via GitHub Classroom until a day before the lecture.

πŸ§‘β€πŸ« Lecture

Web APIs and principles of data collection

πŸ’» Lab

Collecting data from APIs in Python using the requests library

πŸ“£ Assignment Reveal

Release of Problem Set 03 β€” Web APIs (30%).
Deadline: W07
Details: TBA during the lecture.

πŸ—“οΈ Week 06
30 Oct 2023 -
03 Nov 2023

πŸ†˜ Drop-in sessions

There is no lecture or lab this week. Instead, we will hold drop-in sessions to help you with your Summative 03. The exact times and dates will be announced in the lecture of Week 05.

Cleaning and reshaping data

Here we reach the main core of the course. We will spend a lot of time learning how to clean and reshape data.

πŸ—“οΈ Week 07
06 Nov 2023 -
10 Nov 2023

⏲️ Deadline

Submit your Problem Set 03 via GitHub Classroom until a day before the lecture.

πŸ§‘β€πŸ« Lecture

Data summarisation and the grammar of graphics

πŸ’» Lab

  • Dataviz with plotnine
  • Form your groups for the project in the lab

πŸ“£ Assignment Reveal
(Formative)

For Week 08, each group will have to:

  • Write and sign a β€˜team contract’
  • Prepare a 10-minute pitch of their project idea.
    Details: TBA during the lecture.

πŸ—“οΈ Week 08
13 Nov 2023 -
17 Nov 2023

⏲️ Deadline

Submit your team contracts via GitHub Classroom until the day of the lecture.

πŸ§‘β€πŸ« Lecture

Databases & data pivoting
Pre-processing and grouping data with pandas, a groupby-apply tutorial

πŸ’» Lab

πŸ—£οΈ GROUP PRESENTATIONS (formative)

✍️ Formative

This is a group assignment we will do during the lecture.

Practice using GitHub as a team to collaborate on a data reshaping task.

πŸ—“οΈ Week 09
20 Nov 2023 -
24 Nov 2023

πŸ§‘β€πŸ« Lecture

Conda environments, databases and join operations

πŸ’» Lab

Github Issues & Pull Requests

πŸ“£ Assignment Reveal

Groups must start preparing a group presentation for Week 11.
Details: TBA during the lecture.

Applications

In the final two weeks, the focus is on setting up your projects. The lectures focus on practical applications and tips that closely resemble the problems you are facing in your projects.

For example, if several groups are struggling with merging data from two different data sources, I select a dataset that requires this operation and show you how to do it. If groups are not struggling with anything in particular, I have some content prepared on text mining and network analysis.

πŸ—“οΈ Week 10
27 Nov 2023 -
01 Dec 2023

πŸ§‘β€πŸ« Lecture

Applications I: Text mining/Network analysis

πŸ’» Lab

πŸ¦ΈπŸ»β€β™‚οΈ Super Tech Support

  • We will use the lab to help you with your projects.

πŸ—“οΈ Week 11
04 Dec 2023 -
08 Dec 2023

πŸ§‘β€πŸ« Lecture

Applications II: Text Mining/Network Analysis

πŸ’» Lab

πŸ—£οΈ GROUP PRESENTATIONS (15%)

Final Steps (Winter Term)

After the end of the Autumn Term, you will have to submit your final project (25%). The deadline is in Week 04 of the Winter Term. More details about the requirements of the final project, as well as drop-in sessions will be announced in the Autumn Term.

References

Abubakar, Mohammed. 2021. β€œWhat Is Ubuntu?” Blogpost. How-To Geek. https://www.howtogeek.com/763775/what-is-ubuntu/.
beautifulSoup. 2023. β€œBeautiful Soup Documentation β€” Beautiful Soup 4.9.0 Documentation.” https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Brandies, Parice A., and Carolyn J. Hogg. 2021. β€œTen Simple Rules for Getting Started with Command-Line Bioinformatics.” PLOS Computational Biology 17 (2): e1008645. https://doi.org/10.1371/journal.pcbi.1008645.
Canonical. 2022. β€œInstall Ubuntu on WSL2 on Windows 11 with GUI Support.” Tutorial. Ubuntu. https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-11-with-gui-support.
Davenport, Thomas. 2020. β€œBeyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.55546b4a.
Duckett, Jon. 2014. HTML & CSS: Design and Build Websites. Indianapolis, Indiana: John Wiley & Sons Inc.
Ebrahim, Mokhtar, and Andrew Mallett. 2018. Mastering Linux Shell Scripting: A Practical Guide to Linux Command-Line, Bash Scripting, and Shell Programming, 2nd Edition. 2nd ed. Birmingham: Packt Publishing.
Hartl, Michael. 2020. β€œUsing Z Shell on Macs with the Learn Enough Tutorials.” Online {Course}. Learn Enough News & Blog. https://news.learnenough.com/macos-bash-zshell.
Microsoft. 2022. β€œWhat Is Windows Subsystem for Linux.” Tutorial. What Is Windows Subsystem for Linux. https://docs.microsoft.com/en-us/windows/wsl/about.
numpy. 2022a. β€œData Type Objects (Dtype) β€” NumPy V1.24 Manual.” https://numpy.org/doc/1.24/reference/arrays.dtypes.html#arrays-dtypes.
β€”β€”β€”. 2022b. β€œData Types β€” NumPy V1.24 Manual.” https://numpy.org/doc/1.24/user/basics.types.html.
pandas. 2022a. β€œHow Do I Select a Subset of a DataFrame? β€” Pandas 1.5.3 Documentation.” https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html.
β€”β€”β€”. 2022b. β€œIntro to Data Structures β€” Pandas 1.5.3 Documentation.” https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/dsintro.html#dsintro.
Pelz, Oliver. 2018. Fundamentals of Linux: Explore the Essentials of the Linux Command Line. Birmingham: Packt Publishing Ltd.
requests. 2023. β€œRequests: HTTP for Humansβ„’ β€” Requests Documentation.” https://requests.readthedocs.io/en/v3.0.0/.
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. 1st edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.