🗓️ Week 01 - Part II
Data Science toolbox

DS105 Data for Data Science 🖥️ 🤹

9/30/22

What do we mean by data science?

Data science is…

“[…] a field of study and practice that involves the collection, storage, and processing of data in order to derive important 💡 insights into a problem or a phenomenon.

Such data may be generated by humans (surveys, logs, etc.) or machines (weather data, road vision, etc.),

and could be in different formats (text, audio, video, augmented or virtual reality, etc.).”

The mythical unicorn 🦄

knows everything about statistics

able to communicate insights perfectly

fully understands businesses like no one

is a fluent computer programmer

In reality…

We are all jugglers 🤹

  • Everyone brings a different skill set.
  • We need multi-disciplinary teams.
  • Good data scientists know a bit of everything.
    • Not fluent in all things
    • Understands their strenghts and weaknessess
    • They know when and where to interface with others

The
Data
Science
Workflow

The Data Science Workflow

start Start gather Gather data   start->gather store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       end End communicate->end

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate       communicate->end

It is often said that 80% of the time and effort spent on a data science project goes to the tasks highlighted above.

The Data Science Workflow

start Start gather Gather data   start->gather end End store Store it          somewhere gather->store       clean Clean &         pre-process store->clean       build Build a dataset clean->build       eda Exploratory     data analysis build->eda eda->end ml Machine learning eda->ml       insight Obtain    insights ml->insight       communicate Communicate results          insight->communicate      

And this is what this course is about! You will learn some of the most common tools used during this process.

The toolbox 🧰

  • Python or R ??
    • Use the programming language you feel more comfortable with.
    • When you form groups, discuss whether you will adopt a single language or use a mix
    • It is okay to mix languages if your group is well coordinated

Python vs R

Python

How should we share code?

Github!

Use Github for everything related to your project!

  • You will learn to setup Github for your own code on 🗓️ Week 05’s lab.
  • You will learn how to work effectively as a team on Github on 🗓️ Week 09’s lab.

Important

Don’t share code via e-mail, Dropbox, Google Drive or anything like that!

It is a bad practice as things get messy very quickly.

Where do I get data?

Tip: Data is Plural

Data is Plural run by Buzzfeed’s data editor 🧑 Jeremy Singer-Vine. People send him interesting/funny/odd datasets and he shares them in a weekly newsletter. Here’s the link to the website (the google doc list of datasets is linked here)

Final project requirements

  • Your main data source must be collected:
    • using an API or,
    • by webscrapping
  • That is, you cannot use static datasets.
    • the point of this course is for you to get past the technical barrier of collecting and handling data
  • We will give you more detais during the Term.

What’s Next?

  • Next week we will introduce you to The Terminal
  • Join our Slack group if you haven’t done so yet.
  • Use the time before our first lab to revisit basic programming skills.
  • Head over to the đź”– Week 01 - Appendix page for:
    • Indicative & recommended reading
    • Programming Resources

Thank you

References

Davenport, Thomas. 2020. “Beyond Unicorns: Educating, Classifying, and Certifying Business Data Scientists.” Harvard Data Science Review 2 (2). https://doi.org/10.1162/99608f92.55546b4a.
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. First edition. Beijing ; Sebastopol: O’Reilly Media. https://ebookcentral.proquest.com/lib/londonschoolecons/detail.action?docID=1465965.
Shah, Chirag. 2020. A Hands-on Introduction to Data Science. Cambridge, United Kingdom ; New York, NY, USA: Cambridge University Press. https://librarysearch.lse.ac.uk/permalink/f/1n2k4al/TN_cdi_askewsholts_vlebooks_9781108673907.