{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LSE Data Science Institute | DS105A (2023/24) | Week 04\n",
    "\n",
    "# 🗓️ Week 04 – Data types, File formats, and live coding\n",
    "\n",
    "Theme: Behind the scenes\n",
    "\n",
    "**DATE:** 19 October 2023\n",
    "\n",
    "**AUTHOR:** [@jonjoncardoso](https://jonjoncardoso.github.io)\n",
    "\n",
    "------------------------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ⚙️ Setup\n",
    "\n",
    "Let's start by importing the libraries we will use today:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "💡 We can run **Terminal** comands inside Jupyter notebook!!!! All you need to do is to add a `!` before the command. The code will no longer be interpreted as Python code, but as a Terminal command.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -lh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd data && ls -lh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But be careful! Even if you `cd`, you will not change the directory of the notebook. The `pwd` will always remain the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. We need good datasets\n",
    "\n",
    "\n",
    "> 📒 A **dataset** is simply a collection of data.\n",
    "\n",
    "- But just having a collection of data is not enough...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We need <ins>organised</ins> datasets \n",
    "\n",
    "Otherwise, we won't be at our most productive in subsequent stages of the data science workflow. \n",
    "\n",
    "Soon, you will start collecting data. You will notice first-hand how messy data can be."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<svg width=\"960\" height=\"384\" viewBox=\"0.00 0.00 1104.31 425.00\" xmlns=\"http://www.w3.org/2000/svg\" xlink=\"http://www.w3.org/1999/xlink\" style=\"; max-width: none; max-height: none; display: block; margin: auto auto auto auto\">\n",
    "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(36 389)\"> <polygon fill=\"white\" stroke=\"transparent\" points=\"-36,36 -36,-389 1068.31,-389 1068.31,36 -36,36\"></polygon> <!-- start --> <g id=\"node1\" class=\"node\">\n",
    "<title>\n",
    "start\n",
    "</title>\n",
    "<polygon fill=\"#f2eaf6\" stroke=\"#f2eaf6\" points=\"73.98,-353 8.43,-353 8.43,-301 73.98,-301 73.98,-353\"></polygon> <text text-anchor=\"middle\" x=\"41.21\" y=\"-321\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Start</text> </g> <!-- gather --> <g id=\"node2\" class=\"node\">\n",
    "<title>\n",
    "gather\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"14\" y=\"-232\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Gather</text> <text text-anchor=\"middle\" x=\"41.21\" y=\"-208\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">data &nbsp;</text> </g> <!-- start&#45;&gt;gather --> <g id=\"edge8\" class=\"edge\">\n",
    "<title>\n",
    "start-&gt;gather\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M41.21,-300.76C41.21,-292.69 41.21,-283.47 41.21,-274.35\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"44.71,-274.16 41.21,-264.16 37.71,-274.16 44.71,-274.16\"></polygon> </g> <!-- store --> <g id=\"node3\" class=\"node\">\n",
    "<title>\n",
    "store\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"134.54\" y=\"-232\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Store it &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> <text text-anchor=\"middle\" x=\"186.21\" y=\"-208\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">somewhere</text> </g> <!-- gather&#45;&gt;store --> <g id=\"edge1\" class=\"edge\">\n",
    "<title>\n",
    "gather-&gt;store\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M82.76,-226C91.36,-226 100.68,-226 110.08,-226\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"110.29,-229.5 120.29,-226 110.29,-222.5 110.29,-229.5\"></polygon> <text text-anchor=\"middle\" x=\"101.48\" y=\"-233.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- clean --> <g id=\"node4\" class=\"node\">\n",
    "<title>\n",
    "clean\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"304.6\" y=\"-232\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Clean &amp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> <text text-anchor=\"middle\" x=\"358.21\" y=\"-208\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">pre-process</text> </g> <!-- store&#45;&gt;clean --> <g id=\"edge2\" class=\"edge\">\n",
    "<title>\n",
    "store-&gt;clean\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M252.21,-226C261.4,-226 270.9,-226 280.28,-226\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"280.38,-229.5 290.38,-226 280.38,-222.5 280.38,-229.5\"></polygon> <text text-anchor=\"middle\" x=\"271.24\" y=\"-233.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- build --> <g id=\"node5\" class=\"node\">\n",
    "<title>\n",
    "build\n",
    "</title>\n",
    "<polygon fill=\"#8431a6\" stroke=\"transparent\" points=\"556.04,-264 464.38,-264 464.38,-188 556.04,-188 556.04,-264\"></polygon> <text text-anchor=\"start\" x=\"478.54\" y=\"-232\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#fcfcfc\">Build a </text> <text text-anchor=\"middle\" x=\"510.21\" y=\"-208\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#fcfcfc\">dataset</text> </g> <!-- clean&#45;&gt;build --> <g id=\"edge3\" class=\"edge\">\n",
    "<title>\n",
    "clean-&gt;build\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M426.2,-226C435.62,-226 445.21,-226 454.37,-226\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"454.51,-229.5 464.51,-226 454.51,-222.5 454.51,-229.5\"></polygon> <text text-anchor=\"middle\" x=\"445.18\" y=\"-233.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- eda --> <g id=\"node6\" class=\"node\">\n",
    "<title>\n",
    "eda\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"452.44\" y=\"-119\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Exploratory &nbsp;&nbsp;&nbsp;</text> <text text-anchor=\"middle\" x=\"510.21\" y=\"-95\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">data analysis</text> </g> <!-- build&#45;&gt;eda --> <g id=\"edge4\" class=\"edge\">\n",
    "<title>\n",
    "build-&gt;eda\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M510.21,-187.73C510.21,-179.31 510.21,-170.26 510.21,-161.48\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"513.71,-161.36 510.21,-151.36 506.71,-161.36 513.71,-161.36\"></polygon> </g> <!-- ml --> <g id=\"node7\" class=\"node\">\n",
    "<title>\n",
    "ml\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"634.22\" y=\"-119\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Machine</text> <text text-anchor=\"middle\" x=\"669.21\" y=\"-95\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">learning</text> </g> <!-- eda&#45;&gt;ml --> <g id=\"edge5\" class=\"edge\">\n",
    "<title>\n",
    "eda-&gt;ml\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M581.8,-113C591.14,-113 600.63,-113 609.75,-113\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"609.85,-116.5 619.85,-113 609.85,-109.5 609.85,-116.5\"></polygon> <text text-anchor=\"middle\" x=\"601.1\" y=\"-120.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- insight --> <g id=\"node8\" class=\"node\">\n",
    "<title>\n",
    "insight\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"770.49\" y=\"-119\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Obtain &nbsp;&nbsp;</text> <text text-anchor=\"middle\" x=\"805.21\" y=\"-95\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">insights</text> </g> <!-- ml&#45;&gt;insight --> <g id=\"edge6\" class=\"edge\">\n",
    "<title>\n",
    "ml-&gt;insight\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M718.31,-113C727.26,-113 736.69,-113 745.91,-113\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"746.18,-116.5 756.18,-113 746.18,-109.5 746.18,-116.5\"></polygon> <text text-anchor=\"middle\" x=\"737.34\" y=\"-120.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- communicate --> <g id=\"node9\" class=\"node\">\n",
    "<title>\n",
    "communicate\n",
    "</title>\n",
    "<text text-anchor=\"start\" x=\"906.11\" y=\"-119\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">Communicate</text> <text text-anchor=\"middle\" x=\"962.21\" y=\"-95\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">results &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- insight&#45;&gt;communicate --> <g id=\"edge7\" class=\"edge\">\n",
    "<title>\n",
    "insight-&gt;communicate\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M854.02,-113C862.88,-113 872.33,-113 881.82,-113\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"882.1,-116.5 892.1,-113 882.1,-109.5 882.1,-116.5\"></polygon> <text text-anchor=\"middle\" x=\"873.02\" y=\"-120.2\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"14.00\"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</text> </g> <!-- end --> <g id=\"node10\" class=\"node\">\n",
    "<title>\n",
    "end\n",
    "</title>\n",
    "<polygon fill=\"#f2eaf6\" stroke=\"#f2eaf6\" points=\"989.21,-38 935.21,-38 935.21,0 989.21,0 989.21,-38\"></polygon> <text text-anchor=\"middle\" x=\"962.21\" y=\"-13\" font-family=\"Helvetica,Arial,sans-serif\" font-size=\"20.00\" fill=\"#8431a6\">End</text> </g> <!-- communicate&#45;&gt;end --> <g id=\"edge9\" class=\"edge\">\n",
    "<title>\n",
    "communicate-&gt;end\n",
    "</title>\n",
    "<path fill=\"none\" stroke=\"#8431a6\" stroke-width=\"4\" d=\"M962.21,-74.85C962.21,-65.94 962.21,-56.57 962.21,-48.15\"></path> <polygon fill=\"#8431a6\" stroke=\"#8431a6\" stroke-width=\"4\" points=\"965.71,-48.09 962.21,-38.09 958.71,-48.09 965.71,-48.09\"></polygon> </g> </g>\n",
    "</svg>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🌐 Click [here](https://opendata.camden.gov.uk/Crime-and-Criminal-Justice/On-Street-Crime-In-Camden-Map/893b-tp33) to view the **Open Street Crime Map** data for Camden, London. We will use this resource as an example to explore different file formats."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## File format: CSV\n",
    "\n",
    "Let's start by exporting the crime data as a CSV file.\n",
    "\n",
    "Place the downloaded file in the `./data` folder of your project (create this folder if it doesn't exist yet)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We downloaded a CSV file. The [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) suggests we can use the `read_csv()` function to read it:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime = pd.read_csv(\"./data/On_Street_Crime_In_Camden_Map.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🗣️ **CLASSROOM DISCUSSION:** What does the message above mean? Is it an error? How can I know?\n",
    "\n",
    "<details><summary>Click here to see a possible answer</summary>\n",
    "\n",
    "To understand the error above, we need to understand the concept of `dtype` (data type). When reading the file, the 🐼 pandas library tries to infer a lot of things about the data. \n",
    "\n",
    "Since the file ends with `.csv`, a **structured data format**, 🐼 pandas assumes a few things:\n",
    "\n",
    "1. this is a **C**omma-**S**eparated **V**alues file, which is a fancy way to say that the data is separated by commas (`,`).\n",
    "\n",
    "2. on top of the commas, the data is also separated by **new lines** (`\\n`)\n",
    "\n",
    "3. each line represents a **row** of data, an individual **observation**\n",
    "\n",
    "4. the very first row is special. It contains the **column names**. Each column name is separated by, again, a comma (`,`)\n",
    "\n",
    "5. _the data in each column is of the same **type**._\n",
    "\n",
    "💡 The message above is not an Error, but a Warning and it seems connected with Assumption 5. 🐼 pandas is trying to infer the data type of each column, it fails to do so but it still manages to read the file.\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can confirm that the file was read by checking out the first 5 rows of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How do I get a list of column names**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do you want to know how many columns are there? Use the function `len()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(df_street_crime.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What if I just want to see a few columns?**\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_columns = [\"Category\", \"Street ID\", \"Street Name\", \"Outcome Category\", \"Outcome Date\", \"Location\"]\n",
    "\n",
    "# Save to illustrate how CSV looks like in raw format\n",
    "df_street_crime[selected_columns].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's save this reduced dataset as a CSV file**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime[selected_columns].to_csv(\"./data/reduced_crime_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What does the dataset looks like in its true form? (Not when read by Pandas)\n",
    "\n",
    "Let's use the `head` command to see the first 5 lines of the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head ./data/reduced_crime_data.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get more info about what is stored in the `df_street_crime` variable by using the `info()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🗣️ **CLASSROOM DISCUSSION:** What part of the message above is related to the types of data stored in each column? And what does it mean?\n",
    "\n",
    "<details><summary>Click here to read a possible answer</summary>\n",
    "\n",
    "The `Dtype` column shows the data type of each column.\n",
    "\n",
    "What is funny about the above is that pretty much all columns are of type `object`. This is 🐼 pandas way of saying \"I don't know what data type this column is, I will treat it as text\".\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Type: Integer\n",
    "\n",
    "An integer is a whole number (not a fraction) that can be positive, negative, or zero.\n",
    "\n",
    "At a lower level, at the computer's memory, an integer is represented by a **limited** sequence of bits (zeros and ones). I say limited because the number of bits used to represent an integer is fixed. See the `int64` notation under the 'Street ID' column? This means that the computer is using **64 individual bits** to represent each integer in that column.\n",
    "\n",
    "\n",
    "<details><summary>Click here to see how Integers work</summary>\n",
    "\n",
    "\n",
    "Suppose we only had 8 bits to represent an integer. The drawing below represents a single number, encoded in eight rectangular boxes. Each box represents a bit and they are colored in red if they have the number 1 and light grey if 0:\n",
    "\n",
    "<span style=\"background-color: #e5e5e5; display:inline-block; width:800px; padding:0; box-sizing: border-box;vertical-align:bottom;\">\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;background-color: #C63C4A; width:100px; color:white; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                1\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;background-color: #C63C4A; width:100px; color:white; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                1\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;width:100px; color:#212121; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                0\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;width:100px; color:#212121; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                0\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;width:100px; color:#212121; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                0\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;width:100px; color:#212121; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                0\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;background-color: #C63C4A; width:100px; color:white; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                1\n",
    "            </span>\n",
    "            <span style=\"float:left; font-size:3em;font-weight:bold;width:100px; color:#212121; border: 1px gray solid; padding-top:1em;padding-bottom:1em; box-sizing: border-box;text-align:center;\">\n",
    "                0\n",
    "            </span>\n",
    "</span>\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's check a single column: 'Street ID'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are all the different integer sizes supported?\n",
    "\n",
    "- We would have to look at [numpy's documentation on Integer types](https://numpy.org/doc/stable/reference/arrays.scalars.html#integer-types). Pandas uses numpy under the hood to store data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What would be the most adequate integer type for the `Street ID` column?**\n",
    "\n",
    "<details><summary>Click here to read spoilers</summary>\n",
    "\n",
    "- Check the min and max of the column with the code:\n",
    "\n",
    "    ```python\n",
    "    df_street_crime['<column-name>'].min()\n",
    "    ```\n",
    "    \n",
    "    ```python\n",
    "    df_street_crime['<column-name>'].max()\n",
    "    ```\n",
    "\n",
    "- Learn how to convert a pandas column to a different data type using [.astype()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🗣️ **What do you think would happen if I tried to conver this column to `np.int8` instead??**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Type: [Floating Point types](https://numpy.org/doc/stable/reference/arrays.scalars.html#floating-point-types)\n",
    "\n",
    "💡 Tip: '[Why is my addition/multiplication not working??](https://stackoverflow.com/a/52931828/843365)'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime['Easting']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Type: [Text/strings(object)](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)\n",
    "\n",
    "Without getting into the specifics of how 🐼 pandas represents text, it is important that you learn about encodings!\n",
    "\n",
    "That is because, text is of course also represented as a sequence of bits. But how do we know which sequence of bits represents which character?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### ASCII table\n",
    "\n",
    "- The ASCII table is one of early standards\n",
    "- It encodes **characters** using 7-bits. Therefore, it can represent a total of 128 unique characters\n",
    "\n",
    "Here is a snippet of the ASCII table:\n",
    "\n",
    "<details style=\"margin-left:2em;font-size:0.75em;\"><summary> Click here to see Part I of the ASCII table</summary>\n",
    "\n",
    "|   Dec |   Binary | Char   | Description               |\n",
    "|------:|---------:|:-------|:--------------------------|\n",
    "|     0 |  0000000 | NUL    | Null                      |\n",
    "|     1 |  0000001 | SOH    | Start of Header           |\n",
    "|     2 |  0000010 | STX    | Start of Text             |\n",
    "|     3 |  0000011 | ETX    | End of Text               |\n",
    "|     4 |  0000100 | EOT    | End of Transmission       |\n",
    "|     5 |  0000101 | ENQ    | Enquiry                   |\n",
    "|     6 |  0000110 | ACK    | Acknowledge               |\n",
    "|     7 |  0000111 | BEL    | Bell                      |\n",
    "|     8 |  0001000 | BS     | Backspace                 |\n",
    "|     9 |  0001001 | HT     | Horizontal Tab            |\n",
    "|    10 |  0001010 | LF     | Line Feed                 |\n",
    "|    11 |  0001011 | VT     | Vertical Tab              |\n",
    "|    12 |  0001100 | FF     | Form Feed                 |\n",
    "|    13 |  0001101 | CR     | Carriage Return           |\n",
    "|    14 |  0001110 | SO     | Shift Out                 |\n",
    "|    15 |  0001111 | SI     | Shift In                  |\n",
    "|    16 |  0010000 | DLE    | Data Link Escape          |\n",
    "|    17 |  0010001 | DC1    | Device Control 1          |\n",
    "|    18 |  0010010 | DC2    | Device Control 2          |\n",
    "|    19 |  0010011 | DC3    | Device Control 3          |\n",
    "|    20 |  0010100 | DC4    | Device Control 4          |\n",
    "|    21 |  0010101 | NAK    | Negative Acknowledge      |\n",
    "|    22 |  0010110 | SYN    | Synchronize               |\n",
    "|    23 |  0010111 | ETB    | End of Transmission Block |\n",
    "|    24 |  0011000 | CAN    | Cancel                    |\n",
    "|    25 |  0011001 | EM     | End of Medium             |\n",
    "|    26 |  0011010 | SUB    | Substitute                |\n",
    "|    27 |  0011011 | ESC    | Escape                    |\n",
    "|    28 |  0011100 | FS     | File Separator            |\n",
    "|    29 |  0011101 | GS     | Group Separator           |\n",
    "|    30 |  0011110 | RS     | Record Separator          |\n",
    "|    31 |  0011111 | US     | Unit Separator            |\n",
    "\n",
    "\n",
    "</details>\n",
    "\n",
    "<br>\n",
    "\n",
    "<details style=\"margin-left:2em;font-size:0.75em;\"><summary> Click here to see Part II of the ASCII table</summary>\n",
    "\n",
    "|   Dec |   Binary | Char   | Description       |\n",
    "|------:|---------:|:-------|:------------------|\n",
    "|    32 |  0100000 | space  | Space             |\n",
    "|    33 |  0100001 | !      | exclamation mark  |\n",
    "|    34 |  0100010 | \"      | double quote      |\n",
    "|    35 |  0100011 | #      | number            |\n",
    "|    36 |  0100100 | $      | dollar            |\n",
    "|    37 |  0100101 | %      | percent           |\n",
    "|    38 |  0100110 | &      | ampersand         |\n",
    "|    39 |  0100111 | '      | single quote      |\n",
    "|    40 |  0101000 | (      | left parenthesis  |\n",
    "|    41 |  0101001 | )      | right parenthesis |\n",
    "|    42 |  0101010 | *      | asterisk          |\n",
    "|    43 |  0101011 | +      | plus              |\n",
    "|    44 |  0101100 | ,      | comma             |\n",
    "|    45 |  0101101 | -      | minus             |\n",
    "|    46 |  0101110 | .      | period            |\n",
    "|    47 |  0101111 | /      | slash             |\n",
    "|    48 |  0110000 | 0      | zero              |\n",
    "|    49 |  0110001 | 1      | one               |\n",
    "|    50 |  0110010 | 2      | two               |\n",
    "|    51 |  0110011 | 3      | three             |\n",
    "|    52 |  0110100 | 4      | four              |\n",
    "|    53 |  0110101 | 5      | five              |\n",
    "|    54 |  0110110 | 6      | six               |\n",
    "|    55 |  0110111 | 7      | seven             |\n",
    "|    56 |  0111000 | 8      | eight             |\n",
    "|    57 |  0111001 | 9      | nine              |\n",
    "|    58 |  0111010 | :      | colon             |\n",
    "|    59 |  0111011 | ;      | semicolon         |\n",
    "|    60 |  0111100 | <      | less than         |\n",
    "|    61 |  0111101 | =      | equality sign     |\n",
    "|    62 |  0111110 | >      | greater than      |\n",
    "|    63 |  0111111 | ?      | question mark     |\n",
    "\n",
    "</details>\n",
    "\n",
    "Click [here](https://www.ascii-code.com/) to see the full thing.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Other encoding standards\n",
    "\n",
    "ASCII is not the only standard. There are other ways to encode text using binary.\n",
    "\n",
    "Below is a non-comprehensive list of other text (**encoding**):\n",
    "\n",
    "- Unicode\n",
    "- UTF-8 (Most common of all)\n",
    "- UTF-16\n",
    "- UTF-32\n",
    "- ISO-8859-1\n",
    "- Latin-1\n",
    "\n",
    "🔗 [Here](https://docs.python.org/3/library/codecs.html#standard-encodings) is the full list of encoding standards supported by Python by default.\n",
    "\n",
    "\n",
    "💡 You might have come across encoding mismatches before if you ever opened a file and the text looked like this:\n",
    "\n",
    "> \"NestlÃ© and MÃ¶tley CrÃ¼e\"\n",
    "\n",
    "Where it should have read \n",
    "\n",
    "> \"Nestlé and  Mötley Crüe\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Emojis are text 😃 !\n",
    "\n",
    "See the complete list on the [Unicode website](https://unicode.org/emoji/charts/full-emoji-list.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#As far as I know, this CSV file is free of non-standard characters, \n",
    "# so changing the encoding to something else should not change much.\n",
    "# After all, most standards are supersets of ASCII.\n",
    "pd.read_csv(\"./data/On_Street_Crime_In_Camden_Map.csv\", encoding=\"ascii\", encoding_errors=\"strict\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data type: [Categorical](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)\n",
    "\n",
    "<details><summary>Click here to read spoilers</summary>\n",
    "\n",
    "- How to show what is inside a column\n",
    "- How to show just the unique values of a column (`df['<column-name>']unique()`)\n",
    "- Count the number of unique values in a column (`df['<column-name>'].nunique()`)\n",
    "- Count the number of times each unique value appears in a column (`df['<column-name>'].value_counts()`)\n",
    "\n",
    "<details><summary>Hidden pro-tip</summary>\n",
    "\n",
    "Here is how you can create a plot showing the percentage of each unique value in a column:\n",
    "\n",
    "```python\n",
    "df_street_crime['Category'].value_counts(normalize=True).apply(lambda x: x*100).plot(kind='bar')\n",
    "```\n",
    "\n",
    "</details>\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime['Category']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Converting to a Categorical data type.**\n",
    "\n",
    "When data is clearly categorical (the hint for this one is in the name of the column), we can convert it to a Categorical data type. This might also make our life easier when we want to do some analysis on this column.\n",
    "\n",
    "Here are the reasons the 🐼 pandas documentation on [Categorical Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) says you should use it:\n",
    "\n",
    "> The categorical data type is useful in the following cases:\n",
    ">\n",
    "> - A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.\n",
    ">\n",
    "> - The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.\n",
    ">\n",
    "> - As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime['Category'] = pd.Categorical(df_street_crime['Category'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_street_crime.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Show that we can convert the `datetime` column to a `datetime` object:\n",
    "df[\"Outcome Date\"] = pd.to_datetime(df[\"Outcome Date\"])\n",
    "\n",
    "# Make `Outcome Category` a categorical column\n",
    "df[\"Outcome Category\"] = df[\"Outcome Category\"].astype(\"category\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Outcome Category\"].cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Outcome Date\"].dt.day_name()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_as_list = df.astype(str).values.tolist()\n",
    "\n",
    "with open(\"data/sample.txt\", \"w\") as f:\n",
    "    f.write(\"[\")\n",
    "    for item_list in df_as_list:\n",
    "        f.write(\"[\")\n",
    "        f.write(\"','\".join(item_list))\n",
    "        f.write(\"'],\\n\")\n",
    "    f.write(\"]\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## File format: JSON\n",
    "\n",
    "Say we download the JSON version of the data. How should we read it as a Data Frame?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "filepath = Path(\"./data/rows.json\")\n",
    "\n",
    "df = pd.read_json(filepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The path is correct but pandas could not parse things correctly. \n",
    "\n",
    "We have to find a workaround:\n",
    "\n",
    "👉 read the JSON file as a Python dictionary using the `json` library instead"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "with open(\"data/rows.json\", \"r\") as read_file:\n",
    "    data = json.load(read_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the data type of the variable `data`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many keys are there in the dictionary?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the keys?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conclusion: the data is the same but it is structured in a completely different way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The actual data is in the `data` key\n",
    "len(data[\"data\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here is how we could convert it to a DataFrame\n",
    "pd.DataFrame(data[\"data\"]).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where did the column names go???\n",
    "\n",
    "<details><summary>Read spoilers here</summary>\n",
    "\n",
    "After much trial and error, I found out that the column names are stored in the `meta` key of the dictionary. But I have to go deeper:\n",
    "\n",
    "```python\n",
    "data[\"meta\"][\"view\"][\"columns\"]\n",
    "```\n",
    "\n",
    "</details>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is there a way to **normalise** this data?\n",
    "\n",
    "Normalise = flatten the data structure\n",
    "\n",
    "<details><summary>Read spoilers here</summary>\n",
    "\n",
    "```python\n",
    "df_columns = pd.io.json.json_normalize(data[\"meta\"][\"view\"][\"columns\"])\n",
    "df_columns.head(6)\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Putting it all together:\n",
    "\n",
    "<details><summary>See solution</summary>\n",
    "\n",
    "```python\n",
    "column_names = df_columns[\"name\"].tolist()\n",
    "\n",
    "df_crime = pd.DataFrame(data[\"data\"], columns=column_names)\n",
    "\n",
    "selected_columns = [\"Category\", \"Street ID\", \"Street Name\", \"Outcome Category\", \"Outcome Date\", \"Location\"]\n",
    "\n",
    "print(df_crime[selected_columns].head().to_markdown(index=False))\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_names = df_columns[\"name\"].tolist()\n",
    "\n",
    "df_crime = pd.DataFrame(data[\"data\"], columns=column_names)\n",
    "\n",
    "selected_columns = [\"Category\", \"Street ID\", \"Street Name\", \"Outcome Category\", \"Outcome Date\", \"Location\"]\n",
    "\n",
    "print(df_crime[selected_columns].head().to_markdown(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are back to where we were with the CSV file!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python tricks\n",
    "\n",
    "I will construct this section with you during the lecture.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "86a4f56bb4e17d4e79f8e80b0f18a6d84d252069e5d9eb74ea21d8503d143f4e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}