{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "LSE Data Science Institute | ME204 (2023/24) | Week 03 Day 01\n", "\n", "# 🗓️ Week 03 – Day 01: Intro to Databases and SQL\n", "\n", "\n", "MORNING NOTEBOOK\n", "\n", "**DATE:** 22 July 2024\n", "\n", "**AUTHOR:** Alex Soldatkin & Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io)\n", "\n", "-----\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ⚙️ Setup\n", "\n", "You can use your `base` environment or create a new one for this notebook. \n", "\n", "These are the libraries you need to install:\n", "\n", "
\n", "\n", "```bash\n", "pip install pandas numpy lets-plot numerize tqdm sqlalchemy jupysql\n", "```\n", "\n", "💡 **NOTE**: There are a few new ones today: the [`jupysql`](https://jupysql.ploomber.io/en/latest/intro.html) and [`sqlalchemy`](https://www.sqlalchemy.org/) packages.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**IMPORTS:**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sqlite3\n", "\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from tqdm.notebook import tqdm, trange\n", "tqdm.pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CONSTANTS:**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Adjust it according to your data folder\n", "DATA_FOLDER = os.path.join('../data/')\n", "\n", "WAITROSE_FOLDER = os.path.join(DATA_FOLDER, 'waitrose')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%load_ext sql\n", "%config SqlMagic.autocommit=True # for engines that do not support autommit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**READ THE RAW DATA**\n", "\n", "Read the bunch of CSV files we got from scraping the Waitrose website." ] }, { "cell_type": "code", "execution_count": 233, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtypeon-offerimage-urlpagenamesizeitem-priceprice-per-unitoffer-descriptioncategory
0497633GTruehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/finish-...Finish Quantum All in One Lemon Dishwasher Tab...64Each10.0015.6p eachSave 1/3. Was £15.00Household
1627645GTruehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/finish-...Finish Ultimate Plus All in One Lemon Dishwash...48Each9.0018.8p eachsave £6.50. Was £15.50Household
2475915GTruehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/finish-...Finish Ultimate All in One Lemon Dishwasher Ta...36s8.0022.2p eachSave 1/3. Was £12.50Household
328863GFalsehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/essenti...Essential White Ultra Soft Bathroom Tissue9x190 sheets4.0023.4p/100 sheetsNaNHousehold
422482GFalsehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/essenti...Essential Thick Bleach Citrus750ml0.75£1/litreNaNHousehold
....................................
129016521GFalsehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/watergu...Watergull Orchards Apple Juice Cox750ml2.6535.3p/100mlNaNTea, Coffee & Soft Drinks
129116118GTruehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/red-bul...Red Bull Energy Drink4x250ml4.0040p/100mlsave £1.50. Was £5.50Tea, Coffee & Soft Drinks
129212467GTruehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/red-bul...Red Bull Sugarfree4x250ml4.0040p/100mlsave £1.25. Was £5.25Tea, Coffee & Soft Drinks
12937932GFalsehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/pixley-...Pixley Berries Blackcurrant Cordial500ml4.7595p/100mlNaNTea, Coffee & Soft Drinks
12945156GFalsehttps://ecom-su-static-prod.wtrecom.com/images...https://www.waitrose.com/ecom/products/evian-s...Evian Still Mineral Water Sportscap4x75cl3.8012.7p/100mlNaNTea, Coffee & Soft Drinks
\n", "

25378 rows × 11 columns

\n", "
" ], "text/plain": [ " id type on-offer \\\n", "0 497633 G True \n", "1 627645 G True \n", "2 475915 G True \n", "3 28863 G False \n", "4 22482 G False \n", "... ... ... ... \n", "1290 16521 G False \n", "1291 16118 G True \n", "1292 12467 G True \n", "1293 7932 G False \n", "1294 5156 G False \n", "\n", " image-url \\\n", "0 https://ecom-su-static-prod.wtrecom.com/images... \n", "1 https://ecom-su-static-prod.wtrecom.com/images... \n", "2 https://ecom-su-static-prod.wtrecom.com/images... \n", "3 https://ecom-su-static-prod.wtrecom.com/images... \n", "4 https://ecom-su-static-prod.wtrecom.com/images... \n", "... ... \n", "1290 https://ecom-su-static-prod.wtrecom.com/images... \n", "1291 https://ecom-su-static-prod.wtrecom.com/images... \n", "1292 https://ecom-su-static-prod.wtrecom.com/images... \n", "1293 https://ecom-su-static-prod.wtrecom.com/images... \n", "1294 https://ecom-su-static-prod.wtrecom.com/images... \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/finish-... \n", "1 https://www.waitrose.com/ecom/products/finish-... \n", "2 https://www.waitrose.com/ecom/products/finish-... \n", "3 https://www.waitrose.com/ecom/products/essenti... \n", "4 https://www.waitrose.com/ecom/products/essenti... \n", "... ... \n", "1290 https://www.waitrose.com/ecom/products/watergu... \n", "1291 https://www.waitrose.com/ecom/products/red-bul... \n", "1292 https://www.waitrose.com/ecom/products/red-bul... \n", "1293 https://www.waitrose.com/ecom/products/pixley-... \n", "1294 https://www.waitrose.com/ecom/products/evian-s... \n", "\n", " name size \\\n", "0 Finish Quantum All in One Lemon Dishwasher Tab... 64Each \n", "1 Finish Ultimate Plus All in One Lemon Dishwash... 48Each \n", "2 Finish Ultimate All in One Lemon Dishwasher Ta... 36s \n", "3 Essential White Ultra Soft Bathroom Tissue 9x190 sheets \n", "4 Essential Thick Bleach Citrus 750ml \n", "... ... ... \n", "1290 Watergull Orchards Apple Juice Cox 750ml \n", "1291 Red Bull Energy Drink 4x250ml \n", "1292 Red Bull Sugarfree 4x250ml \n", "1293 Pixley Berries Blackcurrant Cordial 500ml \n", "1294 Evian Still Mineral Water Sportscap 4x75cl \n", "\n", " item-price price-per-unit offer-description \\\n", "0 10.00 15.6p each Save 1/3. Was £15.00 \n", "1 9.00 18.8p each save £6.50. Was £15.50 \n", "2 8.00 22.2p each Save 1/3. Was £12.50 \n", "3 4.00 23.4p/100 sheets NaN \n", "4 0.75 £1/litre NaN \n", "... ... ... ... \n", "1290 2.65 35.3p/100ml NaN \n", "1291 4.00 40p/100ml save £1.50. Was £5.50 \n", "1292 4.00 40p/100ml save £1.25. Was £5.25 \n", "1293 4.75 95p/100ml NaN \n", "1294 3.80 12.7p/100ml NaN \n", "\n", " category \n", "0 Household \n", "1 Household \n", "2 Household \n", "3 Household \n", "4 Household \n", "... ... \n", "1290 Tea, Coffee & Soft Drinks \n", "1291 Tea, Coffee & Soft Drinks \n", "1292 Tea, Coffee & Soft Drinks \n", "1293 Tea, Coffee & Soft Drinks \n", "1294 Tea, Coffee & Soft Drinks \n", "\n", "[25378 rows x 11 columns]" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Run this very long and ugly cell to get the full dataset\n", "all_files = [os.path.join(WAITROSE_FOLDER, file) for file in os.listdir(WAITROSE_FOLDER) \n", " if file.endswith('.csv')]\n", "\n", "# Reads every individual file and concatenates them into a single DataFrame using pandas concat\n", "df = pd.concat((pd.read_csv(file) for file in all_files))\n", "\n", "df = df.drop_duplicates()\n", "\n", "df = df.drop(columns=['data-product-name'])\n", "df = (\n", " df.rename(columns={\n", " 'data-product-type': 'type',\n", " 'data-product-index': 'index',\n", " 'data-product-id': 'id',\n", " 'data-product-on-offer': 'on-offer',\n", " 'product-page': 'page',\n", " 'product-name': 'name',\n", " 'product-size': 'size',\n", " })\n", ")\n", "\n", "def clean_item_price(item_price: str):\n", " \"\"\"\n", " Cleans the item price string by performing necessary transformations.\n", "\n", " Parameters:\n", " item_price (str): The item price as a string.\n", "\n", " Returns:\n", " float: The cleaned item price.\n", " \"\"\"\n", " if pd.isna(item_price):\n", " return np.nan\n", " \n", " item_price = item_price.replace(' each est.', '')\n", " \n", " \n", " if '-' in item_price:\n", " item_prices = item_price.replace('£', '').split('-')\n", " item_prices = [float(price) for price in item_prices]\n", " item_price = np.mean(item_prices)\n", " else:\n", " if '£' in item_price:\n", " item_price = item_price.replace('£', '')\n", " item_price = float(item_price)\n", " elif 'p' in item_price:\n", " item_price = item_price.replace('p', '')\n", " item_price = float('0.' + item_price)\n", " \n", " return float(item_price)\n", "\n", "df['id'] = df['id'].astype('int32')\n", "df['cleaned-item-price'] = df['item-price'].apply(lambda price: clean_item_price(price))\n", "\n", "selected_rows = df['item-price'].astype(str).str.contains('-')\n", "\n", "# The 'size' column of the selected rows all have the same kind of format: (2.2kg-2.8kg).\n", "df.loc[selected_rows, 'size'] = df.loc[selected_rows, 'size'].str.split('-').str[0].str.replace('(', '')\n", "\n", "df['item-price'] = df['cleaned-item-price']\n", "df.drop(columns=['cleaned-item-price', 'index'], inplace=True)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**CONNECT TO THE DATABASE**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [], "source": [ "conn = sqlite3.connect(os.path.join(DATA_FOLDER, 'supermarket.db'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SOLUTION 01: NO FORMAL KEY CONSTRAINTS (easyer to implement but less ideal)\n", "\n", "\n", "
\n", "\n", "

💡 Read this!

\n", "\n", "In this solution, we will create multiple tables to help us exercise linking them together with `pandas` later. I will not use primary or foreign keys. Instead, I will derive the separate data frames from the main one, making sure that there are no duplicates left, before I dump them into the database.\n", "\n", "I believe this might be easier for those who are just starting with SQL, with the caveat that this is not the best practice for databases. Ideally, you want to enforce constraints (primary and foreign keys) to ensure data integrity -- and we can't get that with this approach.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "sql" } }, "source": [ "### Table 01: `products`\n", "\n", "
\n", "\n", "It turns out that IDs are truly unique, we do not need (id, name, size) like we've been doing in the past. This saves us a lot of time and effort. We can just use the `id` as the primary key for the `products` table.\n", "\n", "To confirm that:\n", "\n", "```python\n", "(\n", " df\n", " .groupby(['id'])\n", " .apply(lambda x: pd.Series({'num_unique_pages': x['page'].nunique()}))\n", " .sort_values('num_unique_pages', ascending=False)\n", " .query('num_unique_pages > 1')\n", ")\n", "````\n", "\n", "
\n", "\n", "\n", "We know that we **want** to exercise the linking of tables, so we will not include the `category` column in this table. We will create a separate table for that. That also applies for the information about prices and offers.\n", "\n", "This leaves us with the following columns for the `products` table:\n", "\n", "- `id` (int): the unique identifier for the product (as determined by Waitrose)\n", "- `name` (str): the name of the product\n", "- `size` (str): the size of the product\n", "- `page` (str): the URL of the product's page on the Waitrose website\n", "- `image-url` (str): the URL of the product's image\n", "- `type` (str): the type of the product (which, to be honest, I don't know what it means)\n", "\n", "👉 **These are all the attributes that are specific to the product itself and not to its price or offer or category.**\n", "\n", "Let's build the dataframe for the `products` table." ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "While the original dataset has 25378 rows, the cleaned dataset has 16065 rows, which means that 9313 rows were removed.\n" ] } ], "source": [ "selected_columns = ['id', 'name', 'size', 'page', 'image-url', 'type']\n", "\n", "products = df[selected_columns].copy().drop_duplicates()\n", "\n", "print(f\"While the original dataset has {len(df)} rows, the cleaned dataset has {len(products)} rows, which means that {len(df) - len(products)} rows were removed.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is in the `products` dataframe?" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesizepageimage-urltype
0497633Finish Quantum All in One Lemon Dishwasher Tab...64Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
1627645Finish Ultimate Plus All in One Lemon Dishwash...48Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
2475915Finish Ultimate All in One Lemon Dishwasher Ta...36shttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
328863Essential White Ultra Soft Bathroom Tissue9x190 sheetshttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...G
422482Essential Thick Bleach Citrus750mlhttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...G
.....................
128529679Jacksons Fairtrade Sencha Green Tea Bags 2050ghttps://www.waitrose.com/ecom/products/jackson...https://ecom-su-static-prod.wtrecom.com/images...G
129016521Watergull Orchards Apple Juice Cox750mlhttps://www.waitrose.com/ecom/products/watergu...https://ecom-su-static-prod.wtrecom.com/images...G
129116118Red Bull Energy Drink4x250mlhttps://www.waitrose.com/ecom/products/red-bul...https://ecom-su-static-prod.wtrecom.com/images...G
129212467Red Bull Sugarfree4x250mlhttps://www.waitrose.com/ecom/products/red-bul...https://ecom-su-static-prod.wtrecom.com/images...G
12937932Pixley Berries Blackcurrant Cordial500mlhttps://www.waitrose.com/ecom/products/pixley-...https://ecom-su-static-prod.wtrecom.com/images...G
\n", "

16065 rows × 6 columns

\n", "
" ], "text/plain": [ " id name size \\\n", "0 497633 Finish Quantum All in One Lemon Dishwasher Tab... 64Each \n", "1 627645 Finish Ultimate Plus All in One Lemon Dishwash... 48Each \n", "2 475915 Finish Ultimate All in One Lemon Dishwasher Ta... 36s \n", "3 28863 Essential White Ultra Soft Bathroom Tissue 9x190 sheets \n", "4 22482 Essential Thick Bleach Citrus 750ml \n", "... ... ... ... \n", "1285 29679 Jacksons Fairtrade Sencha Green Tea Bags 20 50g \n", "1290 16521 Watergull Orchards Apple Juice Cox 750ml \n", "1291 16118 Red Bull Energy Drink 4x250ml \n", "1292 12467 Red Bull Sugarfree 4x250ml \n", "1293 7932 Pixley Berries Blackcurrant Cordial 500ml \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/finish-... \n", "1 https://www.waitrose.com/ecom/products/finish-... \n", "2 https://www.waitrose.com/ecom/products/finish-... \n", "3 https://www.waitrose.com/ecom/products/essenti... \n", "4 https://www.waitrose.com/ecom/products/essenti... \n", "... ... \n", "1285 https://www.waitrose.com/ecom/products/jackson... \n", "1290 https://www.waitrose.com/ecom/products/watergu... \n", "1291 https://www.waitrose.com/ecom/products/red-bul... \n", "1292 https://www.waitrose.com/ecom/products/red-bul... \n", "1293 https://www.waitrose.com/ecom/products/pixley-... \n", "\n", " image-url type \n", "0 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1 https://ecom-su-static-prod.wtrecom.com/images... G \n", "2 https://ecom-su-static-prod.wtrecom.com/images... G \n", "3 https://ecom-su-static-prod.wtrecom.com/images... G \n", "4 https://ecom-su-static-prod.wtrecom.com/images... G \n", "... ... ... \n", "1285 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1290 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1291 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1292 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1293 https://ecom-su-static-prod.wtrecom.com/images... G \n", "\n", "[16065 rows x 6 columns]" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save it to the database." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16065" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products.to_sql('products', conn, if_exists='replace', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test that it worked:" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_num_products
016065
\n", "
" ], "text/plain": [ " total_num_products\n", "0 16065" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql('SELECT COUNT(*) AS total_num_products FROM products', conn)" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesizepageimage-urltype
0497633Finish Quantum All in One Lemon Dishwasher Tab...64Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
1627645Finish Ultimate Plus All in One Lemon Dishwash...48Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
2475915Finish Ultimate All in One Lemon Dishwasher Ta...36shttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...G
328863Essential White Ultra Soft Bathroom Tissue9x190 sheetshttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...G
422482Essential Thick Bleach Citrus750mlhttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...G
\n", "
" ], "text/plain": [ " id name size \\\n", "0 497633 Finish Quantum All in One Lemon Dishwasher Tab... 64Each \n", "1 627645 Finish Ultimate Plus All in One Lemon Dishwash... 48Each \n", "2 475915 Finish Ultimate All in One Lemon Dishwasher Ta... 36s \n", "3 28863 Essential White Ultra Soft Bathroom Tissue 9x190 sheets \n", "4 22482 Essential Thick Bleach Citrus 750ml \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/finish-... \n", "1 https://www.waitrose.com/ecom/products/finish-... \n", "2 https://www.waitrose.com/ecom/products/finish-... \n", "3 https://www.waitrose.com/ecom/products/essenti... \n", "4 https://www.waitrose.com/ecom/products/essenti... \n", "\n", " image-url type \n", "0 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1 https://ecom-su-static-prod.wtrecom.com/images... G \n", "2 https://ecom-su-static-prod.wtrecom.com/images... G \n", "3 https://ecom-su-static-prod.wtrecom.com/images... G \n", "4 https://ecom-su-static-prod.wtrecom.com/images... G " ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# See what the table looks like, get just the first 5 rows\n", "pd.read_sql('SELECT * FROM products LIMIT 5', conn)" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "sql" } }, "source": [ "### Table 2: `categories` (DIFFICULT BUT GOOD LEARNING OPPORTUNITY)\n", "\n", "Previously, I showed you an example where we created a very simple `categories` table with only two columns: `id` and `name`. This is common practice in databases when you need to keep track of additional information about the categories. Say, for example, you want to keep track of the number of products in each category, when they started being sold, when they became inactive, etc. In that scenario, I would ideally create a third table, just to map `category_id` to `product_id`. This is also called a [**many-to-many relationship**](https://www.geeksforgeeks.org/relationships-in-sql-one-to-one-one-to-many-many-to-many/).\n", "\n", "Since we don't have additional info on the categories themselves, I will simplify things and create just one table that relates to `categories`. In this table, each product can be associated with multiple categories. This is a [**one-to-many relationship**](https://www.geeksforgeeks.org/relationships-in-sql-one-to-one-one-to-many-many-to-many/).\n", "\n", "This table will have the following columns:\n", "\n", "- `product_id` (int): the unique identifier for the product\n", "- `category_name` (int): the name of the category\n", "\n", "That's it! By moving it to a separate table, we keep the `products` table free of duplicates and make it easier to update the categories for a product.\n", "\n", "Let's write the code to create this table:" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "# This is how the table looks like\n", "categories = (\n", " df[['id', 'category']].drop_duplicates()\n", " .rename(columns={'category': 'name', 'id': 'product_id'})\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time to save it:" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25377" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories.to_sql('categories', conn, if_exists='replace', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Test that it works**" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idname
0497633Household
1627645Household
2475915Household
328863Household
422482Household
\n", "
" ], "text/plain": [ " product_id name\n", "0 497633 Household\n", "1 627645 Household\n", "2 475915 Household\n", "3 28863 Household\n", "4 22482 Household" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql('SELECT * FROM categories LIMIT 5', conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What if I want to get the categories of a product under this new schema?" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesizepageimage-urltypecategory
0497633Finish Quantum All in One Lemon Dishwasher Tab...64Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...GHousehold
1627645Finish Ultimate Plus All in One Lemon Dishwash...48Eachhttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...GHousehold
2475915Finish Ultimate All in One Lemon Dishwasher Ta...36shttps://www.waitrose.com/ecom/products/finish-...https://ecom-su-static-prod.wtrecom.com/images...GHousehold
328863Essential White Ultra Soft Bathroom Tissue9x190 sheetshttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...GEveryday Value
428863Essential White Ultra Soft Bathroom Tissue9x190 sheetshttps://www.waitrose.com/ecom/products/essenti...https://ecom-su-static-prod.wtrecom.com/images...GHousehold
\n", "
" ], "text/plain": [ " id name size \\\n", "0 497633 Finish Quantum All in One Lemon Dishwasher Tab... 64Each \n", "1 627645 Finish Ultimate Plus All in One Lemon Dishwash... 48Each \n", "2 475915 Finish Ultimate All in One Lemon Dishwasher Ta... 36s \n", "3 28863 Essential White Ultra Soft Bathroom Tissue 9x190 sheets \n", "4 28863 Essential White Ultra Soft Bathroom Tissue 9x190 sheets \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/finish-... \n", "1 https://www.waitrose.com/ecom/products/finish-... \n", "2 https://www.waitrose.com/ecom/products/finish-... \n", "3 https://www.waitrose.com/ecom/products/essenti... \n", "4 https://www.waitrose.com/ecom/products/essenti... \n", "\n", " image-url type category \n", "0 https://ecom-su-static-prod.wtrecom.com/images... G Household \n", "1 https://ecom-su-static-prod.wtrecom.com/images... G Household \n", "2 https://ecom-su-static-prod.wtrecom.com/images... G Household \n", "3 https://ecom-su-static-prod.wtrecom.com/images... G Everyday Value \n", "4 https://ecom-su-static-prod.wtrecom.com/images... G Household " ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We select the products and join them with the categories table to get the category names\n", "pd.read_sql(\n", "'''\n", "SELECT\n", " p.*,\n", " c.name AS category\n", "FROM\n", " products p\n", "LEFT JOIN\n", " categories c\n", "ON\n", " p.id = c.product_id\n", "LIMIT 5\n", "''', conn)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "☝️ Notice how the products appear multiple times if they have more than one category. This is the expected behavior, as we are linking the two tables together with a simple JOIN operation.\n", "\n", "**Here's an alternative way that preserves the unique products:**" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesizepageimage-urltypecategories
01023Taylors of Harrogate Yorkshire Gold 160 Bags500ghttps://www.waitrose.com/ecom/products/taylors...https://ecom-su-static-prod.wtrecom.com/images...GTea, Coffee & Soft Drinks
11029Nestlé Box Bowls Variety 6 Portion Pack210ghttps://www.waitrose.com/ecom/products/nestle-...https://ecom-su-static-prod.wtrecom.com/images...GFood Cupboard
21048Kellogg's Corn Flakes Breakfast Cereal Large Pack670ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...GDietary & Lifestyle,Food Cupboard
31051Kellogg's Crunchy Nut Breakfast Cereal Large Pack840ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...GDietary & Lifestyle,Food Cupboard
41052Kellogg's Crunchy Nut Breakfast Cereal460ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...GDietary & Lifestyle,Food Cupboard
\n", "
" ], "text/plain": [ " id name size \\\n", "0 1023 Taylors of Harrogate Yorkshire Gold 160 Bags 500g \n", "1 1029 Nestlé Box Bowls Variety 6 Portion Pack 210g \n", "2 1048 Kellogg's Corn Flakes Breakfast Cereal Large Pack 670g \n", "3 1051 Kellogg's Crunchy Nut Breakfast Cereal Large Pack 840g \n", "4 1052 Kellogg's Crunchy Nut Breakfast Cereal 460g \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/taylors... \n", "1 https://www.waitrose.com/ecom/products/nestle-... \n", "2 https://www.waitrose.com/ecom/products/kellogg... \n", "3 https://www.waitrose.com/ecom/products/kellogg... \n", "4 https://www.waitrose.com/ecom/products/kellogg... \n", "\n", " image-url type \\\n", "0 https://ecom-su-static-prod.wtrecom.com/images... G \n", "1 https://ecom-su-static-prod.wtrecom.com/images... G \n", "2 https://ecom-su-static-prod.wtrecom.com/images... G \n", "3 https://ecom-su-static-prod.wtrecom.com/images... G \n", "4 https://ecom-su-static-prod.wtrecom.com/images... G \n", "\n", " categories \n", "0 Tea, Coffee & Soft Drinks \n", "1 Food Cupboard \n", "2 Dietary & Lifestyle,Food Cupboard \n", "3 Dietary & Lifestyle,Food Cupboard \n", "4 Dietary & Lifestyle,Food Cupboard " ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We group by the `product_id` and use GROUP_CONCAT to concatenate all the category names into a single string\n", "pd.read_sql(\n", "'''\n", "SELECT\n", " p.*,\n", " GROUP_CONCAT(c.name) AS categories\n", "FROM\n", " products p\n", "LEFT JOIN\n", " categories c\n", "ON\n", " p.id = c.product_id\n", "GROUP BY\n", " p.id\n", "LIMIT 5\n", "''', conn)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table 3: `prices`\n", "\n", "The `prices` table will have the following columns:\n", "\n", "- `product_id` (int): the unique identifier for the product\n", "- `item-price` (float): the price of the product\n", "- `price-per-unit` (float): the price per unit of the product\n", "\n", "The process is very similar to what we did for the `categories` table:" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_iditem-priceprice-per-unit
31410237.10£1.42/100g
240310292.30£1.10/100g
51310482.7541p/100g
79410515.0059.5p/100g
80510522.7559.8p/100g
\n", "
" ], "text/plain": [ " product_id item-price price-per-unit\n", "314 1023 7.10 £1.42/100g\n", "2403 1029 2.30 £1.10/100g\n", "513 1048 2.75 41p/100g\n", "794 1051 5.00 59.5p/100g\n", "805 1052 2.75 59.8p/100g" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prices = (\n", " df[['id', 'item-price', 'price-per-unit']].drop_duplicates()\n", " .sort_values('id')\n", " .rename(columns={'id': 'product_id'})\n", ")\n", "prices.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save it:" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16065" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prices.to_sql('prices', conn, if_exists='replace', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test that the table was created correctly" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_iditem-priceprice-per-unit
010237.10£1.42/100g
110292.30£1.10/100g
210482.7541p/100g
310515.0059.5p/100g
410522.7559.8p/100g
\n", "
" ], "text/plain": [ " product_id item-price price-per-unit\n", "0 1023 7.10 £1.42/100g\n", "1 1029 2.30 £1.10/100g\n", "2 1048 2.75 41p/100g\n", "3 1051 5.00 59.5p/100g\n", "4 1052 2.75 59.8p/100g" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql('SELECT * FROM prices LIMIT 5', conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each product in the `products` table, recover their prices and categories." ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesizepageimage-urltypeproduct_iditem-priceprice-per-unitcategories
01023Taylors of Harrogate Yorkshire Gold 160 Bags500ghttps://www.waitrose.com/ecom/products/taylors...https://ecom-su-static-prod.wtrecom.com/images...G10237.10£1.42/100gTea, Coffee & Soft Drinks
11029Nestlé Box Bowls Variety 6 Portion Pack210ghttps://www.waitrose.com/ecom/products/nestle-...https://ecom-su-static-prod.wtrecom.com/images...G10292.30£1.10/100gFood Cupboard
21048Kellogg's Corn Flakes Breakfast Cereal Large Pack670ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...G10482.7541p/100gDietary & Lifestyle,Food Cupboard
31051Kellogg's Crunchy Nut Breakfast Cereal Large Pack840ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...G10515.0059.5p/100gDietary & Lifestyle,Food Cupboard
41052Kellogg's Crunchy Nut Breakfast Cereal460ghttps://www.waitrose.com/ecom/products/kellogg...https://ecom-su-static-prod.wtrecom.com/images...G10522.7559.8p/100gDietary & Lifestyle,Food Cupboard
\n", "
" ], "text/plain": [ " id name size \\\n", "0 1023 Taylors of Harrogate Yorkshire Gold 160 Bags 500g \n", "1 1029 Nestlé Box Bowls Variety 6 Portion Pack 210g \n", "2 1048 Kellogg's Corn Flakes Breakfast Cereal Large Pack 670g \n", "3 1051 Kellogg's Crunchy Nut Breakfast Cereal Large Pack 840g \n", "4 1052 Kellogg's Crunchy Nut Breakfast Cereal 460g \n", "\n", " page \\\n", "0 https://www.waitrose.com/ecom/products/taylors... \n", "1 https://www.waitrose.com/ecom/products/nestle-... \n", "2 https://www.waitrose.com/ecom/products/kellogg... \n", "3 https://www.waitrose.com/ecom/products/kellogg... \n", "4 https://www.waitrose.com/ecom/products/kellogg... \n", "\n", " image-url type product_id \\\n", "0 https://ecom-su-static-prod.wtrecom.com/images... G 1023 \n", "1 https://ecom-su-static-prod.wtrecom.com/images... G 1029 \n", "2 https://ecom-su-static-prod.wtrecom.com/images... G 1048 \n", "3 https://ecom-su-static-prod.wtrecom.com/images... G 1051 \n", "4 https://ecom-su-static-prod.wtrecom.com/images... G 1052 \n", "\n", " item-price price-per-unit categories \n", "0 7.10 £1.42/100g Tea, Coffee & Soft Drinks \n", "1 2.30 £1.10/100g Food Cupboard \n", "2 2.75 41p/100g Dietary & Lifestyle,Food Cupboard \n", "3 5.00 59.5p/100g Dietary & Lifestyle,Food Cupboard \n", "4 2.75 59.8p/100g Dietary & Lifestyle,Food Cupboard " ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql(\n", "'''\n", "SELECT\n", " p.*,\n", " pr.*,\n", " GROUP_CONCAT(DISTINCT c.name) AS categories\n", "FROM\n", " products p\n", "LEFT JOIN\n", " prices pr\n", "ON\n", " p.id = pr.product_id\n", "LEFT JOIN\n", " categories c\n", "ON\n", " p.id = c.product_id\n", "GROUP BY p.id\n", "LIMIT 5\n", "''', conn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table 4: `offers`\n", "\n", "The `offers` table will have the following columns:\n", "\n", "- `product_id` (int): the unique identifier for the product \n", "- `offer-description` (str): the offer on the product\n", "\n", "\n", "**⚠️ IMPORTANT**: I don't need to store a `is_on_offer` column in the `products` table. This is redundant information, as I can get this information from the `offers` table. When joining the two tables, I can check if the `description` column is not null." ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idoffer-description
5131048save 45p. Was £3.20
8051052save 75p. Was £3.50
8631066save 70p. Was £3.45
13161138save £1.30. Was £3.80
12481182save 85p. Was £1.85
.........
171988412Introductory Offer.Will be £6.00
265990127Introductory Offer.Will be £3.65
257995459New Line
249999031Introductory Offer.Will be £1.20
1625999167Introductory Offer.Will be £2.00
\n", "

4656 rows × 2 columns

\n", "
" ], "text/plain": [ " product_id offer-description\n", "513 1048 save 45p. Was £3.20\n", "805 1052 save 75p. Was £3.50\n", "863 1066 save 70p. Was £3.45\n", "1316 1138 save £1.30. Was £3.80\n", "1248 1182 save 85p. Was £1.85\n", "... ... ...\n", "171 988412 Introductory Offer.Will be £6.00\n", "265 990127 Introductory Offer.Will be £3.65\n", "257 995459 New Line\n", "249 999031 Introductory Offer.Will be £1.20\n", "1625 999167 Introductory Offer.Will be £2.00\n", "\n", "[4656 rows x 2 columns]" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "offers = (\n", " df[df['on-offer'] == True][['id', 'offer-description']].drop_duplicates()\n", " .sort_values('id')\n", " .rename(columns={'id': 'product_id'})\n", ")\n", "offers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By filtering to only the products that are on offer, I get fewer rows in the `offers` table, which is good for performance." ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4656" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "offers.to_sql('offers', conn, if_exists='replace', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select **only the products that are on offer**, use the `INNER JOIN` operation:" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnameoffer-description
0497633Finish Quantum All in One Lemon Dishwasher Tab...Save 1/3. Was £15.00
1627645Finish Ultimate Plus All in One Lemon Dishwash...save £6.50. Was £15.50
2475915Finish Ultimate All in One Lemon Dishwasher Ta...Save 1/3. Was £12.50
314407Andrex Supreme Quilts Toilet RollAdd 2 for £10
471469Andrex Classic Clean Toilet Rollsave £1.65. Was £6.60
............
465499326Complan OriginalAdd 2 for £10
465582540Fentimans Dandelion & Burdocksave 65p. Was £1.65
465654537Nescafe Dolce Gusto Lungo Coffee Pods 16'sAdd 2 for £7.50
465716118Red Bull Energy Drinksave £1.50. Was £5.50
465812467Red Bull Sugarfreesave £1.25. Was £5.25
\n", "

4659 rows × 3 columns

\n", "
" ], "text/plain": [ " id name \\\n", "0 497633 Finish Quantum All in One Lemon Dishwasher Tab... \n", "1 627645 Finish Ultimate Plus All in One Lemon Dishwash... \n", "2 475915 Finish Ultimate All in One Lemon Dishwasher Ta... \n", "3 14407 Andrex Supreme Quilts Toilet Roll \n", "4 71469 Andrex Classic Clean Toilet Roll \n", "... ... ... \n", "4654 99326 Complan Original \n", "4655 82540 Fentimans Dandelion & Burdock \n", "4656 54537 Nescafe Dolce Gusto Lungo Coffee Pods 16's \n", "4657 16118 Red Bull Energy Drink \n", "4658 12467 Red Bull Sugarfree \n", "\n", " offer-description \n", "0 Save 1/3. Was £15.00 \n", "1 save £6.50. Was £15.50 \n", "2 Save 1/3. Was £12.50 \n", "3 Add 2 for £10 \n", "4 save £1.65. Was £6.60 \n", "... ... \n", "4654 Add 2 for £10 \n", "4655 save 65p. Was £1.65 \n", "4656 Add 2 for £7.50 \n", "4657 save £1.50. Was £5.50 \n", "4658 save £1.25. Was £5.25 \n", "\n", "[4659 rows x 3 columns]" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql(\n", "'''\n", "SELECT\n", " p.id,\n", " p.name,\n", " o.`offer-description`\n", "FROM\n", " products p\n", "INNER JOIN\n", " offers o\n", "ON \n", " p.id = o.product_id\n", "''', conn\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add a `is-on-offer` column to the `products` table, use the `LEFT JOIN` operation and the `CASE` statement:" ] }, { "cell_type": "code", "execution_count": 194, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnameoffer-descriptionis-on-offer
0497633Finish Quantum All in One Lemon Dishwasher Tab...Save 1/3. Was £15.001
1627645Finish Ultimate Plus All in One Lemon Dishwash...save £6.50. Was £15.501
2475915Finish Ultimate All in One Lemon Dishwasher Ta...Save 1/3. Was £12.501
328863Essential White Ultra Soft Bathroom TissueNone0
422482Essential Thick Bleach CitrusNone0
...............
1606029679Jacksons Fairtrade Sencha Green Tea Bags 20None0
1606116521Watergull Orchards Apple Juice CoxNone0
1606216118Red Bull Energy Drinksave £1.50. Was £5.501
1606312467Red Bull Sugarfreesave £1.25. Was £5.251
160647932Pixley Berries Blackcurrant CordialNone0
\n", "

16065 rows × 4 columns

\n", "
" ], "text/plain": [ " id name \\\n", "0 497633 Finish Quantum All in One Lemon Dishwasher Tab... \n", "1 627645 Finish Ultimate Plus All in One Lemon Dishwash... \n", "2 475915 Finish Ultimate All in One Lemon Dishwasher Ta... \n", "3 28863 Essential White Ultra Soft Bathroom Tissue \n", "4 22482 Essential Thick Bleach Citrus \n", "... ... ... \n", "16060 29679 Jacksons Fairtrade Sencha Green Tea Bags 20 \n", "16061 16521 Watergull Orchards Apple Juice Cox \n", "16062 16118 Red Bull Energy Drink \n", "16063 12467 Red Bull Sugarfree \n", "16064 7932 Pixley Berries Blackcurrant Cordial \n", "\n", " offer-description is-on-offer \n", "0 Save 1/3. Was £15.00 1 \n", "1 save £6.50. Was £15.50 1 \n", "2 Save 1/3. Was £12.50 1 \n", "3 None 0 \n", "4 None 0 \n", "... ... ... \n", "16060 None 0 \n", "16061 None 0 \n", "16062 save £1.50. Was £5.50 1 \n", "16063 save £1.25. Was £5.25 1 \n", "16064 None 0 \n", "\n", "[16065 rows x 4 columns]" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add is-on-offer column to the products table if there's no offer for a product, set it to False\n", "pd.read_sql(\n", "'''\n", "SELECT\n", " p.id,\n", " p.name,\n", " o.`offer-description`,\n", " CASE\n", " WHEN o.`offer-description` IS NULL THEN False\n", " ELSE True\n", " END AS `is-on-offer`\n", "FROM\n", " products p\n", "LEFT JOIN\n", " offers o\n", "ON\n", " p.id = o.product_id\n", "''', conn\n", ")\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenametbl_namerootpage
sql
CREATE TABLE \"products\" (\\n\"id\" INTEGER,\\n \"name\" TEXT,\\n \"size\" TEXT,\\n \"page\" TEXT,\\n \"image-url\" TEXT,\\n \"type\" TEXT\\n)tableproductsproducts3
CREATE TABLE \"categories\" (\\n\"product_id\" INTEGER,\\n \"name\" TEXT\\n)tablecategoriescategories2
CREATE TABLE \"prices\" (\\n\"product_id\" INTEGER,\\n \"item-price\" REAL,\\n \"price-per-unit\" TEXT\\n)tablepricesprices10
CREATE TABLE \"offers\" (\\n\"product_id\" INTEGER,\\n \"offer-description\" TEXT\\n)tableoffersoffers1310
\n", "
" ], "text/plain": [ " type name \\\n", "sql \n", "CREATE TABLE \"products\" (\\n\"id\" INTEGER,\\n \"na... table products \n", "CREATE TABLE \"categories\" (\\n\"product_id\" INTEG... table categories \n", "CREATE TABLE \"prices\" (\\n\"product_id\" INTEGER,\\... table prices \n", "CREATE TABLE \"offers\" (\\n\"product_id\" INTEGER,\\... table offers \n", "\n", " tbl_name rootpage \n", "sql \n", "CREATE TABLE \"products\" (\\n\"id\" INTEGER,\\n \"na... products 3 \n", "CREATE TABLE \"categories\" (\\n\"product_id\" INTEG... categories 2 \n", "CREATE TABLE \"prices\" (\\n\"product_id\" INTEGER,\\... prices 10 \n", "CREATE TABLE \"offers\" (\\n\"product_id\" INTEGER,\\... offers 1310 " ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that all the tables are created correctly\n", "\n", "pd.read_sql('SELECT * FROM sqlite_master', conn).set_index('sql')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SOLUTION 2: MORE FORMAL KEY CONSTRAINTS (more complex but ideal)\n", "\n", "The way to create the data frames that will eventually populate the database is the same as before. The difference is that we will now create the database schema with primary and foreign keys.\n", "\n", "Therefore, we cannot use the `if_exists='replace'` parameter in the `to_sql` method. Otherwise, we will just be creating new tables every time we run the code." ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Connecting to 'sqlite:///supermarket.db'" ], "text/plain": [ "Connecting to 'sqlite:///supermarket.db'" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# For this I'd rather use the jupysql package\n", "%sql sqlite:///../data/supermarket.db --alias supermarket" ] }, { "cell_type": "code", "execution_count": 221, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "data": { "text/html": [ "Running query in 'sqlite:///supermarket.db'" ], "text/plain": [ "Running query in 'sqlite:///supermarket.db'" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
" ], "text/plain": [ "++\n", "||\n", "++\n", "++" ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "-- RECREATE THE products TABLE\n", "\n", "DROP TABLE IF EXISTS products;\n", "\n", "CREATE TABLE \"products\" (\n", " \"id\" INTEGER PRIMARY KEY,\n", " \"name\" VARCHAR(100),\n", " \"size\" VARCHAR(30),\n", " \"page\" VARCHAR(300),\n", " \"image-url\" VARCHAR(300),\n", " \"type\" CHAR(1)\n", ");\n", "\n", "-- RECREATE THE categories TABLE\n", "\n", "DROP TABLE IF EXISTS categories;\n", "\n", "CREATE TABLE \"categories\" (\n", " \"product_id\" INTEGER,\n", " \"name\" VARCHAR(50),\n", " PRIMARY KEY (\"product_id\", \"name\"),\n", " FOREIGN KEY (\"product_id\") REFERENCES \"products\" (\"id\")\n", ");\n", "\n", "-- RECREATE THE prices TABLE\n", "\n", "DROP TABLE IF EXISTS prices;\n", "\n", "CREATE TABLE \"prices\" (\n", " \"product_id\" INTEGER,\n", " \"item-price\" REAL,\n", " \"price-per-unit\" TEXT,\n", " FOREIGN KEY (\"product_id\") REFERENCES \"products\" (\"id\")\n", ");\n", "\n", "-- RECREATE THE offers TABLE\n", "\n", "DROP TABLE IF EXISTS offers;\n", "\n", "CREATE TABLE \"offers\" (\n", " \"product_id\" INTEGER,\n", " \"offer-description\" TEXT,\n", " FOREIGN KEY (\"product_id\") REFERENCES \"products\" (\"id\")\n", ")\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "sql" } }, "source": [ "**Populate the database**\n", "\n", "When we try to populate the database, we get an error saying that some ids are duplicated!!\n", "\n", "This is, of course, annoying because we now have to fix the problem but good because otherwise, we would have inserted the data into the database with duplicate ids." ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "vscode": { "languageId": "sql" } }, "outputs": [ { "ename": "IntegrityError", "evalue": "UNIQUE constraint failed: products.id", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIntegrityError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[222], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mproducts\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mproducts\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mappend\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/util/_decorators.py:333\u001b[0m, in \u001b[0;36mdeprecate_nonkeyword_arguments..decorate..wrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(args) \u001b[38;5;241m>\u001b[39m num_allow_args:\n\u001b[1;32m 328\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[1;32m 329\u001b[0m msg\u001b[38;5;241m.\u001b[39mformat(arguments\u001b[38;5;241m=\u001b[39m_format_argument_list(allow_args)),\n\u001b[1;32m 330\u001b[0m \u001b[38;5;167;01mFutureWarning\u001b[39;00m,\n\u001b[1;32m 331\u001b[0m stacklevel\u001b[38;5;241m=\u001b[39mfind_stack_level(),\n\u001b[1;32m 332\u001b[0m )\n\u001b[0;32m--> 333\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/core/generic.py:3087\u001b[0m, in \u001b[0;36mNDFrame.to_sql\u001b[0;34m(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)\u001b[0m\n\u001b[1;32m 2889\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 2890\u001b[0m \u001b[38;5;124;03mWrite records stored in a DataFrame to a SQL database.\u001b[39;00m\n\u001b[1;32m 2891\u001b[0m \n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 3083\u001b[0m \u001b[38;5;124;03m[(1,), (None,), (2,)]\u001b[39;00m\n\u001b[1;32m 3084\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m \u001b[38;5;66;03m# noqa: E501\u001b[39;00m\n\u001b[1;32m 3085\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mpandas\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mio\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m sql\n\u001b[0;32m-> 3087\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43msql\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3088\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3089\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3090\u001b[0m \u001b[43m \u001b[49m\u001b[43mcon\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3091\u001b[0m \u001b[43m \u001b[49m\u001b[43mschema\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mschema\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3092\u001b[0m \u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mif_exists\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3093\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3094\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex_label\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex_label\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3095\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunksize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3096\u001b[0m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3097\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3098\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/io/sql.py:842\u001b[0m, in \u001b[0;36mto_sql\u001b[0;34m(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method, engine, **engine_kwargs)\u001b[0m\n\u001b[1;32m 837\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mNotImplementedError\u001b[39;00m(\n\u001b[1;32m 838\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mframe\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m argument should be either a Series or a DataFrame\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 839\u001b[0m )\n\u001b[1;32m 841\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m pandasSQL_builder(con, schema\u001b[38;5;241m=\u001b[39mschema, need_transaction\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m) \u001b[38;5;28;01mas\u001b[39;00m pandas_sql:\n\u001b[0;32m--> 842\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mpandas_sql\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mto_sql\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 843\u001b[0m \u001b[43m \u001b[49m\u001b[43mframe\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 844\u001b[0m \u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 845\u001b[0m \u001b[43m \u001b[49m\u001b[43mif_exists\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mif_exists\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 846\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 847\u001b[0m \u001b[43m \u001b[49m\u001b[43mindex_label\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindex_label\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 848\u001b[0m \u001b[43m \u001b[49m\u001b[43mschema\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mschema\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 849\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunksize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 850\u001b[0m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 851\u001b[0m \u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmethod\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 852\u001b[0m \u001b[43m \u001b[49m\u001b[43mengine\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mengine\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 853\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mengine_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 854\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/io/sql.py:2851\u001b[0m, in \u001b[0;36mSQLiteDatabase.to_sql\u001b[0;34m(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method, engine, **engine_kwargs)\u001b[0m\n\u001b[1;32m 2841\u001b[0m table \u001b[38;5;241m=\u001b[39m SQLiteTable(\n\u001b[1;32m 2842\u001b[0m name,\n\u001b[1;32m 2843\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 2848\u001b[0m dtype\u001b[38;5;241m=\u001b[39mdtype,\n\u001b[1;32m 2849\u001b[0m )\n\u001b[1;32m 2850\u001b[0m table\u001b[38;5;241m.\u001b[39mcreate()\n\u001b[0;32m-> 2851\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtable\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minsert\u001b[49m\u001b[43m(\u001b[49m\u001b[43mchunksize\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmethod\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/io/sql.py:1119\u001b[0m, in \u001b[0;36mSQLTable.insert\u001b[0;34m(self, chunksize, method)\u001b[0m\n\u001b[1;32m 1116\u001b[0m \u001b[38;5;28;01mbreak\u001b[39;00m\n\u001b[1;32m 1118\u001b[0m chunk_iter \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mzip\u001b[39m(\u001b[38;5;241m*\u001b[39m(arr[start_i:end_i] \u001b[38;5;28;01mfor\u001b[39;00m arr \u001b[38;5;129;01min\u001b[39;00m data_list))\n\u001b[0;32m-> 1119\u001b[0m num_inserted \u001b[38;5;241m=\u001b[39m \u001b[43mexec_insert\u001b[49m\u001b[43m(\u001b[49m\u001b[43mconn\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mkeys\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mchunk_iter\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1120\u001b[0m \u001b[38;5;66;03m# GH 46891\u001b[39;00m\n\u001b[1;32m 1121\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m num_inserted \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n", "File \u001b[0;32m~/miniconda3/lib/python3.12/site-packages/pandas/io/sql.py:2547\u001b[0m, in \u001b[0;36mSQLiteTable._execute_insert\u001b[0;34m(self, conn, keys, data_iter)\u001b[0m\n\u001b[1;32m 2545\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_execute_insert\u001b[39m(\u001b[38;5;28mself\u001b[39m, conn, keys, data_iter) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m \u001b[38;5;28mint\u001b[39m:\n\u001b[1;32m 2546\u001b[0m data_list \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mlist\u001b[39m(data_iter)\n\u001b[0;32m-> 2547\u001b[0m \u001b[43mconn\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mexecutemany\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43minsert_statement\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnum_rows\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata_list\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 2548\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m conn\u001b[38;5;241m.\u001b[39mrowcount\n", "\u001b[0;31mIntegrityError\u001b[0m: UNIQUE constraint failed: products.id" ] } ], "source": [ "products.to_sql('products', conn, if_exists='append', index=False)" ] }, { "cell_type": "code", "execution_count": 253, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesize
4867459NaNNaN
23667459No.1 Royal Deeside Still Water750ml
34069460Waitrose Pineapple Fruit Juice1litre
4969460NaNNaN
4885201NaNNaN
3585201Essential LeeksTypical weight 0.21kg
3485247Essential CauliflowerEach
4885247NaNNaN
3685513Duchy Organic Carrots650g
4985513NaNNaN
4686535Waitrose Cherry Vine Tomatoes270g
4986535NaNNaN
56522314Waitrose Red Leicester Cheese & Sriracha Burge...140g
48522314NaNNaN
25522601Essential Pedal Bin Liners Tie Handles30s
24522601NaNNaN
48722909NaNNaN
723722909John Lewis Bumble Bee Napkin 33cm20s
59750319Waitrose 2 Lemon & Lime Daisy Cheesecakes2x76g
49750319NaNNaN
\n", "
" ], "text/plain": [ " id name \\\n", "48 67459 NaN \n", "236 67459 No.1 Royal Deeside Still Water \n", "340 69460 Waitrose Pineapple Fruit Juice \n", "49 69460 NaN \n", "48 85201 NaN \n", "35 85201 Essential Leeks \n", "34 85247 Essential Cauliflower \n", "48 85247 NaN \n", "36 85513 Duchy Organic Carrots \n", "49 85513 NaN \n", "46 86535 Waitrose Cherry Vine Tomatoes \n", "49 86535 NaN \n", "56 522314 Waitrose Red Leicester Cheese & Sriracha Burge... \n", "48 522314 NaN \n", "25 522601 Essential Pedal Bin Liners Tie Handles \n", "24 522601 NaN \n", "48 722909 NaN \n", "723 722909 John Lewis Bumble Bee Napkin 33cm \n", "59 750319 Waitrose 2 Lemon & Lime Daisy Cheesecakes \n", "49 750319 NaN \n", "\n", " size \n", "48 NaN \n", "236 750ml \n", "340 1litre \n", "49 NaN \n", "48 NaN \n", "35 Typical weight 0.21kg \n", "34 Each \n", "48 NaN \n", "36 650g \n", "49 NaN \n", "46 270g \n", "49 NaN \n", "56 140g \n", "48 NaN \n", "25 30s \n", "24 NaN \n", "48 NaN \n", "723 20s \n", "59 2x76g \n", "49 NaN " ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Investigate it\n", "\n", "# What are the duplicated ids?\n", "duplicated_ids = products['id'][products['id'].duplicated()].tolist()\n", "\n", "# Get the products with the duplicated ids\n", "products[products['id'].isin(duplicated_ids)].sort_values('id')[['id', 'name', 'size']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This error is somewhat easy to fix! The duplicated entries are essentially **scraping errors**. We can just drop those that contain NaN values in the `name` column." ] }, { "cell_type": "code", "execution_count": 259, "metadata": {}, "outputs": [], "source": [ "# Drop the rows with NaN values in the `name` column\n", "products = products.dropna(subset=['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try again:" ] }, { "cell_type": "code", "execution_count": 260, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16054" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products.to_sql('products', conn, if_exists='append', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IT WORKED! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Add the categories table**" ] }, { "cell_type": "code", "execution_count": 261, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25377" ] }, "execution_count": 261, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories.to_sql('categories', conn, if_exists='append', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Add the `prices` table to the database**" ] }, { "cell_type": "code", "execution_count": 263, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16065" ] }, "execution_count": 263, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prices.to_sql('prices', conn, if_exists='append', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Add the `offers` table to the database**" ] }, { "cell_type": "code", "execution_count": 264, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4656" ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "offers.to_sql('offers', conn, if_exists='append', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "✅ AND WE'RE DONE! WE MADE SURE THAT EVERYTHING IS IN THE DATABASE AND THAT THE DATA IS CONSISTENT." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }