DS101 Lab 2: Importing and exporting structured data to and from Python

Author

The DS101A Teaching Team

Below is the script that was showcased during the week 2 class. You can try it for yourselves on Google Colab.

Click on the button below to download the source file (.ipynb notebook) as well as data file (.csv) that were used to create this page, if you ever want to try the demo for yourselves:

Tracking socio-economic outcomes in the BRICS

Working with Python using Jupyter notebooks

Jupyter notebooks are a useful scratchpad for working with data.

The notebook structure means code is entered and prcessed inside individual cells, rather than as entire (end-to-end) script. Jupyter works with Python, and Google Colab is a cloud store for Jupyter notebooks attached to a backend that can run your code – without having to worry about the details of installing many packages (and their dependencies).

  • Place the cursor inside a cell to make edits
  • Press shift+ENTER to run the code in a cell
  • That progresses the cursor to the next cell

Just follow the instructions above each cell. Ask for help with any technical problems.

Outline

We undertake the following steps:

  • Source information
  • Gather data
  • Storage
  • Load
  • Clean
  • Pre-process
  • Build a dataset
  • EDA
  • Processing
  • Visualisation
  • Export

Getting started with Python

Loading libraries

Powerful languages like Python and R thrive on the ecosystem of user-created packages.

We import (or, “load”) a few of these packages here:

  • numpy – gives python more efficient way to work with arrays
  • pandas – gives a way to work with (row and column) labelled arrays

These are very common modules to use whenever you want to work with datasets.

import numpy as np
import pandas as pd
from plotnine import *

Sourcing data

On the left-handside, create a new folder DS101A_data and drop your WDI_metrics.csv file here

Loading data

Loading data into Python is fairly straightforward. We use a file path which tells Python where to look for the data followed by a / separator followed by the name of the file (WDI_metrics.csv).

Python provides the os module (Operating System Interface) – with dozens of functions for interacting with the operating system: https://docs.python.org/3/tutorial/stdlib.html

It’s a part of the standard library (i.e. the core of Python), and it mirrors the way that you would work with your own operating system if your were doing that by hand (or from the command line).

“cwd” means “current working directory”

To find out what your working directory is, we can import the os module and use the function: os.getcwd().

import os
os.getcwd()
'/home/runner/work/DS101/DS101/2024/autumn-term/weeks/week02'

We are going to load our data into a python module called pandas – it is specialised for handling tables of data called “dataframes

  • In this case, our variable is called wb_metrics, and that hold our dataframe
wb_metrics = pd.read_csv("DS101A_data/WDI_metrics.csv", na_values="..").dropna(subset = "Series Code")

Dataset exploration

What does this data look like?

Just “print” the dataframe

  • Replace this code with wb_metrics.head()
  • Try using different integers inside the brackets:
  • e.g. wb_metrics.head(1)
wb_metrics.head(3)
Series Name Series Code Country Name Country Code 2000 [YR2000] 2001 [YR2001] 2002 [YR2002] 2003 [YR2003] 2004 [YR2004] 2005 [YR2005] 2006 [YR2006] 2007 [YR2007] 2008 [YR2008] 2009 [YR2009] 2010 [YR2010] 2011 [YR2011] 2012 [YR2012] 2013 [YR2013] 2014 [YR2014] 2015 [YR2015]
0 Population, total SP.POP.TOTL Afghanistan AFG 19542982.0 19688632.0 21000256.0 22645130.0 23553551.0 24411191.0 25442944.0 25903301.0 26427199.0 27385307.0 28189672.0 29249157.0 30466479.0 31541209.0 32716210.0 33753499.0
1 Population, total SP.POP.TOTL Albania ALB 3089027.0 3060173.0 3051010.0 3039616.0 3026939.0 3011487.0 2992547.0 2970017.0 2947314.0 2927519.0 2913021.0 2905195.0 2900401.0 2895092.0 2889104.0 2880703.0
2 Population, total SP.POP.TOTL Algeria DZA 30774621.0 31200985.0 31624696.0 32055883.0 32510186.0 32956690.0 33435080.0 33983827.0 34569592.0 35196037.0 35856344.0 36543541.0 37260563.0 38000626.0 38760168.0 39543154.0

Pandas allows us to view different portion of our dataframe, using e.g. .iloc[]:

  • The first integer is the row
  • The second integer is the column

We can specify rows an columns using named labels too.

Here we select column subsets using the “slice” notation – try changing the values

wb_metrics.iloc[0:4, 0:10]
Series Name Series Code Country Name Country Code 2000 [YR2000] 2001 [YR2001] 2002 [YR2002] 2003 [YR2003] 2004 [YR2004] 2005 [YR2005]
0 Population, total SP.POP.TOTL Afghanistan AFG 19542982.0 19688632.0 21000256.0 22645130.0 23553551.0 24411191.0
1 Population, total SP.POP.TOTL Albania ALB 3089027.0 3060173.0 3051010.0 3039616.0 3026939.0 3011487.0
2 Population, total SP.POP.TOTL Algeria DZA 30774621.0 31200985.0 31624696.0 32055883.0 32510186.0 32956690.0
3 Population, total SP.POP.TOTL American Samoa ASM 58230.0 58324.0 58177.0 57941.0 57626.0 57254.0
wb_metrics.shape
(11067, 20)
wb_metrics.columns
Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
       '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]', '2011 [YR2011]',
       '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]', '2015 [YR2015]'],
      dtype='object')
wb_metrics.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11067 entries, 0 to 11066
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series Name    11067 non-null  object 
 1   Series Code    11067 non-null  object 
 2   Country Name   11067 non-null  object 
 3   Country Code   11067 non-null  object 
 4   2000 [YR2000]  7943 non-null   float64
 5   2001 [YR2001]  7816 non-null   float64
 6   2002 [YR2002]  7915 non-null   float64
 7   2003 [YR2003]  8084 non-null   float64
 8   2004 [YR2004]  8214 non-null   float64
 9   2005 [YR2005]  8324 non-null   float64
 10  2006 [YR2006]  8376 non-null   float64
 11  2007 [YR2007]  8432 non-null   float64
 12  2008 [YR2008]  8455 non-null   float64
 13  2009 [YR2009]  8524 non-null   float64
 14  2010 [YR2010]  8597 non-null   float64
 15  2011 [YR2011]  8630 non-null   float64
 16  2012 [YR2012]  8646 non-null   float64
 17  2013 [YR2013]  8633 non-null   float64
 18  2014 [YR2014]  8725 non-null   float64
 19  2015 [YR2015]  8443 non-null   float64
dtypes: float64(16), object(4)
memory usage: 1.8+ MB

Readability

The column labels are not great, in particular, they contain square brackets [ and ] which could easily be mistaken for (what are known as) operators within python: python thinks a symbol like this means you are working with “lists”

Let’s get rid of these and make our code more readable for human beings in the process

colnames = ["series_name", "series_code", "country_name", "country_code"]
colnames = colnames + ["yr_" + str(x) for x in range(2000, 2016)]
wb_metrics.columns = colnames

Column names

print(wb_metrics.columns)
Index(['series_name', 'series_code', 'country_name', 'country_code', 'yr_2000',
       'yr_2001', 'yr_2002', 'yr_2003', 'yr_2004', 'yr_2005', 'yr_2006',
       'yr_2007', 'yr_2008', 'yr_2009', 'yr_2010', 'yr_2011', 'yr_2012',
       'yr_2013', 'yr_2014', 'yr_2015'],
      dtype='object')

Fix to reflect the categorical nature of the datatype

In the “series name”, “series code” and “country code” columns. We will stick with strings for “country name” for now

wb_metrics_cat = wb_metrics.copy()
wb_metrics_cat[["series_name", "series_code", "country_code"]] = wb_metrics_cat[["series_name", "series_code", "country_code"]].astype("category")
wb_metrics_cat[["country_name"]] = wb_metrics_cat[["country_name"]].astype("string")
wb_metrics_cat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11067 entries, 0 to 11066
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   series_name   11067 non-null  category
 1   series_code   11067 non-null  category
 2   country_name  11067 non-null  string  
 3   country_code  11067 non-null  category
 4   yr_2000       7943 non-null   float64 
 5   yr_2001       7816 non-null   float64 
 6   yr_2002       7915 non-null   float64 
 7   yr_2003       8084 non-null   float64 
 8   yr_2004       8214 non-null   float64 
 9   yr_2005       8324 non-null   float64 
 10  yr_2006       8376 non-null   float64 
 11  yr_2007       8432 non-null   float64 
 12  yr_2008       8455 non-null   float64 
 13  yr_2009       8524 non-null   float64 
 14  yr_2010       8597 non-null   float64 
 15  yr_2011       8630 non-null   float64 
 16  yr_2012       8646 non-null   float64 
 17  yr_2013       8633 non-null   float64 
 18  yr_2014       8725 non-null   float64 
 19  yr_2015       8443 non-null   float64 
dtypes: category(3), float64(16), string(1)
memory usage: 1.6 MB

Note how much smaller the dataframe has become (~10%)

What metrics are in the data set?

The function unique() creates a list from a repeating list of entries.

The following gives us a way to see the unique entries in a column. In this case, each entry is associated with a series, so now we know all the series within the table:

all_series = wb_metrics["series_name"].unique()
print(all_series)
['Population, total' 'Population growth (annual %)'
 'Surface area (sq. km)'
 'Poverty headcount ratio at national poverty lines (% of population)'
 'GNI, Atlas method (current US$)'
 'GNI per capita, Atlas method (current US$)'
 'GNI, PPP (current international $)'
 'GNI per capita, PPP (current international $)'
 'Income share held by lowest 20%'
 'Life expectancy at birth, total (years)'
 'Fertility rate, total (births per woman)'
 'Adolescent fertility rate (births per 1,000 women ages 15-19)'
 'Contraceptive prevalence, any method (% of married women ages 15-49)'
 'Births attended by skilled health staff (% of total)'
 'Mortality rate, under-5 (per 1,000 live births)'
 'Prevalence of underweight, weight for age (% of children under 5)'
 'Immunization, measles (% of children ages 12-23 months)'
 'Primary completion rate, total (% of relevant age group)'
 'School enrollment, secondary (% gross)'
 'School enrollment, primary and secondary (gross), gender parity index (GPI)'
 'Prevalence of HIV, total (% of population ages 15-49)'
 'Forest area (sq. km)'
 'Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)'
 'Energy use (kg of oil equivalent per capita)'
 'CO2 emissions (metric tons per capita)'
 'Electric power consumption (kWh per capita)' 'GDP (current US$)'
 'GDP growth (annual %)' 'Inflation, GDP deflator (annual %)'
 'Agriculture, forestry, and fishing, value added (% of GDP)'
 'Industry (including construction), value added (% of GDP)'
 'Exports of goods and services (% of GDP)'
 'Imports of goods and services (% of GDP)'
 'Gross capital formation (% of GDP)'
 'Revenue, excluding grants (% of GDP)'
 'Start-up procedures to register a business (number)'
 'Market capitalization of listed domestic companies (% of GDP)'
 'Military expenditure (% of GDP)'
 'Mobile cellular subscriptions (per 100 people)'
 'High-technology exports (% of manufactured exports)'
 'Merchandise trade (% of GDP)'
 'Net barter terms of trade index (2015 = 100)'
 'External debt stocks, total (DOD, current US$)'
 'Total debt service (% of GNI)' 'Net migration'
 'Personal remittances, paid (current US$)'
 'Foreign direct investment, net inflows (BoP, current US$)'
 'Net ODA received per capita (current US$)'
 'GDP per capita (current US$)'
 'Foreign direct investment, net (BoP, current US$)'
 'Inflation, consumer prices (annual %)']

Select a subset of rows, by series

By category

gdp_0 = wb_metrics_cat[ wb_metrics_cat['series_name'] == 'GDP (current US$)' ]
gdp_0.head(10)
series_name series_code country_name country_code yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015
5642 GDP (current US$) NY.GDP.MKTP.CD Afghanistan AFG 3.521418e+09 2.813572e+09 3.825701e+09 4.520947e+09 5.224897e+09 6.203257e+09 6.971758e+09 9.747886e+09 1.010930e+10 1.241615e+10 1.585667e+10 1.780510e+10 1.990733e+10 2.014642e+10 2.049713e+10 1.913422e+10
5643 GDP (current US$) NY.GDP.MKTP.CD Albania ALB 3.480355e+09 3.922101e+09 4.348068e+09 5.611496e+09 7.184686e+09 8.052077e+09 8.896075e+09 1.067732e+10 1.288135e+10 1.204421e+10 1.192693e+10 1.289076e+10 1.231983e+10 1.277622e+10 1.322815e+10 1.138685e+10
5644 GDP (current US$) NY.GDP.MKTP.CD Algeria DZA 5.479040e+10 5.941340e+10 6.151610e+10 7.348226e+10 9.191368e+10 1.070466e+11 1.230843e+11 1.424827e+11 1.803838e+11 1.503173e+11 1.777851e+11 2.183319e+11 2.271437e+11 2.297014e+11 2.389427e+11 1.874939e+11
5645 GDP (current US$) NY.GDP.MKTP.CD American Samoa ASM NaN NaN 5.120000e+08 5.240000e+08 5.090000e+08 5.000000e+08 4.930000e+08 5.180000e+08 5.600000e+08 6.750000e+08 5.730000e+08 5.700000e+08 6.400000e+08 6.380000e+08 6.430000e+08 6.730000e+08
5646 GDP (current US$) NY.GDP.MKTP.CD Andorra AND 1.432606e+09 1.548266e+09 1.764280e+09 2.366942e+09 2.900245e+09 3.161084e+09 3.459338e+09 3.957625e+09 4.102319e+09 3.688976e+09 3.449926e+09 3.629134e+09 3.188653e+09 3.193513e+09 3.271686e+09 2.789881e+09
5647 GDP (current US$) NY.GDP.MKTP.CD Angola AGO 9.129595e+09 8.936079e+09 1.528559e+10 1.781270e+10 2.355206e+10 3.697090e+10 5.238103e+10 6.526642e+10 8.853867e+10 7.030720e+10 8.379947e+10 1.117897e+11 1.280529e+11 1.323391e+11 1.359668e+11 9.049642e+10
5648 GDP (current US$) NY.GDP.MKTP.CD Antigua and Barbuda ATG 9.005519e+08 8.775148e+08 8.979889e+08 9.479556e+08 1.026181e+09 1.143715e+09 1.303548e+09 1.487300e+09 1.557541e+09 1.386444e+09 1.298256e+09 1.281337e+09 1.327107e+09 1.325426e+09 1.378830e+09 1.437756e+09
5649 GDP (current US$) NY.GDP.MKTP.CD Argentina ARG 2.842038e+11 2.686968e+11 9.772400e+10 1.275870e+11 1.646579e+11 1.987371e+11 2.325573e+11 2.875305e+11 3.615580e+11 3.329765e+11 4.236274e+11 5.301581e+11 5.459824e+11 5.520251e+11 5.263197e+11 5.947493e+11
5650 GDP (current US$) NY.GDP.MKTP.CD Armenia ARM 1.911564e+09 2.118468e+09 2.376335e+09 2.807061e+09 3.576615e+09 4.900470e+09 6.384452e+09 9.206301e+09 1.166204e+10 8.647937e+09 9.260286e+09 1.014211e+10 1.061932e+10 1.112146e+10 1.160951e+10 1.055334e+10
5651 GDP (current US$) NY.GDP.MKTP.CD Aruba ABW 1.873453e+09 1.896457e+09 1.961844e+09 2.044112e+09 2.254831e+09 2.360017e+09 2.469783e+09 2.677641e+09 2.843025e+09 2.553793e+09 2.453597e+09 2.637859e+09 2.615208e+09 2.727850e+09 2.790850e+09 2.962907e+09

By string

gdp = wb_metrics.query("series_name == 'GDP (current US$)'")
gdp.head(5)
series_name series_code country_name country_code yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015
5642 GDP (current US$) NY.GDP.MKTP.CD Afghanistan AFG 3.521418e+09 2.813572e+09 3.825701e+09 4.520947e+09 5.224897e+09 6.203257e+09 6.971758e+09 9.747886e+09 1.010930e+10 1.241615e+10 1.585667e+10 1.780510e+10 1.990733e+10 2.014642e+10 2.049713e+10 1.913422e+10
5643 GDP (current US$) NY.GDP.MKTP.CD Albania ALB 3.480355e+09 3.922101e+09 4.348068e+09 5.611496e+09 7.184686e+09 8.052077e+09 8.896075e+09 1.067732e+10 1.288135e+10 1.204421e+10 1.192693e+10 1.289076e+10 1.231983e+10 1.277622e+10 1.322815e+10 1.138685e+10
5644 GDP (current US$) NY.GDP.MKTP.CD Algeria DZA 5.479040e+10 5.941340e+10 6.151610e+10 7.348226e+10 9.191368e+10 1.070466e+11 1.230843e+11 1.424827e+11 1.803838e+11 1.503173e+11 1.777851e+11 2.183319e+11 2.271437e+11 2.297014e+11 2.389427e+11 1.874939e+11
5645 GDP (current US$) NY.GDP.MKTP.CD American Samoa ASM NaN NaN 5.120000e+08 5.240000e+08 5.090000e+08 5.000000e+08 4.930000e+08 5.180000e+08 5.600000e+08 6.750000e+08 5.730000e+08 5.700000e+08 6.400000e+08 6.380000e+08 6.430000e+08 6.730000e+08
5646 GDP (current US$) NY.GDP.MKTP.CD Andorra AND 1.432606e+09 1.548266e+09 1.764280e+09 2.366942e+09 2.900245e+09 3.161084e+09 3.459338e+09 3.957625e+09 4.102319e+09 3.688976e+09 3.449926e+09 3.629134e+09 3.188653e+09 3.193513e+09 3.271686e+09 2.789881e+09

Cleaning

From here, we need to get the data into a consistent and “tidy” format. For the first chunk, we remove unnecessary variables.

gdp = gdp.drop(columns = ["series_name", "series_code", "country_code"])
gdp
country_name yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015
5642 Afghanistan 3.521418e+09 2.813572e+09 3.825701e+09 4.520947e+09 5.224897e+09 6.203257e+09 6.971758e+09 9.747886e+09 1.010930e+10 1.241615e+10 1.585667e+10 1.780510e+10 1.990733e+10 2.014642e+10 2.049713e+10 1.913422e+10
5643 Albania 3.480355e+09 3.922101e+09 4.348068e+09 5.611496e+09 7.184686e+09 8.052077e+09 8.896075e+09 1.067732e+10 1.288135e+10 1.204421e+10 1.192693e+10 1.289076e+10 1.231983e+10 1.277622e+10 1.322815e+10 1.138685e+10
5644 Algeria 5.479040e+10 5.941340e+10 6.151610e+10 7.348226e+10 9.191368e+10 1.070466e+11 1.230843e+11 1.424827e+11 1.803838e+11 1.503173e+11 1.777851e+11 2.183319e+11 2.271437e+11 2.297014e+11 2.389427e+11 1.874939e+11
5645 American Samoa NaN NaN 5.120000e+08 5.240000e+08 5.090000e+08 5.000000e+08 4.930000e+08 5.180000e+08 5.600000e+08 6.750000e+08 5.730000e+08 5.700000e+08 6.400000e+08 6.380000e+08 6.430000e+08 6.730000e+08
5646 Andorra 1.432606e+09 1.548266e+09 1.764280e+09 2.366942e+09 2.900245e+09 3.161084e+09 3.459338e+09 3.957625e+09 4.102319e+09 3.688976e+09 3.449926e+09 3.629134e+09 3.188653e+09 3.193513e+09 3.271686e+09 2.789881e+09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5854 Virgin Islands (U.S.) NaN NaN 3.262000e+09 3.443000e+09 3.797000e+09 4.428000e+09 4.484000e+09 4.784000e+09 4.244000e+09 4.201000e+09 4.324000e+09 4.223000e+09 4.089000e+09 3.738000e+09 3.565000e+09 3.663000e+09
5855 West Bank and Gaza 4.313600e+09 4.003700e+09 3.555800e+09 3.968000e+09 4.603100e+09 5.125700e+09 5.348300e+09 5.815700e+09 7.310400e+09 8.085700e+09 9.681500e+09 1.118610e+10 1.220840e+10 1.351550e+10 1.398970e+10 1.397240e+10
5856 Yemen, Rep. 9.679317e+09 9.852991e+09 1.069343e+10 1.177753e+10 1.386763e+10 1.673157e+10 1.906314e+10 2.165053e+10 2.691086e+10 2.513028e+10 3.090675e+10 3.272642e+10 3.540133e+10 4.041523e+10 4.322859e+10 4.244449e+10
5857 Zambia 3.600632e+09 4.094441e+09 4.193850e+09 4.901870e+09 6.221110e+09 8.331870e+09 1.275686e+10 1.405696e+10 1.791086e+10 1.532834e+10 2.026556e+10 2.345952e+10 2.550306e+10 2.803724e+10 2.714102e+10 2.125122e+10
5858 Zimbabwe 6.689958e+09 6.777385e+09 6.342116e+09 5.727592e+09 5.805598e+09 5.755215e+09 5.443896e+09 5.291950e+09 4.415703e+09 9.665793e+09 1.204166e+10 1.410192e+10 1.711485e+10 1.909102e+10 1.949552e+10 1.996312e+10

217 rows × 17 columns

Change the ‘shape’ of our table to suit our analysis – using melt()

This is a fairly sophisticated feature of the pandas API that ‘pivots’ a table into a new structure and layout

To get data into tidy format, we need to stack the variables with the yr_ prefix on top of one another. We also need to convert the year column to an integer so that we can plot GDP across different countries.

gdp_long = pd.melt(gdp, id_vars = "country_name", var_name = "year", value_name = "gdp")
gdp_long["year"] = gdp_long["year"].str.replace("yr_", "").astype(int)
print(gdp_long)
               country_name  year           gdp
0               Afghanistan  2000  3.521418e+09
1                   Albania  2000  3.480355e+09
2                   Algeria  2000  5.479040e+10
3            American Samoa  2000           NaN
4                   Andorra  2000  1.432606e+09
...                     ...   ...           ...
3467  Virgin Islands (U.S.)  2015  3.663000e+09
3468     West Bank and Gaza  2015  1.397240e+10
3469            Yemen, Rep.  2015  4.244449e+10
3470                 Zambia  2015  2.125122e+10
3471               Zimbabwe  2015  1.996312e+10

[3472 rows x 3 columns]

Creating a graph of BRICS GDP

brics = ["Brazil", "Russian Federation", "India", "China", "South Africa"]
gdp_brics = gdp_long.query("country_name in @brics")
#gdp_brics["gdp"] = gdp_brics["gdp"] / 1e12
gdp_brics.loc[:, "gdp"] = gdp_brics["gdp"].astype(float) / 1e12
print(gdp_brics)
            country_name  year        gdp
26                Brazil  2000   0.655448
41                 China  2000   1.211332
88                 India  2000   0.468396
160   Russian Federation  2000   0.259710
176         South Africa  2000   0.151753
...                  ...   ...        ...
3281              Brazil  2015   1.802212
3296               China  2015  11.061573
3343               India  2015   2.103588
3415  Russian Federation  2015   1.363482
3431        South Africa  2015   0.346710

[80 rows x 3 columns]
/tmp/ipykernel_8975/973961371.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Aggregation by ‘group’

We want a summary statistic for each ‘group’

  • Summary statistics on each year
  • sum
  • mean
total_gdp_brics = gdp_brics.groupby("year", as_index = False)["gdp"].sum()
mean_gdp_brics = gdp_brics.groupby("year", as_index = False)["gdp"].mean()

Visualisation – “Visual Storytelling”

Visualise to observe any patterns – ‘seeing’ as a form of understanding

  • What happens between 2005 and 2010?
(
 ggplot()
 + geom_line(total_gdp_brics, aes("year", "gdp"), size = 2, linetype = "dashed")
 + theme_minimal()
 + labs(x = "Year", y = "GDP (trillions USD)",
        title = "The BRICS are growing",
        subtitle = "Total GDP from 2000 to 2015")
 )

(
 ggplot()
 + geom_line(gdp_brics, aes("year", "gdp", colour = "country_name"))
 + geom_line(mean_gdp_brics, aes("year", "gdp"), size = 2, linetype = "dashed")
 + theme_minimal()
 + labs(x = "Year", y = "GDP (trillions USD)", colour = "",
        title = "The lion's share of GDP growth in BRICS is from China",
        caption = "Note: black dotted line represents average growth")
 )

Converting BRICS 2015 data to JSON

A convenience function to produce JSON from tables

gdp_brics
country_name year gdp
26 Brazil 2000 0.655448
41 China 2000 1.211332
88 India 2000 0.468396
160 Russian Federation 2000 0.259710
176 South Africa 2000 0.151753
... ... ... ...
3281 Brazil 2015 1.802212
3296 China 2015 11.061573
3343 India 2015 2.103588
3415 Russian Federation 2015 1.363482
3431 South Africa 2015 0.346710

80 rows × 3 columns

gdp_brics_2015 = gdp_brics.query("year == 2015").reset_index(drop = True).drop(columns = 'year')
gdp_brics_2015[["country_name"]] = gdp_brics_2015[["country_name"]].astype("string")
gdp_brics_2015[["gdp"]] = gdp_brics_2015[["gdp"]].astype("float")
print(gdp_brics_2015.info())

import json
gdp_brics_2015_json = gdp_brics_2015.to_json()

from pprint import pprint as pp
pp(gdp_brics_2015_json, depth=2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country_name  5 non-null      string 
 1   gdp           5 non-null      float64
dtypes: float64(1), string(1)
memory usage: 208.0 bytes
None
('{"country_name":{"0":"Brazil","1":"China","2":"India","3":"Russian '
 'Federation","4":"South '
 'Africa"},"gdp":{"0":1.8022122069,"1":11.0615726186,"2":2.10358836,"3":1.3634821822,"4":0.3467097905}}')

JSON output

JSON notation has the same formatting structure as a Python ‘dictionary

{
  "country_name": {
    "0": "Brazil",
    "1": "China",
    "2": "India",
    "3":"Russian Federation",
    "4":"South Africa"},
  "gdp":{"0":1.8022122069,
        "1":11.0615726186,
        "2":2.10358836,
        "3":1.3634821822,
        "4":0.3467097905}
}

Re-formatted

gdp_brics_2015_json = gdp_brics_2015.to_json(orient = "index")


from pprint import pprint as pp
pp(gdp_brics_2015_json, depth=2)
('{"0":{"country_name":"Brazil","gdp":1.8022122069},"1":{"country_name":"China","gdp":11.0615726186},"2":{"country_name":"India","gdp":2.10358836},"3":{"country_name":"Russian '
 'Federation","gdp":1.3634821822},"4":{"country_name":"South '
 'Africa","gdp":0.3467097905}}')
{
  "0": {
    "country_name": "Brazil",
    "gdp": 1.8022122069
  },
  "1": {
    "country_name": "China",
    "gdp": 11.0615726186
  },
  "2": {
    "country_name": "India",
    "gdp": 2.10358836
  },
  "3": {
    "country_name":"Russian Federation","gdp":1.3634821822},
  "4":{"country_name":"South Africa","gdp":0.3467097905}
}
# print(
#       gdp_brics.query("year == 2015").reset_index(drop = True).drop(columns = "year").to_json()
#       )

Converting BRICS 2010 data to HTML

A convenience function to produce HTML tables (you can embed these into your own web reports)

html = gdp_brics.query("year == 2010").reset_index(drop = True).drop(columns = "year").to_html()
html_op = os.path.join('DS101A_data', 'gdp_brics_2010.html')

with open(html_op, 'w') as html_file:
    html_file.write(html)

print(html)
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>country_name</th>
      <th>gdp</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Brazil</td>
      <td>2.208838</td>
    </tr>
    <tr>
      <th>1</th>
      <td>China</td>
      <td>6.087192</td>
    </tr>
    <tr>
      <th>2</th>
      <td>India</td>
      <td>1.675616</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Russian Federation</td>
      <td>1.524917</td>
    </tr>
    <tr>
      <th>4</th>
      <td>South Africa</td>
      <td>0.417364</td>
    </tr>
  </tbody>
</table>
g7 = ["Canada", "France", "Germany", "Italy", "Japan", "United Kingdom", "United States"]
gdp_g7 = gdp_long.query("country_name in @g7")
#gdp_g7["gdp"] = gdp_g7["gdp"].astype(float) / 1e12
gdp_g7.loc[:,"gdp"] = gdp_g7["gdp"].astype(float) / 1e12
print(gdp_g7)
        country_name  year        gdp
35            Canada  2000   0.744773
67            France  2000   1.365640
72           Germany  2000   1.947982
95             Italy  2000   1.146677
97             Japan  2000   4.968359
...              ...   ...        ...
3327         Germany  2015   3.357586
3350           Italy  2015   1.836638
3352           Japan  2015   4.444931
3460  United Kingdom  2015   2.927911
3461   United States  2015  18.295019

[112 rows x 3 columns]
/tmp/ipykernel_8975/273775842.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
total_gdp_g7 = gdp_g7.groupby("year", as_index = False)["gdp"].sum()
mean_gdp_g7 = gdp_g7.groupby("year", as_index = False)["gdp"].mean()
(
 ggplot()
 + geom_line(total_gdp_g7, aes("year", "gdp"), size = 2, linetype = "dashed")
 + theme_minimal()
 + labs(x = "Year", y = "GDP (trillions USD)",
        title = "The G7 are mostly growing",
        subtitle = "Total GDP from 2000 to 2015")
 )

(
 ggplot()
 + geom_line(gdp_g7, aes("year", "gdp", colour = "country_name"))
 + geom_line(mean_gdp_g7, aes("year", "gdp"), size = 2, linetype = "dashed")
 + theme_minimal()
 + labs(x = "Year", y = "GDP (trillions USD)", colour = "",
        title = "The lion's share of GDP growth in G7 is from US,\n with the rest flatlining or in a downward slope!",
        caption = "Note: black dotted line represents average growth")
 )