LSE Data Science Institute | ME204 (2023/24) | Week 01 Day 03

# üíª Week 01 ‚Äì Day 03: Dataviz Practice 

<span style="color:#E69F25;font-weight: 500 !important">(AFTERNOON NOTEBOOK)</span>

**DATE:** 10 July 2024

**AUTHOR:** [Alexander Soldatkin](https://www.area-studies.ox.ac.uk/people/alexander-soldatkin) (edited by Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io))

-----

## ‚öôÔ∏è Setup

- Ensure the Python kernel has the necessary libraries: `pandas`, `matplotlib` and `lets-plot`, `numerize`
- Ensure the `bakery.csv` file is in the `data` folder.

**Imports**

(It is a good practice to import ALL the libraries you will be using at the start of your notebook)

In [4]:
import numpy as np
import pandas as pd

from numerize import numerize as nz

from lets_plot import * # This imports all of ggplot2's functions
LetsPlot.setup_html()

‚¨áÔ∏è **Downloading and reading the Data**

In today's lab, we will use a different dataset called Gapminder. It has data on life expectancy, GDP per capita, and population by country and year, among other variables.


In [5]:
# Note that `pd.read_csv()` can also take in a URL as an argument. 
gapminder = pd.read_csv('https://raw.githubusercontent.com/kirenz/datasets/master/gapminder.csv')

# We need to convert the year column to a date object
gapminder.year = pd.to_datetime(gapminder.year, format='%Y')

# 1. Intro to Grammar of Graphics: Gapminder Data

In the morning session, you had a brief introduction to `lets-plot` and the layered grammar of graphics approach to building visualizations. Now, let's get a bit more formal into the concepts.

<iframe width="560" height="315" src="https://www.youtube.com/embed/hVimVzgtD6w?si=Mq7_Gnr8T07o_qIM&amp;start=210" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

<iframe width="560" height="315" src="https://www.youtube.com/embed/aJQmVZSAqlc?si=kKNQNwEs_DYcyJRd&amp;start=16" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## Let's start layering 

Observe what changes in the code below when we add layers to the plot. If you just define the data and the aesthetics, you get `No layers in plot`.

In [6]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             )    
)

### Add the first layer: `geom_point()`

Now let's add a `geom_point()` layer to the plot, which is a simple scatter plot. 

In [75]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
        geom_point(alpha=0.5)        
)

### Change the scales

That's better! But it looks like the scale is a bit off due to vast differences in output across different countries. Let's add a `scale_x_log10()` layer to the plot to make it more readable.

In [76]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
        geom_point(alpha=0.5) +
        scale_x_log10() 
)

### Add some break points

Better yet. But it looks like the scale could be better explained with some break points. 

In [77]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            )
)

And we could also make the break labels human-readable by changing the format in which they appear using the `numerize()` function (does not come with `lets_plot`).

In [78]:
(
    ggplot(data = gapminder, mapping = aes(
                                x = 'gdpPercap', 
                                y = 'lifeExp', 
                                color = 'continent', 
                                size = 'pop')
             ) +
            #  add country name to hover
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            )
)

### Explore different themes

Let's add on a theme and try changing it around to see what works best.

In [79]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
            #  add country name to hover
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            ) +
        theme_minimal()        
)

In [80]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
            #  add country name to hover
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            ) +
        theme_grey()       
)

### Fix the labels

Finally, add some lables to change the default variable names. 

In [81]:
(
    ggplot(data = gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
            #  add country name to hover
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            ) +
        theme_minimal() +
        labs(
            # title = 'Life Expectancy vs. GDP per Capita', # Why could we do without this?
            x = 'GDP per Capita',
            y = 'Life Expectancy',
            color = 'Continent',
            size = 'Population' 
        )
        
)

# 2. A few words on (tidy) data formats

## Untidy data

Real-world data are often in a ghastly Excel format that is reasonable for humans but awful for computers. Behold [(source)](https://www.linkedin.com/pulse/untidy-government-data-assault-democracy-peter-king/): 

![Governments, investment banks, and consultancies are notorious for this](https://media.licdn.com/dms/image/C5112AQGpCucou5GlOA/article-cover_image-shrink_720_1280/0/1576703880301?e=2147483647&v=beta&t=w6aAfQaujgqUl1E0dJV3qwPh3LkPybgwojYtqAquKkA)

Another one ([source](https://github.com/andrewheiss/datavizs24.classes.andrewheiss.com/blob/main/slides/img/03/untidy-example.png)):

![](https://github.com/andrewheiss/datavizs24.classes.andrewheiss.com/blob/main/slides/img/03/untidy-example.png?raw=true)

## Tidy data

Tidy data is also called 'long' data, and here's how you might rearrange the table above:  

![](https://github.com/andrewheiss/datavizs24.classes.andrewheiss.com/blob/main/slides/img/03/tidy-example.png?raw=true)

Here's how we might want to mess with the Gapminder dataset to make it untidy (or fit for human consumption): 

In [82]:
# original (tidy) format
display(gapminder.head())
gapminder.info()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952-01-01,28.801,8425333,779.445314
1,Afghanistan,Asia,1957-01-01,30.332,9240934,820.85303
2,Afghanistan,Asia,1962-01-01,31.997,10267083,853.10071
3,Afghanistan,Asia,1967-01-01,34.02,11537966,836.197138
4,Afghanistan,Asia,1972-01-01,36.088,13079460,739.981106


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   country    1704 non-null   object        
 1   continent  1704 non-null   object        
 2   year       1704 non-null   datetime64[ns]
 3   lifeExp    1704 non-null   float64       
 4   pop        1704 non-null   int64         
 5   gdpPercap  1704 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 80.0+ KB


üí° The function `pivot_table` might feel like an advanced step for some. Don't worry, we will practice it some more tomorrow!

In [83]:
# pivot table format, group by continent and year, value is lifeExp
gapminder_pivot_life_exp = gapminder.pivot_table(
                        index='year', columns='continent', 
                        values='lifeExp', aggfunc='mean')
display(gapminder_pivot_life_exp.head())

continent,Africa,Americas,Asia,Europe,Oceania
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1952-01-01,39.1355,53.27984,46.314394,64.4085,69.255
1957-01-01,41.266346,55.96028,49.318544,66.703067,70.295
1962-01-01,43.319442,58.39876,51.563223,68.539233,71.085
1967-01-01,45.334538,60.41092,54.66364,69.7376,71.31
1972-01-01,47.450942,62.39492,57.319269,70.775033,71.91


In [84]:
# pivot table format, group by country and year, value is mean gdpPercap
gapminder_pivot_gdp = gapminder.pivot_table(
                        index='year', columns='country', 
                        values='gdpPercap', aggfunc='mean')
display(gapminder_pivot_gdp.head())

country,Afghanistan,Albania,Algeria,Angola,Argentina,Australia,Austria,Bahrain,Bangladesh,Belgium,...,Uganda,United Kingdom,United States,Uruguay,Venezuela,Vietnam,West Bank and Gaza,"Yemen, Rep.",Zambia,Zimbabwe
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1952-01-01,779.445314,1601.056136,2449.008185,3520.610273,5911.315053,10039.59564,6137.076492,9867.084765,684.244172,8343.105127,...,734.753484,9979.508487,13990.48208,5716.766744,7689.799761,605.066492,1515.592329,781.717576,1147.388831,406.884115
1957-01-01,820.85303,1942.284244,3013.976023,3827.940465,6856.856212,10949.64959,8842.59803,11635.79945,661.637458,9714.960623,...,774.371069,11283.17795,14847.12712,6150.772969,9802.466526,676.285448,1827.067742,804.830455,1311.956766,518.764268
1962-01-01,853.10071,2312.888958,2550.81688,4269.276742,7133.166023,12217.22686,10750.72111,12753.27514,686.341554,10991.20676,...,767.27174,12477.17707,16173.14586,5603.357717,8422.974165,772.04916,2198.956312,825.623201,1452.725766,527.272182
1967-01-01,836.197138,2760.196931,3246.991771,5522.776375,8052.953021,14526.12465,12834.6024,14804.6727,721.186086,13149.04119,...,908.918522,14142.85089,19530.36557,5444.61962,9541.474188,637.123289,2649.715007,862.442146,1777.077318,569.795071
1972-01-01,739.981106,3313.422188,4182.663766,5473.288005,9443.038526,16788.62948,16661.6256,18268.65839,630.233627,16672.14356,...,950.735869,15895.11641,21806.03594,5703.408898,10505.25966,699.501644,3133.409277,1265.047031,1773.498265,799.362176


## Moving from wide to long format

If you envounter some Excel table or a PDF report which emphasises human readability, you can use the `pd.melt()` function to convert it to the long format to then use it in your analysis and visualise it. 

In [None]:
# Melt is essentially a convenient setting around pivot_table
gapminder_melt = gapminder_pivot_life_exp.melt(ignore_index=False).reset_index()
        
display(gapminder_melt.head())

In [86]:
gapminder_melt_life_exp = gapminder_pivot_life_exp.reset_index() \
    .melt(id_vars='year', 
          var_name='continent', 
          value_name='lifeExp')
    
display(gapminder_melt_life_exp.head())

Unnamed: 0,year,continent,lifeExp
0,1952-01-01,Africa,39.1355
1,1957-01-01,Africa,41.266346
2,1962-01-01,Africa,43.319442
3,1967-01-01,Africa,45.334538
4,1972-01-01,Africa,47.450942


In [87]:
gapminder_melt_gdp = gapminder_pivot_gdp.reset_index() \
    .melt(id_vars='year', 
          var_name='country', 
          value_name='gdpPercap')
    
display(
    gapminder_melt_gdp.head(),
    gapminder_melt_gdp.tail()
    )

Unnamed: 0,year,country,gdpPercap
0,1952-01-01,Afghanistan,779.445314
1,1957-01-01,Afghanistan,820.85303
2,1962-01-01,Afghanistan,853.10071
3,1967-01-01,Afghanistan,836.197138
4,1972-01-01,Afghanistan,739.981106


Unnamed: 0,year,country,gdpPercap
1699,1987-01-01,Zimbabwe,706.157306
1700,1992-01-01,Zimbabwe,693.420786
1701,1997-01-01,Zimbabwe,792.44996
1702,2002-01-01,Zimbabwe,672.038623
1703,2007-01-01,Zimbabwe,469.709298


# 3. A few more examples: faceting

### `mtcars` dataset

In [88]:
# !pip install statsmodels
import statsmodels.api as sm
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mtcars = pd.DataFrame(mtcars)
mtcars.head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [89]:
(
    ggplot(
        data = mtcars, 
        mapping = aes(
            x = 'wt', 
            y = 'mpg', 
            color = 'hp'
            )
        ) +
        geom_point() +
        geom_smooth(method = 'lm') +
        theme_minimal()
)


In [90]:
(
    ggplot(
        data = mtcars, 
        mapping = aes(
            x = 'wt', 
            y = 'mpg', 
            color = 'cyl'
            )
        ) +
        geom_point() +
        geom_smooth(method = 'lm') +
        theme_minimal()
)

In [91]:
(
  ggplot(data = mtcars,
       mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'gear')) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_color_viridis() 
)

In [92]:
(
    ggplot(data = mtcars,
       mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'gear')) +
        geom_point() +
        geom_smooth(method = "lm") +
        scale_color_viridis() +
        facet_wrap('gear', ncol = 1) #<<

)


In [93]:
(
     ggplot(data = mtcars,
          mapping = aes(x = 'disp',
                         y = 'mpg',
                         color = 'hp')) +
          geom_point() +
          geom_smooth(method = "lm") +
          scale_color_viridis() +
          facet_wrap('gear', ncol = 1) +
          labs(x = "Displacement", y = "Highway MPG",  #<<
               color = "Horsepower",   #<<
               title = "Heavier cars get lower mileage",  #<<
               subtitle = "Displacement indicates weight(?)",  #<<
               caption = "I know nothing about cars")
)

In [94]:
(
    ggplot(data = mtcars,
          mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'hp')
                     ) +
     geom_point() +
     geom_smooth(method = "lm") +
     scale_color_viridis() +
     facet_wrap('gear', ncol = 1, 
                    format = '{} gears') +#<<               
     labs(x = "Displacement", y = "Highway MPG",  #<<
          color = "Horsepower",   #<<
          title = "Heavier cars get lower mileage",  #<<
          subtitle = "Displacement indicates weight(?)",  #<<
          caption = "I know nothing about cars") +
     theme_bw() #<<
) 

In [95]:
(
    ggplot(data = mtcars,
          mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'hp')
     ) +
     geom_point() +
     geom_smooth(method = "lm") +
     scale_color_viridis(breaks = [100, 200, 300]) +
     facet_wrap('gear', ncol = 1, format = '{} gears') +
     labs(x = "Displacement", y = "Highway MPG",  #<<
          color = "Horsepower",   #<<
          title = "Heavier cars get lower mileage",  #<<
          subtitle = "Displacement indicates weight(?)",  #<<
          caption = "I know nothing about cars") +
     theme_bw() + 
     theme(legend_position = "bottom", #<<
        plot_title = element_text(face = "bold")) #<<
) 

In [96]:
# get 2007 and 1952 data
_gapminder = gapminder[gapminder.year.isin([pd.to_datetime('2007', format='%Y'), pd.to_datetime('1952', format='%Y')])]

(
    ggplot(data = _gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
        geom_point(alpha=0.5) +
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            ) +
        theme_minimal() +
        labs(
            x = 'GDP per Capita',
            y = 'Life Expectancy',
            color = 'Continent',
            size = 'Population' 
        )
        
)

## Gapminder faceting

In [97]:
# get 2007 and 1952 data
_filt = [pd.to_datetime(x, format='%Y') for x in ['2007', '1952']]
_gapminder = gapminder[gapminder.year.isin(_filt)]
_gapminder.year = _gapminder.year.dt.year   

(
    ggplot(data = _gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
                    ) +
            geom_point(alpha=0.5) +
            scale_x_log10(
                breaks = [1000, 10000, 25000, 50000, 100000],
                labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
                ) +
            facet_wrap('year', ncol=1) + ## <<<<
            theme_minimal() +
            labs(
                x = 'GDP per Capita',
                y = 'Life Expectancy',
                color = 'Continent',
                size = 'Population' 
        )
        
)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [98]:
(
    ggplot(data = gapminder, 
       mapping = aes(x = 'continent',
                     y = 'lifeExp',
                     fill = 'continent')
       ) +
        geom_violin(alpha=0.5) +
        geom_boxplot(alpha = 0.5) +
        guides(fill = 'none')  +# Turn off legend
        labs(
            # title = 'Life Expectancy by Continent',
            y = 'Life Expectancy',
            # x is nothing
            x = 'Continent'
        ) 
)

# 4. Getting fancier: `plotly`, dashboards, and animations

In [99]:
# !pip install plotly==5.22.0
import plotly.express as px

px.scatter(gapminder, 
           x="gdpPercap", 
           y="lifeExp", 
           animation_frame = "year",       
           animation_group = "country",
           size = "pop", color = "continent", 
           hover_name = "country",
           log_x = True, 
           size_max = 55, 
           range_x = [100,100000], 
           range_y=[25,90]
           )