{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**📝 Week 5 Summative Assessment** \n",
"\n",
"## DS105A – Data for Data Science\n",
"\n",
"**PURPOSE**: The purpose of this Jupyter Notebook is to document my answers to the DS105 W5 Summative assessment, show the steps I took and explain the rationale behind these decisions. It will also include some additional, relevant insights into the data itself.\n",
"\n",
"**CLICK THE IMAGE BELOW TO VIEW THE WEBSITE THAT WAS SCRAPED FOR THIS PROJECT:**\n",
"\n",
"\n",
" \n",
"\n",
"\n",
"**LAST REVISION:** 30th October 2023"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"Notice how well organise this notebook is! It even has pictures. It's easy to read and follow. It's also easy to see what the author has done and why. This is a great example of how to present your work.\n",
"\n",
"Another way to have a sense for what is in a notebook is by clicking on the **Outline** button above, while the notebook is open in VS Code. This will show you all the headings in the notebook. You can then click on a heading to jump to that section of the notebook.\n",
"\n",
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"This is a fantastic great practice! I know exactly what I need to type on the Terminal to get the same packages they used in the submission\n",
"\n",
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"- This person discovered these very useful Python packages (`ordered_set` and `collections`). It is not clear if those were discovered with the help of generative AI (they should have ellaborated a bit more on this at the end), but they used it well.\n",
"\n",
"- This person used other Python packages useful for text mining analysis (!) This made us go WOW! We weren't expecting any deep analysis this early in the course.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Folders Created\n",
"- `Data` folder is created using a relative path (so that my username/name isn't given away by the path)\n",
"- This contains `Schedule.CSV` (from Part 1 of the Summative) and `Agenda.CSV` (from Part 2 of the Summative), these files are both saved directly to the `Data` Folder\n",
"- An `if` statement is used so that multiple `Data` folders are not generated, and whenever this Jupyter Notebook is ran the new `agenda.csv` and `schedule.csv` will replace the old version within the `Data` folder"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Folder 'Data' already exists in the current directory.\n"
]
}
],
"source": [
"import os\n",
"\n",
"# Get the current working directory (where your Python script is located)\n",
"current_directory = os.getcwd()\n",
"\n",
"# Specify the name of your new folder\n",
"folder_name = 'Data'\n",
"\n",
"# Construct the relative path to the new folder\n",
"data_folder_path = os.path.join(current_directory, folder_name)\n",
"\n",
"# Check if the folder doesn't exist already, then create it\n",
"if not os.path.exists(data_folder_path):\n",
" os.makedirs(data_folder_path)\n",
" print(f\"Folder '{folder_name}' created successfully in the current directory.\")\n",
"else:\n",
" print(f\"Folder '{folder_name}' already exists in the current directory.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"This is great! They found a way to make this replicable and reproducible. The [`os` library](https://docs.python.org/3/library/os.html), from the standard Python library, is used to create a folder and save the files there, emulating what one would do in the Terminal. This is a very good practice.\n",
"\n",
"If you aren't familiar with the `os` library, you could have achieved similar greatness by adding a markdown cell with instructions: \n",
"\n",
"> Go to the Terminal and type the following before running the rest of this notebook\n",
">\n",
"> ```console\n",
"> mkdir Data\n",
"> ```\n",
">\n",
"\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"## 🔐 Requesting a Web Page\n",
"\n",
"### Requesting the Website and using a Selector\n",
"\n",
"- We store the URL of the target website, send a GET request to the specified URL and store the server's response\n",
"- We then create a Scrapy Selector object which allows us to apply CSS selectors to extract specific data from the HTML code"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# This is the address of the website we want to scrape\n",
"my_url = 'https://socialdatascience.network/index.html#schedule'\n",
"\n",
"# We set a GET request to the website\n",
"response = requests.get(my_url)\n",
"\n",
"# Parse the HTML code using Scrapy Selector\n",
"sel = Selector(text=response.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"## 🔍 Part 1: Scraping for titles, speakers, dates and links to CIVICA Seminar Schedule\n",
"\n",
"### Scrape the **names of all the events** and save them to a list\n",
"\n",
"- All event titles are represented within a `
` tag (found by inspecting the page)\n",
"- Therefore: `h6.card-title` is a good CSS Selector to use"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Saving all the event titles as a list\n",
"titles = sel.css('h6.card-title ::text').getall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scrape the **links to all the events** and save them to a list\n",
"\n",
"- From inspecting the page we can see that all event links are under the attribute `href` and are represented within `
`\n",
"- Therefore: `div.card-body a ::attr(href)` is a good selector to use\n",
"- The generated list has duplicates, so to remove these I used an ordered set (a normal set would have changed the order of the links leading to mismatched data in the final table). Although I considered using a for loop to remove duplicates, I found that the ordered set was much simpler\n",
"- I also added the prefix of \"https://socialdatascience.network/\" to all the links to make them full links instead of partial"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# Saving all the links to events as a list\n",
"from ordered_set import OrderedSet\n",
"partial_links = OrderedSet(sel.css('div.card-body a ::attr(href)').getall())\n",
"\n",
"# Adding a prefix to all the links in the list\n",
"address = \"https://socialdatascience.network/\" \n",
"links = []\n",
"for partial_link in partial_links:\n",
" full_link = address + partial_link\n",
" links.append(full_link)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"Their solution is elegant in its use of the `ordered_set` package to remove duplicated entries. If you were dealing with thousand, or millions, of entries, this would be a very efficient way to remove duplicates.\n",
"\n",
"But even without knowledge of this package, you could have achieved the same result by using a regular, slightly less efficient, Python `set` data structure. A concise alternative to the chunk above would be to use list comprehension:\n",
"\n",
"```python\n",
"partial_links = set(sel.css('div.card-body a ::attr(href)').getall())\n",
"links = [address+partial_link for partial_link in partial_links]\n",
"```\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scrape the **names of all the speakers** and **dates of all the events** and save them to seperate lists\n",
"\n",
"- All speakers names are represented within a `
` tag which would be our ideal CSS selector\n",
"- However, both the speaker name and date are contained within this, separated by a ` `\n",
"- We can separate the speaker name from the date of the event using a for loop, and ammend different lists to keep the names and dates in separate lists"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"speakers = []\n",
"dates = []\n",
"\n",
"for i in sel.css('div.card-body p ::text').getall():\n",
" i = i.lstrip()\n",
" if i.startswith('Speaker:'):\n",
" speaker_name = i.replace('Speaker: ', '')\n",
" speakers.append(speaker_name)\n",
" elif i.startswith('Date:'):\n",
" date = i.replace('Date: ', '')\n",
" dates.append(date)\n",
" else:\n",
" speakers.append('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert the lists to a **pandas data frame** and save it to a **CSV file**\n",
"\n",
"- Note: In order to use the `pd.DataFrame()` function to create a data frame all the lists/arrays must be the same length. This was checked using `if` statement\n",
"- The final CSV file (schedule.csv) is saved directly to the data folder creating in the Setting Up stage\n",
"- The final CSV file (schedule.csv) can be viewed as a table\n",
"- If an event has no speakers, then the relevant cell of the table will be left empty"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All lists are same length\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
title
\n",
"
speaker
\n",
"
date
\n",
"
link
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Promoting the systematic use of real-world dat...
\n",
"
Dr. Divya Srivastava, LSE
\n",
"
Wednesday, 22 November 2023
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
\n",
"
\n",
"
1
\n",
"
Data science for the Sustainable Development G...
\n",
"
Prof. Elisa Omodei, CEU
\n",
"
Wednesday, 18 October 2023
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
\n",
"
\n",
"
2
\n",
"
CentralBankRoBERTa: A Fine-Tuned Large Languag...
\n",
"
Moritz Pfeifer & Vincent Philipp Marohl
\n",
"
Wednesday, 27 September 2023
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
\n",
"
\n",
"
3
\n",
"
The Evolution of the Climate Discourse on Twit...
\n",
"
Dr. Max Falkenberg
\n",
"
Wednesday, 13 September 2023
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
\n",
"
\n",
"
4
\n",
"
The Handbook of Computational Social Science f...
\n",
"
Dr. Eleonora Bertoni
\n",
"
Wednesday, 31 May 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
5
\n",
"
Artificial Intelligence, Algorithmic Recommend...
\n",
"
Prof. Giacomo Calzolari
\n",
"
Wednesday, 03 May 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
6
\n",
"
Exploring A New Model of Industry/Academic Col...
\n",
"
Prof. Pablo Barberá
\n",
"
Wednesday, 19 April 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
7
\n",
"
Using Multimodal Neural Networks to Better Und...
\n",
"
Prof. Bryce Jensen Dietrich
\n",
"
Wednesday, 22 March 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
8
\n",
"
Models, mathematics, and data science: how to ...
\n",
"
Dr. Erica Thompson
\n",
"
Wednesday, 08 March 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
9
\n",
"
CIVICA Conference on European Polarisation
\n",
"
NaN
\n",
"
Wednesday, 15 February 2023
\n",
"
https://socialdatascience.network/polarisation...
\n",
"
\n",
"
\n",
"
10
\n",
"
New Faces of Bias in Online Platforms
\n",
"
Prof. Aniko Hannak
\n",
"
Wednesday, 08 February 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
11
\n",
"
Introducing the Online Harms Observatory: AI p...
\n",
"
Pica Johnsson
\n",
"
Wednesday, 11 January 2023
\n",
"
https://socialdatascience.network/spring2023/s...
\n",
"
\n",
"
\n",
"
12
\n",
"
Using Open Source Data Streams and Surveys to ...
\n",
"
Prof. Lisa Singh
\n",
"
Wednesday, 02 November 2022
\n",
"
https://socialdatascience.network/fall2022/ses...
\n",
"
\n",
"
\n",
"
13
\n",
"
Does Epistemic Vice Explain Corporate Misconduct?
\n",
"
Dr. Marco Meyer
\n",
"
Wednesday, 19 October 2022
\n",
"
https://socialdatascience.network/fall2022/ses...
\n",
"
\n",
"
\n",
"
14
\n",
"
Becoming a data scientist: what it means to pu...
\n",
"
Prof. Anne Beaulieu
\n",
"
Wednesday, 14 September 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
15
\n",
"
The Making of a French Migration Crisis
\n",
"
Dr. Michelle Reddy & Dr. Hélène Thiollet
\n",
"
Wednesday, 15 June 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
16
\n",
"
A New Approach to Visualizing Spatial Exposure...
\n",
"
Prof. Stephanie Lackner
\n",
"
Wednesday, 01 June 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
17
\n",
"
Modeling Sustainable Development from the Bott...
\n",
"
Dr. Omar A. Guerrero
\n",
"
Wednesday, 04 May 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
18
\n",
"
Internet Communities and the French Presidenti...
\n",
"
Prof. David Chavalarias
\n",
"
Wednesday, 20 April 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
19
\n",
"
The Science of Success: Quantifying Outcomes i...
\n",
"
Prof. Laszlo Barabasi
\n",
"
Wednesday, 09 March 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
20
\n",
"
Embedding Regression: Models for Context-Speci...
\n",
"
Prof. Arthur Spirling
\n",
"
Wednesday, 23 February 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
21
\n",
"
Information and Irregular Migration: Evidence ...
\n",
"
Dr. Alexandra Scacco
\n",
"
Wednesday, 09 February 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
22
\n",
"
Adjusting for Confounding with Text Matching
\n",
"
Prof. Margaret Roberts
\n",
"
Wednesday, 26 January 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
23
\n",
"
What is Data Feminist?
\n",
"
Prof. Lauren Klein
\n",
"
Wednesday, 12 January 2022
\n",
"
https://socialdatascience.network/spring2022/s...
\n",
"
\n",
"
\n",
"
24
\n",
"
The Principles of Collective Learning
\n",
"
Prof. Cesar A. Hidalgo
\n",
"
Wednesday, 1 December 2021
\n",
"
https://socialdatascience.network/fall2021/ses...
\n",
"
\n",
"
\n",
"
25
\n",
"
More Than Words: How Political Rhetoric Shapes...
\n",
"
Prof. Christopher Lucas
\n",
"
Wednesday, 3 November 2021
\n",
"
https://socialdatascience.network/fall2021/ses...
\n",
"
\n",
"
\n",
"
26
\n",
"
Serendipity or Confinement? Deconstructing Alg...
\n",
"
Prof. Camille Roth
\n",
"
Wednesday, 20 October 2021
\n",
"
https://socialdatascience.network/fall2021/ses...
\n",
"
\n",
"
\n",
"
27
\n",
"
Framing a Protest: Determinants and Effects of...
\n",
"
Prof. Michelle Torrest
\n",
"
Wednesday, 06 October 2021
\n",
"
https://socialdatascience.network/fall2021/ses...
\n",
"
\n",
"
\n",
"
28
\n",
"
Understanding Beautiful Places with AI
\n",
"
Prof. Suzy Moat
\n",
"
Wednesday, 22 September, 2021
\n",
"
https://socialdatascience.network/sess9.html
\n",
"
\n",
"
\n",
"
29
\n",
"
Incentives and Covid-19 Vaccination Uptake
\n",
"
Prof. Macartan Humphreys
\n",
"
Wednesday, 08 September 2021
\n",
"
https://socialdatascience.network/fall2021/ses...
\n",
"
\n",
"
\n",
"
30
\n",
"
The Art of Quantitative Editing
\n",
"
Dr. Laura Bronner
\n",
"
Wednesday, 02 June 2021
\n",
"
https://socialdatascience.network/sess7.html
\n",
"
\n",
"
\n",
"
31
\n",
"
Breaking the Social Media Prism
\n",
"
Chris Bail
\n",
"
Wednesday, 19 May, 2021
\n",
"
https://socialdatascience.network/sess6.html
\n",
"
\n",
"
\n",
"
32
\n",
"
Using Public Video Cameras to Detect Racial Di...
\n",
"
Dr. Melissa Sands
\n",
"
Wednesday, 05 May 2021
\n",
"
https://socialdatascience.network/sess5.html
\n",
"
\n",
"
\n",
"
33
\n",
"
How to Detect Fake News Before It Is Written
\n",
"
Dr. Preslav Nakov
\n",
"
Wednesday, 21 April 2021
\n",
"
https://socialdatascience.network/sess4.html
\n",
"
\n",
"
\n",
"
34
\n",
"
Negotiating with AI: Fairness in the Labor Market
\n",
"
Prof. Christo Wilson
\n",
"
Wednesday, 07 April, 2021
\n",
"
https://socialdatascience.network/sess3.html
\n",
"
\n",
"
\n",
"
35
\n",
"
Tracking Covid-19 with the Financial Times
\n",
"
John Burn-Murdoch
\n",
"
Wednesday, 24 March 2021
\n",
"
https://socialdatascience.network/sess2.html
\n",
"
\n",
"
\n",
"
36
\n",
"
Police Diversity to Prevent Violence: Does It ...
\n",
"
Roman Rivera
\n",
"
Wednesday, 17 March 2021
\n",
"
https://socialdatascience.network/sess1.html
\n",
"
\n",
"
\n",
"
37
\n",
"
Data Science in the Time of Covid and What Hap...
\n",
"
NaN
\n",
"
Wednesday, 24 February, 2021
\n",
"
https://socialdatascience.network/launch.html
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" title \\\n",
"0 Promoting the systematic use of real-world dat... \n",
"1 Data science for the Sustainable Development G... \n",
"2 CentralBankRoBERTa: A Fine-Tuned Large Languag... \n",
"3 The Evolution of the Climate Discourse on Twit... \n",
"4 The Handbook of Computational Social Science f... \n",
"5 Artificial Intelligence, Algorithmic Recommend... \n",
"6 Exploring A New Model of Industry/Academic Col... \n",
"7 Using Multimodal Neural Networks to Better Und... \n",
"8 Models, mathematics, and data science: how to ... \n",
"9 CIVICA Conference on European Polarisation \n",
"10 New Faces of Bias in Online Platforms \n",
"11 Introducing the Online Harms Observatory: AI p... \n",
"12 Using Open Source Data Streams and Surveys to ... \n",
"13 Does Epistemic Vice Explain Corporate Misconduct? \n",
"14 Becoming a data scientist: what it means to pu... \n",
"15 The Making of a French Migration Crisis \n",
"16 A New Approach to Visualizing Spatial Exposure... \n",
"17 Modeling Sustainable Development from the Bott... \n",
"18 Internet Communities and the French Presidenti... \n",
"19 The Science of Success: Quantifying Outcomes i... \n",
"20 Embedding Regression: Models for Context-Speci... \n",
"21 Information and Irregular Migration: Evidence ... \n",
"22 Adjusting for Confounding with Text Matching \n",
"23 What is Data Feminist? \n",
"24 The Principles of Collective Learning \n",
"25 More Than Words: How Political Rhetoric Shapes... \n",
"26 Serendipity or Confinement? Deconstructing Alg... \n",
"27 Framing a Protest: Determinants and Effects of... \n",
"28 Understanding Beautiful Places with AI \n",
"29 Incentives and Covid-19 Vaccination Uptake \n",
"30 The Art of Quantitative Editing \n",
"31 Breaking the Social Media Prism \n",
"32 Using Public Video Cameras to Detect Racial Di... \n",
"33 How to Detect Fake News Before It Is Written \n",
"34 Negotiating with AI: Fairness in the Labor Market \n",
"35 Tracking Covid-19 with the Financial Times \n",
"36 Police Diversity to Prevent Violence: Does It ... \n",
"37 Data Science in the Time of Covid and What Hap... \n",
"\n",
" speaker date \\\n",
"0 Dr. Divya Srivastava, LSE Wednesday, 22 November 2023 \n",
"1 Prof. Elisa Omodei, CEU Wednesday, 18 October 2023 \n",
"2 Moritz Pfeifer & Vincent Philipp Marohl Wednesday, 27 September 2023 \n",
"3 Dr. Max Falkenberg Wednesday, 13 September 2023 \n",
"4 Dr. Eleonora Bertoni Wednesday, 31 May 2023 \n",
"5 Prof. Giacomo Calzolari Wednesday, 03 May 2023 \n",
"6 Prof. Pablo Barberá Wednesday, 19 April 2023 \n",
"7 Prof. Bryce Jensen Dietrich Wednesday, 22 March 2023 \n",
"8 Dr. Erica Thompson Wednesday, 08 March 2023 \n",
"9 NaN Wednesday, 15 February 2023 \n",
"10 Prof. Aniko Hannak Wednesday, 08 February 2023 \n",
"11 Pica Johnsson Wednesday, 11 January 2023 \n",
"12 Prof. Lisa Singh Wednesday, 02 November 2022 \n",
"13 Dr. Marco Meyer Wednesday, 19 October 2022 \n",
"14 Prof. Anne Beaulieu Wednesday, 14 September 2022 \n",
"15 Dr. Michelle Reddy & Dr. Hélène Thiollet Wednesday, 15 June 2022 \n",
"16 Prof. Stephanie Lackner Wednesday, 01 June 2022 \n",
"17 Dr. Omar A. Guerrero Wednesday, 04 May 2022 \n",
"18 Prof. David Chavalarias Wednesday, 20 April 2022 \n",
"19 Prof. Laszlo Barabasi Wednesday, 09 March 2022 \n",
"20 Prof. Arthur Spirling Wednesday, 23 February 2022 \n",
"21 Dr. Alexandra Scacco Wednesday, 09 February 2022 \n",
"22 Prof. Margaret Roberts Wednesday, 26 January 2022 \n",
"23 Prof. Lauren Klein Wednesday, 12 January 2022 \n",
"24 Prof. Cesar A. Hidalgo Wednesday, 1 December 2021 \n",
"25 Prof. Christopher Lucas Wednesday, 3 November 2021 \n",
"26 Prof. Camille Roth Wednesday, 20 October 2021 \n",
"27 Prof. Michelle Torrest Wednesday, 06 October 2021 \n",
"28 Prof. Suzy Moat Wednesday, 22 September, 2021 \n",
"29 Prof. Macartan Humphreys Wednesday, 08 September 2021 \n",
"30 Dr. Laura Bronner Wednesday, 02 June 2021 \n",
"31 Chris Bail Wednesday, 19 May, 2021 \n",
"32 Dr. Melissa Sands Wednesday, 05 May 2021 \n",
"33 Dr. Preslav Nakov Wednesday, 21 April 2021 \n",
"34 Prof. Christo Wilson Wednesday, 07 April, 2021 \n",
"35 John Burn-Murdoch Wednesday, 24 March 2021 \n",
"36 Roman Rivera Wednesday, 17 March 2021 \n",
"37 NaN Wednesday, 24 February, 2021 \n",
"\n",
" link \n",
"0 https://socialdatascience.network/fall2023/ses... \n",
"1 https://socialdatascience.network/fall2023/ses... \n",
"2 https://socialdatascience.network/fall2023/ses... \n",
"3 https://socialdatascience.network/fall2023/ses... \n",
"4 https://socialdatascience.network/spring2023/s... \n",
"5 https://socialdatascience.network/spring2023/s... \n",
"6 https://socialdatascience.network/spring2023/s... \n",
"7 https://socialdatascience.network/spring2023/s... \n",
"8 https://socialdatascience.network/spring2023/s... \n",
"9 https://socialdatascience.network/polarisation... \n",
"10 https://socialdatascience.network/spring2023/s... \n",
"11 https://socialdatascience.network/spring2023/s... \n",
"12 https://socialdatascience.network/fall2022/ses... \n",
"13 https://socialdatascience.network/fall2022/ses... \n",
"14 https://socialdatascience.network/spring2022/s... \n",
"15 https://socialdatascience.network/spring2022/s... \n",
"16 https://socialdatascience.network/spring2022/s... \n",
"17 https://socialdatascience.network/spring2022/s... \n",
"18 https://socialdatascience.network/spring2022/s... \n",
"19 https://socialdatascience.network/spring2022/s... \n",
"20 https://socialdatascience.network/spring2022/s... \n",
"21 https://socialdatascience.network/spring2022/s... \n",
"22 https://socialdatascience.network/spring2022/s... \n",
"23 https://socialdatascience.network/spring2022/s... \n",
"24 https://socialdatascience.network/fall2021/ses... \n",
"25 https://socialdatascience.network/fall2021/ses... \n",
"26 https://socialdatascience.network/fall2021/ses... \n",
"27 https://socialdatascience.network/fall2021/ses... \n",
"28 https://socialdatascience.network/sess9.html \n",
"29 https://socialdatascience.network/fall2021/ses... \n",
"30 https://socialdatascience.network/sess7.html \n",
"31 https://socialdatascience.network/sess6.html \n",
"32 https://socialdatascience.network/sess5.html \n",
"33 https://socialdatascience.network/sess4.html \n",
"34 https://socialdatascience.network/sess3.html \n",
"35 https://socialdatascience.network/sess2.html \n",
"36 https://socialdatascience.network/sess1.html \n",
"37 https://socialdatascience.network/launch.html "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check if length of lists are the same\n",
"if len(titles) == len(speakers) == len(dates) == len(links):\n",
" print(\"All lists are same length\")\n",
"else:\n",
" print(\"Lists are not same length, please ammend\")\n",
"\n",
"# Specify the file path including the data folder to save the schedule.csv file\n",
"schedule_csv_file_path = os.path.join(data_folder_path, 'schedule.csv')\n",
"\n",
"# Use the `pd.DataFrame()` function to create a data frame + Save pandas data frame df to a CSV file.\n",
"df_schedule = pd.DataFrame({'title': titles, 'speaker': speakers, 'date': dates, 'link': links})\n",
"df_schedule.to_csv(schedule_csv_file_path, index=False)\n",
"\n",
"# Double-check that the CSV file was created correctly by opening it using pandas\n",
"df_schedule = pd.read_csv(schedule_csv_file_path)\n",
"df_schedule"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"## 📆 Part 2: Scraping for talk links, titles, speakers and descriptions from CIVICA agendas\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scrape the **titles, speakers and descriptions of talks** and save them to a list (using df_schedule as reference)\n",
"\n",
"The agenda/talks for each event are found on the individual event page. A `for` loop is used to go through each event link and scrape the relevant data\n",
"\n",
"Talk Titles\n",
"+ All individual talk titles are represented within a `
` tag under a `
` (found by inspecting the event page)\n",
"+ However, this also includes the name of the assigned speaker for the talk within a further `` tag (we do not want to include this)\n",
"+ Therefore: `'div.row.schedule-item h4:not(:has(span))::text'` is a good CSS Selector to use as it included the title of the talk but excludes the name of the speaker\n",
"\n",
"\n",
"Talk Speakers\n",
"+ All talk speakers are represented within a `` tag under a `
` tag under `
` (found by inspecting the event page)\n",
"+ Therefore: `'div.row.schedule-item h4 span ::text'` is a good CSS Selector to use as it includes the name of the speaker but excludes other information under the `
` tag like title of the talk\n",
"\n",
"\n",
"Talk Descriptions\n",
"+ All talk descriptions are represented within a `
` tag under `
` (found by inspecting the event page)\n",
"+ Therefore: `'div.row.schedule-item p'` is a good CSS Selector to use"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"talk_title = []\n",
"talk_speaker = []\n",
"talk_description = []\n",
"\n",
"# For loop to go through each event link using the existing event links column from Part 1 df_schedule dataframe\n",
"for i in range(len(links)):\n",
" url = links[i]\n",
" response = requests.get(url)\n",
" sel = Selector(text=response.text)\n",
"\n",
" # Getting all the talk titles and excluding the names of the talk speakers\n",
" event_agenda = sel.css('div.row.schedule-item h4:not(:has(span))::text').getall()\n",
" talk_title.append(event_agenda)\n",
"\n",
" # Getting all the talk speakers\n",
" event_speaker = sel.css('div.row.schedule-item h4 span ::text').getall()\n",
" talk_speaker.append(event_speaker)\n",
"\n",
" # Getting all the talk descriptions\n",
" event_description = sel.css('div.row.schedule-item p').getall()\n",
" talk_description.append(event_description)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"The only thing lending this solution down is that this for loop could have been a function!\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Flattening/Ammending lists for links, titles, speakers and descriptions of talks \n",
"\n",
"- In order to use the `pd.DataFrame()` function to create a data frame all the lists/arrays must be the same length\n",
"- The below code goes through each nested list and either flattens it out or ammends it such that all 4 lists are the same length\n",
"- In the case of final_talk_description, we also clean up the data to remove unecessary tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List for Titles"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# Turning the nested title list into a flat list (same will be done with other lists to ensure everything is same length)\n",
"final_talk_title = [item for sublist in talk_title for item in sublist]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"Great use of list comprehension to 'flatten out' a nested list\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List for Speakers"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# Ensuring the final speaker list corresponds with the final talk titles list\n",
"final_talk_speaker = []\n",
"for counter in range(len(talk_title)):\n",
" for i in range(len(talk_title[counter])):\n",
" if i < len(talk_speaker[counter]):\n",
" final_talk_speaker.append(talk_speaker[counter][i])\n",
" #If the talk doesn't have a speaker then this will be left blank eg. most 'Announcement' talks do not have an assigned speaker\n",
" else:\n",
" final_talk_speaker.append('')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List for Descriptions"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# If 'div.row.schedule-item p ::text' was used then the description was split into separate descriptions if there was additional text inside \n",
"# To avoid this problem, '::text' was left out, and instead the now included
and tags were removed using a for loop\n",
"for i in range(len(talk_description)):\n",
" for j in range(len(talk_description[i])):\n",
" # Remove
and tags using string.replace()\n",
" talk_description[i][j] = talk_description[i][j].replace(\"
\", \"\").replace(\"
\", \"\").replace(\"\", \"\").replace(\"\", \"\").replace(\"amp;\", \"\")\n",
"# Turning the nested list into a flat list (same will be done with other lists to ensure everything is same length)\n",
"final_talk_descriptions = [item for sublist in talk_description for item in sublist]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List for Links"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# Currently the list of links will not match the parallel list for individual talks, so this is ammended below\n",
"final_links = []\n",
"for counter in range(len(talk_title)):\n",
" for i in range(len(talk_title[counter])):\n",
" final_links.append(links[counter]) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert the lists to a **pandas data frame** and save it to a **CSV file**\n",
"\n",
"- Note: In order to use the `pd.DataFrame()` function to create a data frame all the lists/arrays must be the same length. This was checked using `if` statement\n",
"- The final CSV file (agenda.csv) is saved directly to the data folder creating in the Setting Up stage\n",
"- The final CSV file (agenda.csv) can be viewed as a table\n",
"- If an event has no speakers, then the relevant cell of the table will be left empty"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All lists are same length\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
event_link
\n",
"
talk_title
\n",
"
talk_speakers
\n",
"
talk_description
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
Welcome Introduction
\n",
"
Dr. Ghita Berrada, LSE
\n",
"
Setting the scene: Brief intro to the speaker ...
\n",
"
\n",
"
\n",
"
1
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
Seminar Session
\n",
"
Dr. Divya Srivastava, LSE
\n",
"
Promoting the systematic use of real-world dat...
\n",
"
\n",
"
\n",
"
2
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
Research Discussion.
\n",
"
Lead Institution
\n",
"
Q&A / Discussion on the research
\n",
"
\n",
"
\n",
"
3
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
Announcement
\n",
"
NaN
\n",
"
Upcoming seminar in the series and other annou...
\n",
"
\n",
"
\n",
"
4
\n",
"
https://socialdatascience.network/fall2023/ses...
\n",
"
Welcome Introduction
\n",
"
Prof. Petra Novak, CEU
\n",
"
Setting the scene: Brief intro to the speaker ...
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
146
\n",
"
https://socialdatascience.network/sess2.html
\n",
"
Announcement
\n",
"
NaN
\n",
"
Upcoming seminar in the series and other annou...
\n",
"
\n",
"
\n",
"
147
\n",
"
https://socialdatascience.network/launch.html
\n",
"
Welcome Introduction
\n",
"
Prof. Slava Jankin, Hertie School Data Science...
\n",
"
Setting the scene: Context and Goals for the C...
\n",
"
\n",
"
\n",
"
148
\n",
"
https://socialdatascience.network/launch.html
\n",
"
Institutional Update
\n",
"
Chair: Prof. Kenneth Benoit, LSE Data Science ...
\n",
"
Introducing CIVICA Partner Institutions' Direc...
\n",
"
\n",
"
\n",
"
149
\n",
"
https://socialdatascience.network/launch.html
\n",
"
Round Table Discussion
\n",
"
Dr. Erica Thompson, LSE Data Science Institute
\n",
"
Topic: Data Science and Digital Transformation...
\n",
"
\n",
"
\n",
"
150
\n",
"
https://socialdatascience.network/launch.html
\n",
"
Closing
\n",
"
NaN
\n",
"
Closing remarks
\n",
"
\n",
" \n",
"
\n",
"
151 rows × 4 columns
\n",
"
"
],
"text/plain": [
" event_link \\\n",
"0 https://socialdatascience.network/fall2023/ses... \n",
"1 https://socialdatascience.network/fall2023/ses... \n",
"2 https://socialdatascience.network/fall2023/ses... \n",
"3 https://socialdatascience.network/fall2023/ses... \n",
"4 https://socialdatascience.network/fall2023/ses... \n",
".. ... \n",
"146 https://socialdatascience.network/sess2.html \n",
"147 https://socialdatascience.network/launch.html \n",
"148 https://socialdatascience.network/launch.html \n",
"149 https://socialdatascience.network/launch.html \n",
"150 https://socialdatascience.network/launch.html \n",
"\n",
" talk_title \\\n",
"0 Welcome Introduction \n",
"1 Seminar Session \n",
"2 Research Discussion. \n",
"3 Announcement \n",
"4 Welcome Introduction \n",
".. ... \n",
"146 Announcement \n",
"147 Welcome Introduction \n",
"148 Institutional Update \n",
"149 Round Table Discussion \n",
"150 Closing \n",
"\n",
" talk_speakers \\\n",
"0 Dr. Ghita Berrada, LSE \n",
"1 Dr. Divya Srivastava, LSE \n",
"2 Lead Institution \n",
"3 NaN \n",
"4 Prof. Petra Novak, CEU \n",
".. ... \n",
"146 NaN \n",
"147 Prof. Slava Jankin, Hertie School Data Science... \n",
"148 Chair: Prof. Kenneth Benoit, LSE Data Science ... \n",
"149 Dr. Erica Thompson, LSE Data Science Institute \n",
"150 NaN \n",
"\n",
" talk_description \n",
"0 Setting the scene: Brief intro to the speaker ... \n",
"1 Promoting the systematic use of real-world dat... \n",
"2 Q&A / Discussion on the research \n",
"3 Upcoming seminar in the series and other annou... \n",
"4 Setting the scene: Brief intro to the speaker ... \n",
".. ... \n",
"146 Upcoming seminar in the series and other annou... \n",
"147 Setting the scene: Context and Goals for the C... \n",
"148 Introducing CIVICA Partner Institutions' Direc... \n",
"149 Topic: Data Science and Digital Transformation... \n",
"150 Closing remarks \n",
"\n",
"[151 rows x 4 columns]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check if length of lists are the same\n",
"if len(final_links) == len(final_talk_descriptions) == len(final_talk_speaker) == len(final_talk_title):\n",
" print(\"All lists are same length\")\n",
"else:\n",
" print(\"Lists are not same length, please ammend\")\n",
"\n",
"# Specify the file path including the data folder to save the agenda.csv file\n",
"csv_file_path = os.path.join(data_folder_path, 'agenda.csv')\n",
"\n",
"# Use the `pd.DataFrame()` function to create a data frame + Save pandas data frame df to a CSV file.\n",
"df_agenda = pd.DataFrame({'event_link': final_links, 'talk_title': final_talk_title, 'talk_speakers': final_talk_speaker, 'talk_description': final_talk_descriptions})\n",
"df_agenda.to_csv(csv_file_path, index=False)\n",
"\n",
"# Double-check that the CSV file was created correctly by opening it using pandas\n",
"df_agenda = pd.read_csv(csv_file_path)\n",
"df_agenda"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"The remaining of this notebook is very impressive and 🆒. It shows initiative, creativity and it is insightful. It is also very well documented, we know exactly what is going on.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"## 💡 Additional Insights into this information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I will be looking at the following insights about the data from the CIVICA Seminar Website\n",
"\n",
"**1. 📖 Natural Language Analysis of Topics of Events (to determine what kinds of events are most common)**\n",
"\n",
"**2. 🗣️ Frequency of different speakers**\n",
"\n",
"**3. 📈 Frequency of CIVICA seminar events per quarter**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### 📖 Natural Language Toolkit and Analysis of Most Popular Topics for CIVICA Data Science Seminar Events\n",
"\n",
"- I thought that it would be very interesting to see what the most popular topics/points of dicussion for events and talks were\n",
"- I realised that I could explore this via NLTK (Natural language toolkit)\n",
"- 'Stop words' like 'The', 'And', 'To' etc. will be filtered out, leaving us only with words that could potentially give us an indication into the most popular topics for events to be held about. Some custom words were added to the list of 'Stop Words' if they appeared in every description eg. \"Q&A, and the first few words were not included in the graph as they consisted of very common words like \"Data\" which gives us little new insight (given that this is a seminar series about data, obviously Data will be the most common word)\n",
"- The most popular words in event titles and talk descriptions will be outputted in the form of a bar chart for easy visual digestion\n",
"- Key insights and takeaways from the output are included below the bar chart"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /Users/jon/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import nltk\n",
"from nltk.corpus import stopwords\n",
"import certifi\n",
"import os\n",
"\n",
"# Set the SSL_CERT_FILE environment variable to the path of the updated certificates bundle\n",
"os.environ['SSL_CERT_FILE'] = certifi.where()\n",
"# Download the stopwords dataset (only required once)\n",
"nltk.download('stopwords')"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"# List of words\n",
"title_words_list = [word for titles in titles for word in titles.split()]\n",
"description_words_list = [word for final_talk_descriptions in final_talk_descriptions for word in final_talk_descriptions.split()]\n",
"\n",
"event_overview = []\n",
"# For loop to go through each event link using the existing event links column from Part 1 df_schedule dataframe\n",
"for i in range(len(links)):\n",
" url = links[i]\n",
" response = requests.get(url)\n",
" sel = Selector(text=response.text)\n",
"\n",
" # Getting all the talk titles and excluding the names of the talk speakers\n",
" overview = sel.css('div.col-lg-9.fade.show.active p::text').getall()\n",
" event_overview.extend(overview)\n",
"\n",
"\n",
"event_overview_list = [word for event_overview in event_overview for word in event_overview.split()]\n",
"\n",
"combined_words_list = title_words_list + description_words_list + event_overview_list\n",
"\n",
"# Get the list of English stopwords from nltk\n",
"stop_words = set(stopwords.words('english'))\n",
"\n",
"# Additional words to be added to stop_words\n",
"custom_stop_words = ['us', 'Setting', 'scene:', 'University', 'Professor', 'also', 'session', 'policy.', 'social,', 'ten', 'Hertie', 'Series,', 'sciences,', 'Hertie', 'School', 'Professor', '.', 'political,'] # Add your custom stop words here\n",
"\n",
"# Extend the stop_words set with custom_stop_words\n",
"stop_words.update(custom_stop_words)\n",
"\n",
"# Remove stopwords from the list of words\n",
"filtered_words_list = [word for word in combined_words_list if word.lower() not in stop_words]\n",
"\n",
"# Count the filtered words\n",
"filtered_counted_words = Counter(filtered_words_list)\n",
"\n",
"items_list = []\n",
"frequencies_list = []\n",
"\n",
"# Iterate through counted items and frequencies\n",
"for item, frequency in filtered_counted_words.most_common()[18:46]:\n",
" items_list.append(item)\n",
" frequencies_list.append(frequency)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Sample data\n",
"items_list\n",
"frequencies_list\n",
"\n",
"# Create a bar chart\n",
"plt.bar(items_list, frequencies_list)\n",
"\n",
"# Adding labels and title\n",
"plt.xlabel('Key Word')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Most Popular Keywords in CIVICA Data Science Seminar Events and Talk Descriptions')\n",
"\n",
"# Display the frequency of each bar above the bars\n",
"for i, freq in enumerate(frequencies_list):\n",
" plt.text(i, freq + 0.5, str(freq), ha='center')\n",
"\n",
"# Rotate x-axis labels sideways\n",
"plt.xticks(rotation='vertical')\n",
"\n",
"# Display the graph\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 📖 Takeaways from Natural Language Analysis of Topics and Descriptions\n",
"\n",
"- From the bar chart we can see that the seminars tend to focus a lot on uses of AI/Data Science (high frequency of \"applications\", \"methodologies\", \"use\")\n",
"- There is a heavy emphasis on the \"social\", \"public\" and \"economic\" aspects on uses of AI and Data Science\n",
"- The lack of technical jargon in the event descriptions and titles shows that these talks will be accessible to the general public/to audiences that do not have existing knowledge of data science concepts (This is a good thing in my opinion! As it means that the CIVICA events are accessible to a larger group of people!)\n",
"- The fact that \"World\" appears more times than \"European\" might indicate that despite CIVICA being a european based organisation, their world is applicable to a broader scope/to the entire world (not just focussing on europe)\n",
"- The high frequency of worlds like \"multi-disciplinary\", \"unique\", \"leading\" and \"new\" potentially indicates that the CIVICA seminar series is good at staying ahead of the data science curve, and that we have been successful at bringing in speakers that are at the cutting edge of the field (or maybe the descriptions are just written in such a way to bring in more interest to the events)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### 🗣️ Seeing who the most common event speakers and talk speakers are across all events\n",
"\n",
"- I thought it would be interesting to see who the most frequent speakers at all the CIVICA events are (as both main speakers and assigned speakers for individual talks within events)\n",
"- If there are any speakers who have spoken a lot, then CIVICA might make note to not bring them in for future events and to maybe bring in a different speaker to allow for more variety (alternatively, they might also want to give an acknowledgement to any speaker that has spoken lots at events!)\n",
"- A bar chart has been generated to visualise this information, and the code has been written such that events with no speaker will be shown as \"no assigned speaker\" as opposed to just showing a blank space"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"\n",
"# List of all speakers\n",
"all_speakers = final_talk_speaker + speakers\n",
"counted_speakers = Counter(all_speakers)\n",
"\n",
"speaker_list = []\n",
"occurence_list = []\n",
"\n",
"# Iterate through counted items and frequencies\n",
"for item, frequency in counted_speakers.most_common(10):\n",
" if item==\"\":\n",
" speaker_list.append(\"No Assigned Speaker\")\n",
" occurence_list.append(frequency)\n",
" else:\n",
" speaker_list.append(item)\n",
" occurence_list.append(frequency)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Sample data\n",
"speaker_list\n",
"occurence_list\n",
"\n",
"# Create a bar chart\n",
"plt.bar(speaker_list, occurence_list)\n",
"\n",
"# Adding labels and title\n",
"plt.xlabel('Speaker Name')\n",
"plt.ylabel('Number of events/talks spoken at')\n",
"plt.title('10 Most Frequent Speakers in CIVICA Data Science Seminar Events and Talks')\n",
"\n",
"# Display the frequency of each bar above the bars\n",
"for i, freq in enumerate(occurence_list):\n",
" plt.text(i, freq + 0.5, str(freq), ha='center')\n",
"\n",
"# Rotate x-axis labels sideways\n",
"plt.xticks(rotation='vertical')\n",
"\n",
"# Display the graph\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 🗣️ Takeaways from analysis of frequency of speakers\n",
"\n",
"- From the bar chart we can see that many talks have no assigned speaker or are broadly facilitated by the lead institution\n",
"- There are no single speakers that have spoken at an incredibly high number of events or talks, indicating that CIVICA seminar series is good at bringing in a variety of speakers (which is good for marketing, as we are always doing new things and not relying on a select froup of speakers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### 📈 Seeing how the frequency of events has moved over time\n",
"\n",
"- I thought it would be interesting to get an overview of how many CIVICA seminar events are held each quarter to see how consistent these events are\n",
"- eg. are we gradually holding more or less events, is there significant fluctuation? etc\n",
"- I used datetime to achieve the final graph, and used nested loops to count the dates into the correct quarters"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"# Sample list of dates (in string format)\n",
"formatted_dates = [string.replace(\",\", \"\") for string in dates]\n",
"\n",
"# Initialize quarter and count lists\n",
"quarters = []\n",
"counts = []\n",
"\n",
"# Initialize quarter and count dictionary\n",
"quarter_counts = {}\n",
"\n",
"# Iterate through the dates and count quarters for each year\n",
"for date_str in formatted_dates:\n",
" # Convert string to datetime object using the appropriate format\n",
" date_obj = datetime.strptime(date_str, \"%A %d %B %Y\")\n",
" year = date_obj.year\n",
" # Filter dates for 2021, 2022, and 2023\n",
" if 2021 <= year <= 2023:\n",
" quarter = (date_obj.month - 1) // 3 + 1\n",
" # Update quarter count in the dictionary\n",
" key = f\"Q{quarter} {year}\"\n",
" quarter_counts[key] = quarter_counts.get(key, 0) + 1\n",
"\n",
"# Extract quarters and counts into separate lists\n",
"for quarter, count in quarter_counts.items():\n",
" quarters.append(quarter)\n",
" counts.append(count)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Sample data\n",
"x = quarters\n",
"y = counts\n",
"\n",
"# Reverse the x and y list to display the axis in reverse order\n",
"x.reverse()\n",
"y.reverse()\n",
"\n",
"# Create a line graph\n",
"plt.plot(x, y, marker='o', color='b', linestyle='-', linewidth=2, markersize=8)\n",
"plt.xticks(rotation='vertical')\n",
"\n",
"# Adding labels and title\n",
"plt.xlabel('Date (Quarter, Year)')\n",
"plt.ylabel('Number of CIVICA Seminar Events')\n",
"plt.title('Frequency of CIVICA Seminar Talks')\n",
"\n",
"# Display the graph\n",
"plt.grid(True) # Add gridlines for better readability\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"There is just one important mistake to the plot above: the Y-axis doesn't start at 0.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 📈 Takeaways from analysing the frequency of CIVICA seminar events\n",
"\n",
"- As we can see from the graph, there is seasonal fluctuation in how many CIVICA Seminar events are held each quarter\n",
"- Q3 consistently sees the lowest number of seminars (probably due to the fact that this time period contains the summer holidays)\n",
"- There is no clearly observable trend, but compared to 2021 and 2022, thus far 2023 is falling behind in terms of number of events (even if we account for the fact that there are 2 months left)\n",
"- This might suggest that in order to hold a similar amount of talks compared to previous years CIVICA should try and schedule in more talks for december 2023\n",
"- The average decline in number of events may also represent the shift in preference away from online events (like the CIVICA seminar series) to in person events post pandemic. If we can get some additional information on this, then CIVICA may decide to hold more in person events in order to gain more attraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"## 💻 In what ways did I use Generative AI tools to help with my assignment?\n",
"\n",
"- While working on my assignment, I found Generative AI tools (ChatGPT) to be incredibly useful for debugging specific sections of my code\n",
"- It helped me identify syntax errors and provided valuable suggestions for correcting loop structures\n",
"- However, I didn't heavily rely on Generative AI tools for the overall assignment\n",
"- This was mainly because the responses they generated were often generic and lacked the specificity required for the summative\n",
"- Additionally, there were instances where the suggestions provided contained errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"**(Jon's comments)**\n",
"\n",
"This is good and in times very specific (used it for debugging and to correct loops) but I feel like some parts of this notebook - the use of Collections and ordered_set packages - probably came as the output of generative AI. If true, the author failed to mention this.\n",
"\n",
"