✅ Week 07 Lab - Solutions

Author
Published

27 February 2024

Part I

Task 1

Step 1: Inspect the page

The solution to this is the CSS selector:

table.wikitable

Why? (and this is the crucial question)

  • When you inspect the page on the browser, you will find that the data we want is inside a <table> tag

  • But if I were to use the CSS selector `table, ’ I would capture all the tables on the page, many of which are hidden.

  • So, I need to be more specific. I need to specify the class of the table I want to capture.

  • Looking closer at the HTML, I see that the table I care about is defined as:

    <table class="wikitable sortable mw-datatable sticky-header static-row-numbers col1left col2left mobile-collapse-element jquery-tablesorter" style="text-align:right">
    ...
    </table>
  • I can use any of these classes to capture the table, but after playing around with the classes, I found that the wikitable class is the most robust one.

Step 2: Understand the output of pd.read_html(table)

  • You should realise that the output of pd.read_html(table) is a list of data frames.

  • The hint here is the [ and ] around the function’s output.

  • You can check this by running:

    type(pd.read_html(table))
  • Then, you could check the length of the list by running:

    len(pd.read_html(table))
  • When you check the first element of the list, you will see that it’s a data frame:

    type(pd.read_html(table)[0])
  • Therefore, the way to collect the first data frame is by using the index [0]:

    # Save this as df for later use
    df = pd.read_html(table)[0]

Task 2

If you go on ChatGPT or Google the warning, you will eventually discover that the solution to getting rid of the warnings is:

df = pd.read_html(StringIO(table))[0]

And that, for this to work, you would have to go back up the import cells and add:

from io import StringIO

Why, though?

Either through Googling or via an AI chatbot (the AIs do get this one right), you would eventually discover that pd.read_html() will no longer accept strings as input in the future. Instead, it will only accept files. You should also be able to discover that the StringIO is a way to convert a string into a file-like object, thus allowing pd.read_html() to accept it as input without throwing a warning.

But our goal with this task is more than finding out the solution. It is about understanding what you get out of AI chatbots and/or online searches.

It is doubtful that you will need to use the StringIO too frequently in the future, but you will definitely encounter errors and warnings every time you are asked to write code. The key here is for you to reflect:

  • How do I know this solved the problem? Is just ‘running the code’ enough? (No, it’s not)
  • Once I find a solution, how do I understand it? (You should be able to explain it to someone else)

⚠️ It is tempting to trust whatever ChatGPT throws at us. As long as it works, we are happy. But that is a recipe for future headaches. As the complexity of your code grows, you might notice that the AI chatbot solutions might become too cumbersome, unhelpful and you won’t be able to debug your own code!

Part II

Step 1: Understand the error

When you run:


df['population'].astype(np.int32)

You will get a very long error, but which eventually boils down to:

ValueError: invalid literal for int() with base 10: '217,732 3,000'"

It is okay if you don’t understand this cryptic error message. But once you do a bit of research (and AI chatbots can indeed be useful here again if used well), you will realise that the problem is that some of the values in the population column are not numbers. They contain multiple values, such as 217,732 3,000.

Step 2: Identifying the problem

☣️ AI CHATBOT DANGER ZONE:

Depending on how you query the AI chatbots, you might get a suggestion like this:

df['population'].str.replace(',', '').astype(np.int32)

THIS IS WRONG!

The AI-suggested solution assumes that the problem is with the comma (,), whereas, in fact, the problem is with the space character between the two numbers. The pandas library can handle commas in numbers just fine - but it cannot parse a string that has two numbers separated by a space.

How, then, do you solve this problem?

One solution (given by Oliver on Slack) is this:

weird_values = [(f"Index: {i}, Value: {df['population'][i]}") for i in range(len(df['population'])) if " " in df['population'][i]]

print(weird_values)

which prints out:

['Index: 152, Value: 217,732 3,000', 'Index: 233, Value: 449 0']

This is useful! I now know which rows have the problem; I know what they look like, and I can see that there are only two of them.

Steps 3-7: Asking a function to identify the problem for me

There are infinite ways to solve this problem, as this StackOverflow discussion thread shows.

In line with the solution provided in the previous step, I could write a function that checks for invalid numbers as follows:

def is_invalid_number(value):
  # Just check for a space
  # If there is one, it is an invalid number
  if " " in value:
    return True
  else:
    return False

Then, I could use this function to identify the problem:

is_invalid_number("217,732 3,000")

But why bother with this function? Why not just use the list comprehension as before? In this course, we want to make the most of the pandas library. The library is optimised for speed and efficiency, and it is best to use it as much as possible.

Instead of looping through values manually, we can use the apply method to apply a function to each value in the column.

Following Steps 5-7 should help you understand how apply works.

Steps 8: Creating a ‘clean’ column

We asked you to decide what to do with the invalid values. You might need to make a decision and take sides. Here are possible solutions:

Option 1: Remove the invalid values

# This wipes out what was in df before and replaces it with a new data frame
# The new dataframe only has rows where the population is a valid number
valid_rows = df['population'] == False
df = df[valid_rows].copy()

Then, you could create a new column with the valid values:

df['clean_population'] = df['population'].astype(np.int32)

This is a valid technical solution, but what does it mean to ignore the invalid values?? (We’ll leave that to you to think about)

Option 2: Replace the invalid values with NaN

NaN means “Not a Number”. This is a way to tell pandas that a value is invalid.

We would have to create yet another custom function to do this. Given that we are unlikely to need such a function again, let’s just use a lambda function (more on that in the W07 lecture):

# This replaces the invalid values with NaN
df['clean_population'] = df['population'].apply(lambda x: np.nan if is_invalid_number(x) else x)

# The regular np.int32 cannot handle NaNs
df['clean_population'] = df['clean_population'].astype('Int32')

Option 3: Split the invalid values and get the first/second number

# This splits the invalid values and gets the first number
df['clean_population'] = df['population'].apply(lambda x: x.split(" ")[0] if is_invalid_number(x) else x)

df['clean_population'] = df['clean_population'].astype(np.int32)

Part III

We will leave Part III as a take-home exercise. I will provide the solution in the W07 lecture.