๐ W02 Formative Exercise - Warm up to data analysis
2024/25 Autumn Term
โญ Update (20 Oct 2024, 5.30pm). Take a look at the โ Week 02 Solutions & Commentary page.
๐ค Am I ready to start this exercise?
You will benefit more from this exercise if you have completed the following tasks:
Context
โณ How long will this take?
I estimate it should take about 2 hours to complete this exercise. Hereโs a rough breakdown of the time you might spend on each task if you donโt encounter any random errors along the way:
- Task 1: 15 minutes
- Task 2: 15 minutes if you donโt follow the (optional) challenging step. 30 minutes if you do.
- Task 3: about 45 minutes if done without the optional challenging step. Allow one extra hour if you decide to tackle it without seeing the solution.
- Task 4: about 15 minutes
โฒ๏ธ Due Date:
- 9 October 2024, 8.00 PM (the day before the first lecture)
You wonโt be able to submit after this time, but you can still work on it and ask questions in the #help
channel on Slack just fine. This assignment wonโt count towards your grade (not even for General Course students), but you might feel behind if you complete it after the deadline.
๐ Useful DS105 links
For a quick reminder of the most common terminal commands.
To learn how to interact with the Nuvolos Cloud Platform and submit your assignment.
๐ W01 Exercise
If you want to revisit the commands in context of usage.
๐งโ๐ซ W01 Lecture Notes
You might want to read through Topic 2 of the notes if you feel confused about paths, files and directories.
๐ Online resources
- Primitive Data Types vs Non Primitive Data Types in Python | Geeks for Geeks
- Python Data Structures | DataCamp
- Python Data Structures CheatSheet | Cheatography
๐ Submission
- You should submit your solutions entirely via the Nuvolos Cloud platform.
๐ฏ Main Objectives:
If you complete this assignment successfully, you will have practiced and learned the following skills:
- ๐ Terminal commands! (
cat
,head
,tail
,wc
,grep
) - Basic data structures in Python and manipulation of lists and dictionaries
- Basic string manipulation and arithmetic operations in Python
- Reading files using Pandas and working with JSON
๐ NOTE: We will not provide any scripts. You are expected to create your own files and directories as needed through the Nuvolos Cloud Terminal app.
Task 1: Getting started with Nuvolos
Ideally, you should do this activity via Nuvolos. As I said in the lecture, accessing the Terminal of a completely different computer with a completely different file system structure will be very useful. The machine you will access on Nuvolos runs not on Windows or Mac but on a Linux Operating System. Working on it will help you understand the concepts of directories and files better.
But if you encounter any issues with Nuvolos and we canโt help you sort it out in time, you can still complete this exercise on your own computer. You wonโt be able to โsubmitโ your work, but you can still learn a lot from the tasks โ which is the most important thing!
๐ Weโre here to help: We are offering 6 hours of support via Slack and 10 hours of โlive supportโ (office hours + drop-in session) on Week 02. If youโre experiencing difficulties with Nuvolos, do not hesitate to ask your questions in the #help
channel on Slack or attend a support session or office hours.
๐ฏ ACTION POINTS:
Access Nuvolos. Visit our Setting up Nuvolos Guide to learn how to access the platform. Read the guide carefully and follow the instructions to access the platform.
Accept the assignment on Nuvolos. Go to the Assignments tab and accept the assignment called โ๐ W02 Formative Exerciseโ. Nothing will change on your files because we are giving you an empty folder, but accepting it will keep a record that you are engaging with exercise.
Run the Terminal emulator app. After a few seconds, you will see a terminal window. Note: even if your computer is Windows, you will have to use the
bash
commands โ the same ones that macOS and Linux users use. Check our Terminal Cheat Sheet to compare the commands you know with the ones you will use on Nuvolos.Create the folder structure below. Using the appropriate terminal commands you saw last week (or on the cheatsheet) to create the following directory structure.
W02-Practice/ โโโ data/ โโโ code/
Check the log of commands youโve run so far. You can use the
history
command to see a list of commands youโve run so far. Try running the following command:history
On Nuvolos, I created a special command called
print-history
that will format the output of thehistory
command in a way that is easier to read. You can run it by typing:print-history
Submit your work
Simply creating folders and files on Nuvolos, like we did above, will not make your changes visible to us. Any alterations made within Nuvolos are only visible to you. To receive feedback and help us understand your progress with the exercises, itโs important to submit your work.
๐ NOTE: I will check all the submissions on Nuvolos. This will help me understand how easy/difficult the exercises and materials are for everyone. While I wonโt necessarily give you individualised feedback on your submission, I will compile common misconceptions and best practices to share with the class in the ๐๏ธ W02 Lecture.
๐ฏ ACTION POINTS:
Export your bash history. Every time you feel ready to submit, run the command below to update your bash history file to Comma-Separated Values file (CSV) by running the following commands:
print-history > ./data/bash_history.csv
The
>
symbol is used to redirect the output of theprint-history
command to a file calledbash_history.csv
in thedata
directory.Note that you might need to edit the path to the file accordingly if you are not in the
W02-Practice
directory.๐ I will combine everyoneโs logs and show some aggregated data analysis in the next lecture.
(Optional) See the content of the file. You can use the
cat
command to see the content of the file you just created. Run the following command:cat ./data/bash_history.csv
You should see a list of commands youโve run so far, separated by commas and enclosed in double quotes.
๐ก Tip: You can also use the
less
command to scroll through the file. Just typeless bash_history.csv
and use the arrow keys to navigate. Pressq
to exit.Go to the Assignments Tab. There you will find the list of assignments available. Locate the assignment and click on the โHand-Inโ Button
Hand-in the submission. You will be asked to type an identifier
Re-submit any time you need. If you need to make changes to your submission, you can hand it in again. The last submission is the only one we will consider when giving feedback.
Itโs absolutely natural and common to have many questions at this stage. We want to hear from you! We feel lonely if you donโt ask for help. Do voice your question on the #help
channel on Slack, attend support sessions or bring them to the lecture.
โ Click here to check your understanding of Task 1
โ Check your understanding
After completing this task, you should be able to explain the following to a colleague:
๐ญ Think about it: a good indicator that we have grasped a new concept is when we can clearly explain it to others.
๐ฃ๏ธ Talk about it! If youโre having difficulty articulating answers to the questions above, you can benefit from chatting with others (peers or instructors) to revisit your learning. What if you posted a message in the #help
channel on Slack, such as, โSo.. why is the directory structure in Nuvolos so different to a normal computerโ?
Task 2: An analysis of your Terminal history data
Typically, the first step in any data analysis project is to understand the data you are working with. Here are a few things you typically investigate:
- What is the structure of the data? Is it like a table, with rows and columns? Or is it more like a list of items?
- What are the types of data in the dataset? Are they numbers, strings, dates, or something else?
- What are the unique values in the dataset? Can I do a tally of how many times each value appears?
You donโt need to wait until we learn Python to start working with data!! In this session, you will find a few new Terminal commands that will be useful even in the future when you want to quickly check the content of a new dataset before you start working properly with it.
๐ฏ ACTION POINTS:
Visualise the content of a file with
cat
. On the Terminal,cd
to where the file is and type the following command to see its content:cat bash_history.csv
Notice that there is a structure to this file: each line represents a command youโve run and the three fields (line, timestamp, and command) are separated by commas. This is a common format for tabular data, and itโs called a CSV (Comma-Separated Values) file.
๐ก Tip: You can also use the
less
command to scroll through the file. Just typeless bash_history.csv
and use the arrow keys to navigate. Pressq
to exit.Use
head
ortail
to explore just a piece of the file. When working with very large files, you might just want to peek at the beginning or end of the file. You can do this with thehead
andtail
commands. Try running the following commands:head bash_history.csv tail bash_history.csv
The
head
command shows the first 10 lines of the file, while thetail
command shows the last 10 lines. You can specify the number of lines you want to see by adding a number after the command, like this:head -n 5 bash_history.csv tail -n 5 bash_history.csv
Count the number of lines in the file. You can use the
wc
command to count the number of lines in the file. Try running the following command:wc -l bash_history.csv
The number should match the number you see at the tail of the file.
What if you want to count the number of words? Swap the
-l
flag for-w
:wc -w bash_history.csv
Count number of times youโve run a specific command. You can use the
grep
command to search for specific commands in the file. For example, to count the number of times youโve run thels
command, you can run the following command:grep -c 'ls' bash_history.csv
The
-c
flag tellsgrep
to count the number of lines that match the pattern within quotes. You can also use the-i
flag to make the search case-insensitive (that is, to ignore the difference between uppercase and lowercase letters).๐ญ Think about it: What other commands would you like to count?
(Optional) 5. CHALLENGING: What are the top 5 commands you typed so far?
โ ๏ธ Warning: This section might be overwhelming if you are new to the command line. Itโs totally fine to skip this step. You will learn all these skills in Python in the next few weeks.
Meet the uniq
command!
There is a command called uniq
that could help us count the number of unique commands youโve run. But there is a catch: uniq
only counts adjacent lines that are the same. That is, before using uniq
, we would need to get our data sorted. If we had something like this:
mv
mv
cd
cd
ls
ls
ls
cat
cat
Then, running uniq -c
would return the count of each command:
2 mv
2 cd
3 ls
2 cat
Maybe sort
can help us?
One way to achieve this is to use the sort
command before uniq
. Try running the following command:
sort bash_history.csv | uniq -c
๐ก Notice the use of the pipe |
to send the output of the sort
command to the uniq
command. The -c
flag tells uniq
to count the number of times each line appears.
The output was not quite what we are going for. The problem is that sort
is treating the entire line as a single entity. So, if you typed cd Documents
and cd Downloads
, sort
would treat them as different commands. We still need to find a way to trim each line to just the command itself.
Enter the cut
command!
One way to do this is to use the cut
command. The cut
command is used to extract sections from each line of input. Try running the following command:
cut -d , -f 1 bash_history.csv
The command above will extract the line numbers. The -d
flag is used to specify the delimiter, which in our case is a comma. The -f
flag is used to specify the field we want to extract.
To get closer to what we want, we need to get the third field (the commands youโve run):
cut -d , -f 3 bash_history.csv
To get just the command (not the arguments or pipes), we cut the remaining string to get the second field after a space:
cut -d , -f 3 bash_history.csv | cut -f 2 -d ' '
Almost there!
Combining these commands with the sort
and uniq
commands, we can count the number of times youโve run each command:
cut -d , -f 3 bash_history.csv | cut -f 2 -d ' ' | sort | uniq -c
Nice! But the commands are sorted in alphabetical order. Therefore we need to add a new sort
command at the end to sort them by the number of times youโve run each command:
cut -d , -f 3 bash_history.csv | cut -f 2 -d ' ' | sort | uniq -c | sort -n
The top commands will be at the end of the list.
Reverse the order and grab the top 5!
If you want to see the top commands at the beginning of the list, you can add the -r
flag to the last sort
command:
echo "My top 5 commands:"
cut -d , -f 3 bash_history.csv | cut -f 2 -d ' ' | sort | uniq -c | sort -nr | head -n 5
where we added a line of text (with echo
) to make it clear what the output is showing and used the head
command to show only the first 5 lines, sorted in reverse order from the most used command to the least used.
Mine looks like this:
My top 5 commands:
115 echo
102 history
59 print-history
57 vim
51 source
๐ฎโ๐จ Phew! That was a lot of commands in a row! You must feel like a hacker.
Submit your work again. Go back to the Submit your work section and follow the instructions again to update your submission.
This will demonstrate to us that youโre reading the material and engaging with the course! Also, I will aggregate everyoneโs data (anonymised) to present some simple analysis at the incoming lecture.
โ Click here to check your understanding of Task 2
โ Check your understanding
After completing this task, you should be able to explain the following to a colleague:
๐ญ Think about it: a good indicator that we have grasped a new concept is when we can clearly explain it to others.
๐ฃ๏ธ Talk about it! If youโre having difficulty articulating answers to the questions above, you can benefit from chatting with others (peers or instructors) to revisit your learning. What if you posted a message in the #help
channel on Slack, such as, โI was brave and tried Step 5 of Task 1 but Iโm still puzzled by this particular line. Does anyone have a better intuition into how it works?โ?
Task 3: Analysis of the CSV file using Python!
Itโs time to switch to Python! In this task, we will read the bash_history.csv
file you created in the previous task and perform the same analysis we did in the previous task, but using Python.
๐ฏ ACTION POINTS:
Re-export your history log. Just to get the most up-to-date data:
print-history > ./data/bash_history.csv
Adapt the path to the file if you are not inside the
W02-Practice
directory.Open the Python shell. Open the Terminal app, make sure your working directory is
W02-Practice
then typepython
to open the Python shell. You should see something like this:Python 3.xx.x (some info) on linux Type "help", "copyright", "credits" or "license" for more information. >>>
๐ก Tip: You can always exit the Python shell by typing
exit()
or pressingCtrl+D
.Use
open()
to read a plain text file in Python. Before you read the file, you must establisha connection to it.Inside the shell (right next to the
>>>
symbol), type this and hit Enter:file = open('./data/bash_history.csv', 'r')
Then type:
file
to see what the
file
variable contains. You should see something like this:<_io.TextIOWrapper name='./data/bash_history.csv' mode='r' encoding='UTF-8'>
This tells us that
file
is a connection to the filebash_history.csv
in read mode. You might have a different encoding, but thatโs fine.โ ๏ธ Actually, this is not a good way to read files in Python, but it will help us build the intuition for the next steps.
Actually read the file. Now that we have a connection to the file, we can read its content. Type the following and hit Enter:
= file.read() content
The
read()
method reads the entire file and returns its content as a string. You should see the content of the file printed to the screen, full of commas (,
), double quotes"
and newlines (\n
).The
content
variable has all of our data. This is what we would further process in Python.You can check the content by typing:
content
and hitting Enter.
Alternatively, you can use the
print()
function to see the content in a more readable way, similar to the output ofcat
we saw in Task 2:print(content)
๐ญ Think about it: Why do you think
content
andprint(content)
produce something so different? (Why not Slack#help
to discuss with others?)Close the connection. After you are done reading the file, you must close the connection. Type the following and hit Enter:
file.close()
After this, if you try to
file.read()
, it will raise an error because the connection is closed.๐ Now, hereโs the best way to do it! Itโs easy to forget to close a file, so Python has a helpful feature to handle the connection automatically. The
with
statement opens and closes the file for you if you use it correctly.Here is how you can read the content of a file using the
with
statement:with open('./data/bash_history.csv', 'r') as file: = file.read() content
After which, you can type
content
again to see the content of the file.๐ก Tip: The
with
statement is a good practice because it ensures that the file is properly closed after its suite finishes, even if an exception is raised.๐ Itโs extremely common to get some random errors at this stage. If that happens to you, contact us via Slack or attend a support session
Split the lines. The content of the file is a single huge string. We need to split it into lines so we can make sense of it.
You can do this by using the
split()
method. Type the following and hit Enter:= content.split('\n') lines
The
split()
method splits a string into a list using the specified separator. In this case, we are splitting the content into lines using the newline character (\n
).How many lines are there? Instead of
wc -l
, which we used in the Terminal, in Python you can simply run:len(lines)
To obtain the length of the list. You should see the number of lines in the file.
Seeing the head and tail of the list.
In Python, the first element of a list is at position
0
. You can see the first line of the file by typing:0] lines[
To get a similar experience to the
head
command in the Terminal, you can type:5] lines[:
The
:
tells Python to get all elements from the beginning of the list up to the fifth element. You can change the number to see more or fewer lines.Now to see the very last line of the file, you can type:
-1] lines[
And you can also use the same slicing technique to see the last 5 lines:
-5:] lines[
๐ Note: In Python, the first element is always at the position
0
. If you want to see the second line, typelines[1]
.๐ก Tip: You can check the content of lines by typing
lines
orprint(lines)
in the Python shell. You can also check how many lines by typinglen(lines)
.Split the line into fields. You know that we decided to use a comma to separate the line number from the timestamp from the commands. You can split the first line into fields by typing the following and hitting Enter:
= lines[0].split(',') fields
You can check the content of the
fields
variable by typingfields
.๐ Note: You should see a list that has three fields: the
line_number
, thetimestamp
and thecommand
Neatly store the fields in a dictionary. You can store the fields in a dictionary to make it easier to access them. Type the following and hit Enter:
= { data 'line_number': fields[0], 'timestamp': fields[1], 'command': fields[2] }
You can check the content of the
data
variable by typingdata
.๐ Note: You should see a dictionary with the keys
line_number
,timestamp
, andcommand
and their respective values.(CHALLENGE) Convert each one of the lines to a dictionary. If you have been taking the Python pre-sessional courses offered by Digital Skills Lab, try to use your knowledge of
for
loops and create a list calledfull_data
such that each element of the list contains a dictionary representing a line in the file.๐ Note: If you are new to Python, donโt worry! ๐๏ธ W02 Lecture will feature a Python crash course.
๐ Look at the solution
You could copy-paste the entire code below to the Python shell or, if you encounter errors, try to copy-paste each line separately. See if you can understand what each line does.
with open('./data/bash_history.csv', 'r') as file:
= file.read()
content = content.split('\n')
lines
= []
full_data for line in lines:
= line.split(',')
fields if len(fields) == 3:
= {
data 'line_number': fields[0],
'timestamp': fields[1],
'command': fields[2]
} full_data.append(data)
You can check the content of the full_data
variable by typing full_data
.
๐ Note: You should see a list of dictionaries, where each dictionary represents a line in the file.
If your code works, when you type full_data
, you should see something like:
'line_number': '263 ',
{'timestamp': '2024-09-24-17:50:01',
'command': '" head bash_history.csv"'},
'line_number': '263 ',
{'timestamp': '2024-09-24-17:50:01',
'command': '" tail bash_history.csv"'},
'line_number': '263 ',
{'timestamp': '2024-09-24-17:50:01',
'command': '" wc -l bash_history.csv"'}
๐ญ Think about it: How would you further process the full_data
to obtain just the name of the commands you used?
Save
full_data
to a JSON file. You will learn about JSON files on the ๐ป W02 Lab, but you can start getting used to it.# The JSON library is part of Python's standard library import json # This time we open the file with `w` to write to it with open('./data/bash_history.json', 'w') as file: file) json.dump(full_data,
Exit the Python shell (Type
exit()
and hit Enter) and usecat
to see the content of the JSON file:cat bash_history.json
Submit your work again. Go back to the Submit your work section and follow the instructions again to update your submission.
This time around, your directory should look like this:
W02-Practice/ โโโ data/ โ โโโ bash_history.csv โ โโโ bash_history.json โโโ code/
โ Click here to check your understanding of Task 3
โ Check your Understanding
After completing this task, you should be able to explain the following to a colleague:
๐ญ Think about it: a good indicator that we have grasped a new concept is when we can clearly explain it to others.
๐ฃ๏ธ Talk about it! If youโre having difficulty articulating answers to the questions above, you can benefit from chatting with others (peers or instructors) to revisit your learning. What if you posted a message in the #help
channel on Slack, such as, โGee! Iโm lost. I got stuck at Step 6 of Task 3 with the following error and canโt get out of it! Help!โ
Task 4: Save your code to a Python script
On Task 3, youโve played around with Python in the shell. This is a great way to experiment with code and test ideas. However, it is not a good way to write and save code that you want to reuse later. For this, we need to save our code to a Python script.
๐ฏ ACTION POINTS:
Create a new Python script. In the Terminal, ensure you are inside โW02-Practiceโ then type the following command to create a new Python script:
touch ./code/commands_analysis.py
Use
touch
(ni
on Windows) every time you want to create a plain text file. If the file already exists,touch
will simply modify the โlast modifiedโ metadata on the file, it wonโt change its content.Open the Python script using a terminal text editor. Use the
nano
command to edit a file on the Terminal, like you would on Word or Google Docs:nano ./code/commands_analysis.py
Make changes to the file as you want. Remember: your mouse wonโt work as expected here, use and abuse the arrow keys of your keyboard.
IMPORTANT: Whenever you feel like you are done editing or you want to save the progress with your file and go back to the Terminal, type Ctrl + X to mark this file as ready to save. You will see the following message at the bottom of the screeen:
Type Y to confirm you want to save the changes. The
nano
app will ask you to give (or confirm) the name of your plain text file:Edit the filename if it is incorrect, then hit Enter. You should be back on the shell app.
Copy the code from Task 3 to the Python script. Copy the code you wrote at the end of Task 3, including the bit where we save a JSON file, to the
commands_analysis.py
file.Run the Python script. Open the Terminal app,
cd
to theW02-Practice
folder and then run the following command:python code/commands_analysis.py
You should see the output of the script printed to the screen just as before.
๐ก Tip: It is common to see errors of โfile not foundโ at this stage. If that happens to you, check where you are (you should be in the โW02-Practiceโ folder), and check the path to
bash_history.csv
you wrote inside the Python script. It should be relative to โW02-Practiceโ, too.Submit your work again. Go back to the Submit your work section and follow the instructions again to update your submission.
This time around, your directory structure should look like this:
W02-Practice/ โโโ data/ โ โโโ bash_history.csv โ โโโ bash_history.json โโโ code/ โโโ commands_analysis.py
โ Click here to check your understanding of Task 4
โ Check your Understanding
After completing this task, you should be able to explain the following to a colleague:
๐ญ Think about it: a good indicator that we have grasped a new concept is when we can clearly explain it to others.
๐ฃ๏ธ Talk about it! If youโre having difficulty articulating answers to the questions above, you can benefit from chatting with others (peers or instructors) to revisit your learning. What if you posted a message in the #help
channel on Slack, such as, โIโm trying to run my Python script but I keep getting this error. Can someone help me understand whatโs going on?โ?
๐ Challenge
If you already had the time to practice some Python, you can try to solve the following challenge.
Under Task 2 - Step 5 (Challenging), we used a series of bash
commands to find the top 5 commands youโve run so far. Can you try to achieve the same result using Python?
๐ฏ ACTION POINTS:
Edit the
commands_analysis.py
script. Open thecommands_analysis.py
script in thenano
text editor and add the code to calculate the top 5 commands youโve run so far.- Use just basic Python! Just
for
orwhile
loops +list
+dict
operations orstr
operations. Do not use any external libraries likepandas
ornumpy
nor advanced Python concepts like there
orcollections
modules.
- Use just basic Python! Just
Create a rudimentary plot for your top commands. Try to make sense of the
plot_ascii_bar()
we shared with you last week, at the end of the ๐ W01 Formative Exercise and reuse it to make a plot of your commands.Run the Python script. Open the Terminal app,
cd
to theW02-Practice
folder and then run the following command:python code/commands_analysis.py
You should see the output of the script printed to the screen just as before.
Submit again! Go back to the Submit your work section and follow the instructions again to update your submission.