Section 3.1 Tables
Preview Activity 3.1.1.
One of the important features of Python for programmers (and SageMath for mathematicians) is its extensive “standard library”: that is, a lot of features are available by default in the language without installing additional programs.
However, since different people use Python for different applications, not every feature can be made availble by default. There are several libraries available in Python for data science, but we will use one specifically developed for students called, well, datascience
1 .
(a)
Copy the contents of 3.1.1 into a Code cell and follow its instructions to install datascience
if it is not already available, then re-run the Code cell. You will only need to do this process once per project.
(b)
To confirm that you can create datascience
tables, copy-paste the contents of 3.1.2 into a Code cell to display a nicely formatted table of (incomplete) Scrabble tile data.
Activity 3.1.2.
As previously investigated, dictionaries can be used to describe complex data with many different properties. In 3.1.3 a list of dictionaries is used to describe several students in standard Python.
(a)
Copy this list into a Code cell, and add data for at least three more students to the list.
(b)
Data is often stored in a spreadsheet or database as a table. An illustration of this is given in 3.1.4.
Name | Age | Favorite Teacher |
Julia | 17 | Mr. Fullinwider |
Jessica | 19 | Dr. Clontz |
Lynn | 18 | Ms. Gilbreath |
Tables can be created programmatically by using the technique from 3.1.1 or by modifying 3.1.5.
Run from datascience import Table
in a Code cell, and then copy 3.1.5 to another Code cell. Finally, append the student data you created in the previous task.
(c)
Since data usually isn't hard-coded into a script, but instead entered by folks who may not be programmers into a spreadsheet or similar file, we will need a way to read external files containing data.
Open a spreadsheet application such as Microsoft Excel or Google Sheets. Copy the table of data displayed in the previous task and paste it into your spreadsheet.
Save the spreadsheet to a Comma-Separated Values file named student_data.csv
and use the (+) New menu to upload it to CoCalc in the same folder as this Jupyter notebook.
Finally, create a Code cell with the line Table.read_table('student_data.csv')
to check that your CSV data can be imported into a datascience
table.
Activity 3.1.3.
When working with data in industry, you're likely working with many more rows than can be displayed on your screen at a given time. In fact, so-called “Big Data” is one of the fastest-moving fields of mathematics and computer science. By working with datascience
tables, you can filter data as needed for display and analysis.
(a)
Visit the website https://vincentarelbundock.github.io/Rdatasets/datasets.html
to get a CSV with the transactions for a (fictional) pizza parlor for the year 2015. (Hint: Use Ctrl+F to search for “pizza”.) If you like, you can download the CSV and view it in a spreadsheet program such as Microsoft Excel.
For this activity however, you should right-click and copy the URL to the CSV file. In a Code cell, save this URL to the CS variable named pizza_url
, and then import this data directly by running pizza_table = Table.read_table(pizza_url)
and pizza_table.show(20)
.
pizza_table
Unnamed: 0 | id | date | time | name | size | type | price |
1 | 2015-000001 | 2015-01-01 | 11:38:36 | hawaiian | M | classic | 13.25 |
2 | 2015-000002 | 2015-01-01 | 11:57:40 | classic_dlx | M | classic | 16 |
3 | 2015-000002 | 2015-01-01 | 11:57:40 | mexicana | M | veggie | 16 |
(b)
There are almost fifty thousand transactions stored in this dataset. You could write pizza_table.show()
without specifying how many records to to try and show them all, but CoCalc will complain that 50,000 rows are too many.
Insert an integer into the .show()
tool to display exactly one hundred rows from the dataset.
(c)
The code pizza_table.where("date","2015-02-01").show(10)
allows you to view the first ten records of a pizza purchased on Feburary 1, 2015.
Display the first 20 pizza pies purchased on \(\pi\)-Day (that is, March 14, 2015).
(d)
You can filter by any column of the table by modifying the where()
arguments. Display the first 20 medium pizzas purchased in this parlor.
(e)
Display the first 20 “Mexicana” pizzas purchased in this parlor. (Hint: make sure you filter the data based on how its string appears in the table.)
(f)
Chain two where()
s and a show()
together to display the first 20 large-sized veggie pizzas purchased in this parlor.
(g)
Running pizza_table.where("date","2015-07-22").sort("size").show()
displays the large pizzas sold on July 22nd first, followed by medium, small, and extra-large pizzas (in that order). Running pizza_table.where("date","2015-07-22").sort("size",descending=True).show()
displays them in the order extra-large, small, medium, then large.
Write a sentence or two explaining why you suspect the datascience library orders the pizzas in this way.
(h)
Display all the Hawaiian pizzas ordered on April 12, once in increasing order, and once in descreasing order.
Activity 3.1.4.
While it's great to have data already organized in a CSV, most data is not already parsed for use directly by a data science library. Instead, data scientists must manually and/or programmatically convert the raw data into a suitable format.
In this activity, we will convert the text of a book in order to count the number of times each of its main characters is mentioned in each chapter.
(a)
Save a plain-text copy of Peter Pan by J. M. Barrie as the file peterpan.txt
in the same folder as this notebook by using the link https://www.gutenberg.org/ebooks/16
. This task may be confirmed by running f=open("peterpan.txt"); print(f.readline()); f.close()
in a Code cell.
(b)
Open the file and manually delete the header and footer information, so the text file begins at the start of “Chapter 1” and ends with the words “THE END”. Re-run print(open("peterpan.txt").readline())
, and you should see the line Chapter 1 PETER BREAKS THROUGH
.
(c)
Fix the code in 3.1.7 to save the chapters of the book to a datascience
table.
(d)
The code in 3.1.8 counts how many times Peter Pan is mentioned in each chapter of the book. Extend this code to also count how many times the Darling siblings Wendy, Michael, and John are mentioned in each chapter.
(e)
Sort this table to find the chapter where Wendy is mentioned the most, and the chapter where she is mentioned the least.
Exercises Exercises
1.
Repeat the necesary steps from the class activities in order to make pizza_table
and book_table
available for these exercises. Then run pizza_table.show(10)
and book_table.show(10)
to confirm.
2.
How many extra-extra-large pizzas were sold by the pizzeria represented by pizza_table
?
3.
Display all the “green garden” pizzas purchased on October 1st.
4.
Display all the pizzas sold on February 14th, sorted from most to least expensive.
5.
Update book_table
to include how frequently Tinker Bell is mentioned. (Note that she is often referred to as simply “Tink”.)
6.
Display book_table
sorted by how frequently Tinker Bell is mentioned, and confirm that she is most often mentioned in Chapter 3 (\(22\) times).
http://data8.org/datascience/