EMDST Tables

Section 3.1 Tables

Preview Activity 3.1.1.

One of the important features of Python for programmers (and SageMath for mathematicians) is its extensive “standard library”: that is, a lot of features are available by default in the language without installing additional programs.

However, since different people use Python for different applications, not every feature can be made availble by default. There are several libraries available in Python for data science, but we will use one specifically developed for students called, well, datascience¹.

(a)

Copy the contents of 3.1.1 into a Code cell and follow its instructions to install datascience if it is not already available, then re-run the Code cell. You will only need to do this process once per project.

try:
    from datascience import Table
    print("The module `datascience` is installed on your project.")
except ModuleNotFoundError:
    import sys
    print("The module `datascience` is not installed.")
    print("To install it for your CoCalc project, do the following:"
    print("1. Use the [(+) New] menu to open a Linux terminal.")
    print("2. Copy-paste the line below into the terminal and hit [Enter].")
    print()
    print(f"{sys.executable} -m pip install --user datascience")
    print()
    print("3. Return to this notebook and \"Restart Kernel\".")
    print("4. Re-run this Code cell.")

Listing 3.1.1. Install and import datascience

(b)

To confirm that you can create datascience tables, copy-paste the contents of 3.1.2 into a Code cell to display a nicely formatted table of (incomplete) Scrabble tile data.

Table().with_columns([
    'letter', ['a', 'b', 'c', 'z'],
    'count',  [  9,   3,   3,   1],
    'points', [  1,   2,   2,  10],
])

Listing 3.1.2. Try out datascience

Activity 3.1.2.

As previously investigated, dictionaries can be used to describe complex data with many different properties. In 3.1.3 a list of dictionaries is used to describe several students in standard Python.

[
    {
        "name":             "Jessica",
        "age":              19,
        "favorite_teacher": "Dr. Clontz",
    },
    {
        "name":             "Lynn",
        "age":              18,
        "favorite_teacher": "Ms. Gilbreath",
    },
    {
        "name":             "Julia",
        "age":              17,
        "favorite_teacher": "Mr. Fullinwider",
    },
]

Listing 3.1.3. A list of student dictionaries

(a)

Copy this list into a Code cell, and add data for at least three more students to the list.

(b)

Data is often stored in a spreadsheet or database as a table. An illustration of this is given in 3.1.4.

Table 3.1.4. Student data table

Name	Age	Favorite Teacher
Julia	17	Mr. Fullinwider
Jessica	19	Dr. Clontz
Lynn	18	Ms. Gilbreath

Tables can be created programmatically by using the technique from 3.1.1 or by modifying 3.1.5.

Table(labels=["Name", "Age", "Favorite Teacher"]).with_rows([
    ["Jessica", 19, "Dr. Clontz"],
    ["Lynn", 18, "Ms. Gilbreath"],
    ["Julia", 17, "Mr. Fullinwider"],
])

Listing 3.1.5. Table from rows

Run from datascience import Table in a Code cell, and then copy 3.1.5 to another Code cell. Finally, append the student data you created in the previous task.

(c)

Since data usually isn't hard-coded into a script, but instead entered by folks who may not be programmers into a spreadsheet or similar file, we will need a way to read external files containing data.

Open a spreadsheet application such as Microsoft Excel or Google Sheets. Copy the table of data displayed in the previous task and paste it into your spreadsheet.

Save the spreadsheet to a Comma-Separated Values file named student_data.csv and use the (+) New menu to upload it to CoCalc in the same folder as this Jupyter notebook.

Finally, create a Code cell with the line Table.read_table('student_data.csv') to check that your CSV data can be imported into a datascience table.

Activity 3.1.3.

When working with data in industry, you're likely working with many more rows than can be displayed on your screen at a given time. In fact, so-called “Big Data” is one of the fastest-moving fields of mathematics and computer science. By working with datascience tables, you can filter data as needed for display and analysis.

(a)

Visit the website https://vincentarelbundock.github.io/Rdatasets/datasets.html to get a CSV with the transactions for a (fictional) pizza parlor for the year 2015. (Hint: Use Ctrl+F to search for “pizza”.) If you like, you can download the CSV and view it in a spreadsheet program such as Microsoft Excel.

For this activity however, you should right-click and copy the URL to the CSV file. In a Code cell, save this URL to the CS variable named pizza_url, and then import this data directly by running pizza_table = Table.read_table(pizza_url) and pizza_table.show(20).

Table 3.1.6. The first three rows of pizza_table

Unnamed: 0	id	date	time	name	size	type	price
1	2015-000001	2015-01-01	11:38:36	hawaiian	M	classic	13.25
2	2015-000002	2015-01-01	11:57:40	classic_dlx	M	classic	16
3	2015-000002	2015-01-01	11:57:40	mexicana	M	veggie	16

(b)

There are almost fifty thousand transactions stored in this dataset. You could write pizza_table.show() without specifying how many records to to try and show them all, but CoCalc will complain that 50,000 rows are too many.

Insert an integer into the .show() tool to display exactly one hundred rows from the dataset.

(c)

The code pizza_table.where("date","2015-02-01").show(10) allows you to view the first ten records of a pizza purchased on Feburary 1, 2015.

Display the first 20 pizza pies purchased on \(\pi\)-Day (that is, March 14, 2015).

(d)

You can filter by any column of the table by modifying the where() arguments. Display the first 20 medium pizzas purchased in this parlor.

(e)

Display the first 20 “Mexicana” pizzas purchased in this parlor. (Hint: make sure you filter the data based on how its string appears in the table.)

(f)

Chain two where()s and a show() together to display the first 20 large-sized veggie pizzas purchased in this parlor.

(g)

Running pizza_table.where("date","2015-07-22").sort("size").show() displays the large pizzas sold on July 22nd first, followed by medium, small, and extra-large pizzas (in that order). Running pizza_table.where("date","2015-07-22").sort("size",descending=True).show() displays them in the order extra-large, small, medium, then large.

Write a sentence or two explaining why you suspect the datascience library orders the pizzas in this way.

(h)

Display all the Hawaiian pizzas ordered on April 12, once in increasing order, and once in descreasing order.

Activity 3.1.4.

While it's great to have data already organized in a CSV, most data is not already parsed for use directly by a data science library. Instead, data scientists must manually and/or programmatically convert the raw data into a suitable format.

In this activity, we will convert the text of a book in order to count the number of times each of its main characters is mentioned in each chapter.

(a)

Save a plain-text copy of Peter Pan by J. M. Barrie as the file peterpan.txt in the same folder as this notebook by using the link https://www.gutenberg.org/ebooks/16. This task may be confirmed by running f=open("peterpan.txt"); print(f.readline()); f.close() in a Code cell.

(b)

Open the file and manually delete the header and footer information, so the text file begins at the start of “Chapter 1” and ends with the words “THE END”. Re-run print(open("peterpan.txt").readline()), and you should see the line Chapter 1 PETER BREAKS THROUGH.

(c)

Fix the code in 3.1.7 to save the chapters of the book to a datascience table.

with open("peterpan.txt") as book_file:
    # Fix the order of the following lines
    book_table = Table().with_column("Chapter", book_chapters) # creates table of chapters
    book_chapters = book_string.split("Chapter ")[1:] # makes list of strings for each chapter
    book_table.show() # displays table
    book_string = book_file.read() # saves contents of the text file to a string

Listing 3.1.7. Import book text to a table

(d)

The code in 3.1.8 counts how many times Peter Pan is mentioned in each chapter of the book. Extend this code to also count how many times the Darling siblings Wendy, Michael, and John are mentioned in each chapter.

book_table = book_table.with_columns([
    "Peter", [c.count("Peter") for c in book_chapters],
])
book_table.show()

Listing 3.1.8. Counting how many times each character is mentioned

(e)

Sort this table to find the chapter where Wendy is mentioned the most, and the chapter where she is mentioned the least.

Exercises Exercises

1.

Repeat the necesary steps from the class activities in order to make pizza_table and book_table available for these exercises. Then run pizza_table.show(10) and book_table.show(10) to confirm.

2.

How many extra-extra-large pizzas were sold by the pizzeria represented by pizza_table?

3.

Display all the “green garden” pizzas purchased on October 1st.

4.

Display all the pizzas sold on February 14th, sorted from most to least expensive.

5.

Update book_table to include how frequently Tinker Bell is mentioned. (Note that she is often referred to as simply “Tink”.)

6.

Display book_table sorted by how frequently Tinker Bell is mentioned, and confirm that she is most often mentioned in Chapter 3 (\(22\) times).

http://data8.org/datascience/