Skip to main content

Section 4.1 Centers of Data

Preview Activity 4.1.1.

A group of seven friends decides to compare their tax refunds, listed in 4.1.1.

Table 4.1.1. Seven tax refunds
$2,000 $2,000 $2,000 $2,500
$2,700 $2,800 $14,000

For each of the given quotes, write a sentence or two explaining whether you agree or disagree with the speaker.

(a)

“The most common tax refund is $2,000. Therefore, the average tax refund in this group is $2,000.”

(b)

“When ordered from least to greatest, the middle tax refund is $2,500. Therefore, the average tax refund in this group is $2,500.”

(c)

“When the seven refunds are added together and divided by \(7\text{,}\) the result is $4,000. Therefore, the average tax refund in this group is $4,000.”

Activity 4.1.2.

There are several notions of “average”, that is, several ways to describe the center of a dataset.

Often, “average” is used to describe the mean of a dataset. If a dataset contains the \(n\) values \(x_0,x_1,\dots,x_{n-1}\text{,}\) then the mean is defined by the following formula:

\begin{equation*} \frac{1}{n}\sum_{0\leq k< n} x_k = \frac{x_0+x_1+\dots+x_{n-1}}{n} \end{equation*}

For example, the mean of the counting numbers \(2,3,5,7,18\) is \(\frac{2+3+5+7+18}{5}=7\text{:}\) all the values in the collection are added together, and then divided by the amount of data in the collection.

(a)

Show that \(8\) is the mean of the counting numbers \(4,8,9,11\text{.}\)

(b)

Is it possible for the mean of a dataset to not belong to the dataset itself? If so, give an example of such a dataset. If not, explain why.

(c)

First, make a copy of the pizza_table used in previous sections available for this notebook.

Then, correct the code in 4.1.2 to output the correct mean of the costs of pizza sold on Nov 10.

# FIXME
# this code says the mean cost of a pizza is less than a dollar!
prices = pizza_table.where("date","2015-10-09").column("price")
len(prices)/sum(prices)
Listing 4.1.2. Broken code to calculuate mean cost of pizza

(d)

Find the mean cost of a large pizza sold during the year of sales recorded in pizza_table.

Activity 4.1.3.

The mode of a dataset is simply its most-repeated value. For example, the mode of the collection of values \(4,5,5,5,8,9,9,10\) is \(5\text{.}\)

(a)

Find the mode of the values \(8, 6, 1, 3, 3, 1, 9, 9, 1, 4\text{.}\)

(b)

It's possible that a dataset may have many different modes. For example, both \(2\) and \(5\) are modes for the values in the list [3,2,6,5,5,11,2,7].

Make up an example of a dataset with seven values that has both \(3\) and \(8\) as modes.

(c)

The Python library statistics provides convenient tools for computing statistical measures. For example, from statistics import mean and mean(prices) is a shortcut to compute the mean of a list of values in Python.

Replace mean with mode to write a program that outputs the mode of the list [3,2,6,5,5,11,2,7].

(d)

While earlier we noted that there are two different modes for the list [3,2,6,5,5,10,2,7], the mode tool only returns the mode that appears first in the list.

Show how to use multimode in place of mode to return a list of all modes for [3,2,6,5,5,11,2,7].

(e)

Compute the mode of the pizza prices sold on Oct 09.

(f)

Compute the mode of the prices of large pizzas sold in 2015.

Activity 4.1.4.

When data is ordered from least to greatest, the middle value is called its median. For example, the median of \(1,1,6,7,9\) is \(6\text{.}\)

(a)

Find the median of \(9,12,3,0,11,9,2\text{.}\)

(b)

If a dataset has two middle values, \(a\) and \(b\text{,}\) then their mean \(\frac{a+b}{2}\) is used as the median. For example, the median of \(4,6,7,9,10,13\) is \(\frac{7+9}{2}=8\text{.}\)

Make up a collection of six numbers such that its median is \(5.5\text{.}\)

(c)

Use the statistics library to compute the median of the pizza prices sold on Oct 09.

(d)

Compute the median of the prices of large pizzas sold in 2015.

(e)

Create a histogram and box plot for the prices of pizzas sold on Oct 09. Do these visualizations effectively communicate the mean, mode, or median computed earlier? What might the visualizations communicate that isn't described by the mean, mode, or median? Write a few sentences summarizing your thoughts.

Exercises Exercises

1.

Find the mean of the values in the list [2, 2, 6, 6, 2, 9, 4, 2, 6] by hand. Then verify your answer using a Code cell.

2.

Find the mode of the values in the list [2, 2, 6, 6, 2, 9, 4, 2, 6] by hand. Then verify your answer using a Code cell.

3.

Find the median of the values in the list [2, 2, 6, 6, 2, 9, 4, 2, 6] by hand. Then verify your answer using a Code cell.

4.

If a dataset is known to contain no repeated values, which measure of center would be least useful: mean, mode, or median? Write a sentence or two explaining why.

5.

Make up a collection of five numbers such that its mode, median, and mean are all \(9\text{.}\)

6.

Make up a collection of five different numbers such that its median and mean are both \(9\text{.}\)

7.

Make a copy of pizza_table available in this notebook. Find the mean, mode, and median of the prices of classic-type pizzas sold in 2015.

8.

Create a histogram and box plot of the classic-type pizza price data summarized by the measures of center in the previous exercise. Write a few sentences comparing/contrasting the numerical measures of center with these visualizations.