EMDST Spread of Data

Section 4.2 Spread of Data

Preview Activity 4.2.1.

Consider the data in the lists [13, 13, 12, 10, 10, 14, 12, 11, 10, 14, 15, 10, 11, 13, 14, 14, 14, 15, 14, 11] and [19, 12, 13, 9, 15, 5, 7, 12, 9, 14, 8, 20, 19, 15, 13, 8, 14, 14, 7, 17].

(a)

Show that the mean, median, and mode for the data in the first list are the same as the second list.

(b)

Draw a dot plot or histogram visualizing the data in each list.

(c)

What does the visualization tell you about the data that the measures of center do not?

Activity 4.2.2.

Often we are interested in the spread of values in a dataset. The simplest measure of spread is the range of data: the difference between its largest value (its maximum) and its smallest value (its minimum). For example, the range of the values in \(3,3,6,7,12,12,16,19\) is \(\text{max}-\text{min}=19-3=16\text{.}\)

(a)

Find the maximum, minimum, and range for the values \(9, 2, 3, 5, 3, 6, 2, 7\text{.}\)

(b)

Find the maximum, minimum, and range for the values \(5, 1, 3, 998, 2, 5, 8, 2\text{.}\)

(c)

What do you notice about the previous task? How does the range relate (or not relate) to the majority of the data?

(d)

Get a copy of pizza_table available for this notebook.

Set classic_prices = pizza_table.where("type","classic").column("price"), and then use min(classic_prices) and max(classic_prices) to describe the mininum price, maximum price, and range of prices for classic pizzas sold by the pizzeria.

(e)

Display a box plot and histogram for the table of classic pizza prices pizza_table.where("type","classic").select("price").

(f)

Notice that the box plot displays outliers in the data as dots: the values that are further away from the median than most points of data.

Ignoring outliers, what is an approximate range of the classic pizza prices?

Activity 4.2.3.

The absolute mean deviation measures pretty much exactly what it says: how far the data deviate from the dataset's mean.

(a)

Draw dot plots for the data in 4.2.1 and 4.2.2.

Table 4.2.1. Dataset One

1	5	5	5	5
6	6	6	6	6
6	7	7	7	12

Table 4.2.2. Dataset Two

1	2	3	4	5
6	6	6	6	6
7	7	9	10	12

(b)

Confirm that both datasets have the same mean, median, mode, and range.

(c)

Despite these common measurements, based on your dot plots, which dataset appears to have values that are spread out more than the other?

(d)

To find a way to measure this spread numerically, flesh out the code given in 4.2.3 to print out lists of absolute deviations of each value of data from the mean.

data_one = [] # fill in values from first dataset
data_two = [] # fill in this too

# if means are different, data was entered incorrectly
from statistics import mean
assert mean(data_one) == mean(data_two), "Means should be the same"
print("The mean for each dataset is FIXME") # print out the mean of the datasets

deviations_one = [
    abs( value - mean(data_one) ) # measure distance of value from the mean
    for value in data_one # do this for each value of data
]
print(deviations_one)
deviations_two = [] # TODO
print(deviations_two) # prints [5, 4, 3, 2, 1, 0, 0, 0, 0, 0, 1, 1, 3, 4, 6]

Listing 4.2.3. Broken code to calculuate deviations of values

(e)

The mean of each list created in the previous task is the absolute mean deviation of the original datasets. Compute these values, and confirm that the absolute mean deviation is higher for the dataset that has more spread-out values.

Activity 4.2.4.

While the absolute mean deviation provides an easy-to-implement measurement of data spread, an alternative formula is used to compute standard deviation.

(a)

Let \(M\) be the mean of a collection of the \(N\) data values \(x_0,x_1,\dots,x_{N-1}\text{.}\) Which formula measures absolute mean deviation, 4.2.4 or 4.2.5?

\begin{equation*} \frac{|x_0-M|+|x_1-M|+\dots+|x_{N-1}-M|}{N} \end{equation*}

Listing 4.2.4. Spread formula 1

\begin{equation*} \sqrt{\frac{(x_0-M)^2+(x_1-M)^2+\dots+(x_{N-1}-M)^2}{N-1}} \end{equation*}

Listing 4.2.5. Spread formula 2

(b)

The other formula shown above measures standard deviation. While the formulas have similarities, the standard deviation formula has certain nice mathematical properties that make it more commonly used by data scientists. (We defer exploring what those advantages are exactly to more advanced statistics courses.)

We wrote our own program to compute absolute mean deviations, but the standard deviation of a list may be computed in Python simply by using from statistics import stdev. Show that the standard deviations of the datasets studied in the previous activity are not exactly the same as the absolute mean deviations, but they still do the same job of communicating how one dataset is spread out more than the other.

(c)

Compute the standard deviation for classic_prices.

(d)

Compute the standard deviation for the prices of large pizzas sold in 2015.

(e)

Write a few sentences explaining your intuition for why the standard deviation for the prices of classic pizzas was larger/smaller than the standard deviation for the prices of large pizzas.

(f)

There is no hard-and-fast mathematical definition for what makes a value an outlier: different areas of study may allow for more or less tolerance for “unusual” values. However, one common assumption (known as the three-sigma rule) is that non-outlier data should be within three standard deviations of the mean, that is, each distance of a piece of non-outlier data from the mean should be less than \(3\times\text{stdev}\text{.}\)

Explain why the three-sigma rule requires that the range of non-outlier data should be less than six standard deviations, or \(6\times\text{stdev}\text{.}\)

(g)

For example, if a dataset has a standard deviation of \(4\text{,}\) it's reasonable to expect that the range of non-outlier data is less than \(24\text{.}\)

Confirm that the datascience library generates a box plot of large pizza prices that shows no outliers. Then confirm that the range of large pizza prices is less than six times its standard deviation.

(h)

Use a box plot of classic pizza prices to show that the range of its non-outlier data is less than six standard deviations.

Exercises Exercises

1.

Find the maximum, minimum, and range for the values \(12,4,9,9,4,10,803,11,5\text{.}\)

2.

Which of the above values would you most consider to be an outlier, and why?

3.

Show how to find the absolute mean deviation for \(12,4,9,9,4,10,803,11,5\text{,}\) either by hand or by computer.

4.

Show how to use technology to compute the standard deviation for \(12,4,9,9,4,10,803,11,5\text{.}\)

5.

If you remove the outlier value from \(12,4,9,9,4,10,803,11,5\text{,}\) would you expect the range to increase or decrease? What about absolute mean deviation? What about standard deviation? And why?

6.

Confirm your expectation by computing the range, absolute mean deviation, and standard deviation for \(12,4,9,9,4,10,803,11,5\) with the outlier value removed.

7.

Without using technology to check, would you expect the range and standard deviation to be higher for the prices of all medium sized pizzas sold in a year, or the prices of all veggie pizzas sold in a year?

8.

Confirm your intuition by getting a copy of pizza_table available in this notebook, and computing the ranges and standard deviations for the medium pizza prices and veggie pizza prices.