Reading sections are from the Introductory Statistics Textbook
With quantitative data, you can find the center by using the mean, the median, and the mode. But how do you calculate the center of categorical data? What is the middle of the categorical set {Exhausted, Tired, Normal, Energized}, taking into account the probabilities of each category?
To calculate the center, we use the expected value.
Finding the expected value can only be done if there is an order to the categories. In other words, we are working with an ordinal variable.
Calculating the expected value can be done in 4 steps:
This is the mean! (If you are interested in see that, in fact, the expected value is the mean, look and my proof.) Here’s the full equation:
\[E[X] = \mu = \sum x\cdot P(x)\]This can be applied in cases where you have values for each category. If there are no values, we can assign values. Let me give you two examples, one for each scenario.
First, we have an example where values are already given to each category.
I am involved in a program where I pay $1000 for a class. If I successfully pass the class, I get half of the payment back. If I fail the class, I lose the entire amount.
Putting those into concrete numbers:
If 80% of the students pass the class, what is the average amount paid per student?
Start by creating a table:
| Category | Pass | Fail |
|---|---|---|
| Value (\(x\)) | $500 | $1000 |
| Probability (\(P(x)\)) | 0.80 | 0.20 |
To find the expected value, multiply the values by their probabilities, then add up those results.
| Category | Pass | Fail |
|---|---|---|
| Value (\(x\)) | $500 | $1000 |
| Probability (\(P(x)\)) | 0.80 | 0.20 |
| \(x\cdot P(x)\) | $500*0.80 = $400 | $1000*0.20 = $200 |
So, the expected value, or the average amount paid per student, is $600.
Does this make sense? Most of the students pass, so they only pay $500. But some do fail, so that will raise the average a little bit, so $600 does make sense.
Now, we’ll address a scenario where we don’t have values for the categories.
After exercising, I track if I feel tired or energetic. I get the following data:
These categories have an order, from exhausted on one extreme and energized on the other. So, I’m going to assign 0 to the lowest extreme (exhausted) and 3 to the highest (energized). Now, I have a numerical way of representing each category.
| Category | Exhausted | Tired | Normal | Energized |
|---|---|---|---|---|
| Value (x) | 0 | 1 | 2 | 3 |
Now, I add the relative frequencies of each category, and multiply the \(x\) values by \(P(x)\):
| Category | Exhausted | Tired | Normal | Energized |
|---|---|---|---|---|
| Value (\(x\)) | 0 | 1 | 2 | 3 |
| Probability (\(P(x)\)) | 12/56 = 0.214 | 19/56 = 0.339 | 16/56 = 0.286 | 9/56 = 0.161 |
| \(x\cdot P(x)\) | 0*0.214 = 0 | 1*0.339 = 0.339 | 2*0.286 = 0.572 | 3*0.161 = 0.483 |
Adding up those results gives me my expected value.
\[\begin{align*} E[x] &= 0*0.214 + 1*0.339 + 2*0.286 + 3*0.161 \\ &= 0 + 0.339 + 0.572 + 0.483 \\ &= \mathbf{1.394} \end{align*}\]Now, what does this number mean? On average, my energy level is at 1.394. That is part way between 1 and 2. But remember that 1 means “Tired” and 2 means “Normal”. So, on average, I am somewhere between Tired and Normal, a little closer to Tired.
And there you have it! We have found the average of a categorical variable.