In 1898, Russian economist Ladislaus Bortkiewicz published his first statistics book entitled Das Gesetz der keinem Zahlen, in which he included an example that eventually became famous for illustrating the Poisson distribution.
Bortkiewicz researched the annual deaths by horse kicks in the Prussian Army from 1875-1984. Data was recorded from 14 different army corps, with one being the Guard Corps. (According to one Wikipedia article on the subject, the Guard Corps may have been responsible for Prussia’s elite Guard units.) Let's take a closer look at his data and see what Minitab has to say using a Poisson goodness-of-fit test.
Here's the data set (thank you, University of Alabama in Huntsville):
As a review, the Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time or space. The Poisson distribution only has one parameter, which is called lambda (or mean). To divert your attention just a little bit before we run our goodness-of-fit test, let’s look at how the distribution changes with different values of lambda. Go to Graph > Probability Distribution Plot > View Single. Select Poisson from the Distribution drop-down and enter in .5 for the mean, then press OK:
After I created my first plot, I created 3 more probability distribution plots with lambda at 2, 4, 10. I then used Minitab’s Layout Tool under the Editor Menu to combine four graphs.
As lambda increases, the graphs begin to resemble a normally distributed curve:
Interesting, right? But let's get back on track and test if the overall data obtained by Bortkiewicz follows a Poisson distribution.
I first had to stack the data from 14 columns into one column. This is done via Data > Stack > Columns…
With the data stacked, I went to Stat > Basic Statistics > Goodness-of-Fit for Poisson…, filling out the dialog as shown below:
After I clicked OK, Minitab delivered the following results:
The Poisson mean, or lambda, is 0.70. This means that we can expect, on average, 0.70 deaths per one corps per one year. If I knew of these statistics and served in the army corps at that time, I would have treated my horse like gold. Anything my horse wants, it gets.
Further down you’ll see a table showing the observed counts and the Expected Counts for the number of deaths by horse. The expected counts visually mirror pretty well to what was observed. To further validate these claims that this data can be modeled by a Poisson distribution, we can use the p-value for the Goodness-of-Fit Test in the last section of the output.
The hypothesis for the Chi-Square Goodness-of-Fit test for Poisson is:
Ho: The data follow a Poisson distribution
H1: The data do not follow a Poisson distribution
We are going to use an alpha level of 0.05. Since our p-value is greater than our alpha, we can say that we do not have enough evidence to reject the null hypothesis, which is that the horse kick deaths per year follow a Poisson distribution.
The chart below shows how close the both the expected and observed values for deaths are to each other.
I've been thinking about what other data could have been collected to serve as potential predictors if we wanted to do a poisson regression. We could then see if there were any significant relationships between our horse kick death counts and some factor of interest. Maybe corps location or horse breed could have been documented? Given that the space or unit of time is considered one year, that specific location or breed would have to be the same value for the entire length of that time. For example, Corps 14 in 1893 must have remained entirely in “Location A” during that year, or every horse in a particular corps must be of the same breed for a particular year.
According to equusmagazine.com, horses kick for six reasons:
- "I feel threatened."
- "I feel good."
- "I hurt."
- "I feel frustrated."
- "Back off."
- "I'm the boss around here."
Wouldn’t this have made for a great categorical variable?