Minitab | Minitab

When you run a regression in Minitab, you receive a huge batch of output, and often it can be hard to know where to start. A lot of times, we get overwhelmed and just go straight to p-values, ignoring a lot of valuable information in the process. This post will give you an introduction to one of the other statistics Minitab displays for you, the VIF, or Variance Inflation Factor.

To start, let's look at what the VIF tells us. It's essentially a way to measure the effect of multicollinearity among your predictors. What is multicollinearity? It's simply a term used to describe when two or more predictors in your regression are highly correlated.

The VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated. More variation is bad news; we're looking for precise estimates. If the variance of the coefficients increases, our model isn't going to be as reliable.

So how are the VIF values calculated? Let's take a look at Minitab Help's regression example to see how it's done.

Each predictor in your model will have a VIF value. In our case, we have a response that is measuring the total heat flux from solar energy powered homes, being predicted by the position of the focal points in 3 different directions, East, South, and North. We can run a regular regression, and get the following Minitab regression output:

So how are the VIFs calculated? Essentially, we take the predictor in question, and regress it against all of the other predictors in our model. If you have your columns in Minitab, you can simply go to Stat > Regression > Regression > Fit Regression Model. In the Response field, enter the predictor in question. In our case, we'll choose South. In the continuous predictors field, you can enter the other predictors in the model, East and North for us here. Then, we simply run the regression.

We need one key piece of output from this regression, and that's the R-Sq value:

In this case, the R-sq value is .1707. Then we use the following formula to calculate:

By the formula, 1/(1-.1707) = 1.21, our VIF.

If you take the square root of the variance inflation factor, that value tells you how much larger the standard error is compared to if that predictor was uncorrelated with any other predictor.

So in our case, for the South factor, the standard error of the factor is SqRt(1.21)=1.1 times as large as if the predictor was uncorrelated with any in the model, which is not a significant change. A VIF around 1 is very good.

There are some guidelines we can use to determine whether our VIFs are in an acceptable range. A rule of thumb commonly used in practice is if a VIF is > 10, you have high multicollinearity. In our case, with values around 1, we are in good shape, and can proceed with our regression.

Just 100 years ago, very few statistical tools were available and the field was largely unknown. Since then, there has been an explosion of tools available, as well as ever-increasing awareness and use of statistics.

tool While most readers of the Minitab Blog are looking to pick up new tools or improve their use of commonly-applied ones, I thought it would be worth stepping back and talking about one that was used before we had so many options at our disposal: the i-Test.

Like most early analysis tools, the i-test had to be simple to compute because computers were not available. It simply returned a binary response—"pass" or "fail" in most cases—that was usually meant to confirm or deny some prior assumption. This equates to what we call the null and alternative hypotheses today.

But even without a calculator or computer, in most cases this test was lightning-fast to use and virtually any employee had it at their disposal, unlike current statistical tools.

No one has ever been credited with its invention, and as far as I can tell from my research, it has never been documented in any peer-reviewed journals. Perhaps it was obvious enough that it was simultaneously developed by multiple scholars around the same time.

Given that better tools and computing power were not available at the time, those using it had quite the advantage over those who did not, so it quickly became a tool that virtually everyone used. Just as with the modern statistical tests you are familiar with doing in Minitab, the combination of simplicity, accuracy (at least for the time), and power made it a necessary tool for anyone wanting to make decisions based on someone thing other than a guess or intuition.

Then came along the tools and computers we have today, and something sad and unfortunate happened. As people learned these new techniques, which were clearly more powerful and even more accurate, the i-test was almost completely forgotten as a potentially useful method among those who could suddenly do ANOVA and regression and t-tests.

It seemed the only people still relying on it were those who didn't understand modern statistics, and they were soon being easily outperformed. The few who cling to relying solely on the i and reject what statistics have to offer are usually mocked.

But I've noticed something. While those using modern statistical tools to make decisions have been easily outperforming those who still rely on the outdated i-test, there is a subset of people who use both. They rely on p-values and R-Squared and confidence intervals, but they also employ the old standby i-test, just to make sure it agrees with what the statistics are telling them.

These people are the highest performers, devastatingly accurate in their analysis and rarely making a bad decision based on data. They are confident in their results because they've double-checked them and easily convince others of what they've learned. They are James Bond of the data analysis world, embracing the latest technology while never letting go of timeless values!

You probably call it the eye-test. Make sure it never leaves your ever-growing toolbox.

Statisticians say the darndest things. At least, that's how it can seem if you're not well-versed in statistics.

When I began studying statistics, I approached it as a language. I quickly noticed that compared to other disciplines, statistics has some unique problems with terminology, problems that don't affect most scientific and academic specialties.

For example, dairy science has a highly specialized vocabulary, which I picked up when I was an editor for Penn State's College of Agricultural Sciences. I found the jargon fascinating, but not particularly confusing to learn. Why? Because words like "rumen" and "abomasum" and "omasum" simply don't turn up in common parlance. They have very specific meaning, and there's little chance of misinterpreting them.

Now open up a statistics text and flip to the glossary. There are plenty of statistics-specific terms, but you're going to see a lot of very common words as well. The problem is that in statistics, these common words don't necessarily mean what they do outside statistics.

And that means that if you're not well versed in statistics, or even if it's just been a while since you thought about it, understanding statistical results—whether it's a research report on the news or an analysis done by a co-worker—can be a real challenge. Sometimes it seems like the language of statistics was designed to be confusing.

That's one of the reasons we incorporated The Assistant into Minitab Statistical Software. This interactive menu guides you through your analysis and presents your results without ambiguity, making them easy to interpret if you aren't a statistician, and making them easy to share if you are one.

Here are 10 common words that are also routinely used in statistics. Those of us who are practicing data analysis and sharing the results with others need to keep in mind the differences between what these words mean to statisticians, and what they mean to everyone else.

1. Significant

When most people say something is "significant," they mean it's important and worth your attention. But for statisticians, significance refers to the odds that what we observe is not simply a chance result. Statisticians know that on a practical level, significant results often have no importance at all. This distinction between practical and statistical significance is easy for people to overlook.

2. Normal

Normally, people who say something is "normal" mean that it's ordinary or commonplace. We call a temperature of 98.6 degrees Fahrenheit "normal." What's more, when something isn't "normal," it often carries negative connotations: "That knocking from my car's engine isn't normal." But to statisticians, data is “normal” when it follows the familiar bell-shaped curve, and there's nothing wrong with data that isn't normal. But it's easy for the uninitiated to conflate "nonnormal data" with "bad data."

3. Regression

In everyday usage, regression means shrinkage or backwards movement. When the dog you're training has a bad day after a few positive ones, you might say his behavior regressed. Unless you're a statistician, you wouldn't immediately think "regression" refers to predicting an output variable based on input variables.

4. Average

In statistics, the arithmetic average (or mean) is the sum of the observations divided by the number of observations. When most people hear and say the word "average," they're not thinking about a mathematical value but rather a qualitative judgment, meaning “so-so,” "normal" or "fair."

5. Error

Error is a measure of an estimate’s precision—if you're a statistician. To everyone else, errors are just mistakes.

6. Bias

In statistics, bias refers to the accuracy of measurements taken by a particular tool or gauge compared to a reference value. In everyday usage, however, bias refers to preconceptions and prejudices that affect a person's view of the world.

7. Residual

For most people who aren't statisticians, residuals is a fancy word for leftovers, not the difference between observed and fitted values.

8. Power

Usually we talk about power in terms of impact and control. Influence. So the fact that a statistical test can be powerful but not influential seems contradictory, unless you already know it refers to the probability of finding a...um...significant effect when one truly exists.

9. Interaction

People use this word to talk about their communications with others. For statisticians, it means the effects of one factor are dependent on another.

10. Confidence

In statistics, the confidence interval is a range of values, derived from a sample, that is likely to hold the true value of a population parameter. The confidence level is the percentage of confidence intervals that contain that population parameter you would get if you sampled the population many times.

Outside of its technical meaning in statistics, the word "confidence" carries an emotional charge that can instantly create unintended implications. All too often, people interpret statistical confidence as meaning the researchers really believe in their results.

These 10 terms are just a few of the most confusing double-entendres found in the statistical world. Terms like sample, assumptions, stability, capability, success, failure, risk, representative, and uncertainty can all mean different things to the world outside our small statistical circle.

Making an effort to help the people we communicate with appreciate the technical meanings of these terms as we use them would be an easy way to begin promoting higher levels of statistical literacy.

What do you think the most confusing terms in statistics are?

Fourth down Down 7-0 midway through the 1st quarter of the College Football Playoff National Championship game, Ohio State was facing a 4th and 2 at the Oregon 35 yard line. Buckeye coach Urban Meyer had a decision to make. Attempt a 52 yard field goal, punt and try to pin Oregon deep inside their own territory, or attempt to gain the 2 yards and get a fresh set of downs. Meyer decided to go for it. Ohio State got the first down, scored a touchdown on the drive, and didn't trail again the remainder of the game.

Clearly Ohio State made the correct decision. Right?

Too often we wait until we see the outcome of the play before we decide whether a coach’s decision was correct. If Meyer had failed on the 4th down conversion, certainly we would have said it was the wrong decision. But coaches can’t tell the future. So how could we determine what the correct decision is before the play is run? The answer is a 4th down calculator using Minitab Statistical Software.

How the Calculator Works

To make an informed decision on 4th down, you need to know two things. First is your probability of getting a first down. Obviously your probability increases as the yards you need to gain decrease. Second is the expected points a team would score with a first down at a specific field position. If you go for it on 4th down and fail, your opponent will be more likely to score than if you punted due to better field position. But is that risk outweighed by the increase in expected points you’ll gain by successfully converting on 4th down? That’s the question the calculator will answer.

Why Specifically the Big 10?

There are already multiple 4th down calculators out there using NFL data, so I wanted to stick to college football since I couldn’t find any calculators that use college data. And because of the massive time investment it took to collect the data, I had to limit my scope. So I decided to collect data from games that involved only teams from the Big 10 Conference. While the resulting model could probably be applied to any college football team, I’m only going to apply it to Big 10 Conference games this season. Particularly, because I want to track each team’s decision.

You always hear stat-heads saying that on 4th down coaches should be punting less and going for it more on. (Spoiler alert: My model is surely going to echo the same sentiments.) But you don’t see anybody compiling what happens next. Are coaches that punt on 4th and short from midfield winning the battle of field position and scoring next? Or are their opponents scoring anyway despite having to drive the length of the field. By tracking both the decision and the result, we can compare the theoretical expected points to what actually happened. And in a perfect world we’ll also have a number of coaches who go for it, so we can compare the two decisions directly and see who scores more points.

Tracking this for every college football team seemed daunting. But doing so for the fourteen Big 10 teams seems feasible. So at least for now, the Big 10 it is!

Back to the National Championship Game

So did Ohio State make the correct decision? Let’s find out! There are still some variables that we don’t (and can’t) know. For example, if they punt, where will the ball be downed? And if they successfully convert the 4th and 2, how many yards will they gain? So we’re going to make some assumptions.

If Ohio State punted, they would down the ball at the 1 yard line
If they convert the 4th down, they would only gain the minimum yards needed

This assumes the best case scenario for a punt, and the worst case scenario for a successful 4th down conversion.

Our model gives Ohio State a 62% chance of gaining the 1st down, and a 47% chance of making a 52 yard field goal. If Oregon starts with the ball on their own 1 yard line, their expected points are -1.24. That means that even though Oregon has the ball, Ohio State is more likely to be the next team that scores. If Oregon starts at their own 35 yard line (after a failed 4th down or field goal attempt), their expected points are 1.33. And if Ohio State gains 2 yards and has a 1st and 10 on the Oregon 33, their expected points are 3.75. Now we can determine what the correct decision is!

Expected Points FG Att = (3*.47) – (1.33*.53) = 0.71

Expected Points Punt = 1.24

Expected Points Go for it = (3.75*.62) – (1.33*.38) = 1.82

So Meyer made correct decision, regardless of the outcome of the play. And this assumes the best case scenario for a punt and the worst case scenario for a successful 4th down conversion. If you change either of our assumptions, the numbers will favor going for it even more. And of course the outcome worked out in Ohio State’s favor, as they ended up scoring a touchdown on the drive.

Those are the type of situations that will be analyzed this fall. In the following weeks I’ll write a series of posts detailing the model that will be used. Then starting in September, I’ll do a weekly post after each Big 10 Conference game (I won't include non-conference games so I avoid games like Ohio State vs. Hawaii and Wisconsin vs. Troy) summarizing each team’s 4th down decisions from the weekend.

So is your coach costing your team points by making sub optimal 4th down decisions? Stay tuned to find out!

Kappa statistics are commonly used to indicate the degree of agreement of nominal assessments made by multiple appraisers. They are typically used for visual inspection to identify defects. Another example might be inspectors rating defects on TV sets: Do they consistently agree on their classifications of scratches, low picture quality, poor sound? Another application could be patients examined by different doctors for a particular disease: How often will the doctors' diagnoses of the condition agree?

A Kappa study will enable you to understand whether an appraiser is consistent with himself (within-appraiser agreement), coherent with his colleagues (inter-appraiser agreement) or with a reference value (standard) provided by an expert. If the kappa value is poor, it probably means that some additional training is required.

The higher the kappa value, the stronger the degree of agreement.

When:

Kappa = 1, perfect agreement exists.
Kappa < 0, agreement is weaker than expected by chance; this rarely happens.
Kappa close to 0, the degree of agreement is the same as would be expected by chance.

But what is exactly the meaning of a Kappa value of 0?

Remember your years at school ? Suppose that you are expected to go through a very difficult examination (with multiple choice questions) and that for this very particular subject, you had, unfortunately, no time at all to review any course material due to a lack of time, family constraints and other very understandable and valid reasons. Suppose that this exam gave you five possible choices and only one correct answer for every question.

If you tick randomly to select one choice per question, you might end up having 20% correct answers, by chance only. Not bad after all, considering the minimal amount of effort involved, but in this case, a 20% agreement with the correct answers would result in a…kappa score of 0.

Kappa Measure in Attribute Agreement Analysis

In an attribute agreement analysis, the kappa measure takes into account the agreement occurring by chance only.

The table below shows the odds of by-chance agreement between correct answer and appraiser assessment:

To estimate the Kappa value, we need to compare the observed proportion of correct answers to the expected proportion of correct answers (based on chance only):

Kappas can be used only with binary or nominal-scale ratings, they are not really relevant for ordered-categorical ratings (for example "good," "fair," "poor").

Kappas are not only restricted to visual inspection in a manufacturing environment. A call center might use this approach to rate the way incoming calls are dealt with, or a tech support service might use it to rate the answers provided by employees. In an hospital this approach could be used to rate the adequacy of health procedures implemented for different types of situations or different symptoms.

Where could you use Kappa studies?

by Colin Courchesne, guest blogger, representing his Governor's School research team.

High-level research opportunities for high school students are rare; however, that was just what the New Jersey Governor’s School of Engineering and Technology provided.

Bringing together the best and brightest rising seniors from across the state, the Governor’s School, or GSET for short, tasks teams of students with completing a research project chosen from a myriad of engineering fields, ranging from biomedical engineering to, in our team's case, industrial engineering.

Tasked with analyzing, comparing, and simulating queue processes at Dunkin’ Donuts and Starbucks, our team of GSET scholars spent five days tirelessly collecting roughly 250 data points on each restaurant. Our data included how much time people spent waiting in line, what type of drinks customers ordered, and how much time they spent waiting for their drinks after ordering.

data collection interface
The students used a computerized interface to collect data about customers in two different coffee shops.

But once the data collection was over, we reached a sort of brick wall. What do we do with all this data? As research debutantes not well versed in the realm of statistics and data analysis, we had no idea how to proceed.

Thankfully, the helping hand of our project mentor, engineer Brandon Theiss, guided us towards Minitab.

Getting Meaning Out of Our Data

Our original, raw data told us nothing. In order to compare data between stores and create accurate process simulations, we needed a way to sort the data, determine descriptive statistics, and assign distributions; it is these very tools that Minitab offered. Getting started was both easy and intuitive.

First, we all managed to download Minitab 17 (thanks to the 30-day trial). Our team then went on to learn the ins and outs of Minitab, both through instructional videos on YouTube as well as helpful written guides, all of which are provided by Minitab. Less than an hour later, we were able to navigate the program with ease.

The nature of the simulations our team intended to create called for us to identify the arrival process for each store, the distributions for the wait time of a customer in line at each restaurant, as well as the distributions for the drink preparation time, sectioned off by both restaurant as well as drink type. In order to input this information into our simulation, we also needed certain parameters that were dependent on the distribution. Such parameters ranged from alpha and beta values for Gamma distributions to means and standard deviations for Normal distributions.

Thankfully, running the necessary hypothesis tests and calculating each of these parameters was simple. We first used the “Goodness of fit for Poisson” test in order to analyze our arrival rates.

All Necessary Information

Rather than having to fiddle with equations and arrange cells like in Excel, Minitab quickly provided us with all necessary information, including our P-value to determine whether the distribution fit the data as well as parameters for shape and scale.

As for distributions for individual drink preparation times, the process was similarly simple. Using the “Individual Distribution Identification” tool, Minitab ran a series of hypothesis tests, comparing our data against a total of 16 possible distributions. The software output graphs along with P-values and Anderson-Darling values for each distribution, allowing us to graphically and empirically determine the appropriateness of fit.

Probability Plot for Latte S

Within 3 hours, we had sorted and analyzed all of our data.

Not only was Minitab a fantastic tool for our analysis purposes, but the software also provided us with a graphical platform, a means by which to produce most of the graphs used in our research paper and presentation. Once we determined which distribution to use with what data, we used Minitab to output histograms with fitted data distributions for each set of data points. The ease of use for this feature served to save us time, as a series of simple clicks allowed us to output all 10 of our required histograms at the same time.

Histogram of Line Time S

The same tools first used to analyze our data were then finally used to analyze the success of our simulations; we ran a Kolmogorov-Smirnov test to determine whether two sets of data—in this case, our observed data and the data output by our simulation—share a common distribution. Like most other features in Minitab, it was extremely easy to use and provided clear and immediate feedback as to the results of the test, both graphically and through the requisite critical and KS values

Research isn’t always fun. It’s often long, tedious, and amounts to nothing. Thankfully, that wasn’t our case. Using Minitab, our entire analysis process was simple and painless. The software was easy to learn and was able to run any test quickly and efficiently, providing us with both empirical and graphical evidence of the results as well as high-quality graphs which were used throughout our project. It really was a pleasure to work with.

GSET Coff(IE) Team

—The GSET COFF[IE] Team, whose members were Kenneth Acquah, Colin Courchesne, Sheela Hanagal, Kenneth Li, and Caroline Potts. The team was mentored by Juilee Malavade and Brandon Theiss, PE. Photo courtesy Colin Courchesne.

About the Guest Blogger:

Colin Courchesne was a scholar in the 2015 New Jersey Governor's School of Engineering and Technology, a summer program for high-achieving high school students. Students in the program complete a set of challenging courses while working in small groups on real-world research and design projects that relate to the field of engineering. Governor’s School students are mentored by professional engineers as well as Rutgers University honors students and professors, and they often work with companies and organizations to solve real engineering problems.

Would you like to publish a guest post on the Minitab Blog? Contact publicrelations@minitab.com.

As you may know, we added Bubble Plots to Minitab's menu of meaningful graphs in Release 17. If you are familiar, I think you'll agree that Bubble Plots make a perfect addition to the pantheon of impressive and powerful plots that you can produce in Minitab. They’re great. Of course, they would have been even greater if they used my idea...but that’s spilt milk under the bridge now.

If you haven’t met the Bubble Plot yet, it’s a lot like a scatterplot, only the dots on the plot (a.k.a. the aforementioned “bubbles”) are different sizes so you can visualize the value of a 3rd variable in addition to the x-variable and the y-variable. For example, the bubble plot below shows gross sales (in thousands of dollars) on the y-axis and quarter of the year (1 through 4) on the x-axis. The size of each bubble indicates the number of orders that were received during the quarter.

The first bubble (far left) shows that the company earned approximately $350,000 in revenue during Quarter 1. The second bubble is smaller and lower than the first bubble, which indicates that both the number of orders and total sales revenues were down in Q2 as compared to Q1. Things rebounded a bit in Q3. Q4 was in progress when the graph was made, but the preliminary data look promising. The red bubble was added to show the projected orders and sales for Q4.

It’s a great graph, and it really speaks to you. But it doesn’t quite sing.

Boring Bubble Plot

You can’t say that I didn’t try. I traveled endlessly up and down the hollowed corridors of Minitab and shared my idea with all who would listen (and several who would not).

I started with a visit to the Software Development department. The developers seemed generally impressed with my idea. At least they smiled a lot while I was explaining it. But they said that I should talk to Research and Development first, so I ventured over there.

The R&D folks were inquisitive and asked thoughtful questions like, “You want to do what?” and “Are you serious?” and “Is that even legal?” In order to address that last question, I took a trip to the Legal Department.

Initially, I was concerned that the folks in Legal would talk over my head. I imagined that they would use Latin words and legal jargon and cite obscure precedents from volumes of landmark court cases. In the end, however, I found them to be quite plain-spoken. I think the exact words were, “Ain’t nobody got time for that.” Legal then sent me to Human Resources. As a reward for my brilliance, HR added an extensive psychotherapy rider to my existing health insurance policy and encouraged me to use it. Which I did. (It’s going very well, by the way. I’m learning a lot about my mother.)

You get the idea. I basically got the run around. Frankly, I think that everyone is simply jealous or embarrassed that they didn’t think of this themselves. Especially since it’s so obvious when you think about it. I mean, instead of settling for a mere bubble plot, who wouldn’t want to showcase their data in a fabulous Bublé Plot!

Introducing the Bublé Plot

Just think of the extra attention that you’ll garner at your next meeting when your data are brought to life...not by boring old bubbles, but by the viral and vivacious visage of the one and only Mr. Michael Bublé!

The Bublé Plot

Now there’s a graph that just sings out to you. Looking at that graph, how can you possibly doubt that things are looking up? (Or at least looking left?)

And when that happy day comes and you finally do meet those fourth-quarter projections, how do you want to receive the good news? Would you rather stare blankly at expressionless bubbles? Or crack a smile with the chart that smiles back with a look that says, “You did it, Kid! You’re the greatest. That therapist doesn’t know what he’s talking about.”

You did it Kid!

I know which graph I’d rather use. Reminds me of a song ...

I’m not surprised
There’s been slump
I’m not gonna let that get me,
down in the dumps
Revenues they come in,
and expenditures out
We get all worked up
then we let our guard down

We’ve tried so very hard to improve it
Now is not the time for excuses
Let’s think of every source of variability

And I know that Q4 it’ll all turn out
They’ll make us work so we can work to work it out
And we promised, yes we did, and will, but we haven’t quite met
Fourth quarter projections yet

Credit for the original image of a smiling Mr. Bublé goes to www.vancityallie.com. Credit for the original image of a smoldering Mr. Bublé goes to Dallas Bittle. Both are available under Creative Commons License 2.0.

Credit for the bubbles in the first plot go to the colors Blue and Red, and to the letter Q. All are creative, common, and available in Minitab Statistical Software.

The 1949 film A Connecticut Yankee in King Arthur's Court includes the song “Busy Doing Nothing,” and this could be written about the Null Hypothesis as it is used in statistical analyses.

The words to the song go:

We're busy doin' nothin'
Workin' the whole day through
Tryin' to find lots of things not to do

And that summarises the role of the Null Hypothesis perfectly. Let me explain why.

What's the Question?

Before doing any statistical analysis—in fact even before we collect any data—we need to define what problem and/or question we need to answer. Once we have this, we can then work on defining our Null and Alternative Hypotheses.

The null hypothesis is always the option that maintains the status quo and results in the least amount of disruption, hence it is “Busy Doin’ Nothin'”.

When the probability of the Null Hypothesis is very low and we reject the Null Hypothesis, then we will have to take some action and we will no longer be “Doin Nothin'”.

Let’s have a look at how this works in practice with some common examples.

Question

Null Hypothesis

Do the chocolate bars I am selling weigh 100g? Chocolate Weight = 100g

If I am giving my customers the right size chocolate bars I don’t need to make changes to my chocolate packing process.
Are the diameters of my bolts normally distributed?

Bolt diameters are normally distributed.

If my bolt diameters are normally distributed I can use any statistical techniques that use the standard normal approach.

Does the weather affect how my strawberries grow? Number of hours sunshine has no effect on strawberry yield

Amount of rain has no effect on strawberry yield

Temperature has no effect on strawberry yield

Note that the last instance in the table, investigating if weather affects the growth of my strawberries, is a bit more complicated. That's because I needed to define some metrics to measure the weather. Once I decided that the weather was a combination of sunshine, rain and temperature, I established my null hypotheses. These all assume that none of these factors impact the strawberry yield. I only need to control the sunshine, temperature and rain if the probability that they have no effect is very small.

Is Your Null Hypothesis Suitably Inactive?

So in conclusion, in order to be “Busy Doin’ Nothin’”, your Null Hypothesis has to be as follows:

A logical question.
Focused on one objective.
Requires action only if its probability of being true is low (typically 5%).

If you want to use data to predict the impact of different variables, whether it's for business or some personal interest, you need to create a model based on the best information you have at your disposal. In this post and subsequent posts throughout the football season, I'm going to share how I've been developing and applying a model for predicting the outcomes of 4th down decisions in Big 10 games. I hope sharing my experiences will help you, whether the questions you want to answer are about football or business logistics.

Here are some questions I was looking to answer when I began thinking about creating a 4th down calculator. If you have a 1st and 10 at your opponent’s 20-yard line, on average you’ll score more points than if you have the ball at your own 20 yard line. But how many more? And how does that number change as you move to different positions on the field. And what if you’re playing on the road as opposed to playing at home?

If you’re trying to use analytics to determine what the best decision is on 4th down, you need to know how many points you (or your opponent) would be expected to score on the ensuing 1st down. So my first step in creating a Big Ten 4th down calculator was to use Minitab Statistical Software to model a team’s expected points on 1st and 10 from anywhere on the field.

The Data

I went through every Big Ten conference game the last two seasons. For each instance a team had 1st and 10, I recorded the field position and the next score. If your opponent was the next team to score, then the value for the next score was negative. If nobody scored before halftime or the end of the game (depending on which half they were in) the value was 0.

I only included conference games because many non-conference game are one-sided (I’m looking at you, Ohio State vs. Kent State in 2014). I also didn’t include the conference championship game, since I want to account for home field advantage and that game is played at a neutral site. Finally, I did my best to exclude drives that ended prematurely because of halftime and drives in the 4th quarter of blowouts.

I ended up with 5,496 drives over the two seasons. You can get both the raw and summarized data here.

A bar chart can give us a quick glance at what the most common score is.

Bar Chart

The most common outcome when you have possession of the ball is that you score a touchdown. No revelation there. But surprisingly, it was actually more common for your opponent to get the ball back and score a touchdown than it was for you to kick a field goal. I wouldn’t have expected that.

So now let’s see what happens when we account for the field position and home field advantage.

A Model for Expected Points

I grouped the field position into groups of 5 yards intervals. Then for each group, I took the average of the next score. So first, let’s look at a fitted line plot of the data, without accounting for home field advantage.

Fitted Line Plot

The regression model fits the data very well. The R-squared value indicates that 96.4% of the variation in Expected Points can be explained by the number of yards to the end zone. That’s fantastic! I added a reference line at the point where the expected value is 0. It crosses our regression line at a distance to the end zone of approximately 85 yards. That suggests you have to be inside your own 15 yard line before the team on defense is more likely to be the next team to score.

Now let’s factor in home field advantage. We’ll start by examining a scatterplot that will show the difference in expected points for home and away teams at each yard line group.

Scatterplot

In 17 of the 20 groups, the home team has a higher number of expected points than the away team. And in the 3 cases where the away team is higher, the two values are very close. This gives strong evidence that we need to account for home field advantage. I ran a regression analysis to confirm that we should include that game location in our model.

Regression

The p-value for location is less than 0.05, and the R-squared value remains very high. I can now use these two equations (one for home games, one for away games) to predict how many points a team with a first down will score from anywhere on the field.

Testing the Interaction Between Home Field Advantage and Yards to the End Zone

There is one last thing I want to look into. Is there an interaction between our two terms? Think about it this way: Say you have 1st and goal inside your opponent’s 10 yard line. You’re so close to the end zone, it seems like it might not matter whether you’re at home or on the road.

Now imagine you have a 1st and 10 inside your own 10 yard line. It seems like a much more daunting task to drive the length of the field on the road with the hostile crowd roaring than it would be with the cheers of a friendly home crowd.

In other words, does the effect of home field advantage increase the further a team is from the end zone? Intuitively, it seems like it should. But we should run a regression analysis to see if the data supports that notion.

Regression

The data does not support my intuition. The p-value for the interaction term is much higher than 0.05, indicating that it is not a significant term, and thus that we should not include it in our model. To visualize why, let’s revisit the previous scatterplot, but this time I'll add regression lines to each group.

Scatterplot

If there were an interaction between our two terms, we would expect the two lines to be close together at small distances to the end zone. Then they should move farther apart as the yards to the end zone increase. But you can see here that the lines are pretty parallel to each other. So we can safely remove the interaction term from our model.

The Final Model

Let’s take a final look at the model created by this regression analysis.

Model Equations

The equations indicate that if you start a drive on the road, you’ll be expected to score approximately 0.6 fewer points than you would if you were playing at home. Because there is no interaction term, the slopes are the same for both equations. The value of -0.075 means that for every yard you move away from the end zone, your expected points decrease by 0.075. So if you decide to punt the football away and get a net of 40 yards (the average in the Big Ten last year), this model indicates you’ll have saved yourself about 3 points on average.

Of course, that 3 points assumes that you turned the ball over on downs. But a third option exists: successfully converting on 4th down.

Will the reward of a successful conversion outweigh the risk of losing those 3 points you would gain by punting? That all depends on the probability of successfully converting on 4th down. And that’s exactly what I'll look at in my next post. Once we can determine the probability of converting on 4th down, we’ll be able to get some data-driven insights into what the correct decision is on 4th down. Stay tuned!

When performing a design of experiments (DOE), some factor levels may be very difficult to change—for example, temperature changes for a furnace. Under these circumstances, completely randomizing the order in which tests are run becomes almost impossible.To minimize the number of factor level changes for a Hard-to-Change (HTC) factor, a split-plot design is required.

Why Do We Want to Randomize a Designed Experiment?

Randomization means that the experimental tests are run in a random order specified by Minitab, and that factor level changes occur randomly. Randomization in a DOE is desirable, because it helps ensure that factor estimates are not biased by long-term drifts during the experiments.

Suppose that, due to environmental conditions, a gradual change in temperature takes place during the tests. This gradual change may affect factor effect estimates. If most of the tests are performed at the lower setting for a particular factor at the beginning of the experiment, and then most of the tests at the end use the upper settings of this factor, then temperature effect and the drift will be erroneously attributed to this factor, leading to biases and wrong conclusions.

That's why complete randomization in a DOE is desirable. When you can randomize, a drift in environmental conditions will not have a systematic effect on some factor estimates. There will certainly be a random impact, but not a systematic one. But when HTC factors need to be studied, randomization is not always possible.

Enter the split-plot design.

What Is a Split-Plot design

The first designs of experiments were agricultural experiments at the beginning of the 20th century. Think about a large field in which experiments need to be performed to test different types of plant varieties, fertilizers, soil treatments, etc.

This experimental field may be divided into plots, and different treatments will be allocated to these plots. If the number of level changes needs to be minimized for a specific HTC factor, large plots of the field will be used. These are referred to as whole plots.

For an Easy-to-Change (ETC) factor, smaller plots may be used. We create these sub-plots by subdividing the whole plots into the smaller subplots.

In the presence of spatial variability in the experimental field, we can expect soil variations to be smaller when subplots are located close to one another, whereas we would expect soil variations to be much larger between whole plots since they are located further from one another.

In a manufacturing context, whole plots represent long-term variations and sub plots represent short-term variations. To estimate long-term variations, whole plots need to be replicated.

To minimize complex modifications of settings, the levels of HTC factors are never changed within a whole plot and all the runs within a whole plot are performed together, in the same period of time.

The field above has been divided into four whole plots, and the whole plots have then been subdivided into subplots.

A split plot design array as displayed in Minitab Statistical Software appears below, with different colors for whole plots and subplots (see below). In the HTC column the 1 or -1 settings are changed much less often than in the ETC column:

There are two main sources of variations to be considered in a split plot design: short-term variation (between subplots: SP) and long-term variation (between whole plots: WP). Hard-to-Change (WP) factors are affected by long term variability whereas Easy-to-Change (SP) factors are affected by short term variability.

HTC + : treatment at the upper setting for the Hard to change factor
HTC - : treatment at the lower setting for the Hard to change factor
ETC + : treatment at the upper setting for the Easy to change factor
ETC - : treatment at the lower setting for the Easy to change factor

To determine whether a factor is significant or not (according to the F test), the effects of ETC factors will be compared to the short-term error term (SP) only, whereas the effects of HTC factors will be compared to the long term error term (WP) only.

Also, two R² estimates are displayed in a split plot design analysis: a short term R² and a long term R².

To compute the short term R² value (SP), the amount of variability which is explained by short-term, Easy-to-Change (SP) factors is compared to the short-term overall (subplot sum of squares) variability.

To estimate the long-term R² value (WP), the amount of variability explained by long term Hard-to-Change (WP) factors will be compared to the long term overall (WP sum of squares) variability only.

Conclusion

In a split plot design, two error terms need to be considered (short term and long term) separately, and two R² values need to be computed (short term and long term). The analysis may look more complex, but that makes the interpretation of the DOE results a lot more realistic.

The difference between defects and defectives lets you answer questions like whether to use a P chart or a U chart in Minitab, so it’s a handy difference to be able to explain. Of course, if you’ve explained it enough times—or if someone’s explained it to you enough times—the whole thing can get a little tired.

Fortunately, a new explanation of defects and defectives is one more way we can entertain ourselves with the candidates from the 2016 Republican Presidential Primary, even though it’s only 2015. Ready? Here we go!

Defectives

A defective item is not acceptable for use, but when we do a statistical analysis we don’t have to be overly literal. In politics, when we do a poll about whether a voter will vote for a certain candidate, we’re using the same math that we do when we talk about defective items. The voter either votes for the candidate or doesn’t vote for the candidate. The voter is either useful to the candidate or not useful to the candidate, in terms of election results.

So when a Fox News poll reported on August 3rd that 26% of their poll respondents who answered a question about who they would choose in the Republican primary chose Donald Trump, the other 74% of the respondents were defectives—as far as their usefulness to Donald Trump in that poll.

26% respond that they would vote for Trump. The other 74% are of no use to him in this poll.

Defects

A defect is any departure from specifications, but a single defect does not make an item unacceptable for use. In fact, an item can have multiple defects and the defects might not even be noticeable to the person who needs to use the item.

People who do not vote for a candidate are defectives from the perspective of their usefulness to the candidate, but might or might not have defects.

An interesting example of defects comes from grammarly.com’s Grammar Power Rankings, which check the grammar of candidates’ supporters on Facebook. Grammarly determined, for example, that Carly Fiorina’s supporters wrote comments on her Facebook page that contained 6.3 grammatical errors per 100 words. Individual posts can have a higher or lower rate of defects, but the candidate might not care at all. The post, as an item, is still usable.

Grammar mistakes are good examples of defects. In a Facebook post, they probably can't make a post impossible to use.

Wrap up

If you need a more traditional explanation of defects and defectives, that information is in the Minitab Support Center (plus a lot more).

If you’re in the Assistant, then you can click on “What are you counting” to get an explanation right when you need it. With the support you get with Minitab, you can spend less time looking for answers and more time making decisions.

When we take pictures with a digital camera or smartphone, what the device really does is capture information in the form of binary code. At the most basic level, our precious photos are really just a bunch of 1s and 0s, but if we were to look at them that way, they'd be pretty unexciting.

In its raw state, all that information the camera records is worthless. The 1s and 0s need to be converted into pictures before we can actually see what we've photographed.

We encounter a similar situation when we try to use statistical distributions and parameters to describe data. There's important information there, but it can seem like a bunch of meaningless numbers without an illustration that makes them easier to interpret.

For instance, if you have data that follows a gamma distribution with a scale of 8 and a shape of 7, what does that really mean? If the distribution shifts to a shape of 10, is that good or bad? And even if you understand it, how easy would it be explain to people who are more interested in outcomes than statistics?

Enter the Probability Distribution Plot

That's where the probability distribution plot comes in. Making a probability distribution plot using Minitab Statistical Software will create a picture that helps bring the numbers to life. Even novices can benefit from understanding their data’s distribution.

Let's take a look at a few examples.

Changing Shape

A building materials manufacturer develops a new process to increase the strength of its I-beams. The old process fit a gamma distribution with a scale of 8 and a shape of 7, whereas the new process has a shape of 10.

estimates

The manufacturer does not know what this change in the shape parameter means, and the numbers alone don't tell the story.

But if we go in Minitab to Graph > Probability Distribution Plot, select the "View Probability" option, and enter the information about these distributions, the impact of the change will be revealed.

Here's the original process, with the shape of 7:

And here is the plot for the new process, with a shape of 10:

The probability distribution plots make it easy to see that the shape change increases the number of acceptable beams from 91.4% to 99.5%, an 8.1% improvement. What's more, the right tail appears to be much thicker in the second graph, which indicates the new process creates many more unusually strong units. Hmmm...maybe the new process could ultimately lead to a premium line of products.

Communicating Results

Suppose a chain of department stores is considering a new program to reduce discrepancies between an item’s tagged price and the amount is charged at the register. Ideally, the system would eliminate any discrepancies, but a ± 0.5% difference is considered acceptable. However, implementing the program will be extremely expensive, so the company runs a pilot test in a single store.

In the pilot study, the mean improvement is small, and so is the standard deviation. When the company's board looks at the numbers, they don't see the benefits of approving the program, given its cost.

communicate results data

The store's quality specialist thinks the numbers aren't telling the story, and decides to show the board the pilot test data in a probability distribution plot instead:

By overlaying the before and after distributions, the specialist makes it very easy to see that price differences using the new system are clustered much closer to zero, and most are in the ± 0.5% acceptable range. Now the board can see the impact of adopting the new system.

Comparing Distributions

An electronics manufacturer counts the number of printed circuit boards that are completed per hour. The sample data is best described by a Poisson distribution with a mean of 3.2. However, the company's test lab prefers to use an analysis that requires a normal distribution and wants to know if it is appropriate.

The manufacturer can easily compare the known distribution with a normal distribution using the probability distribution plot. If the normal distribution does not approximate the Poisson distribution, then the lab's test results will be invalid.

As the graph indicates, the normal distribution—and the analyses that require it—won’t be a good fit for data that follow a Poisson distribution with a mean of 3.2.

Creating Probability Distribution Plots in Minitab

It's easy to use Minitab to create plots to visualize and to compare distributions and even to scrutinize an area of interest.

Let's say a market researcher wants to interview customers with satisfaction scores between 115 and 135. Minitab’s Individual Distribution Identification feature shows that these scores are normally distributed with a mean of 100 and a standard deviation of 15. However, the analyst can’t visualize where his subjects fall within the range of scores or their proportion of the entire distribution.

Choose Graph > Probability Distribution Plot > View Probability.
Click OK.

dialog box

From Distribution, choose Normal.
In Mean, type 100.
In Standard deviation, type 15.
Click on the "Shaded Area" tab.

distribution plot dialog box 2

In Define Shaded Area By, choose X Value.
Click Middle.
In X value 1, type 115.
In X value 2, type 135.
Click OK.

Minitab creates the following plot:

distribution plot

About 15% of sampled customers had scores in the region of interest (115-135). This is not a very large percentage, so the researcher may face challenges in finding qualified subjects.

Using Probability Distribution Plots

Just like your camera when it assembles 1s and 0s into pictures, probability distribution plots let you see the deeper meaning of the numbers that describe your distributions. You can use these graphs to highlight the impact of changing distributions and parameter values, to show where target values fall in a distribution, and to view the proportions that are associated with shaded areas. These simple plots also clearly and easily communicate these advanced concepts to a non-statistical audience that might be confused by hard-to-understand concepts and numbers.

Presidential Seal

There's more data available today than ever before, and with statistical software such as Minitab it only takes a couple of seconds to get some significant insights, whether it concerns how to make your business run better or national politics.

For instance, if we look back at the last 9 presidential elections (1980 to 2012), there are some interesting correlations between the percent of state votes for Democrats/Republicans and the percent voting for Democrats/Republicans nationally.

The bar chart below shows how each state's percent Democratic vote count correlates with the percent national vote count. (Each state's votes were taken out of the national count before the correlation was calculated, so that a state like California didn't have a high correlation just because it has a large proportion of the national vote.)

individual state correlations with national popular vote

Connecticut had the highest correlation, and the fitted line plot below, which plots Connecticut's percent Democratic voting against the national percentage, shows just how closely correlated the state's percentages have been over the last nine presidential election cycles.

Connecticut correlation

Other states, including Michigan and Ohio, had similarly high correlations. Keep in mind, however, that no matter how high these correlations are, correlation does not imply causation. If a candidate focused on Connecticut thinking their national percent would increase, they would be falling victim to flawed statistical thinking. In fact, that candidate's percent in other states would probably decrease, as opposed to increase, since they would be neglecting the other states.

West Virginia had the lowest correlation with the national voting percentages—so low, in fact, that its correlation was negative.

West Virginia correlation

In years where West Virginia saw a high percentage of voters go Democratic, the national Democratic vote percent was low, and in years where West Virginia had a low Democratic vote percent, the Democratic vote percent was high at the national level. In general, southern states (South Carolina, Mississippi, Louisiana, Georgia, Arkansas, Tennessee, Kentucky, and Alabama) had low correlations with the national percentages.

This data came from The University of California Santa Barbara website, http://www.presidency.ucsb.edu/elections.php, which has state-by-state historical voting data. As the 2016 election cycle gathers more steam, it will be very interesting to see what the data from a myriad of sources will be able to tell us.

To make objective decisions about the processes that are critical to your organization, you often need to examine categorical data. You may know how to use a t-test or ANOVA when you’re comparing measurement data (like weight, length, revenue, and so on), but do you know how to compare attribute or counts data? It easy to do with statistical software like Minitab.

One person may look at this bar chart and decide that each production line had the same proportion of defects. But another person may focus on the small difference between the bars and decide that one of the lines has outperformed the others. Without an appropriate statistical analysis, how can you know which person is right?

When time, money, and quality depend on your answers, you can’t rely on subjective visual assessments alone. To answer questions like these with statistical objectivity, you can use a Chi-Square analysis.

Which Analysis Is Right for Me?

Minitab offers three Chi-Square tests. The appropriate analysis depends on the number of variables that you want to examine. And for all three options, the data can be formatted either as raw data or summarized counts.

Chi-Square Goodness-of-Fit Test – 1 Variable

Use Minitab’s Stat > Tables > Chi-Square Goodness-of-Fit Test (One Variable) when you have just one variable.

The Chi-Square Goodness-of-Fit Test can test if the proportions for all groups are equal. It can also be used to test if the proportions for groups are equal to specific values. For example:

A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. The manufacturer uses the Chi-Square Goodness-of-Fit Test to determine if the proportion of defects is equal across all three lines.
A bottle cap manufacturer operates three production lines and records the number of defective caps for each line. One line runs at high speed and produces twice as many caps as the other two lines that run at a slower speed. The manufacturer uses the Chi-Square Goodness-of-Fit Test to determine if the defects for each line is proportional to the volume of caps it produces.

Chi-Square Test for Association – 2 Variables

Use Minitab’s Stat > Tables > Chi-Square Test for Association when you have two variables.

The Chi-Square Test for Association can tell you if there’s an association between two variables. In another words, it can test if two variables are independent or not. For example:

A paint manufacturer operates two production lines across three shifts and records the number of defective units per line per shift. The manufacturer uses the Chi-Square Goodness-of-Fit Test to determine if the defect rates are similar across all shifts and production lines. Or, are certain lines during certain shifts more prone to defects?
A credit card billing center records the type of billing error that is made, as well as the type of form that is used. The billing center uses a Chi-Square Test to determine whether certain types of errors are related to certain forms.

Cross Tabulation and Chi-Square – 2 or more variables

Use Minitab’s Stat > Tables > Cross Tabulation and Chi-Square when you have two or more variables.

If you simply want to test for associations between two variables, you can use either Cross Tabulation and Chi-Square or Chi-Square Test for Association. However, Cross Tabulation and Chi-Square also lets you control for the effect of additional variables. Here’s an example:

A dairy processing plant records information about each defective milk carton that it produces. The plant uses a Cross Tabulation and Chi-Square analysis to look for dependencies between the defect types and the machine that produces the carton, while controlling for any shift effect. Perhaps a particular filling machine is prone to a certain type of defect, but only during the first shift.

This analysis also offers advanced options. For example, if your categories are ordinal (good, better, best or small, medium, large) you can include a special test for concordance.

Conducting a Chi-Square Analysis in Minitab

Each of these analyses is easy to run in Minitab. For more examples that include step-by-step instructions, just navigate to the Chi-Square menu of your choice and then click Help > example.

It can be tempting to make subjective assessments about a given set of data, their makeup, and possible interdependencies, but why risk an error in judgment when you can be sure with a Chi-Square test?

Whether you’re interested in one variable, two variables, or more, a Chi-Square analysis can help you make a clear, statistically sound assessment.

4th Down Imagine a multi-million dollar company that released a product without knowing the probability that it will fail after a certain amount of time. “We offer a 2 year warranty, but we have no idea what percentage of our products fail before 2 years.” Crazy, right? Anybody who wanted to ensure the quality of their product would perform a statistical analysis to look at the reliability and survival of their product.

Now imagine a multimillion-dollar football organization that makes 4th down decisions without knowing the probability that they will convert the 4th down. “We punt on every 4th and 1, but we have no idea what percentage of the time we would keep possession if we went for it.” That's just as crazy, except that seems to be what every football organization does.

But it doesn’t have to be this way. Just like businesses use statistics to improve the quality of their products, football teams should use statistics to improve their chances of winning. So I’m going to use Minitab’s binary logistic regression to create a model that will let us know the probability a team has of successfully converting on 4th down.

The Data

We’re continuing our quest to make a Big Ten 4th down calculator, so we’ll start with the same data that we used to create a model for expected points. For every 3rd down in Big Ten conference games the last 2 seasons, I recorded the distance needed to convert, whether the team on offense was at home or away, and whether they converted. I used 3rd down instead of 4th down to increase the sample size. And since the goal on 3rd down is the same as 4th down (convert in one play), the probabilities should be the same.

Speaking of the probabilities, we can use a scatterplot to get an initial look at how distance affects the probability of converting.

Scatterplot

The probability of converting decreases pretty consistently as the distance increases. The data does appear to level out a bit between 10 and 15 yards before decreasing again. And there are some outliers at the end of the data, but that is due to small sample sizes.

Now, I do have a different data set with a much larger sample that we can use to eliminate the noise in the data, but first I want to show something with this first data set that we can’t show with the next one.

The Effect of Playing at Home or Away

In the model for expected points, the location of the game affected a team's expected points. Will we see the same effect on the probability of converting on 3rd down? We’ll use binary logistic regression to determine whether Home or Away is a significant term in the model.

Binary Logistic Regression

When it comes to the probability of converting on 3rd down, it doesn’t matter whether the team is home or away. The p-value in the regression analysis is 0.994, which is much greater than the common significance level of 0.05. So why does it matter for expected points, but not here? My best guess is the sample size. Home field advantage has such a small effect on a single play that it doesn’t show up in the 3rd down conversions. But over the course of a multiple play drive (like what we looked at in the expected points model), those small effects add up and the effect of home field advantage becomes noticeable.

So when it comes to a single play, we can ignore home field advantage.

The Data: Part II

To increase our sample size, fellow blogger Joel Smith was kind enough to share data he collected on every college football game from 2006–2012. Because our sample size was so large, we can actually look at 4th downs instead of 3rd downs. Here is a scatterplot of the data:

Scatterplot

We see a similar pattern as before. The data decreases until about 10 yards where it levels out a bit before decreasing practically to 0% after 20 yards. And that outlier? Teams were 1 for 3 on 4th and 34. That one success came in the 4th quarter when the team on offense was down by 21 points, so the defense probably no longer had their starters in. That means we should clean up the data to try and remove points like these.

To try and avoid games that were blowouts, I removed any 4th downs where the score differential was greater than 4 touchdowns in the first 3 quarters, and greater than 16 points (3 scores) in the 4th quarter. Finally, I removed any distance greater than 20 yards, since the probability basically drops to 0. This means the decision on anything greater than 4th and 20 should be very easy. Punt or kick a FG unless it’s late in the game and you absolutely need to score a touchdown. So we don't really need to worry about modeling that for our 4th down calculator.

After removing these observations, we still have 11,623 4th downs. Here's the data I used.

The Final Model

We already saw that it doesn’t matter whether you’re playing at home or on the road, but there is another factor we should take into account. When you get closer to the goal line, the defense has a smaller portion of the field to defend. This might make it harder to convert on 4th down when you have to score a touchdown rather than simply get a first down. So I created a variable to determine whether it was 4th and goal or not to include in the model.

There also appears to be some curvature in the data, so I included the 2nd and 3rd order terms for distance. And lastly, our integers for distance represent the midpoint of the actual distance. For example, on 4th and 4 you could really have to gain anywhere from 3.5 to 4.5 yards. But on 4th and 1, the range is really 0 yards to 1.5 yards. So instead of using the integer 1, I used 0.75.

Now let’s put our data into Minitab and see the results.

Binary Logistic Regression

The p-values for all of our terms are less than 0.05, so we can conclude that they are all significant and keep them in the model. The Deviance R-squared value tells us that 97% of the deviance in the probability of converting on 4th down can be explained by the model. We can now use the model to predict the probability to converting at different distances.

Distance

Probability when Goal to go

Probability when not Goal to go

61%

70%

50%

60%

43%

53%

37%

46%

32%

41%

29%

37%

26%

34%

24%

32%

22%

30%

21%

28%

*I used a value of 0.75 for the prediction

We see that being at the goal line decreases your chances on 4th down by about 10%. We also see what a drastic effect just a couple of yards makes. Imagine getting a false start penalty and having your 4th and 1 go to 4th and 6. You just cut your odds of converting in half!

So let’s go back to that coach who punts on every 4th and 1. Now that we have our data, we can analyze whether he is making the correct decision. Let’s say he has a 4th and 1 at his own 10 yard line and is playing on the road. We can use our expected points model and our 4th down model to see what the correct decision should be.

Decision

Expected Points Success

Expected Points Fail

Total Expected Points

Go for it

-0.64

-5.9

-2.2

Punt*

-2.9

N/A

-2.9

* The average net punt in the Big Ten was about 40 yards, so that’s the value I used.

By this model, in going for it on 4th down the coach increases his expected points by 0.7 points. That may not sound like much, but imagine making a similar decision 4 or 5 times a game. Those expected points add up to about a field goal. Think there is a coach out there who wouldn’t want an easy way to increase their score by 3 points?

And keep in mind our numbers assume you only gain 1 yard on 4th down. When you account for the fact that you can gain more than 1 yard, the case for going for it only strengthens. As Alabama found out against Ohio State last year, even a simple running play up the middle has the potential to go the distance.

So now we’re all set to track the 4th down decisions in this upcoming Big Ten season. The first Big Ten conference game is September 19th, when Rutgers takes on Penn State. And the Big Ten 4th down calculator is ready and waiting.

Let the games begin!

Map of China Newsweek's recent article, The Environmental Disaster in Your Closet, led me (through Greenpeace's Detox Catwalk) to an interesting new data set on the web. Since I like public data, I thought I'd share some graphs I made from the Chinese Institute for Public and Environmental (IPE) affairs global online platform. The IPE website describes that their goal is "to expand environmental information disclosure to allow communities to fully understand the hazards and risks in the surrounding environment, thus promoting widespread public participation in environmental governance," which sounds great to me. Minitab can make it easy to see patterns in different groups of data over time, including environmental data by region across China.

Total Wastewater is Highest in Guangdong

Because the IPE’s aim is to expand environmental information disclosure, it has a lot of environmental data. The data sets exist at a number of different levels, including individual facilities and river basins. I started out looking at total wastewater by region (8/31/2015). Here’s what that looks like over time:

Total wastewater is highest in Guangdong

Most of the lines are close together on this scale, but the top one stands out. This is the line for Guangdong. Because I hadn’t acknowledged the depth of my ignorance about China, I had to do some research to find out whether there was an explanation for why this region would stand out in terms of their wastewater discharge.

Looking a Little Deeper

Turns out that Guangdong was the most populous region in China in 2013 and the region with the highest Gross Domestic Product (GDP) in 2009. Either factor could contribute to the amount of wastewater from a region. It turns out that the association between a region’s population in 2013 and the amount of wastewater recorded in the IPE database is fairly strong, which could be one explanation for why Guangdong has so much wastewater. Of course, we'd have to look more closely to establish a causal relationship (See mistake 3). Because the association between population and total wastewater is strong, it’s not surprising that a graph of wastewater per person looks different from the graph of total wastewater. While Guangdong is still one of the higher lines, it’s Shanghai that is the leader in per capita wastewater (using the 2013 population as representative for the years 2004 to 2013).

Shanghai has the highest amount of wastewater per person

If you look at the amount of wastewater divided by GDP, it looks like Guangxi will stand out, but a large drop in 2011 puts it closer to the rest of the lines.

Total wastewater declined more in Guanxi in 2011 than in most other regions.

So Many Graphs, So Little Time

The amount of transparency in society is increasing all the time. The things that you can learn from that data are increasing too. Graphical analysis that Minitab provides can give you quick answers to difficult questions about your data. To see more, take a look at Which graphs are included in Minitab? for an overview of different ways you can examine your data.

Rare events inherently occur in all kinds of processes. In hospitals, there are medication errors, infections, patient falls, ventilator-associated pneumonias, and other rare, adverse events that cause prolonged hospital stays and increase healthcare costs.

But rare events happen in many other contexts, too. Software developers may need to track errors in lines of programming code, or a quality practitioner may need to monitor a low-defect process in a high-yield manufacturing environment. Accidents that occur on the shop floor and aircraft engine failures are also rare events, ideally.

Whether you’re in healthcare, software development, manufacturing or some other industry, statistical process control is an important component of quality improvement. Using control charts, we can graph these rare events and monitor a process to determine if it’s stable or if it’s out of control and therefore unpredictable and in need of attention.

The G Chart

There are many different types of control charts available, but in the case of rare events, we can use Minitab Statistical Software and the G chart to assess the stability of our processes. The G chart, based on the geometric distribution, is a control chart designed specifically for monitoring rare events.

G charts are typically used to plot the number of days between rare events. They also can be used to plot the number of opportunities between rare events.

For example, suppose we want to monitor heart surgery complications. We can use a G chart to graph the number of successful surgeries that were performed in between the ones that involved complications.

The G chart is simple to create and use. To produce a G chart, all you need is either the dates on which the rare events occurred or the number of opportunities between occurrences.

Advantages of the G Chart

In addition to its simplicity, this control chart also offers greater statistical sensitivity for monitoring rare events than its traditional counterparts.

Because rare events occur at very low rates, traditional control charts like the P chart are typically not as effective at detecting changes in the event rates in a timely manner. Because the probability that a given event will occur is so low, considerably larger subgroup sizes are required to create a P chart and abide by the typical rules of thumb. In addition to the arduous task of collecting more data, this creates the unfortunate circumstance of having to wait longer to detect a shift in the process. Fortunately, G charts do not require large quantities of data to effectively detect a shift in a rare events process.

Another advantage of using the G chart to monitor your rare events is that it does not require that you collect and record data on the total number of opportunities, while P charts do.

For example, if you’re monitoring medication errors using a P chart, you must count the total number of medications administered to each and every patient in order to calculate and plot the proportion of medication errors. To create a G chart however, you just need to record the dates on which the medication errors occurred. Note the G chart does assume that the opportunities, or medications administered in this example, are reasonably constant.

Creating a G Chart

Each year, nosocomial (hospital-acquired) infections cause an exorbitant number of additional hospital days nationally, and, unfortunately, a considerable number of deaths. Suppose you work for a hospital and want to monitor these infections so you can promptly detect changes in your process and react appropriately if it goes out of control.

In Minitab, you first need to input the dates when each of the nosocomial infections occurred. Then to create a G chart and plot the elapsed time between infections, select Stat > Control Charts > Rare Event Charts > G.

In the dialog box, you can input either the 'Dates of events' or the 'Number of opportunities' between adverse events. In this case, we have the date when each infection occurred so we can use 'Dates of events' and specify the Infections column.

Interpreting a G Chart

Minitab plots the number of days between infections on the G chart. Points above the upper control limit (UCL) are desirable as they indicate an extended period of time between events. Points near or below the lower control limit (LCL) are undesirable and indicative of a shortened time period between events.

Minitab flags any points that extend beyond the control limits, or fail any other tests for special causes, in red.

The G chart above shows that this hospital went nearly 2 months without an infection. Therefore, you should try to learn from this fortunate circumstance. However, you can also see that the number of days between events has recently started to decrease, meaning the infection rate is increasing, and the process is out of control. You should therefore investigate what is causing the recent series of infections.

Monitoring Rare Events with T Charts

While G charts are used to monitor the days or opportunities between rare events, you can use a T chart if your data are instead continuous.

For example, if you have recorded both the dates and time of day when rare events occurred, you can assess process stability using Stat > Control Charts > Rare Event Charts > T.

As more and more organizations embrace and realize the benefits of quality improvement, they will encounter the good problem of increased cases of rare events. As these events present themselves with greater frequency, practitioners across industries can rely on Minitab and the G and T charts to effectively monitor their processes and detect instability when it occurs.

Example of an overfit regression model In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading. In this post, I explain what an overfit model is and how to detect and avoid this problem.

An overfit model is one that is too complicated for your data set. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.

Instead, we want our model to approximate the true model for the entire population. Our model should not only fit the current sample, but new samples too.

The fitted line plot illustrates the dangers of overfitting regression models. This model appears to explain a lot of variation in the response variable. However, the model is too complex for the sample data. In the overall population, there is no real relationship between the predictor and the response. You can read about the model here.

Fundamentals of Inferential Statistics

To understand how overfitting causes these problems, we need to go back to the basics for inferential statistics.

The overall goal of inferential statistics is to draw conclusions about a larger population from a random sample. Inferential statistics uses the sample data to provide the following:

Unbiased estimates of properties and relationships within the population.
Hypothesis tests that assess statements about the entire population.

An important concept in inferential statistics is that the amount of information you can learn about a population is limited by the sample size. The more you want to learn, the larger your sample size must be.

You probably understand this concept intuitively, but here’s an example. If you have a sample size of 20 and want to estimate a single population mean, you’re probably in good shape. However, if you want to estimate two population means using the same total sample size, it suddenly looks iffier. If you increase it to three population means and more, it starts to look pretty bad.

The quality of the results worsens when you try to learn too much from a sample. As the number of observations per parameter decreases in the example above (20, 10, 6.7, etc), the estimates become more erratic and a new sample is less likely to reproduce them.

Applying These Concepts to Overfitting Regression Models

In a similar fashion, overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation. The sample size limits the number of terms that you can safely include before you begin to overfit the model. The number of terms in the model includes all of the predictors, interaction effects, and polynomials terms (to model curvature).

Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.

Just like the example with multiple means, you must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression.

For example, if your model contains two predictors and the interaction term, you’ll need 30-45 observations. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.

How to Detect and Avoid Overfit Models

Cross-validation can detect overfit models by determining how well your model generalizes to other data sets by partitioning your data. This process helps you assess how well the model fits new observations that weren't used in the model estimation process.

Minitab statistical software provides a great cross-validation solution for linear models by calculating predicted R-squared. This statistic is a form of cross-validation that doesn't require you to collect a separate sample. Instead, Minitab calculates predicted R-squared by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation.

If the model does a poor job at predicting the removed observations, this indicates that the model is probably tailored to the specific data points that are included in the sample and not generalizable outside the sample.

To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.

For more about the model selection process, read my blog post, How to Choose the Best Regression Model.

In 2007, the Crayola crayon company encountered a problem. Labels were coming off of their crayons. Up to that point, Crayola had done little to implement data-driven methodology into the process of manufacturing their crayons. But that was about to change. An elementary data analysis showed that the adhesive didn’t consistently set properly when the labels were dry. Misting crayons as they went through the labeling machines solved the problem, and that project’s success prompted Crayola to expand the use of statistical methods. The following year, the company’s initial wave of Six Sigma projects saved more than $1.5 million, and Crayola now relies on a data-driven culture of continuous improvement to enhance the quality of their crayons.

But statistical success stories don’t have to be confined to the business world. Baseball has already proven that the advancement of statistical analyses can revolutionize a sport. Basketball and hockey teams are also starting to look into how analytics can improve the quality of their team. The only sport that seems to be lagging behind is football. But all it will take is one team to have success implementing statistics into their game plan, and others will surely follow.

Are you listening, Virginia Tech?

The Hokies are playing the defending National Champion Ohio State Buckeyes on Monday night. Ohio State is a double digit favorite in the game, and a good part of the reason is because their offense is great.

Just how great? I’m glad you asked.

Ohio State’s Offense the Previous Two Seasons

Recently, I created a regression model that can calculate the number of points a football team is expected to score based on their field position and whether they are playing at home or on the road. The data comes from every Big Ten conference game the last two seasons. To no surprise, the farther you are from the end zone, the fewer points you’re expected to score. And you’re expected to score fewer points on the road than at home.

Since Ohio State is playing at Virginia Tech, let’s focus on teams playing on the road. I took the data from my previous analysis, removed drives by the home team, and I removed Ohio State. Here is a fitted line plot of the data:

Fitted Line Plot

This is exactly what we would expect. The farther you are from the end zone, the fewer points you’re expected to score. Now let’s make the same plot for all of Ohio State’s road drives the previous two seasons.

Fitted line plot

You can start Ohio State anywhere on the field, and odds are they are going to score you on before you score on them. Start Ohio State on their own 1 yard line, and the model says their expected points are still 2.6 (compared to a value of -1.8 for the other Big Ten teams). But the most impressive part is that the data included 20 drives that Ohio State started inside their own 20 yard line. Here is a bar chart of the next score:

Bar chart

Even backed up in their own territory and playing on the road, Ohio State was the next team to score 75% of the time, with almost all of those scores being touchdowns. With an offense that good, it really begs the question.

Why even give them the ball?

The Onside Kick

The onside kick is mainly used at the end of games when a losing team is desperate to get the ball back before the clock runs out. But there is no rule saying you can’t do an onside kick early in the game. Or even do an onside kick every time you have a kickoff.

Would it actually benefit Virginia Tech to attempt an onside kick every time? Let’s calculate the percentage of kicks they would need to recover to make it worth it. If you kick the ball deep, most drives will start from the 25 yard line. So we’ll use that for Ohio State’s starting position on a deep kick. The model above shows that we can expect Ohio State to score about 3.4 points starting from their own 25 yard line.

An onside kick needs to travel at least 10 yards in order for the kicking team to legally recover it. Kickoffs are from the 35 yard line, so if Virginia Tech recovers they will be 55 yards from the end zone, and if Ohio State recovers they will be 45 yards from the end zone. This gives Virginia Tech an expected point value of 2.5 points (calculated using this regression model) and Ohio State would have an expected point value of 4.5 points. Now we can use algebra to calculate the break-even success rate, where p is the probability that Virginia Tech recovers the onside kick.

-3.4 = 2.5*p – 4.5*(1-p)

-3.4 = 2.5*p – 4.5 + 4.5*p

1.1 = 7*p

p = 1.1/7 = .157 = 16%

So if Virginia Tech can recover the onside kick about 16% of the time, their total expected points will be the same as if they were to kick deep. If they can recover a higher percentage, then they should be attempting an onside kick every time.

I couldn’t find any good data on college onside kicks, but in the NFL, non-surprise onside kick recovery rates are approximately 20%. The success rate in college football should be pretty similar. And hey, 20% > 17%, so onside kick every time, right?

Not so fast.

Anytime you perform a data analysis, it’s important to know where your data came from. In this case, our expected points for Virginia Tech came from data from all Big Ten teams. So really, it’s what we would expect an average Big Ten offense to score against an average Big Ten defense. Last year, according to Football Outsiders S&P+ ratings, Virginia Tech ranked 85th in offense, and Ohio State ranked 11th in defense. So when Virginia Tech has the ball, it will really be closer to a below average Big Ten offense going up against an above average Big Ten defense. This means our estimate for Virginia Tech’s expected points after a successful onside kick is probably a little too high.

Additionally, Virginia Tech had the #10 ranked defense last year and almost everybody returns from that defense this year. Our model for Ohio State's expected points is based off of an average Big Ten defense. So we should lower Ohio State’s expected points for both a deep kickoff and an unsuccessful onside kick. But how much we should decrease these values by is hard to quantify. So let's look at different values and see how it affects the break-even success rate. In the previous equation, I decreased both Ohio State's expected points and Virginia Tech's expected points by the values in the following table.

Decrease Expected Points By New Break-Even Success Rate 0.5 18% 1 22% 1.5 27.5% 2 37%

We see that the larger the effect of the better than average defenses and Virginia Tech's poor offense, the more the statistics side with not attempting an onside kick every time.

To Onside Kick or Not to Onside Kick?

With the uncertainty of how much the defenses and Virginia Tech's offense affect the numbers, we can’t definitively say that Virginia Tech should onside kick every single time. However, this data analysis has shown enough that we can definitely say one thing.

Virginia Tech should attempt an onside kick……..at least once.

The 20% value we used for onside kick recoveries was for non-surprise onside kicks. However, in the NFL surprise onside kicks succeed close to 60% of the time. And the first onside kick Virginia Tech attempts will certainly take Ohio State by surprise. And the second one probably will too. Maybe even a third. But eventually Ohio State would adjust their formation of their kick return team and the success rate would drop to that 20% value.

So if Virginia Tech wants to maximize their chances of winning, they really should attempt at least one onside kick. Until Ohio State adjusts their kick return team, anytime Virginia Tech kicks the ball deep they’re just giving away free points.

Ohio State at Home

I’d be remiss if I didn’t share one last thing about the Ohio State offense. The previous data for them only included Big Ten games played on the road. Impressive as it was, their offense gets even better when playing at home. How much better? Well, it doesn’t matter where they start with the football

Like, at all.

Fitted Line Plot

Virginia Tech can be thankful they’re playing Ohio State at home. If it was at Columbus, this data analysis not only would have concluded that they should onside kick every time, but it would have said never to punt either!

Before Crayola fully embraced the widespread use of statistical analyses, the vice president of manufacturing said he saw people spend more time trying to figure out how to come up with data that supported their thesis rather than letting the data reveal where they needed to go. It’s that kind of thinking that is too prevalent in football today. If you were to never punt against Ohio State, the first time the Buckeyes scored a touchdown after you failed on 4th down, people would point to that as proof that your strategy doesn’t work. But the numbers speak for themselves. Had you punted, Ohio State probably would have scored a touchdown anyway.

So take note Hawaii, Northern Illinois, Western Michigan, Maryland, Penn State, Minnesota, and Michigan State. If you go into Columbus and willingly give Ohio State possession of the football, you’re doing nothing but hurting your football team. Go ahead and ignore the data if you want. Just know that if you do, you might end up with some defective crayons.

Ever use dental floss to cut soft cheese? Or Alka Seltzer to clean your toilet bowl? You can find a host of nonconventional uses for ordinary objects online. Some are more peculiar than others.

Ever use ordinary linear regression to evaluate a response (outcome) variable of counts?

Technically, ordinary linear regression was designed to evaluate a a continuous response variable. A continuous response variable, such as temperature or length, is measured along a continuous scale that includes fractional (decimal) values. In practice, however, ordinary linear regression is often used to evaluate a response of count data, which are whole numbers as 0, 1, 2, and so on.

You can do that. Just like you can use a banana to clean a DVD. But there are things to watch out for if you do that. To examine issues related to performing ordinary linear regression analysis with count data, consider the following scenario.

Kids, Ants, and Sandwiches

A bored kid in a backyard makes a great scientist. One day, three Australian kids wondered which of their lunch sandwiches would attract more meat ants: Peanut butter, Vegemite, or Ham and pickles.

Note: Meat ants are an aggressive species of Australian ant that can kill a poisonous cane toad. Vegemite is a slightly bitter, salty brown paste made from brewer’s yeast extract.

To test their hypotheses, the kids starting dropping pieces of the three sandwiches and counting the number of ants on each sandwich after a set amount of time. Years later, as an adult, one of the kids replicated this childhood experiment with increased rigor. You can find the details of his modified experiment and the sample data it produced on the web site of the American Statistical Association.

Preparing the Data

To make the data and the results easier to interpret, I coded and sorted the original sample data set using the Code and Sort commands in Minitab's Data menu. If you want to see those data manipulation maneuvers, click here to open the project file in Minitab, then open the Report Pad to see the instructions. If you don't have a copy of Minitab, you can download a free 30-day trial version.

After coding and sorting, the combination of factor levels for each sandwich used for ant bait are easy to see in the worksheet, and the data values are arranged in the order that they were collected.

For example, row 9 shows that ham and pickles on rye with butter was the 9th piece of sandwich bait used—and it attracted 65 meat ants.

Performing Linear Regression

Are meat ants statistically more likely to swarm a ham sandwich—or will the pickles be a turnoff? Do they gravitate to the creamy comfort of butter? Or will salty, malty Vegemite drive them wild?

To evaluate the data using ordinary linear regression, choose Stat > Regression > Fit Regression Model. Fill out the dialog box as shown below and click OK.

First, examine the ANOVA table to determine whether any of the predictors are statistically significant.

At the 0.1 level of significance, both Filling and Butter predictors are statistically significant (p-value < 0.1). What matters to a meat ant, it seems, is not the bread, but what's between it.

To see how each of the levels of the factors relate to the number of ants (the response), examine the Coefficients table.

Each coefficient value is calculated in relation to the reference level for the variable, which has a coefficient of 0. Whatever level isn’t shown in the table is the reference level. So for the Filling variable, the reference level is Vegemite.

Tip: You can see the reference levels used for each variable by clicking the Coding button on the Regression dialog box. If you want the coefficients to be calculated relative to a different level, simply change the reference level in the drop-down list and rerun the analysis.

So what do these coefficient values mean? Generally speaking, larger coefficients are associated with a response of greater magnitude. The positive coefficients indicate a positive association, and the negative coefficients indicate a negative association.

For example, the positive coefficient of 27.28 for ham and pickles indicates that many more ants are attracted to the ham and pickles over Vegemite. The p-value of 0.000 for the coefficient indicates that the difference between ham and pickles and Vegemite is statistically significant. Based on these results, meat ants appear to be aptly named!

The Regression Equation: Caveat with a Count Response

The output for ordinary linear regression also includes a regression equation. The equation can be used to estimate the value of the response for specific values of the predictor variables

For categorical predictors, substitute a value of 1 into the equation for the levels at which you want to predict a response, and substitute 0 for the other levels.

For example, using the equation above, the number of meat ants that you can expect to be attracted by a peanut butter sandwich, without butter, on white bread, is estimated at: 24.31 + 7.04(0) + 1.12(0) - 1.21(1) + 0.0(0) + 8.31(1) + 27.28(0) + 0.0(1) + 11.40(0) ≈ 31.41 ants. (You can have Minitab do these calculations for you. Simply choose Stat > Regression > Regression > Predict and enter the predictor levels in the dialog box.)

One issue that can arise if you use ordinary linear regression with a count response is that, at certain predictor levels, the regression equation may estimate negative values for the response. But a negative "count" of ants—or anything else—doesn't make any sense. In that case, the equation may not be practically useful.

For this particular data set, it's not a problem. Using the regression equation, the lowest possible estimated response is for a Vegemite sandwich on white bread without butter (24.31 - 1.21), which yields an estimate of about 23 ants. Negative estimates don't occur primarily because the counts in this data set are all considerably greater than 0. But often that's not the case.

Evaluating the Model Fit and Assumptions

Regardless of whether you're performing ordinary linear regression with a continuous response variable or a discrete response variable of counts, it's important to assess the model fit, investigate extreme outliers, and check the model assumptions. If there's a serious problem, your results might not be valid.

The R-squared (adj) value suggests this model explains about half of the variation of the ant count (47.35%). Not great—but not not bad for a linear regression model with only a few categorical predictors. For this particular analysis, the ANOVA output also includes a p-value for lack-of-fit.

If the p-value for lack-of-fit is less than 0.05, there's statistically significant evidence that the model does not fit the data adequately. For this model, the p-value here is greater than 0.05. That means there's not sufficient evidence to conclude that the model doesn't fit well. That's a good thing.

Minitab's regression output also flags unusual observations, based on the size of their residuals. Residuals, also called "model errors", measure how much the response values estimated by the regression model differ from the actual response values in your data. The smaller a residual, the closer the value estimated by the model is to the actual value in your data. If a residual is unusually large, it suggests the the observation may be an outlier that's "bucking the trend" of your model.

For the ant count sample data, three observations are flagged as unusual:

If you see unusual values in this table, it's not a cause for alarm. Generally, you can expect roughly 5% of the data values to have large standardized residuals. But if there's a lot more than that, or if the size of a residual is unusually large, you should investigate.

For this sample data set of 48 observations, the number of unusual observations is not worrisome. However, two of the observations (circled in red) appear to be very much out-of-whack with the other observations. To figure out why, I went back to the original sample data set online, and found this note from the experimenter:

"Two results are large outliers. A reading of 97 was due to…leaving a portion of sandwich behind from the previous observation (i.e., there were already ants there); and one of 2 was due to [the sandwich portion be placed] too far away from the entrance to the [ant] hill.”

Because these outliers can be attributed to a special (out-of-the-ordinary) cause, it would be OK to remove them and re-run the analysis, as long you clearly state that you have done so (and why). However, in this case, removing these two outliers doesn't significantly change the overall results of the linear regression analysis, anyway (for brevity, I won't include those results here).

Finally, examine the model assumptions for the regression analysis. In Minitab, choose Stat > Regression > Fit Regression Model. Then click Graphs and check Four in one.

The two plots on the left (the Normal Probability Plot and the Histogram) help you assess whether the residuals are normally distributed. Although normality of the residuals is a formal assumption for ordinary linear regression, the analysis is fairly robust (resilient) to this assumption if the data set is sufficiently large (greater than 15 or so). Here, the points fall along the line of the normal probability plot and the histogram shows a fairly normal distribution. All is well..

Constant variance of the residuals is a more critical assumption for linear regression. That means the residuals should be distributed fairly evenly and randomly across all the fitted (estimated) values. To assess constant variance, look at the Residuals versus Fits plot in the upper right. In the plot above, the points appear to be randomly scattered on both sides of the line representing a residual value of 0. Again, no evidence of a problem.

With this sample data, using ordinary linear regression with a count response seems to work OK. But with different count data, might things have worked out differently? We'll examine that in the next post (Part 2).

Meanwhile, kick back and fix yourself a ham and pickle sandwich on rye with butter. And keep an eye out for meat ants.

What in the World Is a VIF?

Combining Tools of the Past and Present: The i-Test

10 Statistical Terms Designed to Confuse Non-Statisticians

Coming Soon: The Big Ten 4th Down Calculator

Kappa Studies : What Is the Meaning of a Kappa Value of 0?

High School Researchers: What Do We Do with All of this Data?

The Bubble Plot: It's A Beautiful Display

The Null Hypothesis: Always “Busy Doing Nothing”

Big Ten 4th Down Calculator: Creating a Model for Expected Points

Interpreting Results from a Split-Plot Design

Considering Defects and Defectives via the Republican Primary

Using Probability Distribution Plots to See Data Clearly

Who Will Win in 2016? Ask Someone from Connecticut

Chi-Square Analysis: Powerful, Versatile, Statistically Objective

Calculating the Probability of Converting on 4th Down

Graphing Wastewater in China

Monitoring Rare Events with G Charts

The Danger of Overfitting Regression Models

Should Virginia Tech Always Onside Kick Against Ohio State?

Regression with Meat Ants: Analyzing a Count Response (Part 1)