Minitab | Minitab

Rodin's statue, The Thinker Choosing the correct linear regression model can be difficult. After all, the world and how it works is complex. Trying to model it with only a sample doesn’t make it any easier. In this post, I'll review some common statistical methods for selecting models, complications you may face, and provide some practical advice for choosing the best regression model.

It starts when a researcher wants to mathematically describe the relationship between some predictors and the response variable. The research team tasked to investigate typically measures many variables but includes only some of them in the model. The analysts try to eliminate the variables that are not related and include only those with a true relationship. Along the way, the analysts consider many possible models.

They strive to achieve a Goldilocks balance with the number of predictors they include.

Too few: An underspecified model tends to produce biased estimates.
Too many: An overspecified model tends to have less precise estimates.
Just right: A model with the correct terms has no bias and the most precise estimates.

Statistical Methods for Finding the Best Regression Model

For a good regression model, you want to include the variables that you are specifically testing along with other variables that affect the response in order to avoid biased results. Minitab statistical software offers statistical measures and procedures that help you specify your regression model. I’ll review the common methods, but please do follow the links to read my more detailed posts about each.

Adjusted R-squared and Predicted R-squared: Generally, you choose the models that have higher adjusted and predicted R-squared values. These statistics are designed to avoid a key problem with regular R-squared—it increases every time you add a predictor and can trick you into specifying an overly complex model.

The adjusted R squared increases only if the new term improves the model more than would be expected by chance and it can also decrease with poor quality predictors.
The predicted R-squared is a form of cross-validation and it can also decrease. Cross-validation determines how well your model generalizes to other data sets by partitioning your data.

P-values for the predictors: In regression, low p-values indicate terms that are statistically significant. “Reducing the model” refers to the practice of including all candidate predictors in the model, and then systematically removing the term with the highest p-value one-by-one until you are left with only significant predictors.

Stepwise regression and Best subsets regression: These are two automated procedures that can identify useful predictors during the exploratory stages of model building. With best subsets regression, Minitab provides Mallows’ Cp, which is a statistic specifically designed to help you manage the tradeoff between precision and bias.

Real World Complications

Great, there are a variety of statistical methods to help us choose the best model. Unfortunately, there also are a number of potential complications. Don’t worry, I’ll provide some practical advice!

The best model can be only as good as the variables measured by the study. The results for the variables you include in the analysis can be biased by the significant variables that you don’t include. Read about an example of omitted variable bias.
Your sample might be unusual, either by chance or by data collection methodology. False positives and false negatives are part of the game when working with samples.
P-values can change based on the specific terms in the model. In particular, multicollinearity can sap significance and make it difficult to determine the role of each predictor.
If you assess enough models, you will find variables that appear to be significant but are only correlated by chance. This form of data mining can make random data appear significant. A low predicted R-squared is a good way to check for this problem.
P-values, predicted and adjusted R-squared, and Mallows’ Cp can suggest different models.
Stepwise regression and best subsets regression are great tools and can get you close to the correct model. However, studies have found that they generally don’t pick the correct model.

Recommendations for Finding the Best Regression Model

Choosing the correct regression model is as much a science as it is an art. Statistical methods can help point you in the right direction but ultimately you’ll need to incorporate other considerations.

Theory

Research what others have done and incorporate those findings into constructing your model. Before beginning the regression analysis, develop an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes. Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining.

Theoretical considerations should not be discarded based solely on statistical measures. After you fit your model, determine whether it aligns with theory and possibly make adjustments. For example, based on theory, you might include a predictor in the model even if its p-value is not significant. If any of the coefficient signs contradict theory, investigate and either change your model or explain the inconsistency.

Complexity

You might think that complex problems require complex models, but many studies show that simpler models generally produce more precise predictions. Given several models with similar explanatory ability, the simplest is most likely to be the best choice. Start simple, and only make the model more complex as needed. The more complex you make your model, the more likely it is that you are tailoring the model to your dataset specifically, and generalizability suffers.

Verify that added complexity actually produces narrower prediction intervals. Check the predicted R-squared and don’t mindlessly chase a high regular R-squared!

Residual Plots

As you evaluate models, check the residual plots because they can help you avoid inadequate models and help you adjust your model for better results. For example, the bias in underspecified models can show up as patterns in the residuals, such as the need to model curvature. The simplest model that produces random residuals is a good candidate for being a relatively precise and unbiased model.

In the end, no single measure can tell you which model is the best. Statistical methods don't understand the underlying process or subject-area. Your knowledge is a crucial part of the process!

If you're learning about regression, read my regression tutorial!

* The image of Rodin's The Thinker was taken by flickr user innoxius and licensed under CC BY 2.0.

If you’re not a statistician, looking through statistical output can sometimes make you feel a bit like Alice inWonderland. Suddenly, you step into a fantastical world where strange and mysterious phantasms appear out of nowhere.

For example, consider the T and P in your t-test results.

“Curiouser and curiouser!” you might exclaim, like Alice, as you gaze at your output.

What are these values, really? Where do they come from? Even if you’ve used the p-value to interpret the statistical significance of your results umpteen times, its actual origin may remain murky to you.

T & P: The Tweedledee and Tweedledum of a T-test

T and P are inextricably linked. They go arm in arm, like Tweedledee and Tweedledum. Here's why.

When you perform a t-test, you're usually trying to find evidence of a significant difference between population means (2-sample t) or between the population mean and a hypothesized value (1-sample t). The t-value measures the size of the difference relative to the variation in your sample data. Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T (it can be either positive or negative), the greater the evidence against the null hypothesis that there is no significant difference. The closer T is to 0, the more likely there isn't a significant difference.

Remember, the t-value in your output is calculated from only one sample from the entire population. It you took repeated random samples of data from the same population, you'd get slightly different t-values each time, due to random sampling error (which is really not a mistake of any kind–it's just the random variation expected in the data).

How different could you expect the t-values from many random samples from the same population to be? And how does the t-value from your sample data compare to those expected t-values?

You can use a t-distribution to find out.

Using a t-distribution to calculate probability

For the sake of illustration, assume that you're using a 1-sample t-test to determine whether the population mean is greater than a hypothesized value, such as 5, based on a sample of 20 observations, as shown in the above t-test output.

In Minitab, choose Graph > Probability Distribution Plot.
Select View Probability, then click OK.
From Distribution, select t.
In Degrees of freedom, enter 19. (For a 1-sample t test, the degrees of freedom equals the sample size minus 1).
Click Shaded Area. Select X Value. Select Right Tail.
In X Value, enter 2.8 (the t-value), then click OK.

The highest part (peak) of the distribution curve shows you where you can expect most of the t-values to fall. Most of the time, you’d expect to get t-values close to 0. That makes sense, right? Because if you randomly select representative samples from a population, the mean of most of those random samples from the population should be close to the overall population mean, making their differences (and thus the calculated t-values) close to 0.

T values, P values, and poker hands

T values of larger magnitudes (either negative or positive) are less likely. The far left and right "tails" of the distribution curve represent instances of obtaining extreme values of t, far from 0. For example, the shaded region represents the probability of obtaining a t-value of 2.8 or greater. Imagine a magical dart that could be thrown to land randomly anywhere under the distribution curve. What's the chance it would land in the shaded region? The calculated probability is 0.005712.....which rounds to 0.006...which is...the p-value obtained in the t-test results!

In other words, the probability of obtaining a t-value of 2.8 or higher, when sampling from the same population (here, a population with a hypothesized mean of 5), is approximately 0.006.

How likely is that? Not very! For comparison, the probability of being dealt 3-of-a-kind in a 5-card poker hand is over three times as high (≈ 0.021).

Given that the probability of obtaining a t-value this high or higher when sampling from this population is so low, what’s more likely? It’s more likely this sample doesn’t come from this population (with the hypothesized mean of 5). It's much more likely that this sample comes from different population, one with a mean greater than 5.

To wit: Because the p-value is very low (< alpha level), you reject the null hypothesis and conclude that there's a statistically significant difference.

In this way, T and P are inextricably linked. Consider them simply different ways to quantify the "extremeness" of your results under the null hypothesis. You can’t change the value of one without changing the other.

The larger the absolute value of the t-value, the smaller the p-value, and the greater the evidence against the null hypothesis.(You can verify this by entering lower and higher t values for the t-distribution in step 6 above).

Try this two-tailed follow up...

The t-distribution example shown above is based on a one-tailed t-test to determine whether the mean of the population is greater than a hypothesized value. Therefore the t-distribution example shows the probability associated with the t-value of 2.8 only in one direction (the right tail of the distribution).

How would you use the t-distribution to find the p-value associated with a t-value of 2.8 for two-tailed t-test (in both directions)?

Hint: In Minitab, adjust the options in step 5 to find the probability for both tails. If you don't have a copy of Minitab, download a free 30-day trial version.

In several previous blogs, I have discussed the use of statistics for quality improvement in the service sector. Understandably, services account for a very large part of the economy. Lately, when meeting with several people from financial companies, I realized that one of the problems they faced was that they were collecting large amounts of "qualitative" data: types of product, customer profiles, different subsidiaries, several customer requirements, etc.

There are several ways to process such qualitative data. Qualitative data points may still be counted, and once they have been counted they may be quantitatively (numerically) analyzed using statistical methods.

I will focus on the analysis of qualitative data using a simple and obvious example. In this case, we would like to analyze mistakes on invoices made during a period of several weeks by three employees (anonymously identified).

I will present three different ways to analyze such qualitative data (counts). In this post, I will cover:

A very simple graphical approach based on bar charts to display counts (stacked and clustered bars), Pareto diagrams and Pie charts.

Then, in my next post, I will demonstrate:

A more complex approach for testing statistical significance using a Chi-square test.
An even more complex multivariate approach (using correspondence analysis).

Again, the main purpose of this example is to show several ways to analyze qualitative data. Quantitative data represent numeric values such as the number of grams, dollars, newtons, etc., whereas qualitative data may represent text values such as different colours, types of defects or different employees.

The Assistant in Minitab 17 provides a great breakdown of two main data types:

Charts and Diagrams with Qualitative Data

I first created a pie chart using the Minitab Assistant (Assistant > Graphical Analysis) as well as a stacked bar chart on counts (from the graph menu of Minitab, select Graph > Bar Charts) to describe the proportion of each type of mistakes according to the day of the week.

In the pie charts above, the proportion of mistake types seems to be fairly similar across the different days of the week.

The number of mistakes also seems to be very stable and uniform according to day of week, when we see the stacked bar chart above.

Now let's create a stacked bar chart on counts to analyze mistakes by employees. In this second graph, shown above, large variations in the number of errors do occur according to employees. The distribution of errors also seems to be very different, with more “Product” errors associated with employee A.

Qualitative Data in a Pareto Chart

Above we see Pareto charts created using the Minitab Assistant (above): an overall Pareto and some additional Pareto diagrams, one for each employee. Again, it's easy to identify the large number of “product” mistakes (red columns) for employee A.

Stacked Bar Charts of Qualitative Data

Mistake counts are represented as percentages in the stacked bar chart above. For each employee the error types are summed up to obtain 100% (within each employee's column). This provides a clearer understanding of how each employee's mistakes are distributed. Again, the high percentage of “Product” errors (in yellow) for employee A is very noticeable, but also note the high percentage, proportionately, of “Address” mistakes (blue areas) for employee C.

The stacked bar chart above displays changes in the number of errors and in error types according to the week (time trends). Notice that in the last three weeks, at the end of the period, only product and address issues occurred. Apparently error types tend to shift towards more “product” and “address” types of errors, at the end of the period.

Different Views of the Data Give a More Complete Picture

These diagrams do provide a clear picture of mistake occurrences according to employees, error types and weeks. However, as you've seen, it takes several graphs to provide a good understanding of the issue.

This is still a subjective approach though, several people seated around the same table looking at these same graphs, might interpret them differently and in some cases, this could result in endless discussions.

Therefore we would also like to use a more scientific and rigorous approach: the Chi-square test. We'll cover that in my next post.

In my recent meetings with people from various companies in the service industries, I realized that one of the problems they face is that they were collecting large amounts of "qualitative" data: types of product, customer profiles, different subsidiaries, several customer requirements, etc.

As I discussed in my previous post, one way to look at qualitative data is to use different types of charts, including pie charts, stacked bar charts, and Pareto charts. In this post, we'll cover how to dig deeper into qualitative data with Chi-square analysis and multivariate analysis.

A Chi-Square Test with Qualitative Data

The table below shows which statistical methods can be used to analyze data according to the nature of such data (qualitative or numeric/quantitative). Even when the output (Y) is qualitative and the input (predictor : X) is also qualitative, at least one statistical method is relevant and can be used : the Chi-Square test.

X \ Y

Numeric/quantitative Output

Qualitative Output

Numeric/quantitative Input

Regression

Logistic Regression

Qualitative Input

ANOVA

T tests

Chi-Square

Proportion tests

Let's perform the Chi-square test of statistical significance on the same qualitative mistakes data I used in my previous post:

data

In Minitab Statistical Software, go to Stat > Tables > Cross Tabulation and Chi-square... In the output below, you can see that for each Employee / Error type combination, observed counts are obtained. Below that, expected counts (based on the assumption that the distribution of types of errors is strictly identical for each employee) are displayed. And below the expected count is displayed that combination's contribution to the overall Chi-Square.

A low p-value (p = 0.042 <0.05), shown below the table, indicates a significant difference in the distribution of error types according to the three employees.

We then need to consider the major contributions to the overall chi-square:

Largest contribution: 3.79 for the Mistake type: “Product” & Employee: A combination. Note that in this case, for that particular cell, the number of observed errors for “product” (third row) and employee A (first column of the table) is much larger than the number of expected errors. Due to that difference the contribution for that particular combination is large : 3.79.

Second largest contribution: 2.66 for the Error type: “Address” & employee: C combination. Note that for this particular combination (i.e., this particular cell in the table) the observed number of address errors is much larger than the number of expected errors for Employee C (and therefore the contribution 2.66 is quite large).

Simple Correspondence (Multivariate) Analysis for Qualitative Data

This third approach to analyzing qualitative data is more complex and computationally intensive but this is also a very effective and explicit statistical tool from a graphical point of view. In Minitab, go to Stat > Multivariate > Simple Correspondence Analysis...

To do this analysis, I rearranged the data in a two way contingency table, with the addition of a column for the employee names :

The simple correspondence symmetric plot below indicates that “Product” type errors are more likely to be associated with employee A (see on the right part of the graph below the two points are close to one another) whereas "Address" type errors are more likely to be associated with employee C (the two points are visually close on the left part of the graph). This is the same conclusion we found using the Chi-square test.

How Can You Use Qualitative Data?

Counts of qualitative data may obviously be used to provide relevant information to decision takers, process owners, quality professionals etc., and several graphical or statistical tools are available for that in Minitab. Our statistical software includes statistical tools that are useful to analyze qualitative values, but that I didn't have space to present in this short blog (for example, Kappa studies, Attribute sampling inspection, Nominal Logistic regression...).

Quantitative analysis and statistics might still be used more extensively in the service sector to improve quality and customer satisfaction. Of course, analyses of qualitative data are also often performed in the manufacturing industry. If you're not already using it, please download our free 30-day trial and see what you can learn from your data!

Deflated football In the past week there has been a big commotion over this article that shows since 2007 the Patriots have fumbled at rate that is extremely lower than the rest of the NFL. Why 2007? Because that’s the year the NFL changed their policies to allow every team to use their own footballs, even when they were playing on the road. So if the Patriots were going to try to gain an advantage by deflating footballs, that’s the year they would have started to do it.

Now, there have been several articles pointing out some flaws in the analysis (here’s a good one if you’re interested), so I’m not going to get into any statistics that have to do with fumbling. Instead I’m going to look at something else that I think could be affected by deflated footballs.

Dropped passes.

After all, if a deflated football makes it easier to carry, it should make it easier to catch, too. And when your offense is centered around one of the greatest quarterbacks of all time, making passes easier to catch would be a huge advantage. So I’m going to see if we can detect anything fishy about the Patriots' percentage of dropped passes since 2007.

For every NFL team, I obtained the percentage of passes that they dropped in each season from 2000-2014 (via sportingcharts.com). You can get the data I used here.

The 2000-2006 Era

Prior to 2007, the home team provided the footballs used by both teams in the game. Any advantage that you might gain by deflating them would also be gained by the opposing team. So let’s see how the Patriots in the Belichick era (which started in 2000) stacked up to the rest of the league when they couldn’t have gained an edge by tampering with the footballs.

Individual Value Plot

Time Series Plot

From 2000 to 2006, the New England receivers had some of the best hands in the NFL. Only three teams dropped a smaller percentage of passes than them (two of which played in a dome) and their drop percentage was less than the league average 6 out of the 7 years. Were the Patriots able to continue this trend after the rule change in 2007?

The 2007-2014 Era

Take one of the best quarterbacks ever (say, Tom Brady), give him one of the best receivers ever (say, Randy Moss), and the result is the 2007 Patriots, one of the most explosive offenses in the history of the NFL. By the same vein, imagine you take an already good pass catching team (say, the Patriots from 2000-2006), give them an additional advantage in catching passes (say, deflated footballs), and the result should be a superior pass catching team that is unrivaled by any other team in the NFL.

At least, that’s how you would imagine it to work, right?

Individual Value Plot

Time Series Plot

Conspiracy theorists can stop reading now. There is nothing to suggest that Patriots have a drop percentage that is significantly lower than the rest of the league. In fact, their drop percentage after the rule change in 2007 is a full percentage point higher than it was from 2000-2006.

If this data analysis showed that the Patriots did in fact have a lower drop percentage than the rest the league, we could continue to break down the data into specifically outdoor games in cold and wet weather. But with the Patriots drop percentage from 2007 to 2014 already so high, it is highly unlikely we’d reach a different conclusion. If the Patriots have been deflating footballs since 2007, it doesn’t appear to have improved their pass catching at all. So if you want to bring down the Patriots for cheating, you'll have to look somewhere else.

by Matthew Barsalou, guest blogger.

E. E. Doc Smith, one of the greatest authors ever, wrote many classic books such as The Skylark of Space and his Lensman series. Doc Smith’s imagination knew no limits; his Galactic Patrol had millions of combat fleets under its command and possessed planets turned into movable, armored weapons platforms. Some of the Galactic Patrol’s weapons may be well known. For example, there is the sunbeam, which concentrated the entire output of a sun’s energy into one beam.

amazing stories featuring E. E. "Doc" Smith The Galactic Patrol also created the negasphere, a planet-sized dark matter/dark energy bomb that could eat through anything. I’ll go out on a limb and assume that they first created a container that could contain such a substance, at least briefly.

When I read about such technology, I always have to wonder “How did they test it?” I can see where Minitab Statistical Software could be very helpful to the Galactic Patrol. How could the Galactic Patrol evaluate smaller, torpedo-sized units of negasphere? Suppose negasphere was created at the time of firing in a space torpedo and needed to be contained for the first 30 seconds after being fired, lest it break containment early and damage the ship that is firing it or rupture the torpedo before it reaches a space pirate.

The table below shows data collected from fifteen samples each of two materials that could be used for negasphere containment. Material 1 has a mean containment time of 33.951 seconds and Material 2 has a mean of 32.018 seconds. But is this difference statically significant? Does it even matter?

Material 1

Material 2

34.5207

32.1227

33.0061

31.9836

32.9733

31.9975

32.4381

31.9997

34.1364

31.9414

36.1568

32.0403

34.6487

32.1153

36.6436

31.9661

35.3177

32.0670

32.4043

31.9610

31.3107

32.0303

34.0913

32.0146

33.2040

31.9865

32.5601

32.0079

35.8556

32.0328

The questions we're asking and the type and distribution of the data we have should determine the types of statistical test we perform. Many statistical tests for continuous data require an assumption of normality, and this can easily be tested in our statistical software by going to Graphs > Probability Plot… and entering the columns containing the data.

probability plot of material 1

probability plot of material 2

The null hypothesis is “the data are normally distributed,” and the resulting P-values are greater 0.05, so we fail to reject the null hypothesis. That means we can evaluate the data using tests that require the data to be normally distributed.

To determine if the mean of Material 1 is indeed greater than the mean of Material 2, we perform a two sample t-test: go to Stat > Basic Statistics > 2 Sample t… and select “Each sample in its own column.” We then choose “Options..” and select “Difference > hypothesized difference.”

two-sample t-test and ci output

The P-value for the two sample t-test is less than 0.05, so we can conclude there is a statistically significant difference between the materials. But the two sample t-test does not give us a complete picture of the situation, so we should look at the data by going to Graph > Individual Value Plot... and selecting a simple graph for multiple Y’s.

individual value plot

The mean of Material 1 may be higher, but our biggest concern is identifying a material that does not fail in 30 seconds or less. Material 2 appears to have far less variation and we can assess this by performing an F-test: go to Stat > Basic Statistics > 2 Variances… and select “Each sample in its own column.” Then choose “Options..” and select “Ratio > hypothesized ratio.” The data is normally distributed, so put a checkmark next to “Use test and confidence intervals based on normal distribution.”

two variances test output

The P-value is less than 0.05, so we can conclude the evidence does supports the alternative hypothesis that the variance of the first material is greater than the variance of the second material. Having already looked at a graph of the data, this should come as no surprise

No statistical software program can tell us which material to choose, but Minitab can provide us with the information needed to make an informed decision. The objective is to exceed a lower specification limit of 30 seconds and the lower variability of Material 2 will achieve this better than the higher mean value for Material 1. Material 2 looks good, but the penalty for a wrong decision could be lost space ships if the negasphere breaches its containment too soon, so we must be certain.

The Galactic Patrol has millions of ships so a failure rate of even one per million would be unacceptably high so we should perform a capability study by going to Quality Tools > Capability Analysis > Normal… Enter the column containing the data for Material 1 and use the same column for the subgroup size and then enter a lower specification of 30. This would then be repeated for Material 2.

process capability for material 1

Process Capability for Material 2

Looking at the Minitab generated capability studies, we can see that Material 1 can be expected to fail thousands of times per million uses, but Material 2 would is not expected to fail at all. In spite of the higher mean, the Galactic Patrol should use Material 2 for the negaspehere torpedoes.

About the Guest Blogger

Matthew Barsalou is an engineering quality expert in BorgWarner Turbo Systems Engineering GmbH’s global engineering excellence department. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time, Statistics for Six Sigma Black Belts and The ASQ Pocket Guide to Statistics for Six Sigma Black Belts.

Tom Brady

There’s no shortage of interest this week in whether Tom Brady is the best quarterback to ever play the game of football. As a University of Tennessee alum, I have to recuse myself from that particular debate for lack of objectivity. (Everyone knows Peyton Manning is the best quarterback to ever play the game, right?) But now seems like a good time to look at some numbers that show where Brady fits among the greatest Super Bowl quarterbacks of all time. The passing data are from NFL.com and include only post-merger statistics, with my apologies to Bart Starr, Joe Namath, and Len Dawson. In the graphs that follow, gold indicates a good result for a quarterback and black indicates a bad result.

Narrow the Field

When making the case for someone to be the best Super Bowl quarterback ever, a common starting point is the number of victories. Joe Montana, Tom Brady, and Terry Bradshaw belong in the rarefied group of quarterbacks who have won 4 Super Bowls. Troy Aikman is the only other quarterback to have won at least 3 Super Bowls. If we use the number of victories as a standard for determining the best Super Bowl Quarterback ever, then these 4 make a good list of candidates.

Tom Brady has played in more Super Bowls than any other quarterback.

A Look at Passing Statistics

We could compare passer ratings among the quarterbacks, but it’s a little unfair across time. While not necessarily a perfect statistic, I’m going to compare each quarterback’s median passer rating in victories to the median passer ratings of other Super-Bowl-winning starting quarterbacks before and after their victories. I’m using the median because Jim Plunkett’s passer rating was so good in Super Bowl XV, and Ben Roethlisberger’s rating was so bad in Super Bowl XL, that I think using the mean would give an unfair advantage to Brady over Bradshaw and Montana. Here’s what the passer rating comparison looks like:

The median passer ratings for Brady and Montana both exceed the median passer ratings of their competition by 20.4 points.

In terms of the median, Aikman and Bradshaw have comparable passer ratings to other Super-Bowl-winning quarterbacks near them in time. Brady and Montana are better than their contemporaries. Amazingly, the difference of medians between Brady and his competition is identical to the difference of medians between Montana and his competition: 20.4 points.

The need to try to refine the analysis to compare Brady and Montana more closely leads me to consider an item that comes out in favor of the conclusion that Tom Brady is the best Super Bowl quarterback ever. This last graph shows the same median statistics for the quarterbacks as the previous graph. Each point is labeled with the margin of victory from that Super Bowl. Joe Montana’s best games, while extraordinary athletic accomplishments, came during Super Bowls where his team was much better than the competition. (Sorry Dolphins and Broncos, but you lost by more than 3 touchdowns.) Brady, in contrast, has never played in a Super Bowl where he could pad his stats in an uncompetitive contest.

Joe Montana's best games were in blowouts. All of Tom Brady's Super Bowls have been competitive matches.

Wrap Up

Without the number of victories requirement, the field for consideration would get much wider. Roger Staubach, Phil Simms, Doug Williams, and Jim Plunkett are easy names to come up with when you think of extraordinary performances by quarterbacks in Super Bowl victories. You might even consider Russell Wilson who, despite having 1 win and 1 loss, has two performances with higher passer ratings than Tom Brady’s highest passer rating ever. But I think Wilson’s case might be far from stated. For now, with 4 victories, 3 Super Bowl MVP awards, a median passer rating that exceeds his most direct competition by over 20 points, and no victories over clearly inferior competition, Tom Brady is the best Super Bowl quarterback ever.

Bonus

There's lots of color editing in the graphs above. Want to see more about what you can do with graphs in Minitab Statistical Software? Check out Chapter 2 of the Getting Started Guide!

The photo of Tom Brady is by Keith Allison and is licensed under this Creative Commons License.

As someone who has collected and analyzed real data for a living, the idea of using simulated data for a Monte Carlo simulation sounds a bit odd. How can you improve a real product with simulated data? In this post, I’ll help you understand the methods behind Monte Carlo simulation and walk you through a simulation example using Devize.

Process capability chart

What is Devize, you ask? Devize is Minitab's exciting new, web-based, Monte Carlo simulation software for manufacturing engineers!

What Is Monte Carlo Simulation?

The Monte Carlo method uses repeated random sampling to generate simulated data to use with a mathematical model. This model often comes from a statistical analysis, such as a designed experiment or a regression analysis.

Suppose you study a process and use statistics to model it like this:

Regression equation for the process

With this type of linear model, you can enter the process input values into the equation and predict the process output. However, in the real world, the input values won’t be a single value thanks to variability. Unfortunately, this input variability causes variability and defects in the output.

To design a better process, you could collect a mountain of data in order to determine how input variability relates to output variability under a variety of conditions. However, if you understand the typical distribution of the input values and you have an equation that models the process, you can easily generate a vast amount of simulated input values and enter them into the process equation to produce a simulated distribution of the process outputs.

You can also easily change these input distributions to answer "what if" types of questions. That's what Monte Carlo simulation is all about. In the example we are about to work through, we'll change both the mean and standard deviation of the simulated data to improve the quality of a product.

Today, simulated data is routinely used in situations where resources are limited or gathering real data would be too expensive or impractical.

How Can Devize Help You?

Devize helps engineers easily perform a Monte Carlo analysis in order to:

Simulate product results while accounting for the variability in the inputs
Optimize process settings
Identify critical-to-quality factors
Find a solution to reduce defects

Along the way, Devize interprets simulation results and provides step-by-step guidance to help you find the best possible solution for reducing defects. I'll show you how to accomplish all of this right now!

Step-by-Step Example of Monte Carlo Simulation using Devize

A materials engineer for a building products manufacturer is developing a new insulation product. The engineer performed an experiment and used statistics to analyze process factors that could impact the insulating effectiveness of the product. (The data for this DOE is just one of the many data set examples that can be found in Minitab’s Data Set Library.) For this Monte Carlo simulation example, we’ll use the regression equation shown above, which describes the statistically significant factors involved in the process.

Step 1: Define the Process Inputs and Outputs

The first thing we need to do is to define the inputs and the distribution of their values. The process inputs are listed in the regression output and the engineer is familiar with the typical mean and standard deviation of each variable. For the output, we simply copy and paste the regression equation that describes the process from Minitab statistical software right into Devize!

In Devize, we start with these entry fields:

Setup the process inputs and outputs

And, it's an easy matter to enter the information about the inputs and outputs for the process like this.

Setup the input values and the output equation

Verify your model with the above diagram and then click Simulate.

Initial Simulation Results

After you click Simulate, Devize very quickly runs 50,000 simulations by default, though you can specify a higher or lower number of simulations. (The free trial of Devize is limited to 75 simulations.)

Initial simulation results

Devize interprets the results for you using output that is typical for capability analysis—a capability histogram, percentage of defects, and the Ppk statistic. Devize correctly points out that our Ppk is below the generally accepted minimum value of Ppk.

Step-by-Step Guidance for the Monte Carlo Simulation

Devize doesn’t just run the simulation and then let you figure what to do next. Instead, Devize has determined that our process is not satisfactory and presents you with a smart sequence of steps to improve the process capability.

How is it smart? Devize knows that it is generally easier to control the mean than the variability. Therefore, the next step that Devize presents is Parameter Optimization, which finds the mean settings that minimize the number of defects while still accounting for input variability.

Next steps leading to parameter optimization

Step 2: Define the Objective and Search Range for Parameter Optimization

At this stage, we want Devize to find an optimal combination of mean input settings to minimize defects. After you click Parameter Optimization, you'll need to specify your goal and use your process knowledge to define a reasonable search range for the input variables.

Setup for parameter optimization

And, here are the simulation results!

Results of the parameter optimization

At a glance, we can tell that the percentage of defects is way down. We can also see the optimal input settings in the table. However, our Ppk statistic is still below the generally accepted minimum value. Fortunately, Devize has a recommended next step to further improve the capability of our process.

Next steps leading to a sensitivity analysis

Step 3: Control the Variability to Perform a Sensitivity Analysis

So far, we've improved the process by optimizing the mean input settings. That reduced defects greatly, but we still have more to do in the Monte Carlo simulation. Now, we need to reduce the variability in the process inputs in order to further reduce defects.

Reducing variability is typically more difficult. Consequently, you don't want to waste resources controlling the standard deviation for inputs that won't reduce the number defects. Fortunately, Devize includes an innovative graph that helps you identify the inputs where controlling the variability will produce the largest reductions in defects.

Setup for the sensitivity analysis

In this graph, look for inputs with sloped lines because reducing these standard deviations can reduce the variability in the output. Conversely, you can ease tolerances for inputs with a flat line because they don't affect the variability in the output.

In our graph, the slopes are fairly equal. Consequently, we'll try reducing the standard deviations of several inputs. You'll need to use process knowledge in order to identify realistic reductions. To change a setting, you can either click the points on the lines, or use the pull-down menu in the table.

Final Monte Carlo Simulation Results

Results of the sensitivity analysis

Success! We've reduced the number of defects in our process and our Ppk statistic is 1.41, which is above the benchmark value. The assumptions table shows us the new settings and standard deviations for the process inputs that we should try. If we ran Parameter Optimization again, it would center the process and I'm sure we'd have even fewer defects.

To improve our process, Devize guided us on a smart sequence of steps during our Monte Carlo simulation:

Simulate the original process
Optimize the mean settings
Strategically reduce the variability

If you want to try Monte Carlo simulation for yourself, sign up for a free trial subscription of Devize!

Recently, Minitab’s Joel Smith posted about his vacation and being pooped on twice by birds. Then guest blogger Matthew Barsalou wrote a wonderful follow-up on the chances of Joel being pooped on a third time. While I cannot comment on how Joel has handled this situation psychologically so far, I can say that if I had been pooped on twice in a short amount of time, I would be wary of our feathered friends for quite a while.

falcon

If Joel is experiencing any ongoing distress about these recent events, one avenue he could take is birdwatching. Ebird.org allows birdwatchers to easily keep track of their observations. By observing and documenting what he sees, Joel could take a more proactive approach in avoiding a third "encounter." For example, here is what birdwatcher Ron Crandall documented on December 14, 2014, on ebird:

Observational location: PSU (Univ. Park)--central campus

Red-tailed Hawk

Rock Pigeon

Blue Jay

Tufted Titmouse

White-breasted Nuthatch

Golden-crowned Kinglet

White-throated Sparrow

Dark-eyed Junco

American Goldfinch

We can look at this data visually by using Minitab’s Pareto Chart, which can be found under Stat > Quality Tools > Pareto Chart. (If you want to follow along and don't already have Minitab on your computer, please download the free trial.) The dialog window would be filled out as shown below:

Here are the results after pressing OK:

Pareto charts are a very common tool, and a very common question I receive from Minitab users is “How do I remove the table located below the Pareto Chart?" To delete the table, you simply have to select the labels ‘Observational Count’, ‘Percent’, and ‘Cum %’, one at a time and hit the delete key. However, the only way I have found to bring these rows back is by going to Edit > Undo. So if you’ve made multiple changes to your project since you deleted this Pareto table, you’ll have to either recreate the graph or undo multiple times!

It looks like Ron documented that he had more visual contact with the Dark-eyed Junco than any other bird that day. Was does that mean for Ron? Most likely...nothing. If Joel were to take up bird watching, it may tell him how many are in visual range and from what species they belong to, but it probably won't tell him if he's being targeted. That is, unless, he finds himself in Tippi Hedren's situation in Hitchcock's The Birds. (Incidentally, I was having a hard time finding information on how many birds were found for the movie. One web site mentions that 20,000 crows were captured in Arizona, and that's just one location.)

Maybe a statistical analysis like a binary logistic regression can shed more light on this matter. You use a binary logistic regression to perform logistic regression on a binary response variable. A binary response variable has two possible values, the event and non-event. You could then attempt to model a relationship between various predictor variables and this binary response.

Let’s say, hypothetically, Joel constructs a helmet that could track when a bird was directly overhead of him. The helmet could also track if they attempted to poop on him, and we could define this as the “event.” When the bird didn't poop, we could define this as the “non-event.” The predictors can be continuous or categorical in nature. Three hypothetical (and quite silly) variables of interest could be outdoor temperature, whether one is wearing a Scarecrow outfit, and the loudness of the boombox that person is carrying.

Over course of some time, a data sheet might start to look like this:

Under C1, “1” represents the event and “0” represents the non-event. To analyze this in Minitab, you would go to Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model, and fill in your dialog window has follows:

If you wanted to add interactions, you’d need to go into the Model… sub-menu:

To add the interaction between Temperature and ‘db levels...”, you’d follow these steps:

Shift select on your keyboard both Temperature and dB levels… under the Predictors box.
Under the interactions through order drop down, choose 2.
Click on the Add button.

You’d see Temperature* ‘DB levels of carried boombox’ under the Terms in the model box. After hitting "OK" in a few dialog windows, your results will show in the session window.

Hopefully there will not be a third time for Joel. If there is, he may take solace in that some people view it as good luck. However...my shirt would disagree. For more information on how to interpret results from a binary logistic regression, please click on the links below.

Analyzing Titanic Survival Rates, Part II: Binary Logistic Regression

Interpreting Halloween Statistics with Binary Logistic Regression

Using Binary Logistic Regression to Investigate High Employee Turnover

power

In my experience, one of the hardest concepts for users to wrap their head around revolves around the Power and Sample Size menu in Minitab's statistical software, and more specifically, the field that asks for the "difference" or "difference to detect."

Let’s start with power. In statistics, the definition of power is the probability that you will correctly reject the null hypothesis when it is false. That’s a little abstract, so we can think of it a little more simply as the likelihood that you will find a significant effect or difference when one truly exists.

Let’s look at an example, and interpret the results. If we go to Stat > Power and Sample Size > 2 sample t, and enter the following:

We get the following results:

What this tells us is the following: Assuming the standard deviation of our population is 2.5, and given that we want to achieve a power of 0.9, we would need a sample size of 7 to detect a difference in means of 5.

It’s a pretty straightforward idea that, all else being equal, the more samples you have, the more power you will have as well. But how does power relate to the “difference” a test can detect? If you play around in Minitab, you’ll find out how this affects power.

The first thing to be aware of is the link between sample size and the size of the difference you want to detect. In the 2-sample t dialog box shown above, this is what the "Differences" field is used for. The larger the difference iyou’re interested in, the smaller the sample size you would need to detect it.

A professor once offered me the following example, which I have found invaluable. Let’s imagine Michael Jordan gets together with LeBron James for some pickup basketball. They play a series of games until one player reaches 11 points. This would probably proceed with the two players alternating wins, until one person eventually pulled ahead. How many games would it take for you to determine who is a better player? Since the talent level is so similar, it would probably take a large number of games. Statistically, this is an example of trying to detect a very small difference. To detect a small difference, you need a much larger sample—in this case, games played—to achieve sufficient power.

Now, if Michael Jordan would play me in a pickup basketball game (a matchup with two players who do not have a similar talent level), it would take relatively few samples to realize that one player is better than another.

Hopefully, this provides a little more light on what Minitab is asking for in the “difference” field, and how potential changes in that affect your power and sample size calculations.

The Minitab Fan section of the Minitab blog is your chance to share with our readers! We always love to hear how you are using Minitab products for quality improvement projects, Lean Six Sigma initiatives, research and data analysis, and more. If our software has helped you, please share your Minitab story, too!

My LSS coach suggested that I regularly conduct data analysis to refresh my Minitab skills. I'm sure many of you have heard about the devaluation of Russian currency caused by European Union and United States sanctions, and dropping oil prices.

I decided to check this situation with statistical analysis. The question I intended to answer was simple: is there any correlation between Brent oil price and the exchange rate between the Russian ruble and the U.S. dollar?

I found relevant data from 01-Jan-2014 through 18-Dec-2014 and used the regression tools in Minitab statistical software to interpret it.

First I looked at the model and saw the regression equation was:

RR/USD = 78,90 - 0,4071 USD/bbl (Brent), with R-Sq(adj) = 90,2%

This means that the model describe the behavior of RR/USD exchange rate on 90%, which is very good. Another 10% can be assigned to outside USD/bbl (Brent) factors, such as sanctions.
Then I paid attention to Residual Plots.

Ideally the residuals should be distributed randomly. In the real world we see that at the end of data observation the residuals are much higher than expected (see graphs in the right column).

This is very much correlated with recent news: many people in Russia are now buying foreign currency to avoid further devaluation. That kind of people behavior significantly increases the currency demand and exchange rate.

If we would build the graph when oil price was moving within $110 and $85 USD/bbl (Brent), then we expect that when the oil price was 60, the RR/USD exchange rate should be about 45-50 rubles per US dollar. But we actually see that it falls within 60-70, which may reflect panic on the Russian exchange market.

The conclusions I draw from my analysis:

1. There is a strong correlation between Brent oil price and the exchange rate between the Russian ruble and U.S. dollar. About 90% of the ruble's fall can be explained by the oil price.

2. Be careful when you build a regression model and do not “extend” it for the interval you have not tested it, as you may encounter with another significant factors which are beyond consideration.

You can see that now the ruble is cheaper that it is expected based on oil prices. So maybe it is a good time for you to visit Russia!

Alexander Drevalev
Consultant and Master Black Belt
Accenture
Moscow, Russia

I typically attend a few Lean Six Sigma conferences each year, and at each there is at least one session about compensating belts. Any number of ideas exist out there, but they commonly include systems that provide a percentage of savings as a portion of pay or provide a bonus for meeting target project savings. There are always issues with these pay schemes, including the fact that belt compensation may be tied to the value of projects assigned to them or to the accuracy of estimated savings when the project was assigned (an inexact science at best).

There is a larger fallacy with these schemes, and to explain it you must know the story of the Brownie of Blednoch by William Nicholson. You can find the original poem as well as adapted stories online, but I'll re-tell the story here in my own words:

In the town of Blednoch the people are gathered in the town square one day when they hear the sound of humming coming up the road and see a strange-looking, bearded man approaching. As he gets closer the townspeople realized he is humming "Any work for Aiken-Drum? Any work for Aiken-Drum?" Granny, the wisest person in town, recognizes the man as a "Brownie" and explains to the townspeople that Brownies are the hardest working people anywhere and had simple needs.

Aiken-Drum asks if the townspeople have any work he can do and says he does not need money, clothing, or fancy living but just a warm, dry place to sleep and something warm to drink at bedtime. The town blacksmith gives Aiken-Drum a horse blanket and allows him to sleep in a corner in his barn, and each night Granny brought him a warm drink. From that day on, the townspeople are amazed at the deeds he performs, often without being asked. A farmer finds his sheep had been led into the barn just before a storm. The town church finally gets built. One sick resident has Aiken-Drum show up, clean her entire house, and cook soup for her. The baker finds his wagon wheel repaired on the morning he is to deliver goods to town. Even the kids love Aiken-Drum, who often builds fires and sings songs and plays games with them.

Everyone is extremely pleased with the Brownie's work. Except Miss Daisy, who thinks it's unfair that Aiken-Drum isn't better compensated for his outstanding work. The other townspeople try to convince Miss Daisy that Brownies are driven by the love of their work for others and not by material things, but she can't be swayed and one evening while he is out working she leaves him some new pants.

In the morning, Aiken-Drum is gone and never seen again.

If you are trying to assign a specific dollar figure to the work your belts do, you are Miss Daisy.

In the highly-recommended book Thinking Fast & Slow by Daniel Kahnemann, the difference between social norms and business norms are explained quite clearly. One example to illustrate the difference would be needing assembling a new shed: ask a close friend for help, and they'll likely give you their whole Saturday morning. Ask the same friend and offer them $20, and they'll likely come up with an excuse as they feel insulted that you think their Saturday morning and energy is only worth $20. You can't put a price on the motivating factors behind the hard work.

You see, the best work comes from those driven by something other than the material compensation.

I'm not suggesting that we not pay belts—of course there is a market value for the job they do and should be compensated fairly. But if we want belts to work like Brownies, instead of expending time and energy on fine-tuning a bonus and compensation system tied to the tasks they do, we would be much better off considering what drives the best of the best and how we can provide that. Some examples:

If the belt's job requires regular travel to work on projects, make their travel less stressful by allowing them to choose their own flights and accommodations or simplify their expense-reporting procedures. If any travel is extended, offer to pay airfare for their family and for appropriate accommodations.
Offer the belt several project options and allow them to choose to work on the project that best aligns with their interests.
The best belts are driven by a passion to reach towards perfection and a disdain for waste and defects—so do not assign projects in an area where management or process owners will not appreciate striving for this. Several programs I am familiar with use a pull system, where managers have to ask that a project be done in their area.
Many belts feel their work is very important to the success of the organization but is not noticed by many outside of the improvement program and the areas they've done projects in. A high-level executive personally contacting the belt and expressing gratitude AND familiarity with what a particular project accomplished will achieve a level of employee satisfaction that a bonus cannot.

These are just some examples. What matters most is that a fair compensation is offered and after that there is no more haggling over money—instead, focus on the real drivers behind the belt's work and satisfy those to the best of your ability. I know of no organization that has created Brownies through financial compensation and bonuses, but know of many driven belts who do great work because their company values what they are trying to achieve and tries to remove all barriers to that success.

Turn your belts into Brownies and you'll hear humming in the halls!

If you wanted to figure out the probability that your favorite football team will win their next game, how would you do it? My colleague Eduardo Santiago and I recently looked at this question, and in this post we'll share how we approached the solution. Let’s start by breaking down this problem:

There are only two possible outcomes: your favorite team wins, or they lose. Ties are a possibility, but they're very rare. So, to simplify things a bit, we’ll assume they are so unlikely that could be disregarded from this analysis.
There are numerous factors to consider.
1. What will the playing conditions be?
2. Are key players injured?
3. Do they match up well with their opponent?
4. Do they have home-field advantage?
5. And the list goes on...

First, since we assumed the outcome is binary, we can put together a Binary Logistic Regression model to predict the probability of a win occurring. Next, we need to find which predictors would be best to include. After a little research, we found the betting markets seem to take all of this information into account. Basically, we are utilizing the wisdom of the masses to find out what they believe will happen. Since betting markets take this into account, we decided to look at the probability of a win, given the spread of a NCAA football game.

Data Collection

If you are not convinced about how accurate the spreads can be in determining the outcome of a game: win or loss, we collected data for every college football game played between 2000 and 2014. The structure of the data is illustrated below. The third column has the spread (or line) provided by casinos at Vegas, and the last column displayed is the actual score differential (vscore – hscore).

Note: In betting lines, a negative spread indicates how many points you are favored over the opponent. In short, you are giving the opponent a certain number of points.

The original win-or-lose question can be rephrased then as follows: Is the difference between the spreads and actual score differentials statistically significant?

Since we have two populations that are dependent we would compare them via a paired t test. In other words, both the Spread and scoreDiffer are observations (a priori and a posteriori) for the same game and they reflect the relative strength of the home team i versus the road team j.

Using Stat > Basic Statistics > Paired t in Minitab Statistical Software, we get the output below.

Since the p-value is larger than 0.05, we can conclude from the 15 years of data that the average difference between Las Vegas spreads and actual score differentials is not significantly different from zero. With this we are saying that the bias that could exist between both measures of relative strength for teams is not different from zero, which in lay terms means that on average the error that exists between Vegas and actual outcomes is negligible.

It is worth noting that the results above were obtained with a sample size of 10,476 games! So we hope you'll excuse our not including power calculations here.

As a final remark on spreads, the histogram of the differences below shows a couple of interesting things:

The average difference between the spreads and score differentials seem to be very close to zero. So don’t get too excited yet, as the spreads cannot be used to predict the exact score differential for a game. Nevertheless, with extremely high probability the spread will be very close to the score differential.
The standard deviation, however, is 15.5 points. That means that if a game shows a spread for your favorite team of -3 points, the outcome could be with high confidence within plus or minus 2 standard deviations of the point estimate, which is -3 ± 31 points in this case. So your favorite team could win by 34 points, or lose by 28!

Figure 1 - Distribution of the differences between scores and spreads

The Binary Logistic Regression Model

By this point, we hope you are convinced about how good these spread values could be. To make the output more readable we summarized the data as follows:

Creating our Binary Logistic Regression Model

After summarizing the data, we used the Binary Fitted Line Plot (new in Minitab 17) to come up with our model.

If you are following along, here are the steps:

Go to Stat > Regression > Binary Fitted Line Plot
Fill out the dialog box as shown below and click OK.

The steps will produce the following graph:

Interpreting the Plot

If your team is favored to win by 25 points or more, you have a very good chance of winning the game, but what if the spread is much closer?

For the 2014 National Championship, Ohio State was an underdog by 6 points to Oregon. Looking at the Binary Fitted Line Plot the probability of a 6-point underdog to win the game is close to 31% in college football.

Ohio State University ended up beating Oregon by 22 points. Given that the differences described in Figure 1 are normally distributed with respect to zero, then if we assume the spread is given (or known), we can compute the probability of the national championship game outcome being as extreme—or more—as it turned out.

With Ohio State 6 point underdogs, and a standard deviation of 15.53, we can run a Probability Distribution Plot to show that Ohio State would win by 22 points or more 3.6% of the time.

Eduardo Santiago and myself will be giving a talk on using statistics to rank college football teams at the upcoming Conference on Statistical Practice in New Orleans. Our talk is February 21 at 2 p.m. and we would love to have you join.

I left off last with a post outlining how the Six Sigma students at Rose-Hulman were working on a project to reduce the amount of recycling thrown in the normal trash cans in all of the academic buildings at the institution.

Using the DMAIC methodology for completing improvement projects, they had already defined the problem at hand: how could the amount of recycling that’s thrown in the normal trash cans be reduced? They collected baseline data for the types of recyclables thrown into the trash, including their weights and frequencies. In order to brainstorm ideas to improve recycling efforts at Rose-Hulman and to determine causes for the lack of recycling in the first place, the students created fishbone diagrams.

Implementing Improvements

The students then entered the ‘Improve’ phase of the project and formed a list of recommended actions based on the variables they could control to motivate recycling practices in a four-week time frame. The short time constraint was fixed due to the length of an academic quarter.

This list of actions included the following:

Placing a recycling bin next to each and every trash can throughout the academic buildings, including classrooms.
Constructing and displaying posters next to or on recycling bins indicating what items are recyclable and are not recyclable on campus:

Informing campus about Rose-Hulman recycling policies, as well as the current percentage of recyclables on campus (by weight), determined during the Measure phase. (The information was shared with the entire campus via an email and an article in the school newspaper by Dr. Evans.)
Encouraging good recycling habits through creative posters, contests, incentives, and using concepts related to “The Fun Theory.” Fun theory is used to change people’s behaviors through making activities fun. For example, the class discussed ways to make recycling bins produce amusing sounds when items are placed in it.

The students implemented many of these improvements and then gathered post-improvement data at the end of four weeks during four fixed collection periods.

Analyzing Pre-Improvement vs. Post-Improvement Data

There were a total of 15 areas in the academic buildings where recycling data was collected. Fifteen student teams were assigned one of these areas for the entire project, collecting data during the pre- and post-improvement phases. There are a total of 60 data points for both phases.

The teams compared pre-improvement and post-improvement statistics for the percentage of recyclables in the trash with Minitab (using Stat > Basic Statistics > Display Descriptive Statistics in the software):

Some highlights of this analysis:

The mean percentage of recyclables in trash decreased from 37% to 24%, which is a reduction of 35%.
The median percentage of recyclables in trash decreased from 31% to 17%, which is a reduction of 45%.
The total average weight of recyclables in trash over the baseline period (4 days) decreased from 84.3 pounds with a standard deviation of approximately 7.89 pounds to 45.9 pounds with a standard deviation of approximately 5.19 pounds during the improvement period, which is a reduction in the total average weight of 46%.
The mean recyclable weight for all areas decreased from 1.405 pounds to 0.765 pounds, which is a reduction of 84%.

They were also able to view the improvements graphically with boxplots in Minitab:

Boxplots of the percentage of recyclables during the four collection periods in the pre-improvement phase (left plot) and the four collection periods in the post-improvement phase (right plot).

Although it is not apparent in these boxplots that the mean percentage of recyclables (the circles with the crossbars) has decreased in the improvement phase, it is obvious that the median percentage of recyclables (line within the boxplot) has decreased.

In addition, the students used Minitab plots to track changes in percentage of recyclables in the trash per area, both pre and post-improvement:

Plot of the mean percentage of recyclables in the trash by academic building area for both pre and post-improvement phases. The mean is averaged over the four collection times in each phase.

These plots helped the students to graphically see gaps between the percentages of recyclables collected pre and post-improvement by area. Given the location of each academic area, the changes between pre and post means were justifiable and informative.

And in order to statistically determine if the true mean percentage of recyclables post-improvement was significantly less than the true mean percentage of recyclables pre-improvement, the students ran a paired t-test for all 60 data points, pairing by area and day. See below for the Minitab output for this test:

With a t-test statistic of 4.66, it is evident that the recycling improvements made a difference! They ran a paired t-test since the pre and post recyclable percentages were linked by area and day. They did not need to check for normality of the paired differences since we had n = 60 data points.

After collecting baseline data, the students had created a Pareto Chart to display the type of trash (and recyclables) found in the regular trash cans. They also created a Pareto Chart for the post-improvement data—you can see both below to compare (pre-improvement – left, post-improvement – right):

Plastics were the most common recyclable item in the trash both pre- and post-improvement, and overall, besides the Java City coffee cups increasingpost-improvement, the other categories saw a noticeable decrease post-improvement compared to pre-improvement.

To complete their pre- and post-improvement analysis, the students also ran a capability study in Minitab to determine the pre and post-improvement capability of recyclables in the trash. Post-improvement, both their Pp and Ppk values improved.

Results and Future Improvement Efforts

Of the 15 areas (Spring Quarter 2014) that collected pre-improvement and post-improvement data over the span of two four-day collection periods, only two areas had an increased percentage of recyclables in the trash after the improvements were made. These two areas had “special causes” associated, which can be explained.

One area with increased recyclables after improvements was the Moench Mailroom. The Moench Mailroom area is next to the campus mailroom where students pick up their daily mail, graded homework assignments, etc., in their mail slots. It was evident during post-improvement trash collection that a student had emptied an entire quarter’s worth of mail, including junk mail, magazines, and assignments, into the trash can by the mailroom. Since the student’s name was on the mail and assignments, it was clear that that the recyclables discarded in the trash was from this one student. He certainly threw off that area’s post-improvement data!

Although the improvement efforts were short-term, the students saw their efforts significantly decrease the percentage of recyclables being discarded in the normal trash cans at the academic buildings. At the beginning of Spring Quarter 2014, 36% of trash cans (by weight) were recyclable items. At the end of Spring Quarter 2014 after the improvement phase, 24% of trash cans (by weight) were recyclable items!

They were not only able to decrease the carbon footprint of their school and aid in their school’s sustainability program, but the increase in recycling also has the potential to create revenue for the school down the road (if they choose to recycle aluminum cans or sell paper, for example).

Dr. Evans and the students have shared their results with the campus community and plan to work with the administration to publish their results, which will hopefully highlight why these improvement efforts should stick around long-term. Way to go Dr. Evans and Rose-Hulman Six Sigma Students!

Many thanks to Dr. Evans for her contributions to this post!

By Peter Olejnik, guest blogger.

Previous posts on the Minitab Blog have discussed the work of the Six Sigma students at Rose-Hulman Institute of Technology to reduce the quantities of recyclables that wind up in the trash. Led by Dr. Diane Evans, these students continue to make an important impact on their community.

As with any Six Sigma process, the results of the work need to be evaluated. A simple two-sample T test could be performed, but it gives us a very limited amount of information – only whether there is a difference between the before and after improvement data. But what if we want to know if a certain item or factor affects the amount of recyclables disposed of? What if we wanted to know by how much of an effect important factors have? What if we want to create a predictive model that can estimate the weight of the recyclables without the use of a scale?

Sounds like a lot of work, right? But actually, with the use of regression analysis tools in Minitab Statistical Software, it's quite easy!

In this two-part blog post, I’ll share with you how my team used regression analysis to identify and model the factors that are important in making sure recyclables are handled appropriately.

Preparing Your Data for Regression Analysis

All the teams involved in this project collected a substantial amount of data. But some of this data is somewhat subjective. Also, this data has been recorded in a manner that is geared toward people, and not necessarily for analysis by computers. To start doing analysis in Minitab, all of our data points need to be quantifiable and in long format.

The Data as Inserted by the Six Sigma Teams

data after conversion

The Data, After Conversion into Long Format and Quantifiable Values

Now that we have all this data in a computer-friendly format, we need to identify and eliminate any extreme outliers present, since they can distort our final model. First we create a regression model with all of the factors included. As part of this, we generate the residuals from the data vs. the fit. For our analysis, we utilized deleted T-residuals. These are less affected by the skew of an outlier compared to regular T-residuals, making it them better indicator. These can be selected to be displayed by Minitab in the same manner that any other residual can be selected. Looking at these residuals, those with values above 4 were removed. A new fit was then created and the process was repeated until no outliers remain.

Satisfying the Assumptions for Regression Analysis

Once the outliers have been eliminated, we need to verify the regression assumptions for our data to ensure that the analysis conducted is valid. We need to satisfy five assumptions:

The mean value of the errors is zero.
The variance of the errors is even and consistent (or “homoscedastic”) through the data.
The data is independent and identically distributed (IID).
The errors are normally distributed.
There is negligible variance in the predictor values.

For our third assumption, we know that the data points should be IID, because each area’s daily trash collection should have no effect on that of other areas or the next day’s collection. We have no reason to suspect otherwise. The fifth assumption is also believed to have been met, as we have no reason to suspect that there is variance in the predictor value. This means that only three of the five assumptions still need to be checked.

Our first and second assumptions can be checked simply by plotting the deleted T-residuals against the individual factors, as well as the fits and visually inspecting them.

Plots used to verify regression assumptions.

When looking over the scatter plots, it looks like these two assumptions are met. Checking the fourth assumption is just as easy. All that needs to be done is to run a normality test on the deleted T-residuals.

Normality plot of deleted T-residuals

It appears that our residuals are not normally distributed, as seen by the p-value of our test. This is problematic, as it means any analysis we would conduct would be invalid. Fortunately, all is not lost: we can perform a Box-Cox analysis on our results. This will tell us is if the response variable needs to be raised by a power.

Box-Cox analysis of the data

The results of this analysis indicate that the response variable should be raised by a constant of 0.75. A new model and residuals can be generated from this modified response variable and the assumptions can be checked.

New plots used in order to verify regression assumptions for revised model.

The residuals again appear to be homoscedastic and centered about zero.

Normality plot on deleted T-residuals

The residuals now are normally distributed. Our data is now prepped and ready for analysis!

The second part of this post will detail the regression analysis.

About the Guest Blogger

Peter Olejnik is a graduate student at the Rose-Hulman Institute of Technology in Terre Haute, Indiana. He holds a bachelor’s degree in mechanical engineering and his professional interests include controls engineering and data analysis.

Over the weekend Penn State men's basketball coach Pat Chambers had some strong words about a foul that went against his team in a 76-73 loss against Maryland. Chambers called it “The worst call I’ve ever seen in my entire life,” and he wasn’t alone in his thinking. Even sports media members with no affiliation to Penn State agreed with him.

Jay Bilas Tweet

Dan Dakich Tweet

This wasn't the first time this season Chambers has criticized the referees. After a game against Michigan State, Chambers said he thought the teams and coaches affected how the referees called the end of close games.

"At some point, this thing's got to stop, it's got to switch, I'm not a Hall of Fame coach," Chambers said. "Nothing against Tom (Izzo), nothing against John Beilein, nothing against all these other guys, but it's got to stop."

This quote goes along the same thinking of the tweet by Dan Dakich. In a close game, will the refs be biased towards the more established coach or team? Now, I'm not saying the refs are having a secret meeting before the game, going "Okay, Michigan St is an elite program, if this game is close we need to make sure they win." But if you're a ref being yelled at by Tom Izzo/Bo Ryan/Thad Matta and on the other side is well, Pat Chambers, who are you going to listen to?

One way to answer this question is to look at the outcome of close games. Specifically, games decided by 2 possessions or less or that go into overtime (and from here on out, I'll just refer to these as "close games"). We've previously seen that the outcome in close games is pretty random. In the long run, you'll win about half your close games and lose about half of them. However, if the referees are consistently making calls that go against you (or for you) it's possible that your winning percentage could deviate from .500.

Penn State's Big 10 Record in Close Games

Since Pat Chambers has coached Penn State, they are 8-19 (.296) in close Big 10 games. And in close out-of-conference (OOC) games under Chambers (which are usually against traditionally weaker schools), they are 12-5 (.706). That's a huge difference! It's pretty easy to see why Chambers is so upset. However, we're dealing with some pretty small samples. So before we jump to any conclusions, we better increase our sample size.

I collected data for every close Penn State game since 2002. I got my data from kenpom.com, which only goes back to 2002, which is why I choose that year. The following table has the results.

Type of GameWinsLossesWinning Percentage Out-of-Conference 28 18 .609 Big 10 38 58 .396

We see that in 96 close Big 10 games, Penn State wasn't even able to win 40% of them! And if they were actually missing some sort of "clutch gene" (or whatever Skip Bayless would say) we would expect to see the same thing in their OOC games. However, they've actually won a majority of their close OCC games!

Let's see if we can chalk this up to random variation. Assuming that their actual probability in close games is .500, what is the probability that Penn State would win 38 or fewer games in 96 tries?

Distribution Plot

If their close games were truly random, the probability that Penn State would win 38 or fewer games is only 2.6%. That's low enough to conclude it didn't happen by chance. But if the reason is actually the refs giving preferential treatment to more established teams and coaches, we would expect to see similar results for other Big 10 teams.

Examining the Entire Big 10

If the refs are making calls against Penn State in close games, are they doing similar things to other lowly Big 10 teams? And on the flip side, are the better teams in the Big 10 seeing an increase in the amount of close games they win? First, let's just look at who the best and worst Big 10 teams have been since 2002. Here are each team's winning percentage in only Big 10 conference games since 2002. (Because I went back to 2002, Nebraska, Maryland, and Rutgers are not included.)

TeamsWinsLossesWinning Percentage Wisconsin 167 67 0.713675 Michigan St 157 77 0.67094 Ohio St 153 82 0.651064 Illinois 133 102 0.565957 Purdue 123 112 0.523404 Michigan 119 116 0.506383 Indiana 117 118 0.497872 Iowa 99 135 0.423077 Minnesota 97 138 0.412766 Northwestern 75 159 0.320513 Penn St 61 174 0.259574

We see that Wisconsin, Michigan St, and Ohio State have been the premier Big 10 teams since 2002. And it's no surprise that they have 3 of the most established coaches in the conference with Bo Ryan, Tom Izzo, and Thad Matta. On the flip side, Iowa, Minnesota, Northwestern, and Penn State all have Big 10 winning percentages significantly under .500. And naturally, these four programs have been through a combined 12 different coaches since 2002. So if referees were biased towards winning programs, we would expect Wisconsin, Michigan St, and Ohio State to have won a higher percentage of close games at the expense of Iowa, Minnesota, Northwestern, and Penn St. But if there is no bias, we would expect each team to have a winning percentage close to .500.

The following individual plot shows the percentage of close Big 10 games each team has won since 2002.

Individual Value Plot

The 4 perennial losers in the Big 10 also just happen to be the 4 teams that have had the worst "luck" in close games. But is it really bad luck, or could officiating be giving the benefit of the doubt to the more established team/coach? And if it's the latter, could Bo Ryan be the greatest referee manipulator ever? Wisconsin has won a ridiculous 56 of 87 close Big 10 games since 2002. If you assume their true chance of winning a close game is .500, the probability that they would win 56 or more games in 87 tries is...well, it's low.

Distribution Plot

That's less than half a percent! The chances of that are about 1 in 207! Is Wisconsin getting really, really lucky, or is something else going on?

Breaking Down the Top Versus the Bottom

Granted, we should expect to see some variation in the distribution of the teams. If you take 11 different coins and flip each one 90 times, all 11 coins are not going to have heads come up 45 times. Wisconsin has an insanely high winning percentage in close games, but Michigan St and Ohio St have winning percentages right around .500. And Northwestern doesn't seem to have had as many close games go against them as Minnesota or Iowa, despite having a lower overall winning percentage than both teams.

So let's break down the close games between only our top and bottom teams. After all, if the theory is that the established coaches/programs get calls over the perennial losers, we should look at games played specifically between those teams. So here are how the top 3 Big 10 teams have fared in close games against Penn St, Iowa, and Minnesota. And don't worry Northwestern, we'll get to you in a minute.

TeamWinsLossesWinning Percentage Wisconsin 19 11 .633 Michigan St 14 9 .609 Ohio St 12 8 .600 Total 45 28 .616

Michigan State and Ohio State didn't have winning percentages significantly higher than .500 in close games when compared to the entire Big 10, but when you just look at games against the weaker teams, they both have winning percentages around .600. Total, our top 3 teams have won 45 out of 73 close games against Penn State, Minnesota, and Iowa. So is this significantly greater than .500?

1 Proportion Test

The p-value, which is 0.03, is lower than the common significance level of 0.05, which means we can conclude the top 3 teams win more than 50% of their close games against the bottom of the Big 10. There is definitely a decent case to be made that Big 10 referees give the favorable calls to the better team at the end of close games.

Hey, What About Northwestern?

If there is one Big 10 team associated with losing even more so than Penn State, it's Northwestern. Even though Penn State has a worse winning percentage in Big 10 games since 2002, the Nittany Lions have at least had some success. They won the NIT in 2009, made the NCAA tournament in 2011, and even reached the Sweet Sixteen in 2001. Meanwhile, Northwestern hasn't made the NCAA Tournament since..........well, ever. And yet they haven't fared poorly in close Big 10 games, winning 36 out of 78 (.462). Assuming their true chance of winning close games is .500, the probability of them winning 36 or fewer games out of 78 is 28.58%. That's not near uncommon enough to conclude its significantly lower than .500. So has Northwestern ruined our referee theory?

Not at all. In fact, they're going to drive the final nail in the coffin.

Remember how Michigan State and Ohio State didn't appear to win a higher percentage of their close games until we looked at how they fared against only the worst teams? Well Northwestern doesn't appear to lose a lower percentage of their close games.........until we only look at their games against the top 3 teams.

2-13

Yep, that's Northwestern's record in close games against Wisconsin, Ohio State, and Michigan State since 2002. Two and thirteen! With a record that poor, I was sure I could find a game where the refs might have played a part. It took a 30 second internet search to find not one, but two. And in the same season!

Jan 29th, 2011: #1 Ohio St 58 - Northwestern 57
Mar 11th, 2011: #1 Ohio St 67 - Northwestern 61 (OT)

Coming into both games, Ohio State was ranked #1 in the country. They also had one of the best players in the country in Jared Sullinger. The first game was played at Northwestern, while the second was played at a neutral site during the Big 10 tournament. So you don't have to worry about ref biased due to home court advantage.

So what happened during these two games?

Northwestern Free Throws AttemptedJared Sullinger Free Throws Attempted Game 1 11 10 Game 2 18 18 Total 29 28

Sullinger had almost as many free throw attempts as the entire Northwestern team. Overall Ohio State had 52 free throws to Northwestern's 29. But even more astonishing than that is that 18 of Sullinger's attempts occurred during the final 5 minutes of the game or in overtime. Close game between Northwestern and the #1 ranked team with one of the best players in the country? It's no longer looking like much of a surprise which way the fouls went.

When we add Northwestern's wins and losses to our previous table of the top teams versus the bottom teams, we get the following.

TeamWinsLossesWinning Percentage Wisconsin 20 12 .625 Michigan St 18 9 .668 Ohio St 20 9 .690 Total 58 30 .659

When the top 3 teams in the Big 10 since 2002 have played a close game against the bottom 4 teams, they've won about 66% of the time. That's pretty B1G. So what do you think? Do refs actually favor established teams and coaches at the end of close games?

Make the call.

In part 1 of this post, I covered how Six Sigma students at Rose-Hulman Institute of Technology cleaned up and prepared project data for a regression analysis. Now we're ready to start our analysis. We’ll detail the steps in that process and what we can learn from our results.

What Factors Are Important?

We collected data about 11 factors we believe could be significant:

Whether the date of collection was a Monday or a Tuesday
The number of trashcans in a team's area
The ratio of recycle bins to trash cans
Number of plastic cups and bottles collected
Number of Java City (a coffee shop on campus) cups collected
Number of paper sheets collected
Number of newspapers collected
Number of glass bottles collected
Number of aluminum cans collected
Number of cardboard items collected
Whether the data was collected pre or post improvement

Just because we collected data about 11 factors doesn't mean that they are all important. Any good regression model should attempt to keep the number of factors down to a minimum. So how do we go about finding out which factors are important? The easiest way is to use Minitab's Best Subsets regression tool! Best Subsets evaluates and gives you important descriptive statistics about the regression models that can be formed from the different combinations of factors. The resulting output table lists the number of factors in each model, R2 and adjusted R2, and also tells us which factors are included in each model.

best subsets regression analysis

Results of the Best Subsets

Looking at Adjusted R2

The output from the Best Subsets analysis gave us quite a lot of potential models we could use. Which one should we choose? We used two components to narrow down the options. The first was the adjusted R2 values, since this statistic takes into account the number of variables used. We want this value to be as high as possible. When we plot the adjusted R2 values against the number of factors in each model, we see a point where adding additional factors has diminishing returns. For this set of data, that point was at five factors.

scatterplot-of-r2-vs-variables

Notice how at 5 variables and beyond the adjusted R-squared value hits a plateau? That’s our point of diminishing returns!

The Factors that Always Seem to Appear

The second component we considered was which factors consistently appeared in the top models generated. If these factors keep appearing in the top models, we reasoned, there's a good chance they’re significant.

When we look at the results from our Best Subsets, we find that five factors are consistently chosen by the algorithm: The number of plastics, paper, newspaper, aluminum, and the effect of the improvement efforts.

Identifying those five factors enables us to generate our final model.

Verifying the Final Model

Great! So we went through all this and got ourselves a model. Now we are ready to make conclusions, right? Not quite. We still need to ensure that the model we’ve created adheres to the assumptions that are associated with regression analysis. If our model does not meet these assumptions, then we can't make any definitive conclusions. Luckily for us, the process doesn't change from before.

As before, first we need to check whether the mean error is zero and the data is homoscedastic.

Plots used to verify regression assumptions.

As before, the plots indicate that we have no reason to assume that the data is not IID. Moving on, we check whether the residuals are normally distributed.

Normality plot

Last but not least, we are continuing our assumption that the teams can count and that there is no variance in the values in our predictor values.

It appears that this new model does in fact meet the regression assumptions. The final model created from this data is:

Final Results: What Have We Learned?

At the end of all of this, we determined our regression model, all ready to go and verified. But what does this single equation we created tell us? What can we use it for?

For starters, we now have an accurate model that we can use to predict the weight of recyclables disposed of in the trash, based solely on five factors. This is nice, as we can predict the weight of recyclables from various areas simply by just looking at what the items are present in the trash!

We also learned that of the 11 factors we started with, only five of them have a significant relationship with the weight of the recyclables. Plastic cups and bottles, sheets of paper, newspapers, and aluminum cans were found to be significant contributors to the total weight of recyclables disposed of in the trash. This is important to know, since it tells us what to focus on in future efforts.

The last factor that was found to be significant was the effect of the improvement phase of our project. More importantly, if you look at the equation for the final model, this factor has a negative constant associated with it. This tells us that our efforts have been successful, as the effort was statistically significant and in a manner that decreased the amount of recyclables thrown away in the trash.

Now that wasn’t too bad, was it? With regression and a little help from Minitab, there was no chance our data analysis efforts would go to waste!

About the Guest Blogger

It’s safe to say that most people who use statistics are more familiar with parametric analyses than nonparametric analyses. Nonparametric tests are also called distribution-free tests because they don’t assume that your data follow a specific distribution.

You may have heard that you should use nonparametric tests when your data don’t meet the assumptions of the parametric test, especially the assumption about normally distributed data. That sounds like a nice and straightforward way to choose, but there are additional considerations.

In this post, I’ll help you determine when you should use a:

Parametric analysis to test group means.
Nonparametric analysis to test group medians.

In particular, I'll focus on an important reason to use nonparametric tests that I don’t think gets mentioned often enough!

Hypothesis Tests of the Mean and Median

Nonparametric tests are like a parallel universe to parametric tests. The table shows related pairs of hypothesis tests that Minitab statistical software offers.

Parametric tests (means)

Nonparametric tests (medians)

1-sample t test

1-sample Sign, 1-sample Wilcoxon

2-sample t test

Mann-Whitney test

One-Way ANOVA

Kruskal-Wallis, Mood’s median test

Factorial DOE with one factor and one blocking variable

Friedman test

Reasons to Use Parametric Tests

Reason 1: Parametric tests can perform well with skewed and nonnormal distributions

This may be a surprise but parametric tests can perform well with continuous data that are nonnormal if you satisfy these sample size guidelines.

Parametric analyses

Sample size guidelines for nonnormal data

1-sample t test

Greater than 20

2-sample t test

Each group should be greater than 15

One-Way ANOVA

If you have 2-9 groups, each group should be greater than 15.
If you have 10-12 groups, each group should be greater than 20.

Reason 2: Parametric tests can perform well when the spread of each group is different

While nonparametric tests don’t assume that your data follow a normal distribution, they do have other assumptions that can be hard to meet. For nonparametric tests that compare groups, a common assumption is that the data for all groups must have the same spread (dispersion). If your groups have a different spread, the nonparametric tests might not provide valid results.

On the other hand, if you use the 2-sample t test or One-Way ANOVA, you can simply go to the Options subdialog and uncheck Assume equal variances. Voilà, you’re good to go even when the groups have different spreads!

Reason 3: Statistical power

Parametric tests usually have more statistical power than nonparametric tests. Thus, you are more likely to detect a significant effect when one truly exists.

Reasons to Use Nonparametric Tests

Reason 1: Your area of study is better represented by the median

Comparing two skewed distributions This is my favorite reason to use a nonparametric test and the one that isn’t mentioned often enough! The fact that you can perform a parametric test with nonnormal data doesn’t imply that the mean is the best measure of the central tendency for your data.

For example, the center of a skewed distribution, like income, can be better measured by the median where 50% are above the median and 50% are below. If you add a few billionaires to a sample, the mathematical mean increases greatly even though the income for the typical person doesn’t change.

When your distribution is skewed enough, the mean is strongly affected by changes far out in the distribution’s tail whereas the median continues to more closely reflect the center of the distribution. For these two distributions, a random sample of 100 from each distribution produces means that are significantly different, but medians that are not significantly different.

Two of my colleagues have written excellent blog posts that illustrate this point:

Michelle Paret: Using the Mean in Data Analysis: It’s Not Always a Slam-Dunk
Redouane Kouiden: The Non-parametric Economy: What Does Average Actually Mean?

Reason 2: You have a very small sample size

If you don’t meet the sample size guidelines for the parametric tests and you are not confident that you have normally distributed data, you should use a nonparametric test. When you have a really small sample, you might not even be able to ascertain the distribution of your data because the distribution tests will lack sufficient power to provide meaningful results.

In this scenario, you’re in a tough spot with no valid alternative. Nonparametric tests have less power to begin with and it’s a double whammy when you add a small sample size on top of that!

Reason 3: You have ordinal data, ranked data, or outliers that you can’t remove

Typical parametric tests can only assess continuous data and the results can be significantly affected by outliers. Conversely, some nonparametric tests can handle ordinal data, ranked data, and not be seriously affected by outliers. Be sure to check the assumptions for the nonparametric test because each one has its own data requirements.

Closing Thoughts

It’s commonly thought that the need to choose between a parametric and nonparametric test occurs when your data fail to meet an assumption of the parametric test. This can be the case when you have both a small sample size and nonnormal data. However, other considerations often play a role because parametric tests can often handle nonnormal data. Conversely, nonparametric tests have strict assumptions that you can’t disregard.

The decision often depends on whether the mean or median more accurately represents the center of your data’s distribution.

If the mean accurately represents the center of your distribution and your sample size is large enough, consider a parametric test because they are more powerful.
If the median better represents the center of your distribution, consider the nonparametric test even when you have a large sample.

Finally, if you have a very small sample size, you might be stuck using a nonparametric test. Please, collect more data next time if it is at all possible! As you can see, the sample size guidelines aren’t really that large. Your chance of detecting a significant effect when one exists can be very small when you have both a small sample size and you need to use a less efficient nonparametric test!

In technical support, we often receive questions about Gage R&R and how Minitab calculates the amount of variation that is attributable to the various sources in a measurement system.

This post will focus on how the variance components are calculated for a crossed Gage R&R using the ANOVA table, and how we can obtain the %Contribution, StdDev, Study Var and %Study Var shown in the Gage R&R output. For this example, we will accept all of Minitab’s default values for the calculations.

The sample data used in this post is available within Minitab by navigating to File> Open Worksheet, and then clicking the Look in Minitab Sample Data folder button at the bottom of the dialog box. (If you're not already using Minitab, get the free 30-day trial.) The name of the sample data set is Gageaiag.MTW. For this data set, 10 parts were selected that represent the expected range of the process variation. Three operators measured the 10 parts, three times per part, in a random order.

To see the Gage R&R ANOVA tables in Minitab, we use Stat> Quality Tools> Gage Study> Gage R&R Study (Crossed), and then complete the dialog box as shown below:

Minitab 17’s default alpha to remove the Part*Operator interaction is 0.05. Since the p-value for the interaction in the first ANOVA table is 0.974 (much greater than 0.05), Minitab removes the interaction and shows a second ANOVA table with no interaction.

To calculate the Variance Components, we turn to Minitab’s Methods and Formulas section: Help> Methods and Formulas > Measurement systems analysis> Gage R&R Study (Crossed), and then choose VarComp for ANOVA method under Gage R&R table.

There are two parts to this section of Methods and formulas. The first provides the formulas used when the Operator*Part interaction is part of the model. In this example, the Operator*Part interaction was not significant and was removed. Therefore we use the formulas for the reduced model:

The variance components section of the crossed Gage R&R output is shown below so we can compare our hand calculations to Minitab’s results:

We will do the hand calculations using the reduced ANOVA table for each source of variation:

Repeatability is estimated as the Mean Square (MS column) for Repeatability in the ANOVA table, so the estimate for Repeatability is 0.03997.

We can see the formula for Operator above. The number of replicates is the number of times each operator measured each part. We had 10 parts in this study, and each operator measured each of the 10 parts 3 times, so the denominator for the Operator calculation is 10*3. The numerator is the MS Operator – MS repeatability, so the formula for the variance component for the Operator is (1.58363-0.03997)/(10*3) = 1.54366/30 = 0.0514553.

Next, Methods and Formulas shows how to calculate the Part-to-Part variation. The b represents the number of operators (in this study we had 3), and n represents the number of replicates (that is also 3 since each operator measured each part 3 times). So the denominator for the Part-to-Part variation is 3*3, and the numerator is MS Part – MS Repeatability. Therefore, the Part-to-Part variation is (9.81799-0.03997)/(3*3) = 1.08645.

Reproducibility is easy since it is the same as the variance component for operator that we previously calculated; 0.0514553.

For the last two calculations, we’re just adding the variance components for the sources that we previously calculated:

Total Gage R&R = Repeatability + Reproducibility = 0.03997 + 0.0514553 = 0.0914253.

Total Variation = Total Gage R&R + Part-to-Part = 0.0914253 + 1.08645 = 1.17788.

Notice that the Total Variation is the sum of all the variance components. The variances are additive so the total is just the sum of the other sources.

The %Contribution of VarComp column is calculated using the variance components- the VarComp for each source is divided by Total Variation:

Source

VarComp

Calculation

%Contribution

Total Gage R&R

0.0914253

0.0914253/1.17788*100

7.76185

Repeatability

0.03997

0.03997/1.17788*100

3.39338

Reproducibility

0.0514553

0.0514553/1.17788*100

4.36847

Operator

0.0514553

0.0514553/1.17788*100

4.36847

Part-To-Part

1.08645

1.08645/1.17788*100

92.2377

Total variation

1.17788

1.17788/1.17788*100

100

Now that we’ve replicated the Variance components output, we can use these values to re-create the last table in Minitab’s Gage R&R output:

The StdDev column is simple- there we’re just taking the square root of each of the values in the VarComp column. To Total Variation value in the StDev column is the square root of the corresponding VarComp column (it is not the sum of the standard deviations):

Source

VarComp

Square Root of VarComp = StdDev

6 x StdDev = Study Var

Total Gage R&R

0.0914253

0.302366

1.81420

Repeatability

0.03997

0.199925

1.19955

Reproducibility

0.0514553

0.226838

1.36103

Operator

0.0514553

0.226838

1.36103

Part-To-Part

1.08645

1.04233

6.25397

Total Variation

1.17788

1.08530

6.51181

Finally, the %Study Var column is calculated by dividing the Study Var for each source by the Study Var value in the Total Variation row. For example, the %Study Var for Repeatability is 1.19955/6.51181*100 = 18.4211%.

I hope this post helps you understand where these numbers come from in a Gage R&R. Let’s just be glad that we have Minitab to do the calculations behind the scenes so we don’t have to do this by hand every time!

Juvenile Idiopathic Scoliosis. That was the diagnosis given to my then 8 year old daughter last January. In short, it means that she’s young (under 10), she exhibits an abnormal amount of spinal curvature, and there’s no identified cause (aside from some bad luck).

Emilia’s x-rays indicated an S-shaped curve with 26 degrees at its largest curvature. To look at my healthy, active daughter, you’d never notice. However, on an x-ray, 26 degrees is quite alarming.

We learned quickly that the goal with scoliosis is to minimize further curvature; thereby, preventing surgery. The typical solution: a brace. And, given her young age, it could be up to 5 years of wear.

Because Emilia was right on the edge of “bracing,” we had a decision to make: do we brace her now or wait and see? She’s our daughter and we want to do everything we can to support her. We definitely want to prevent surgery but we also want her to live an active life doing all of the things she loves: swimming, skiing, etc. How could we be sure wearing a brace will actually prevent curve progression? Does a relationship between brace wear and non-progression even exist?

A colleague, Meredith Griffith, found a particular study conducted at the Texas Scottish Rite Hospital for Children and reported by The Journal of Bone and Joint Surgery. A sample of 100 patients with curves between 25 and 45 degrees were each fitted with a brace containing a heat sensor for measuring the number of hours of brace wear. Once all patients reached skeletal maturity, doctors compared the number of hours of brace wear with the patient’s curve progression. Specifically, doctors were interested in a curve progression greater than or equal to 6 degrees as an indicator of brace treatment failure.

Based on this study, 82% of patients who wore the brace for more than twelve hours per day experienced successful brace treatment (<6 degrees of curve progression)! This result was more pronounced in patients who were less skeletally mature at the time of the study—indicating that earlier exposure to brace treatment offers a higher chance that the patient will experience minimal to no curve progression. It is also notable that as the hours of brace wear decrease, the rate of successful treatment decreases: those wearing a brace 7 to 12 hours per day showed a 61% treatment success rate, while those who wore the brace fewer than 7 hours per day showed only a 31% success rate. Ultimately, doctors found that a strong relationship—statistically speaking, a strong positive correlation—between hours of brace wear and non-progression exists. So as the number of hours of wear increases, so does the probability of non-progression, and vice versa.

The saying goes: “Correlation does not imply causation.” So although we cannot assume that wearing a brace for 24 hours each day throughout childhood development will yield no curve progression, we can assign probabilities or likelihoods to non-progression based on the hours of wear. Understanding the likelihood of non-progression will equip parents to make valid, data-driven decisions.

Emilia, being our level-headed, data-driven child, made the decision on her own: “Sounds like we need to brace it,” she told the doctor. I love that kid.

And so we did. Emilia wears a brace about 20 hours a day. She manages her time in it and it hasn’t slowed her down. She continues to be on the downhill race team, the swim team, and does everything else a 9-year-old does. Our adventure with scoliosis is a marathon and not a sprint, as our doctor would say. She has days where she doesn’t get a full 20 hours, but we manage and she always gets at least 12 hours of wear. She has a great attitude about it and wonderfully supportive friends.

And the results? At her 6-month checkup, her curvature measured 22 degrees. While there is measurement variation, the reading does indicate that it didn’t progress. As our spinal surgeon told us “No indication of progression and the rest, well, that’s just gravy.” That’s fancy doctor-speak for “We’re going to Disney to celebrate!” And we did.

Good Job, Emilia! You are a rock star!

How to Choose the Best Regression Model

What Are T Values and P Values in Statistics?

Analyzing Qualitative Data, part 1: Pareto, Pie, and Stacked Bar Charts

Analyzing Qualitative Data, part 2: Chi-Square and Multivariate Analysis

Could Deflated Footballs Have Improved the Patriots' Pass Catching?

Statistics: Another Weapon in the Galactic Patrol’s Arsenal

Tom Brady Is the Best Super Bowl Quarterback Ever

Understanding Monte Carlo Simulation with an Example

Angry Birds?

How Powerful Am I? Power and Sample Size in Minitab

How to Use Statistical Software to Predict the Exchange Rate

The Brownie of Blednoch and Lean Six Sigma Belt Compensation

What’s the Probability that Your Favorite Football Team Will Win?

A Little Trash Talk: Improving Recycling Processes at Rose-Hulman, Part II

Using Regression to Evaluate Project Results, part 1

Are Big 10 Basketball Referees Biased Towards Winning Teams?

Using Regression to Evaluate Project Results, part 2

Choosing Between a Nonparametric Test and a Parametric Test

Crossed Gage R&R: How are the Variance Components Calculated?

A Mommy’s Look at Scoliosis…A Study in Correlation