Minitab | Minitab

If you need to assess process performance relative to some specification limit(s), then process capability is the tool to use. You collect some accurate data from a stable process, enter those measurements in Minitab, and then choose Stat > Quality Tools > Capability Analysis/Sixpack or Assistant > Capability Analysis.

Now, what about sorting the data? I’ve been asked “why does Cpk change when I sort my data?” many times during my years at Minitab, so if you’ve wondered the same thing, here’s your answer.

From Soap to Standard Deviations

Suppose you work for a company that manufactures bars of soap. Each bar should weigh between 3.2 and 5.2 ounces. To conduct the study, you randomly select 5 bars of soap every hour from the production line and weigh them.

You can see from the spreadsheet that at 9 a.m. on February 1, the 5 bars weighed in at 3.743, 4.447, 4.009, 4.252 and 3.973 ounces. These 5 measurements make up our first subgroup. For our second subgroup, we have the 5 bar weights corresponding to 10 a.m., and 11 a.m. data for the third subgroup, and so on.

To calculate Cpk, Minitab first computes the pooled standard deviation. Without getting into the specific mathematics of the pooled standard deviation formula, you can basically think of it as the average of all of the subgroup standard deviations. In other words, if we calculate the standard deviation for rows 1-5 (subgroup #1), then the standard deviation for rows 6-10 (subgroup #2), then the standard deviation for rows 11-15 (subgroup #3), etc., and then calculate the average of those standard deviations, we (more or less) arrive at the pooled standard deviation.

Therefore, the pooled standard deviation only accounts for the variability within subgroups—it does not include the shift and drift between them. If you want to account for all of the variability across all of the data, then you should look at the overall standard deviation, and use Ppk rather than Cpk.

The Sordid Details

Now let’s sort this data from smallest to largest and see what happens to the pooled standard deviation and Cpk. If we calculate the subgroup standard deviations for the sorted rows 1-5, then 6-10, then 11-15, etc., we’re going to arrive at much smaller values than the original subgroup standard deviations because we’ve minimized the variability within each subgroup. And the smaller the subgroup standard deviations, the smaller the pooled standard deviation, and thus the larger the Cpk statistic.

If we look at the original, unsorted soap weights and run capability analysis, we get a pooled (also known as “within”) standard deviation of 0.352 and a Cpk of 0.80. And if we re-run the analysis on the sorted soap weights, we arrive at a pooled standard deviation of 0.033 and a Cpk of 8.61. That's two completely different sets of results! The original—and accurate—Cpk is below the 1.33 rule-of-thumb, while the other Cpk is exceptionally larger than 1.33.

The Moral of the Subgroup Story

I hope it's now clear why we should not sort our data when running capability analysis. Subgroups are intended to provide information regarding the natural variability of a process at a given point in time. By sorting the data, we are looking at an inaccurate picture of the true subgroup variability, and thereby inflating Cpk to an unrealistic value.

The Citizen's Bank Weather Ball in Flint, Michigan If you follow the news in the United States then you’ve heard that there’s a water crisis in Flint, Michigan. Although there’s going to continue to be debate about how much ethics played a role in the data collection practices, it’s worthwhile to at least be ready to perform the correct analysis on the data when you have it. Here’s how you can use Minitab to be like a citizen data scientist in Flint, and see for yourself what the data indicate.

Let’s start with the Environmental Protection Agency’s (EPA) Lead and Copper Rule. The EPA says that a water system needs to act when “lead concentrations exceed an action level of 15 ppb” in more than 10% of samples. The statistic that identifies the highest 10% of the samples is called the 90th percentile.

The applicable Code of Federal Regulations (CFR) does not prescribe a random sample to characterize the entire water system. Instead, the CFR suggests that those who administer the water system should select sampling sites based on the likelihood of contamination. In particular, those who administer the system should prefer sampling sites that meet these two criteria:

(i) Contain copper pipes with lead solder installed after 1982 or contain lead pipes; and/or

(ii) Are served by a lead service line.

Clearly, we are not dealing with a random sample—that's because the goal is not to characterize the entire system, but to better understand the worst contamination risks. In this context we're characterizing only the sites that we sample, which we suspect contain the highest lead results in the system. The CFR suggests taking samples from at least 60 sites for a system the size of Flint’s.

The data we’ll work with was collected through an effort organized by an independent research team at Virginia Tech. The data contain 271 samples from 269 different locations, which exceeds the minimum recommended sample size. Because we’re looking for the 90th percentile, what we do isn’t very different from counting down 271/10 ≈ 27 data points from the maximum. The CFR references the use of “first draw” tap samples, so we’ll pay attention to that column in the Virginia Tech data.

A Quick Calculation of the 90th Percentile

Once the data’s in Minitab Statistical Software, the fastest way to calculate the 90th percentile is with Minitab’s calculator. Try this:

Choose Calc > Calculator.
In Store result in variable, enter 90th percentile.
In Expression, enter percentile (‘PB Bottle 1 (ppb) – First Draw’, 0.9). Click OK.

Minitab stores the value 26.944. Because this value is greater than 15, you are now ready to make strongly-worded statements urging people to take measures to protect themselves from lead exposure.

Communicating the 90th Percentile on a Graph

But if you’re really going to communicate your results, it’s nice to have a graph available. A simple bar chart might do:

Bart chart of the actual 90th percentile and the action limit.

However, you can show the data in more detail with a histogram.

Choose Graph > Histogram.
Select Simple. Click OK.
In Graph variables, enter ‘PB Bottle 1 (ppb) – First Draw’.
Click Scale.
Select the Reference Lines tab.
In Show reference lines at data values, enter 15 26.9. Click OK twice.

Histogram showing the 90th percentile exceeds the action limit of 15 parts per billion.

Histograms divide the sample values into intervals called bins. The height of the histogram represents the number of observations that are in the bin. The taller the bar, the more observations in that interval. The reference lines on the graph show the action limit for the 90th percentile and the actual value of the 90th percentile. This graph shows that the action limit is exceeded.

Gather Your Data

In April of 2015, then-mayor of Flint Dayne Walling reported that he and his family “drink and use the Flint water everyday, at home, work, and schools.” It’s easy for me to believe that the mayor’s personal experience with water that was not dangerous affected his judgment about the situation. The zip code for the mayor’s office in Flint is 48502. The news bureau for WNEM TV 5, one place where Mayor Walling drank tap water on TV, is in the same zip code. The citizen data scientists who analyzed the Flint data knew that the geographically-limited sample being shown on TV and Twitter wasn't good enough. Instead, they collected data from 269 different locations around Flint and found that lead was a serious problem.

Of course, collecting that data was no small task: the data scientists estimate that gathering, preparing, and analyzing water samples ended up costing about $180,000, not including volunteer labor. If you’d like to donate towards offsetting the costs and future efforts, check out the Flint Water Study Research Support Fundraiser.

If you’d like to support residents in Flint, consider volunteering for or contributing to the United Way of Genesee County’s Flint Water Fund which “has sourced more than 11,000 filters systems and 5,000 replacement filters, ongoing sources of bottled water to the Food Bank of Eastern Michigan and also supports a dedicated driver for daily distribution.”

The attention brought to Flint has called into question the water testing done in other municipalities in the United States. If you’re concerned about the potential for lead in your own water, the EPA notes that lead testing kits are available in home improvement stores that can be sent to laboratories for analysis.

The citation for the referenced data set is: FlintWaterStudy.org (2015)“Lead Results from Tap Water Sampling in Flint, MI during the Flint Water Crisis.” This link provides the data as a Minitab worksheet: lead_results_from_tap_water_sampling_in_flint__mi_during_the_flint_water_crisis.MTW

The image of the Citizen's Bank Weather Ball is by the Michigan Municipal League and is licensed under this Creative Commons License.

In this post, I’ll address some common questions we’ve received in technical support about the difference between fitted and data means, where to find each option within Minitab, and how Minitab calculates each.

Cat Meme First, let’s look at some definitions. It’s useful to have an example, so I’ll be using the Light Output data set from Minitab’s Data Set Library, which includes a description of the sample data here. This same data set is available within Minitab by choosing File> Open Worksheet, clicking the Look in Minitab Sample Data folder button at the bottom, and then opening the file titled LightOutput_model.MTW.

Calculating Data Means

In an ANOVA, data means are the raw response variable means for each factor/level combination.

For the LightOutput data set, we can calculate the data means for Temperature by choosing Stat> Basic Statistics> Display Descriptive Statistics, and then completing the dialog box as shown below:

Click the Statistics button and make sure only Mean is selected, then click OK in each dialog. Repeat the above steps, and replace Temperature with GlassType to calculate the data means for that second factor. The session window will display these results:

The means calculated directly from the data shown above are the values that would be plotted in a Main Effects plot. To create that plot in Minitab, use Stat> ANOVA> Main Effects Plot and complete the dialog box as shown below:

Click OK display the graph, which will show the same mean values for each level of the two factors (I’ve added data labels to the graph below):

So, data means are the raw response variable means for each factor/level combination. On the other hand, fitted means use least squares regression to predict the mean response values of a balanced design, in which your data has the same number of observations for every combination of factor levels. The two types of means are identical for balanced designs but can be different for unbalanced designs.

Balanced Designs

As I mentioned above, in ANOVA a balanced design has an equal number of observations for all possible combinations of factor levels, whereas an unbalanced design has an unequal number of observations.

If you’re not sure whether your design is balanced or not, Minitab makes it easy to find out. For the Light output data set, we can see that the design is balanced by choosing Stat> Tables> Cross Tabulation and Chi-Square, and then completing the dialog as shown below:

Because there are 3 observations for every combination of Temperature and GlassType, this design is balanced.

We can fit a model to this data by choosing Stat> ANOVA> General Linear Model> Fit General Linear Model, and then completing that dialog box as shown below and clicking OK:

Now that we have a model for this data, we can obtain a main effect plot based on the least-squares model by choosing Stat> ANOVA > General Linear Model> Factorial Plots (NOTE: The Factorial Plots option will not be available until a model is fit, because these graphs are based on the model). Click OK in the dialog box below to accept the defaults and generate the main effects plot:

Calculating Main Effects for Balanced Designs

Again, the fitted means in the main effects plot above are the same as the previous data means plot because this is a balanced design. In this case, the answer is the same, but Minitab obtained these results by finding the fitted value for every possible combination of factor levels. The following steps illustrate what Minitab is doing automatically, behind the scenes:

To obtain these fitted values, after the model has already been fit to the data, type all possible combinations of factor levels into the worksheet as shown below, and then use Stat> ANOVA> General Linear Model> Predict, and enter the two columns with all possible combinations:
Click OK in the dialog box above to store the results in the worksheet.
Now use Stat> Basic Statistics> Store Descriptive Statistics twice; once to get the means of the fits calculated in step 2 for Temp, and a second time to get the means of the fits for Glass Type:

The results show the same means calculated in the fitted means main effects plot:

Unbalanced Designs

Now let’s take a look at what happens in an unbalanced design, where there are an unequal number of observations per factor/level combination.

First, we’ll need to modify the worksheet to make the design unbalanced. Recall that this data set includes 3 observations per combination of factor levels. To make the design unbalanced, I’m changing the second row of data in the Temperature column. The original value there was 125, and I’ve changed that to 100:

With the data modified as shown above, we can use Stat> Tables> Cross Tabulation and Chi-Square again to see that the design is unbalanced:

Calculating Main Effects for Unbalanced Designs

Now let’s fit a model to this data using Stat> ANOVA> General Linear Model> Fit General Linear Model. This time, click the Results button and use the drop-down list next to Coefficients to select Full set of coefficients, then click OK in each dialog. Our results are different. If we generate new factorial plots using the new model, we can see that some of these fitted means are different than those in the balanced model:

We can calculate the fitted means of the main effects in the same way as we calculated them for the balanced case, or we can see the same results by looking at the full table of coefficients:

The fitted mean in the main effects plot for temperature at 100 is calculated by adding the coefficient for temperature at 100 to the constant. So 957.3 + (-349.5) = 607.8 (rounded). For temperature at 125, we add 957.3 + 111.5 = 1168.8, and so forth.

If you’ve enjoyed this post and would like to learn more, check out our other blog posts related to ANOVA.

When I wrote How to Calculate B10 Life with Statistical Software, I promised a follow-up blog post that would describe how to compute any “BX” lifetime. In this post I’ll follow through on that promise, and in a third blog post in this series, I will explain why BX life is one of the best measures you can use in your reliability analysis.

As a refresher, B10 life refers to the time at which 10% of the population has failed—or, to put it another way, it is the 90% reliability of a population at a specific point in time. Let’s revisit our pacemaker battery example from part 1 of this blog series. Here's the data.

Data

Recall that we found the B10 life of pacemaker batteries to be 6.36 years. Another way to interpret this value is to say that 6.36 years is the time at which 10% of the population of pacemaker batteries will fail. This information is useful in establishing a realistic warranty period for a product so that customers are covered through a product’s 90% reliability period, and so the manufacturer won’t have to incur extra cost by replacing an excess of the product during the warranty period.

But perhaps a particular product has additional reliability requirements a manufacturer wishes to monitor, such as B15 life. Or perhaps we would like to know when half of the population will fail—its B50 life. Both B10 and B50 life are industry standards for measuring the life expectancy of an automotive engine, for instance. This is where BX life calculations become even more useful—and Minitab makes it incredibly easy to compute and interpret those values. (If you don't already have Minitab and you'd like to follow along, download the free trial.)

Calculating BX Life

Navigate to Minitab’s Statistics > Reliability/Survival > Distribution Analysis (Right Censoring) > Parametric Distribution Analysis menu and set up the main dialog and the 'Censor' subdialog the same way we did in Part 1:

Parametric Distribution Analysis - Main Dialog

Press the "Censor" button and fill out the subdialog as follows:

Censor Subdialog

When you press OK, Minitab analyzes the distribution of your data and by default will display a Table of Percentiles in the session window. We can take advantage of this table for measures such as B50 life, because the table produces output for a variety of percentiles by default. The percent of population failures at the 50th percentile is included in the default output.

Table of Percentiles for B50 Life

We see that 50% of the population of pacemaker batteries will fail by 9.735 years. But what if we want to compute B15 life? This percentile does not display by default in the Table of Percentiles.

Revisiting the Parametric Distribution Analysis dialog (pressing CTRL-E is a Minitab shortcut that will bring up your most recently completed dialog), we can click the ‘Estimate’ button to specify what “BX” life we want. In the section titled ‘Estimate percentiles for these additional percents,’ entering the number 15 will give us the B15 life for pacemaker batteries.

Estimate Subdialog

Click OK through the dialogs, and we see that a row of output for the 15th percentile is now included in the Table of Percentiles.

Table of Percentiles for B15 Life

It’s as simple as that!

If you’ve never used BX life as a reliability metric, and you’re wondering just how and why these can be some of the best measures of reliability, stay tuned for my final post in this series!

If you want to convince someone that at least a basic understanding of statistics is an essential life skill, bring up the case of Lucia de Berk. Hers is a story that's too awful to be true—except that it is completely true.

A flawed analysis irrevocably altered de Berk's life and kept her behind bars for a full decade, and the fact that this analysis targeted and harmed just one person makes it more frightening. When tragedy befalls many people, aggregating the harmed individuals into a faceless mass helps us cope with the horror. You can't play the same trick on yourself when you consider a single innocent woman, sentenced to life in prison, thanks to an erroneous analysis.

The Case Against Lucia

It started with an infant's unexpected death at a children's hospital in The Hague. Administrators subsequently reviewed earlier deaths and near-death incidents, and identified 9 other incidents in the previous year they believed were medically suspicious. Dutch prosecutors proceeded to press charges against pediatric nurse Lucia de Berk, who had been responsible for patient care and medication at the time of all of those incidents. In 2003, de Berk was sentenced to life in prison for the murder of four patients and the attempted murder of three.

The guilty verdict, rendered despite a glaring lack of physical or even circumstantial evidence, was based (at least in part) on a prosecution calculation that only a 1-in-342-million chance existed that a nurse's shifts would coincide with so many suspicious incidents. "In the Lucia de B. case statistical evidence has been of enormous importance," a Dutch criminologist said at the time. "I do not see how one could have come to a conviction without it." The guilty verdict was upheld on appeal, and de Berk spent the next 10 years in prison.

One in 342 Million...?

If an expert states that the probability of something happening by random chance is just 1 in 342 million, and you're not a statistician, perhaps you'd be convinced those incidents did not happen by random chance.

But if you are statistically inclined, perhaps you'd wonder how experts reached this conclusion. That's exactly what statisticians Richard Gill and Piet Groeneboom, among others, began asking. They soon realized that the prosecution's 1-in-342-million figure was very, very wrong.

Here's where the case began to fall apart—and not because the situation was complicated. In fact, the problems should have been readily apparent to anyone with a solid grounding in statistics.

What Prosecutors Failed to Ask

The first question in any analysis should be, "Can you trust your data?" In de Berk's case, it seems nobody bothered to ask.

Richard Gill graciously attributes this to a kind of culture clash between criminal and scientific investigation. Criminal investigation begins with the assumption a crime occurred, and proceeds to seek out evidence that identifies a suspect. A scientific approach begins by asking whether a crime was even committed.

In Lucia's case, investigators took a decidedly non-scientific approach. In gathering data from the hospitals where she worked, they omitted incidents that didn't involve Lucia from their totals (cherry-picking), and made arbitrary and inconsistent classifications of other incidents. Incredibly, events De Berk could not have been involved in were nonetheless attributed to her. Confirmation and selection bias were hard at work on the prosecution's behalf.

Further, much of the "data" about events was based on individuals' memories, which are notoriously unreliable. In a criminal investigation where witnesses know what's being sought and may have opinions about a suspect's guilt, relying on memories of events that happened weeks and months ago seems like it would be a particularly dubious decision. Nonetheless, the prosecution's statistical experts deemed the data gathered under such circumstances trustworthy.

As Gill, one of the few heroes in this sordid and sorry mess, points out, "The statistician has to question all his clients’ assumptions and certainly not to jump to the conclusions which the client is aiming for." Clearly, that did not happen here.

Even If the Data Had Been Reliable...

So the data used against de Berk didn't pass the smell test for several reasons. But even if the data had been collected in a defensible manner, the prosecution's statement about 1-in-342-million odds was still wrong. To arrive at that figure, the prosecution's statistical expert multiplied p-values from three separate analyses. However, in combining those p-values the expert failed to perform necessary statistical corrections, resulting in a p-value that was far, far lower than it should have been. You can read the details about these calculations in this paper.

In fact, when statisticians, including Gill, analyzed the prosecution's data using the proper formulas and corrected numbers, they found the odds that a nurse could experience the pattern of events exhibited in the data could have been as low as 1 in 25.

Justice Prevails at Last (Sort Of)

Even though de Berk had exhausted her appeals, thanks to the efforts of Gill and others, the courts finally re-evaluated her case in light of the revised analyses. The nurse, now declared innocent of all charges, was released from prison (and quietly given an undisclosed settlement by the Dutch government). But for an innocent defendant, justice remained blind to the statistical problems in this case across 10 years and multiple appeals, during which de Berk experienced a stress-induced stroke. It's well worth learning more about the role of statistics in her experience if you're interested in the impact data analysis can have on one person's life.

At a minimum, what happened to Lucia de Berk should be more than enough evidence that a better understanding of statistics could set you free.

Literally.

In statistics, there are things you need to do so you can trust your results. For example, you should check the sample size, the assumptions of the analysis, and so on. In regression analysis, I always urge people to check their residual plots.

In this blog post, I present one more thing you should do so you can trust your regression results in certain circumstances—standardize the continuous predictor variables. Before you groan about having one more thing to do, let me assure you that it’s both very easy and very important. In fact, standardizing the variables can actually reveal statistically significant findings that you might otherwise miss!

When and Why to Standardize the Variables

You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity.

Multicollinearity is a problem because it can hide statistically significant terms, cause the coefficients to switch signs, and make it more difficult to specify the correct model.

Your regression model almost certainly has an excessive amount of multicollinearity if it contains polynomial or interaction terms. Fortunately, standardizing the predictors is an easy way to reduce multicollinearity and the associated problems that are caused by these higher-order terms. If you don’t standardize the variables when your model contains these types of terms, you are at risk of both missing statistically significant results and producing misleading results.

How to Standardize the Variables

Minitab's coding dialog box Many people are not familiar with the standardization process, but in Minitab 17 it’s as easy as choosing an option and then proceeding along normally. All you need to do is click the Coding button in the main dialog and choose an option from Standardize continuous predictors.

To reduce multicollinearity caused by higher-order terms, choose an option that includes Subtract the mean or use Specify low and high levels to code as -1 and +1.

These two methods reduce the amount of multicollinearity. In my experience, both methods produce equivalent results. However, it’s easy enough to try both methods and compare the results. The -1 to +1 coding scheme is the method that DOE models use. I tend to use Subtract the mean because it’s a more intuitive process.

One caution: the other two standardization methods won't reduce the multicollinearity.

How to Interpret the Results When You Standardize the Variables

Conveniently, you can usually interpret the regression coefficients in the normal manner even though you have standardized the variables. Minitab uses the coded values to fit the model, but it converts the coded coefficients back into the uncoded (or natural) values—as long as you fit a hierarchical model. Consequently, this feature is easy to use and the results are easy to interpret.

I’ll walk you through an example to show you the benefits, how to identify problems, and how to determine whether they have been resolved. This example comes from a previous post where I show how to compare regression slopes. You can get the data here.

In the first model, the response variable is Output and the predictors are Input, Condition, and the interaction term, Input*Condition.

Regression results with unstandardized predictor variables

For the results above, if you use a significance level of 0.05, Input and Input*Condition are statistically significant, but Condition is not significant. However, VIFs greater than 5 suggest problematic levels of multicollinearity, and the VIFs for Condition and the interaction term are right around 5.

I’ll refit the model with the same terms but I’ll standardize the continuous predictors using the Subtract the mean method.

Regression results with standardized predictor variables

These results show that multicollinearity has been reduced because all of the VIFs are less than 5. Importantly, Condition is now statistically significant. Multicollinearity was obscuring the significance in the first model! The coefficients table shows the coded coefficients, but Minitab has converted them back into uncoded coefficients in the regression equation. You interpret these uncoded values in the normal manner.

This example shows the benefits of standardizing the variables when your regression model contains polynomial terms and interaction terms. You should always standardize when your model contains these types of terms. It is very easy to do and you’ll have more confidence that you’re not missing something important!

For more information, see my blog post What Are the Effects of Multicollinearity and When Can I Ignore Them? That post provides a more detailed explanation about the effects of multicollinearity and a different example of how standardizing the variables can reveal significant findings, and even a changing coefficient sign, that would have otherwise remained hidden.

If you're learning about regression, read my regression tutorial!

There's nothing like a boxplot, aka box-and-whisker diagram, to get a quick snapshot of the distribution of your data. With a single glance, you can readily intuit its general shape, central tendency, and variability.

boxplot diagram

To easily compare the distribution of data between groups, display boxplots for the groups side by side. Visually compare the central value and spread of the distribution for each group and determine whether the data for each group are symmetric about the center. If you hold your pointer over a plot, Minitab displays the quartile values and other summary statistics for each group.

boxplot hover

The "stretch" of the box and whiskers in different directions can help you assess the symmetry of your data.

skewed boxplots

Sweet, isn't it? This simple and elegant graphical display is just one of the many wonderful statistical contributions of John Tukey. But, like any graph, the boxplot has both strengths and limitations. Here are a few things to consider.

Be Wary of Sample Size Effects

Consider the boxplots shown below for two groups of data, S4 and L4.

boxplot two groups

Eyeballing these plots, you couldn't be blamed for thinking that L4 has much greater variability than S4.

But guess what? Both data sets were generated by randomly sampling from a normal distribution with mean of 4 and a standard deviation of 1. That is, the data for both plots come from the same population.

Why the difference? The sample for L4 contains 100 data points. The sample for S4 contains only 4 data points. The small sample size shrinks the whiskers and gives the boxplot the illusion of decreased variability. In this way, if group sizes vary considerably, side-by-side boxplots can be easily misinterpreted.

How to See Sample Size Effects

Luckily, you can easily change the settings for a boxplot in Minitab to visually capture sample-size effects. Right-click the box and choose Edit Interquartile Range Box. Then click the Options tab and check the option to show the box widths proportional to the sample size.

options

Do that, and the side-by-side boxplots will clearly reflect sample size differences.

width proportional

Yes that looks weird. But it should look weird! For the sake of illustration, we're comparing a sample of 4 to a sample of 100, which is a weird thing to do.

In practice, you'd be likely to see less drastic—though not necessarily less important—differences in the box widths when groups are different sizes. The following side-by-side boxplots show groups with sample sizes that range from 25 to 100 observations.

side by side plot

Thinner boxes (Group F) indicate smaller samples and "thinner" evidence. Heftier boxes (Group A) indicate larger samples and more ample evidence. The group comparisons are less misleading now because the viewer can clearly see that sample sizes for the groups differ.

Small Samples Can Make Quartiles Meaningless

Another issue with using a boxplot with small samples is that the calculated quartiles can become meaningless. For example, if you have only 4 or 5 data values, it makes no sense to display an interquartile range that shows the "middle 50%" of your data, right?

Minitab display options for the boxplot can help illustrate the problem. Once again, consider the example with the groups S4 (N = 4) and L4 (N = 100), which were both sampled from a normal population with mean of 4 and standard deviation of 1.

boxplot two groups

To visualize the precision of the estimate of the median (the center line of the box), select the boxplots, then choose Editor > Add > Data Display. You'll see a list of items that you can add to the plot. Select the option to display a confidence interval for the median on the plot.

median ci option

Here's the result:

boxpot median ci

First look at the boxplot for L4 on the right. A small box is added to the plot inside the interquartile range box to show the 95% confidence interval for the median. For L4, the 95% confidence interval for the median is approximately (3.96, 4.35), which seems a fairly precise estimate for these data.

S4, on the left, is another story. The 95% confidence interval (3.65, 5.19) for the median is so wide that it completely obscures the whiskers on the plot. The boxplot looks like some kind of clunky, decapitated Transformer. That's what happens when the confidence interval for the median is larger than the interquartile range of the data. If your plot looks like that when you display the confidence interval for the median, it often means that your sample is probably too small to obtain meaningful quartile estimates.

Case in Point: Boxplots and Politics

Like Ginger Rogers, I'm kind of writing this post backwards—although not in high heels. What got me thinking about these issues with the boxplot was a comment from a reader who suggested that my choice of a time series plot to represent the U.S. deficit data was politically biased. Here's the time series plot:

time series plot

Even though I deliberately refrained from interpreting this graph from a political standpoint (given the toxic political climate on the Internet, I didn't want to go there!), the reader felt that by choosing a time series plot for these data, I was attempting to cast Democratic administrations in a more favorable light.The reader asked me to instead consider side-by-side boxplots of the same data:

boxplot deficit

I appreciated the reader's suggestion in a general sense. After all, it's always a sound strategy to examine your data using a variety of graphical analyses.

But not every graph is appropriate for every set of data. And for these data, I'd argue that boxplots are not the best choice, regardless of whether you're a member of the Democratic, Republican, Objectivist, or Rent Is Too Damn High party.

For one thing, the sample sizes for each boxplot are much too small (between 4 and 8 data points, mostly), raising the issues previously discussed. But something else is amiss...

Context is everything...especially in statistics

In most cases, such as in most process data, longer boxes and whiskers indicate greater variability, which is usually a "bad" thing. So when you eyeball the boxplots of %GDP deficits quickly, your eye is drawn to the longer boxes, such as the plot for the Truman administration. The implication is that the deficits were "bad" for those administrations.

But is variability a bad thing with a deficit? If a president inherits a huge deficit and quickly turns it into a huge surplus, that creates a great amount of variability—but it's good variability.

You could argue that the relative location of the center line (median) of the side-by-side plots provides a useful means of comparing "average" deficits for each administration. But really, with so few data values, the median value of each administration is just as easy to see in the time series plot. And the time series plot offers additional insight into overall trends and individual values for each year.

Look what happens when you graph the same data values, but in a different time order, using time series plots and boxplots.

time series plot trends

boxplot of trends

Using a boxplot for this trend data is liking putting on a blindfold. You want to choose a graphical display that illuminates information about data, not obscures it.

In conclusion, a monkey wrench is a wonderful tool. Unless you try to use it as a can opener. Graphs are kind of like that, too.

dollar sign While the roots of Lean Six Sigma and other quality improvement methodologies are in manufacturing, it’s interesting to see how other organizational functions and industries apply LSS tools successfully. Quality improvement certainly has moved far beyond the walls of manufacturing plants!

For example, I recently had the opportunity to talk to Drew Mohler, a Lean Six Sigma black belt and senior organizational development consultant at Buckman, about their very unique quality improvement initiative involving the sales function of the organization.

Buckman, a global leader in the chemical industry, does use LSS and statistics to complete internal improvement projects as many organizations do, but they also train their technical sales teams to use LSS and statistical tools to provide more value to customers and thus drive sales.

A ‘Lean’ Deployment of LSS

In many organizations that deploy LSS, statistical tools are taught in either a green or black belt course, as part of a broader tool set that’s framed by the DMAIC methodology, an approach that divides projects into five phases—define, measure, analyze, improve and control.

In these deployments, the use of statistical tools are typically presented in the context of working through an improvement project. Buckman chose a different approach, recognizing that the statistical tools of LSS are useful for any role that analyzes data. If these tools were taught independently from the DMAIC model, they could be taught to a broader audience than just green or black belts.

“We took a Lean approach to the deployment of Lean Six Sigma, realizing that many of our associates who should be using statistical tools would not benefit from the full curriculum of a LSS belt course,” says Mohler. “As we looked at this expanded view, we realized that a key group who should be taught these tools were our sales associates.”

At Buckman, technical field sales associates who have backgrounds in chemistry, biology, or engineering work directly with the company’s customers to help them assess their processes and look for improvement opportunities. Analyzing customer data is a key part of Buckman’s selling process.

“Using Buckman’s chemistry solutions, our technical sales teams work to make our customers’ systems better,” Mohler says. “In essence, they function as process engineers for our customers.”

Mohler and his colleagues developed two separate Lean Six Sigma courses. The first, a yellow belt course, focused on the traditional DMAIC process and the “soft” tools of quality improvement. This is the course Buckman teaches associates who will be leading simple improvement projects. The second course is a data analysis and statistical tools course targeting the organization’s sales associates.

Instead of using the DMAIC framework as the backbone of the statistics training, Buckman took the selling process and linked the appropriate statistical tools to each step. This framework broke the sales flow into more manageable pieces and looked at data analysis activities that are used to:

Gain knowledge of customer processes
Plan, run, evaluate and sell new chemical programs
Manage ongoing chemical programs
Solve problems within the account

“The end goal is to have our sales associates comfortable using practical statistical tools, so they’ll be able to make better recommendations—data-driven recommendations—to help our customers,” notes Mohler. “We believe that focusing our efforts on improving customer satisfaction will make Buckman more profitable and sustainable.”

In the statistics class, associates are trained on the tools they need for their job activities. Concepts such as control charting, hypothesis testing, capability analysis, and correlation are taught with practical examples that use the data analysis tools in Minitab.

Among the many statistical tools that are covered in the training, associates are taught to visualize their data using various graphs and charts. For example, suppose an associate runs a trial to improve the brightness of paper being produced for one of Buckman’s customers, and collects data both before the trial and during the trial to assess whether their product has made a difference. The staged control chart provides a powerful tool to show the impact that the company’s chemistry had on the process.

Success

Since launching the statistical tools training more than 3 years ago, Buckman has trained more than 500 field sales associates worldwide. And the results are being noticed by other parts of the company.

“Our Research and Development teams are also interested in learning how to leverage Minitab for the data analysis work they routinely do. We are currently developing classes for them which will focus on analysis tools useful for product development and laboratory testing,” Mohler says. “Our corporate mentality has morphed into an approach that empowers everyone within our organization to make data-driven decisions.”

With their field statistics training, sales associates at Buckman are improving their conversion rates—but more important, they are delivering more value for customers.

“Since implementing the field statistics training, we have examples of where we’ve sold new business or protected existing business because of our improved data analysis skills,” notes Mohler. “But what we’ve really seen is increased confidence in our sales associates.

“Because they’ve done the statistical work behind the scenes, they are better prepared to explain the benefits resulting from our products and our customers see us as much more knowledgeable about their systems. This alone makes the training program a success.”

To learn more about how Buckman uses Minitab, check out the full case study: A Positive Reaction: Buckman Mixes Lean Six Sigma and Minitab to Drive Sales

Do you have a unique application of quality improvement tools to share? Tell us in the comments, or share your story with us in a guest post on the Minitab Blog: http://blog.minitab.com/blog/landing-pages/share-your-story-about-minitab/n

For the majority of my career with Minitab, I've had the opportunity to speak at conferences and other events somewhat regularly. I thought some of my talks were pretty good, and some were not so good (based on ratings, my audiences didn't always agree with either—but that's a topic for another post). But I would guess that well over 90% of the time, my proposals were accepted to be presented at the conference, so even though I may not have always delivered a home run on stage, I at least submitted an abstract that was appealing to the organizers.

speaker As chair of the Lean and Six Sigma World Conference this year, I reviewed every abstract submitted and was able to experience things from the other side of the process. Now, with the submission period upon us for the Minitab Insights Conference, I thought I'd share some insights on submitting an A+ speaking submission.

Tell A Story

People are emotional beings, and a mere list of the technical content you plan to present doesn't engage the reviewers any more than it will an audience. Connecting the topic to some story sparks an emotional interest and desire to know more. Several years ago, I presented on the multinomial test at a conference, a topic that probably would have elicited yawns if I'd pitched it as the technical details of how to perform this hypothesis test. Instead I submitted an abstract asking if Virgos were worse drivers, as stated by a well-known auto insurer, and explaining that by answering the question we can also learn how to determine if defect rates were different among multiple suppliers or error rates were different for various claims filers. Want to know if they are, I asked. Accept my talk! They did.

Nail the Title

This can be the most difficult step, but it helps to remember that organizers use the program to promote the conference and draw attendees. A catchy title that elicits interest from prospective attendees can go a long way. So, what makes for a good title? I like to reference the story I will tell and not directly state the topic. For the talk I describe above, the title was "Are Virgos Unsafe Drivers?" Note that from the title, someone considering attending has no idea yet that the talk will be about a statistical test. But they are curious and will read the description. More important, the talk seems interesting and the speaker seems engaging, and those are the criteria attendees use to decide what talks to attend. An alternate title that is more descriptive but not catchy,"The Proper Application of the Multinomial Test of Proportions," sounds like a good place to take a nap.

Reference Prior Experience

If the submission process allows it (the Minitab Insights Conference does), reference prior speaking engagements and even better, provide links to any recordings that may exist of you speaking. Even if it is not a formal presentation, anything that enables to organizers to get a feel for your personality when speaking is a huge plus. It is somewhat straightforward to assess whether a submitted talk would be of interest to attendees, but assessing whether speakers are engaging is difficult or impossible, even though ultimately it will make a huge impact on what attendees think of the conference. Even better, you don't actually have to be an excellent presenter—the organizer's fear is that you might be a terrible speaker! Simply demonstrating that you can present clearly and engage an audience goes a long way.

Don't Make Mistakes

It is best to assume that whoever is evaluating you is a complete stranger. Imagine you ask for something from a stranger and what they send you is incomplete or contains grammatical error or typos: what is your impression of that person? If they are submitting to speak, my suspicion is that they will likely have unprofessional slides and possibly even be unprofessional when they speak. Further, the fact that they would not take the time to review and correct the submission tells me that they are not serious about participating in the event.

Write the Presentation First

Based on experience, I believe this is not done often—but that is a mistake. True, no one wants to put hours into a presentation only to have it get rejected, but that presentation could still be used elsewhere, so the time is not necessarily wasted. Inevitably, when you prepare a presentation new insights and ways of presenting the information come to light that greatly improve what will be presented and the story that will be told. So to tell the best story in the submission, it is immensely valuable to have already made the presentation slides! In fact, if I sorted every presentation I ever gave into buckets labeled "good" and "not so good," they would correspond almost perfectly to whether I had already made the presentation when I submitted the abstract.

Ask a Friend

Finally, approach someone you trust (and who is knowledgeable in the relevant subject area) to give you an honest opinion. Ask them what they think. Is the topic of interest to the expected attendees? Is it too simple? Too complicated? Will the example(s) resonate? After all, you don't want the earliest feedback you receive on your proposal to be from the person(s) deciding whether to accept the talk.

So that's my advice. It may seem like a big effort simply to submit an abstract, but everything here goes to good use as you prepare to actually give the presentation. It's better to put in more work at the start and get to put that work to good use later, than to put in a little work that goes to waste. Do these things and you'll be in a great position to be accepted and deliver a fantastic presentation!

If you’ve not heard of the TV series Game of Thrones, you must have been living on Mars for the past few years! An adaptation of the fantasy series A Song of Ice and Fire by George R. R. Martin, the show is an epic tale of the political conflicts and wars between noble houses in the fictional continents of Westeros and Essos over who sits in the Iron Throne, and thus rules the whole realm. (It was the most pirated TV show for the third consecutive year in 2014, according to the TorrentFreak web site.)

Not only is the show extremely popular, it’s also extremely violent.

When every seasons starts, the first question fans usually ask is, “Which character are they going to kill off this season?” In fact, I suspect the number of deaths in each episode—and their impact on the series story line—is a key reason why so many of us (me included) are hooked on the show. I decided to investigate this further using Minitab 17.

From the genius.com web site (http://genius.com/Game-of-thrones-list-of-game-of-thrones-deaths-annotated), I obtained data on the number of key deaths in each episode of Game of Thrones up to now. And I managed to get the viewer numbers for each episode from Wikipedia. These viewing statistics are based on each episode’s initial airing on HBO in the U.S.

Here is a snapshot of the data in Minitab.

First, I am going to plot the data by unstacking it, so that I end up with one column of data per season, as shown below. If you’re following along in Minitab, go to Data > Unstack Columns.

I then plot the data using the Time Series plot functionality (Graph > Time Series Plot…).

Time Series Plot of Views for Game of Thrones

You can see that the show has definitely increased in popularity each season.

For seasons 1 and 2, the fluctuation in the viewer numbers seemed to be quite steady for the first few episodes, with a slight dip for episode 9 before a higher viewing of the finale. However, in seasons 3 and 4 the viewing numbers were higher for the earlier episodes rather than those at the end of the season. This could be due to the way the storylines went in these seasons, with key events spread out across many episodes rather than clustered in the penultimate or final episodes.

For season 5, which many viewers believe diverged significantly from the books, we saw peak viewing numbers at the beginning and the end of the season. The dip in viewers after episode 6 could be due to a controversial scene in that episode involving the character Sansa Stark. In episode 7, we saw the long-anticipated first meeting of two key characters, Tyrion Lannister and Daenerys Targaryen. However, this did not have any positive impact on the viewing. Perhaps fans needed time to recover from trauma!

In my next post, I’ll show you how to use this data to create a model that helps predict viewing numbers.

In my last post, I looked at viewership data for the five seasons of HBO’s hit series Game of Thrones. I created a time series plot in Minitab that showed how viewership rose season by season, and how it varied episode by episode within each season.

My next step is to fit a statistical model to the data, which I hope will allow me to predict the viewing numbers for future episodes.

I am going to use the General Linear Model to analyse the data. This is because our variables include the season (1 to 5) and episode number (1 to 10), which are fixed and can be considered as categorical. In addition, we’ll consider the number of important characters who die in each episode as a covariate.

Under Stat > ANOVA > General Linear Model > Fit General Linear Model…, I fill in the dialog box as shown below.

Then I click the “Model” button and tell Minitab to analyse the main effects of these variables and their interactions:

Next, I click “OK” to return to the first dialog box, and then click the “Stepwise” button to bring up this dialog box:

By selecting the stepwise method, I’m telling Minitab to fit the most suitable model with the data given without me having to try various combinations of the predictors to obtain the final model Press OK in this dialog and the first, and Minitab returns the results.

That seems like a lot of information to sort through, so we’ll break it down into its important components.

First, let’s look at the Analysis of Variance table. It shows you the effect of the terms in the model on the response. Using the p-value, we can determine whether the effect is significant. The general guideline is to use 0.05. If the p-value is smaller than 0.05, the effect of the term on the response is significant. The number of major deaths (p-value of 0.014) and season (p-value of 0) are both significant!

Next, looking at the model summary section, the R-square values are quite high, more than 90%. The R-square indicates the proportion or percentage of variation in the response that can be explained by the predictors. The higher the value, the better the model.

Well, it turns out I have quite a good model here!

Now moving on to the most important part of the output, we have the regression equation, which is an algebraic representation of the regression line and describes the relationship between the response and predictor variables. Since the predictor “season” is categorical, I have one equation for each season:

Looking at the equations, you will notice that all 5 equations have positive coefficients for the predictor “no of major death” with a positive constant. In other words, based on these equations, we can infer that as the number of major deaths increases, the viewing numbers will also increase. It would seem the show’s audience does have some appetite for violence. However, there are still exceptions.

In episodes 8 and 9 of season 1, there were 7 and 2 deaths, respectively (including Ned Stark’s execution in episode 9). The corresponding viewing numbers for those episodes were 2.72 and 2.66. However, the first season finale, with only 3 deaths, had a higher viewing number. This could be because of the storyline—this is the episode where Daenerys Targaryen became the “Mother of Dragons,” a key event in the books.

Since we now have a model for the data, let’s use a new feature in Minitab 17. Go to Stat > ANOVA > General Linear Model > Predict… to get some fitted values for the data. We can then compare these with the observed values in our original data set. Fill in the dialog box as shown below.

Below are the screenshots from some of the results. The fitted values are stored in the worksheet.

Apart from the fitted value for each row of data, you will also see the standard error, 95% confidence interval and 95% prediction interval for each fitted value. The confidence interval is the range in which the estimated mean response for a given set of predictor values is expected to fall. The prediction interval is the range in which the predicted response for a single observation with a given set of predictor values is expected to fall. Now let’s make some comparisons.

Season 3, episode 9 features the pivotal event “The Red Wedding.” With 8 key deaths, the highest number of casualties up to that point in the series, this episode had 5.22 million viewers. For this episode, our model delivered the following fitted values statistics.

The model slightly overestimates the viewing numbers (5.37 vs. 5.22). However, looking at the 95% CI and 95%PI, the data we observed falls within these intervals, which indicates a reasonably good model.

Another episode in the series with high number of casualties is season 4, episode 9, which saw 10 deaths and captured 6.95 million viewers. For this episode, the model offers the following fitted values statistics:

The model may slightly overestimate the viewing numbers (7.30 vs 6.95). However, looking at the 95% CI and 95%PI, the data we observed falls within these intervals, which indicates a reasonably good model.

If we return to the output of the model, we can see that Minitab diagnostic results flagged some data points as unusual:

It appears that the unusual observations are all related to the data from the most current season. The figures are summarized below:

obs data Fit 95% CI 95% PI No. of Deaths 41 8 6.807 6.5295, 7.08479 5.90426, 7.71002 2 47 5.4 6.807 6.5295, 7.08479 5.90426, 7.71002 2 50 8.117.171 6.81615, 7.52672 6.24174, 8.101137

Overall, the model provides a reasonable range of viewing numbers for many of the episodes in the series apart from episodes 1, 7, and 10 of season 5, where we see big discrepancies between fitted and observed data. The large viewing numbers for episode 1 require no explanation—there is a lot of anticipation and excitement among viewers for the opening episode of every season. The drop in viewing numbers for episode 7, as I noted previously, was likely due to some controversial scenes in the previous episode. And this season’s finale was actually a record-breaking episode, achieving the series’ highest viewing numbers so far.

While this model is not perfect, it does, to some degree, suggest that we Game of Thrones viewers do have some appetite for violence on TV. "Winter is coming," and we are all still speculating about the fate of Jon Snow (a key character who appears to be killed at the end of season 5).

However, one thing I know for sure is that to get record-breaking viewing numbers, all they need to do is to have a certain white-haired lady sitting on the Iron Throne! We’ll have to wait for season 6 to see if that transpires. In meantime, I hope you’ve enjoyed this analysis of the viewership data for the first five seasons!

What is an interaction? It’s when the effect of one factor depends on the level of another factor. Interactions are important when you’re performing ANOVA, DOE, or a regression analysis. Without them, your model may be missing an important term that helps explain variability in the response!

For example, let’s consider 3-point shooting in the NBA. We previously saw that the number of 3-point attempts per game has been steadily increasing in the NBA. And there is no better example of this than the Golden State Warriors, who shoot 35% of their shots from behind the arc (2nd in the NBA). Seeing as how the Warriors currently lead the NBA in points per 100 possessions (a better indicator of offense than points per game since it accounts for pace), could it be that shooting more 3-pointers increases the number of points you score? For every NBA team since 1981, I collected their season totals for points per 100 possessions (ORtg) and the percentage of field goal attempts from 3-point range (3PAr). For example, if your 3PAr is 0.30, then 30% of your field goal attempts are 3-pointers (and the other 70% are from 2). Here is a fitted line plot of the two variables.

Fitted Line Plot

ANOVA Table

At first glance, it doesn’t look like shooting a lot of your shots from 3 has any effect on a team’s offensive rating. However, we’re missing an important variable. The Golden State Warriors don’t score a lot of points just because they shoot a lot of 3-pointers. They score a lot because they shoot a lot of 3-pointers andtheymake a lot of them.

So now let’s include each team’s percentage of successful 3-pointers (3P%) in the model.

ANOVA Table

Both of our terms are now significant, but the R-squared value is only 4.53. That means that our model explains only 4.53% of the variation in a team’s offensive rating. This is because we’re still leaving out an important term: the interaction! If your percentage of successful 3-pointers is low and you shoot a lot of 3-pointers, your offensive rating is going to be lower than if your percentage of successful 3-pointers is high and you shoot a lot of 3-pointers.

Let’s see what happens when we include the interaction term:

ANOVA Table

The interaction term is significant in the model, and our R-squared value has now increased to 20.27%!

When an interaction term is significant in the model, you should ignore the main effects of the variables and focus on the effect of the interaction. Minitab provides several tools to better help you understand this effect. The easiest to use is the line plot.

Interaction Plot

In this plot, the red line represents the highest value for percentage of successful 3-pointers (3P%) in the data, and the blue line represents the lowest. When you shoot significantly more 2-pointers than 3-pointers (the left side of the 3PAr axis) the offensive rating is similar for both the high and low settings of 3P%. But as you shoot fewer 2-pointers and more 3-pointers, offensive rating goes up for the high-success setting of 3-point shooting percentage, and drastically drops for the low-success setting.

Because 3P% is a continuous variable, we should be interested in seeing effects of the interaction for more than just the high and low setting. This can be accomplished using a contour plot.

Contour Plot

Now we can see the full range of values for both 3P% and 3PAr. The colors represent different ranges for offensive rating. Dark green represents a higher rate for offensive rating, while light green and blue represent lower offensive ratings.

We see that if your percentage of successful 3-pointers (3P%) is between approximately 33% and 38%, your 3PAr doesn’t have a large effect on your offensive rating. A 3P% above 38% that means that you should shoot more 3-pointers, where as a percentage below 33% means that means you should shoot fewer 3-pointers.

Now that we understand how the interaction works, let’s use our results to look as some NBA teams. So far in this NBA season, only five teams fall outside the 3P% range of 33% to 38%. Two teams make more than 38% of their 3-pointers (Warriors and Spurs) and 3 teams make less than 33% (Heat, Timberwolves, and Lakers). So do the Warriors and Spurs correctly shoot a high percentage of their field goals from 3, and do the Heat, Timberwolves, and Spurs shoot a high percentage of their shots from 2?

Contour Plot

The Warriors are good at shooting 3s, and they know it. They have the highest 3-point percentage in the NBA, and shoot the second-highest percentage of their field goals from 3 (the Rockets, who shoot the highest percentage of their field goals from 3, are not shown on the plot). On the other side, the Timberwolves are bad at shooting 3s, and they know it. They have the second-worst 3-point percentage and shoot the lowest percentage of their field goals from 3. The Heat also shoot poorly from 3, but they don’t take a lot as they rank 24th in the NBA in percentage of field goal attempts from 3-point range.

The interesting teams are the Spurs and the Lakers. The Spurs are second in the league, making 39.3% of their 3-pointers. However, only 22.4% of their field goals are 3-pointers, which is 26th in the league. They could benefit by shooting even a higher percentage of their shots from 3. And then there’s the Lakers. Despite ranking dead last in 3-point percentage, they shoot 29% of their field goals from 3. That’s good for 14th in the league. From this analysis, the Lakers are taking too many 3-pointers.

Now, this model purposely leaves out other predictors that could affect offensive rating (like 2-point shooting percentage). So don’t assume that 3-point shooting is all that goes into offensive rating. But it does give us a simple example of how interactions work and how you can use them to look at a real life process. Interactions can be an important part of any data model, so don’t neglect them!

caution

When running a binary logistic regression and many other analyses in Minitab, we estimate parameters for a specified model based on the sample data that has been collected. Most of the time, we use what is called Maximum Likelihood Estimation. However, based on specifics within your data, sometimes these estimation methods fail. What happens then?

Specifically, during binary logistic regression, an error comes up often enough that I want to explain what exactly it means, and offer some potential remedies for it. When you attempt to run your model, you may see the following error:

error

What's going on here? First, let's see what causes this error. Take a look at the following data set consisting of one response variable, Y, and one predictor variable, X.

X 1 2 3 4 4 5 5 6 Y 0 0 0 0 0 1 1 1

Note the key pattern. This data set can be simply described as follows:

If X <= 4, then Y=0 without fail. Similarly, if X >4, then Y=1, again without fail. This is what is known as "separation."

This "perfect prediction" of the response is what causes the estimates, and thus your model, to fail.

Often, separation occurs when the data set is too small to observe events with low probabilities. In the example above, it may be possible to observe a Y value of 1 with an X of less than 4, however, when dealing with smaller sample sizes and low probabilities, we didn't observe any instances of this in our data collection. The more predictors are in the model, the more likely separation is to occur because the individual groups in the data have smaller sample sizes.

Essentially, separation occurs when there is a category or range of a predictor with only one value of the response. We need diversity, or variation among the response to estimate the model.

So when separation happens, what can we do to proceed? With the data as is, there's no way to estimate those parameters; however, there are some things we can do to work around this issue.

1. Obtain more data. If possible, being able to get more data increases the probability that you will obtain different values for your response, thus eliminating the separation. If possible, this is a good first step.

2. Consider an alternative model. The more terms are in the model, the more likely that separation occurs for at least one variable. When you select terms for the model, you can check whether the exclusion of a term allows the maximum likelihood estimates to converge. If a useful model exists that does not use the term, you can continue the analysis with the new model.

3. Depending on the predictor variable in question, you may be able to manipulate your groupings to something that has events occurring. For example, you may have a predictor in your model with groups for both "Oranges" and "Apples." With such specific groups, it may be possible to see separation. However, that separation may disappear if you can combine those two levels into one specific grouping, such as "Fruit."

Seeing an error message like this can be frustrating, but it doesn't have to be the end of the line if you know some ways to work around it. Keep in mind these steps when analyzing a model, and you can overcome data issues such as this in the future.

Mind the gap. It's is an important concept to bear in mind whilst traveling on the Tube in London, the T in Boston, the Metro in Washington, D.C., etc. But how many of us remember to mind the gap when we create an interval plot in Minitab Statistical Software? Not too many of us, I'd wager. And it's a shame, too.

When you travel on the subway, minding the gap means giving thoughtful consideration to the space between the platform the and the train. On the subway, minding the gap can make the difference between these two very different views of the subway station:

Bad view of subway Nice view of subway

When you make an interval plot in Minitab, minding the gap means giving thoughtful consideration to the space between groups on the x-axis. For interval plots, minding the gap can make the difference between these two very different views of your data:

Plain view of data Awesome view of data

Allow me to demonstrate with an example. If you like, you can download the data file, PercentMoisture.MTW from our data set library and follow along. (You can get the free 30-day trial of Minitab here if you don't already have the software.) Technicians at a food company collected these data to try to figure out the best combination of time and temperature to bake cereal grains to minimize their moisture content.

Interval plots are useful because they summarize your data and allow you to simultaneously compare the means (represented by the points or symbols) and the variability (represented by the interval bars) for each sample or group. (To see more interval plots in action, check out these other blog posts: Seven Alternatives to Pie Charts and When Even Cupid Isn't Accurate Enough.)

Creating a basic interval plot in Minitab is simple. Just select Graph > Interval Plot. Then choose the One Y, With Groups option, enter the data as follows, and click OK. (For the sake of space in this article, I renamed the columns "Time" and "Temp".)

Creating the interval plot

Basic interval plot

The nice thing about interval plots is that multiple levels of multiple factors can be represented by different positions on the x-axis. But the unfortunate thing about interval plots is that multiple levels of multiple factors are represented by different positions on the x-axis.

All the information is there, but it's hard to see how one group relates to the next. For example, to compare the results for the 130-degree oven temperature across the different oven times, you need to compare the 2nd interval bar to the 5th interval bar and the 8th interval bar. You end up going from one similar-looking bar to another and another, and that seldom ends well.

To make the different oven temperatures stand out more, you can add a little color. Just double-click one of the symbols to open the Edit Mean Symbols dialog box. Click the Groups tab, enter the temperature variable, and click OK.

Grouping the symbols

Interval plot with grouped symbols

To help make the grouping even clearer, you can connect the dots. Right-click the graph and choose Add > Data Display, then select Mean connect line and click OK.

Adding mean connect lines

Interval plot with mean connect lines

Now it's much easier to identify and compare the results for the different oven temperatures. But here is where we really start to mind that gap. By which I mean that we start to give thoughtful consideration to the space between the oven-time groups on the x-axis. And by which I also mean that we mind these gaps because they are annoying and we want them to go away. But we need not worry, because that's one gap we can shrink easily.

Double-click the x-axis to open the Edit Scale dialog box. Notice the Gap within clusters setting. A setting of –1 means that the intervals for all levels of oven temperature at each level of oven time will be at the same location on the x-axis. Change the setting to –1 and the gap is closed.

And while we're at it, let's make the tick labels for temperature go away as well because they are redundant with the legend, and because the legend conveys the same information. And because if we don't, those labels would appear on top of each other, which looks pretty weird.

Removing the gap Removing labels

Interval plot with no more gap!

Awesome! The plot looks much better without the big gaps. Although, perhaps a little gap would make it easier to see the individual intervals more clearly. If we change that gap to –0.85, then everything is groovy.

Interval plot with tasteful gap

Now that's a gap I don't mind at all! Now it's really easy to compare the results for different oven temperatures within and across the different oven times. The interval plot suggests that to minimize moisture content, we want to use the 90-minute oven time, but we don't want to use the 125-degree oven temperature.

As you can see, the interval plot is an easy and fast way to get a good idea of which differences could be important. But remember, the interval plot can’t tell us which effects or which differences are statistically significant or not. For that, we need to conduct an analysis of variance (ANOVA).

Spoiler alert: I already ran an ANOVA on these data and it confirms what we gleaned from the interval plot. The main effects for both time and temperature are significant. (The interaction effect is not quite significant at the 0.05-level.) Tukey comparisons show that 90 minutes in the oven reduces moisture significantly better than either 30 minutes or 60 minutes in the oven. Tukey comparisons also show that a 125-degree oven is significantly worse at reducing moisture than either a 130-degree oven or a 135-degree oven. The effects of the 135-degree oven are not significantly different from the 130-degree oven, so we can probably save some energy and just use 130 degrees to desiccate our wild oats.

Credit for the subway tunnel photo goes to Thomas Claveirole. Credit for the subway station photo goes to Tim Adams. Both are available under Creative Commons License 2.0.

I’ve written about R-squared before and I’ve concluded that it’s not as intuitive as it seems at first glance. It can be a misleading statistic because a high R-squared is not always good and a low R-squared is not always bad. I’ve even said that R-squared is overrated and that the standard error of the estimate (S) can be more useful.

Even though I haven’t always been enthusiastic about R-squared, that’s not to say it isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, you should investigate why yours is different because there might be a problem.

In this blog post, I look at five reasons why your R-squared can be too high. This isn’t a comprehensive list, but it covers some of the more common reasons.

Is A High R-squared Value a Problem?

A very high R-squared value is not necessarily a problem. Some processes can have R-squared values that are in the high 90s. These are often physical process where you can obtain precise measurements and there's low process noise.

You'll have to use your subject area knowledge to determine whether a high R-squared is problematic. Are you modeling something that is inherently predictable? Or, not so much? If you're measuring a physical process, an R-squared of 0.9 might not be surprising. However, if you're predicting human behavior, that's way too high!

Compare your study to similar studies to determine whether your R-squared is in the right ballpark. If your R-squared is too high, consider the following possibilities. To determine whether any apply to your model specifically, you'll have to use your subject area knowledge, information about how you fit the model, and data specific details.

Reason 1: R-squared is a biased estimate

bathroom scale The R-squared in your regression output is a biased estimate based on your sample—it tends to be too high. This bias is a reason why some practitioners don’t use R-squared at all but use adjusted R-squared instead.

R-squared is like a broken bathroom scale that tends to read too high. No one wants that! Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce a weight that is correct on average.

Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model. Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.

For more information, read my posts about Adjusted R-squared and R-squared shrinkage.

Reason 2: You might be overfitting your model

An overfit model is one that is too complicated for your data set. You’ve included too many terms in your model compared to the number of observations. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.

Adjusted R-squared doesn't always catch this, but predicted R-squared often does. Read my post about the dangers of overfitting your model.

Reason 3: Data mining and chance correlations

If you fit many models, you will find variables that appear to be significant but they are correlated only by chance. While your final model might not be too complex for the number of observations (Reason 2), problems occur when you fit many different models to arrive at the final model. Data mining can produce high R-squared values even with entirely random data!

Before performing regression analysis, you should already have an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes based on previous research. Unfortunately, recent trends have moved away from this approach thanks to large, readily available databases and automated procedures that build regression models.

For more information, read my post about using too many phantom degrees of freedom.

Reason 4: Trends in Panel (Time Series) Data

If you have time series data and your response variable and a predictor variable both have significant trends over time, this can produce very high R-squared values. You might try a time series analysis, or including time related variables in your regression model, such as lagged and/or differenced variables. Conveniently, these analyses and functions are all available in Minitab statistical software.

Reason 5: Form of a Variable

It's possible that you're including different forms of the same variable for both the response variable and a predictor variable. For example, if the response variable is temperature in Celsius and you include a predictor variable of temperature in some other scale, you'd get an R-squared of nearly 100%! That's an obvious example, but you can have the same thing happening more subtlety.

For more information about regression models, read my post about How to Choose the Best Regression Model.

data

You have a column of categorical data. Maybe it’s a column of reasons for production downtime, or customer survey responses, or all of the reasons airlines give for those riling flight delays. Whatever type of qualitative data you may have, suppose you want to find the most common categories. Here are three different ways to do that:

1. Pareto Charts

Pareto Charts easily help you separate the vital few from the trivial many. To create a Pareto chart in Minitab Statistical Software, choose Stat > Quality Tools > Pareto Chart, then enter the column that contains your data and click OK. If you can’t easily read the chart because there are too many bars displayed, go back to the Pareto Chart dialog (hint: use Control-E to reopen the last dialog) and change the default percent from 95 to something less, say 50. You can keep playing with this percent until you arrive at the chart that tells the best story.

2. Tally

tally

To create a table of all categories and their respective counts, use Minitab’s Stat > Tables > Tally Individual Variables. Just like the Pareto Chart, you can simply enter your column of data and click OK. Or you can use some of the other options within the dialog depending on what you need.

For example, you may want to calculate both the count and percent for each category, and also store your results in the worksheet so you can sort the data with the most common category first, or highlight the data, etc.

3. Conditional Formatting

Speaking of highlighting the data, you can also identify the most common categories in a column using conditional formatting. In Minitab, simply click anywhere in the column, then Right-click > Conditional Formatting > Pareto and select either:

Most Frequent Values– Suppose you want to look at the 5 most common categories. Enter the number 5 and click OK.
Most Frequent Percentage– Similar to the Pareto Chart, suppose you want to look at the most common categories that represent 80% of all the data. Enter 80 and click OK.

After you use conditional formatting to identify cells of interest, you can also create new data sets that either include or exclude the highlighted cells by right-clicking and selecting Subset Worksheet.

Trying to summarize large amounts of categorical data isn’t always easy, but these 3 tools are a good place to start. And if there’s ever a case where you want to see all categories, you can use some of the same tools as above, or simply create a good old Pie Chart.

The easiest way to determine the probability of being born on a certain day is to assume that every day of the year has an equal probability of being a birthday. But academic scholarship tends to point to seasonal variation in births. If you average statistics from the United Nations, the seasonality in the United States of America from 1969 to 2013, excluding 1976 and 1977, looks like this:

Birth seasonality in the United States

Seeing this pattern made me think that the seasonality could be different in different countries around the world. While we don’t have good data for all of them, the United Nations does provide statistics covering various years for 145 countries or areas. As a matter of course, data’s not provided to the day, so we won’t be able to precisely determine the number of births on Leap Days, but we will be able to see where births in February and March are most common.

The best thing about doing this analysis in Minitab is that it’s easy to remove unneeded rows from the data. For example, if you download and open the comma separated values file, you’ll see that the first row of the data is not a month. Instead, it’s the total for an entire year:

The first row of the data includes a Total value in the Month column.

It’s easy to preserve the original worksheet in Minitab while you create a worksheet without the extra information. In this case, just follow these steps:

Choose Data > Subset Worksheet.
In Do you want to include or exclude rows, choose Exclude rows that match condition.
In Column, enter Month.

Once you choose the column, Minitab shows you a list of all of the values in the column. You don’t have to know every value in the column to subset the data. Although I noticed the Total values right away because they were at the top, I can also remove the Unknown and Missing values at the same time.

Total, Unknown, and Missing are values in the column.

In Values, check the values to exclude.
Enter a new worksheet name, such as Birth data. Click OK.

With a few formulas to get birth rates for comparisons and a bit more subsetting, you can produce graphs that show the most popular countries for babies in February and March.

Tajikistan and Ghana are popular nations for births in February and March.

When you look at all of the data the United Nations provides, Saint Helena: Ascension has the highest birth rates combined in February and March. Ghana, with a population over 26,000,000 today, is the most populous location in the top 10.

Tajikistan and South Africa are popular nations for February and March births in leap years.

If you look at birth rates only in leap years from the United Nations data, Tajikistan moves into the top spot. South Africa moves into the top 10 locations and becomes the most populous location on the list.

The Republic of Korea and Chile were population nations for February and March births last leap year.

Looking at only the most recent leap year changes the list considerably. Neither Tajikistan, Ghana, nor South Africa appear. The Republic of Korea is the most populous location among the top 10.

Because we’re looking at rates, none of these locations will actually have the most leap year babies. The raw numbers mean that countries with lots of births, such as China and India, will have the most babies born on February 29. But in terms of probability, we see that many locations have a seasonality effect that means for a randomly-selected baby in a particular location, the probability that they are born on Leap Day should increase based on when babies tend to be born in that location. And, of course, if you want a Leap Day baby and you’re inclined to occasionally stretch associations into causes, it’s not too early to start planning your 2020 vacation to Saint Helena.

Want to see more about subsetting worksheets? Check out Subset a worksheet based on starting values of a row.

After my husband’s most recent visit to the dentist, he returned home cavity-free...and with a $150 electric toothbrush in hand.

I wanted details.

It began innocently. His dreaded trip to the dentist ended in high praise for no cavities and only a warning to floss more. That prompted my programming-and-automation-obsessed husband, still in the chair, to exclaim, "I wish there was a way to automate this whole process—the brushing and the flossing."

Teeth Next thing you know, he’s swiping the credit card (to "earn miles for our next flight," he says) and walking out with a nice Philips Sonicare DiamondClean Sonic Electric Toothbrush.

From this anecdote, you’d think I was sitting beside him as his teeth-cleaning proceeded. I merely received the story secondhand from our dental hygienist the very next day when I went in for my own visit. But I digress.

When my husband exclaimed his desire to automate a process that very few humans enjoy doing, our dentist was pleased to tell him that this toothbrush comes close. Granted, the toothbrush can’t completely automate these tasks: it still requires the user to be present. However, our dentist offered the following points to consider:

The toothbrush does most of the brushing for you (with the exception of moving your hand so that you brush all your teeth).
The bristles automatically move, reaching crevices between teeth that no manual tooth-brushing ever could.
Because of point #2, plaque buildup will decrease and gum health will improve.
Because of point #3, flossing won’t be a strict daily requirement.

Sold.

The dentist's points give us a nice framework for thinking about automation. An automated solution might not be perfect. But an automated solution should:

a. make a task easier and more efficient (brushing hard-to-reach places more effectively)
b. require less of your time (reduces the need to floss daily), and
c. save you money (better tooth and gum health and fewer fillings equates to cost savings).

Who wouldn’t buy into that idea?

Automated solutions can turn feelings of boredom over performing tedious tasks into feelings of excitement. Why? Because automation removes the need to perform repetitive tasks that we know how to do but might not particularly enjoy, helps us see results faster, and incites us to implement change sooner. This can translate into business efficiency and increased profit.

The mere idea of automating the task of brushing teeth and the results he might experience incited my husband to think about tooth-brushing differently, and prompted the decision to purchase this custom solution (the electric toothbrush) before even implementing it in his daily habits; imagine the changes and process improvements that might occur once the automated solution is in place. Perhaps a report of no cavities for several visits in a row and an extra lump of cash for him to spend on me!

Just as Philips (and other manufacturers) developed an electric toothbrush as a custom solution to automate difficult or tedious aspects of brushing teeth, Minitab has created custom statistical solutions and has automated processes for numerous customers in various industries, including manufacturing, pharmaceutical, medical devices, and healthcare.

Did you know that Minitab is not merely an out-of-the-box statistical software package? Behind the software interface is a powerful statistical and graphical engine that can integrate with a customer’s workflow and provide a unique solution tailored to that customer’s industry-specific problem. Minitab’s engine can communicate with a customer’s databases, applications, and other programs such as Excel, in order to automatically perform analyses and provide output relevant to the customer’s needs.

One interesting example that comes to mind is a project our custom development consultants tackled for a pharmaceutical company. This company was responding to an FDA warning letter and needed to assess the quality of hundreds of active ingredients in a particular drug. They needed to analyze data collected for each ingredient using Minitab’s capability analysis tool, and create a report detailing the result of the analysis in order to show the FDA that their drug was stable and safe for consumption—but they needed to perform the same analysis and create the same report hundreds of times over.

Our custom development consultants used Minitab’s engine to access the customer’s data in Excel, automatically perform capability analysis on each active ingredient in the drug, and create custom reports detailing the quality level of each ingredient and a few additional pieces of output that the FDA wanted to see. Automating this work saved a tremendous amount of time, energy, and money, and ultimately helped the pharmaceutical company respond to the FDA warning letter in a timely manner.

Of course, Minitab’s custom solutions can take on many forms, including custom reports as mentioned in the pharmaceutical example above, real-time dashboard solutions, and alert systems (I’ll save details on that one for the second installment of this blog series, where we’ll hear about more of my husband’s shenanigans pertaining to online bill payments).

If you’re interested in learning more about automated solutions Minitab provides, please join us for a live demo of a real-time dashboard solution!

We live in a world of innovation and creativity; automated solutions touch on both ideas. If we can automate aspects of brushing our teeth, then surely we can automate a business process or task to help you become more efficient, save time, reduce costs, and see results sooner. If you’d like to learn how Minitab can help you, contact us at customdev@minitab.com.

My hope is that after reading this blog post, you see the relevance and value of automation—whether brushing your teeth, performing the same statistical analyses, or creating custom reports. And the power of automation extends far beyond these simple examples! So if I’ve piqued your interest, stay tuned for Part 2 of this series to hear more lessons learned by my husband in his automation endeavors!

Like so many of us, I try to stay healthy by watching my weight. I thought it might be interesting to apply some statistical thinking to the idea of maintaining a healthy weight, and the central limit theorem could provide some particularly useful insights. I’ll start by making some simple (maybe even simplistic) assumptions about calorie intake and expenditure, and see where those lead. And then we can take a closer look at these assumptions to try to get a little closer to reality.

scale I should hasten to add that I’m not a dietitian—or any kind of health professional, for that matter. So take this discussion as an example of statistical thinking rather than a prescription for healthy living.

Some Basic Assumptions

Wearable fitness trackers like a FitBit or pedometer can give us data about the calories we burn, while a food journal or similar tool helps monitor how many calories we take in. The key assumption I am going to make for this discussion is that the number of calories I take in are roughly in balance with the calories I burn. Not that they balance exactly every day, but on average they tend to be in balance. Applying statistical thinking, I am going to assume my daily calorie balance is a random variable X with mean 0, which corresponds to perfect balance.

On days when I consume more calories than I burn, X is positive. On days when I burn more calories than I consume, X is negative. On a day when a coworker brings in doughnuts, X might be positive. On a day when I take a walk after dinner instead of watching TV, X might be negative. I will assume that when X is positive, the extra calories are stored as fat. On days when X is negative, I burn up stored fat to fuel my extra activity. I will assume each pound of body fat is the accumulation of 3500 extra calories.

The variation in X is represented by the variance, which is the mean squared deviation of X from its mean. The standard deviation is the square root of the variance. I will assume a standard deviation of 200 calories.

Each day there’s a new realization of X. If I assume each day’s X value is independent of that from the day before, then it’s like taking a random sample over time from the distribution of X. The central limit theorem assumes independence, so I’ll at least start off with that assumption. Later I’ll revisit my assumptions.

Based on all these assumptions, if I add up all the X’s over the next year (X1 + X2 + … + X365) that will tell me how much weight I will gain or lose. If the sum is positive, I gain at the rate of one pound for every 3500 calories. If the sum is negative, I lose at the same rate. So let’s apply some statistical theory to see what we can say about this sum.

First, the mean of the sum will be the sum of the means. That’s why I wanted to assume that my daily calorie balance has a mean of 0. Add up 365 zeroes, and you still have zero. Just like my daily calorie balance is a random variable with mean zero, so is my yearly calorie balance. So far so good!

Accounting for Variation

Next consider the variability. Variances also add. With the assumption of independence, the variance of the sum is the sum of the variances. I assumed a daily standard deviation of 200 calories, which is the square root of the variance, which would be 40,000 calories squared. It’s weird to talk about square calories, so that’s why I prefer to talk about the standard deviation, which is in units of calories. But standard deviations don’t sum nicely the way variances do. My yearly calorie balance will have a variance of 365 × 40,000 calories squared.

The standard deviation is the square root of this, or 200 times the square root of 365. The square root of 365 is about 19.1, so the standard deviation of my yearly calorie balance is about 19.1*200 = 3820. Is that good? Is that bad? Not sure, but this quantifies the intuitive but vague idea that my weight varies more from year to year than it does from day to day.

Enter the Central Limit Theorem

Now let’s bring the central limit theorem into the discussion. What can it add to what we have found already? The central limit theorem is about the distribution of the average of a large number of independent identically distributed random variables—such as our X. It says that for large enough samples, the average has an approximately normal distribution. And because the average is just the sum divided by the total number of Xs, which is 365 in our example, this also lets us use the normal distribution to get approximate probabilities for my weight change over the next year.

Let’s use Minitab’s Probability Distribution Plot to visualize this distribution. First let’s see the distribution of my yearly calorie balance using a mean of 0 and a standard deviation of 3820.

Distribution Plot

We can get the corresponding distribution in terms of pounds gained by dividing the mean and standard deviation by 3500.

Distribution Plot

The right tail of the distribution is labeled to show that under my assumptions I have about an 18% probability of gaining at least one pound. The distribution is symmetric about zero, so I have the same probability of losing at least one pound over the year. On the bright side, I have about a 64% probability of staying within one pound of my current weight as shown in this next graph.

Normal Distribution Plot

Before I revisit the assumptions, let’s project this process farther into the future. What does it imply for 10 years from now? What’s the distribution of the sum X1+X2+…+X3652 (I included a couple of leap years)? The mean will still be zero. The standard deviation will be 200 times the square root of 3652 or about 12,086. Dividing by 3500 pounds per calorie, we have a standard deviation of about 3.45. What’s the probability that I will have gained 5 pounds or more over the next 10 years?

Distribution Plot

It’s about 7.3%. That’s actually not too bad!

A More Realistic Look at the Assumptions

Now let’s revisit the assumptions. A key assumption is that my mean calorie imbalance is exactly zero. I’m thinking that’s easier said than done—after all, I’m not weighing my food and calculating calories. I am wearing a smart watch to keep track of my exercise calories, but that’s only a piece of the puzzle, and even there, it’s probably not accurate down to the exact calorie.

So let’s look at what happens if I’m off by a little. Suppose the mean of X is slightly positive, say 10 calories more in than out per day. Means add up, so over a year, the mean imbalance is 365 ×10 = 3650 calories. So on average I’ll gain a little more than a pound. Applying the central limit theorem again, what’s my probability of gaining a pound or more in a year?

Distribution Plot

As this graph shows, the probability is about 51.57% that I will gain at least one pound in a year.

What about 10 years? The average is now 36,520 calories, which translates to about 10.43 pounds. Now what’s the probability of gaining at least 5 pounds over the next 10 years?

Distribution Plot

That’s a probability of over 94% of gaining at least 5 pounds, with gains of around 10 pounds or more being very likely.

That’s a big difference due to a seemingly insignificant 10 calorie imbalance per day. Ten calories is about a minute of jumping rope or a dill pickle.

Considering Correlation?

I assumed that each day’s calorie balance was independent of every other day’s. What happens if there are correlations between days? I could get some positive correlation if I buy a whole cheesecake and take a few days to eat it. Or if I go on a hiking trip over a long weekend. On the other hand, I could get some negative correlation if I try to make up for yesterday’s overeating by eating less or exercising more than usual today.

If there are correlations between days, then in addition to summing variances, I have to include a contribution for each pair of days that are correlated. Positive correlations make the variance of the sum larger, but negative correlations make the variance of the sum smaller. So introducing some negative correlations on a regular basis would help reduce the fluctuations of my weight from its mean. But as we’ve seen, that’s no substitute for keeping the long-term mean as close to zero as possible. If I notice my weight trending too quickly in one direction or the other, I had better make a permanent adjustment to how much I eat, how much activity I get, or both.

I am a bit of an Oscar fanatic. Every year after the ceremony, I religiously go online to find out who won the awards and listen to their acceptance speeches. This year, I was so chuffed to learn that Leonardo Di Caprio won his first Oscar for his performance in The Revenant in the 88th Academy Awards—after five nominations in previous ceremonies. As a longtime Di Caprio fan, I still remember going to the cinema when Titanic was released, and returning four more times. Every time, I could not hold back any tears and used up all tissues I'd brought with me! this year's winner...

Compared to his Titanic costar Kate Winslet, who won the Best Actress award in 2009 (aged 33), Leonardo waited 7 more years (20 years since his first nomination) before his turn came. I can name several actresses—Gwyneth Paltrow, Hilary Swank, and Jennifer Lawrence come immediately to mind—who obtained the award at younger ages. However, it appears that few young actors have received the Academy Award in recent years. This makes me wonder whether Oscar-winning actors tend to be older than Oscar-winning actresses.

To investigate, I collected data of the dates of past Academy Awards ceremonies and the birthdays of the winning actors and actresses. From these, I calculated the age of the winners on their Oscar-winning night. Below is a screenshot of some of the data.

oscars data

I used Minitab Statistical Software to create a time series plot of the data, shown below.

time series plot

The plot suggests that there is usually a substantial age difference between the Best Actor and Best Actress winners. There are more years when the Best Actor winner is much older than the best actress winner (blue dots above red dots) than years where the winning actress is older. Some examples:

1992: Anthony Hopkins (54.2466), Jodie Foster (29.3616)

1987: Paul Newman (62.1726), Marlee Matlin (21.5973)

1989: Dustin Hoffman (51.6329), Jodie Foster (26.3507)

1990: Daniel Day-Lewis (32.9068), Jessica Tandy (80.8000)

1998: Jack Nicholson (60.9178), Helen Hunt (34.7699)

2011: Colin Firth (50.4658), Natalie Portman (29.7205)

2013: Daniel Day-Lewis (55.8247), Jennifer Lawrence (22.5288)

There are not many occasions when both the Best Actor and Best Actress are in their 30s, 40s, 50s, etc.

Conditional formatting was introduced with the release of Minitab 17.2 and this is what I am going to use to identify any repeats in the data.

conditional formatting

Minitab applies the following conditional formatting to the data set:

conditional formatting

For the Best Actor award, Daniel Day-Lewis received the award on three occasions, while Marlon Brando, Gary Cooper, Tom Hanks, Dustin Hoffman, Fredric March, Jack Nicholson, Sean Penn, and Spencer Tracy each won the award twice.

For the Best Actress category, Katharine Hepburn won four times. Ingrid Bergman, Bette Davis, Olivia de Havilland, Sally Field, Jane Fonda, Jodie Foster, Glenda Jackson, Vivien Leigh, Luise Rainer, Meryl Streep, Hilary Swank, and Elizabeth Taylor each received the award twice.

Winners below the age of 30 could be regarded as obtaining the award at an early stage of their careers. Using the conditional formatting again, I can quickly identify the actors and actress in the data who are in this group.

conditional formatting

As shown below, a lot more actresses than actors obtain the award before the age of 30.

conditional formatted data

To get a better comparison, I am going to remove the repeats (with the help of the highlighted cells) for actors and actress who won more than once and only take into account their age at first win. This gives data from 79 Best Actor and 74 Best Actress winners. I am going to use the Assistant to carry out a comparison using the 2-sample t-test.

Assistant 2-sample t test

Apart from generating easy-to-interpret output, the Assistant also has the advantage of carrying out a powerful t-test even with unequal sample sizes using the Welch approach.

Report Card

The Report Card indicates that we have sufficient data and the assumptions of the t-test are fulfilled. However, Minitab also detects some usual data which I will look into further.

2 sample t test diagnostic report

Using the brush, the following unusual data are identified.

Best Actor:
John Wayne (62.8658)
Henry Fonda (76.8685)

These winners were considerably older, as the majority of the actor winners are in their 40s and 50s.

Best Actress:
Marie Dressler (63.0027)
Geraldine Page (61.3342)
Jessica Tandy (80.8000)
Helen Mirren (61.5863)

These winners were considerably older as the majority of the actress winners were in their late 30s and 40s.

2-sample t test summary report

The Summary Report provides the key output of the t-test. The mean age of Best Actor is 43.746, while the mean age of Best Actress is 35. The p-value value of the test is very small(<0.001). This means that we have enough evidence to suggest that, on average, the Best Actor winner is older than the Best Actress winner.

So it seems actors do have to accumulate more years of experience before getting their Oscars. On the other hand, actresses—if they choose the right role—could win at very young age.

I will leave it to others to speculate (and perhaps even use data to explore) why this apparent age gap exists. However, whatever their ages, we all enjoy seeing these Oscar winners' amazing performances on the big screen!

Photograph of Leonardo DiCaprio by See Li, used under Creative Commons 2.0.

Why Does Cpk Change When I Sort my Data?

How to Analyze Like a Citizen Data Scientist in Flint

ANOVA: Data Means and Fitted Means, Balanced and Unbalanced Designs

How to Calculate BX Life, Part 2

Imprisoned by Statistics: How Poor Data Collection and Analysis Sent an Innocent Nurse to Jail

When Is It Crucial to Standardize the Variables in a Regression Model?

How to Think Outside the Boxplot

A Unique Application of Continuous Improvement: Using Lean Six Sigma Tools to Drive Sales

Submitting an A+ Presentation Abstract, Even About Statistics

Do We All Really Want Violence on TV? A Study Using Game of Thrones Data, Part 1

Do We All Really Want Violence on TV? A Study using Game of Thrones Data, Part 2

Understanding Interactions with NBA 3-Point Shooting

What Is Complete Separation in Binary Logistic Regression?

Mind the Gap

Five Reasons Why Your R-squared Can Be Too High

3 Ways to Identify the Most Common Categories in a Column

Where in the World Is Someone Most Likely to Be a Leap Year Baby?

What a Trip to the Dentist Taught Us about Automation

Weight for It! A Healthy Application of the Central Limit Theorem

Do Actors Wait Longer than Actresses for Oscars? A Comparison Between Academy Award Winners