Minitab | Minitab

grill

Design of Experiments is an extremely powerful statistical method, and we added a DOE tool to the Assistant in Minitab 17 to make it more accessible to more people.

Since it's summer grilling season, I'm applying the Assistant's DOE tool to outdoor cooking. Earlier, I showed you how to set up a designed experiment that will let you optimize how you grill steaks.

If you're not already using it and you want to play along, you can download the free 30-day trial version of Minitab Statistical Software.

Perhaps you are following along, and you've already grilled your steaks according to the experimental plan and recorded the results of your experimental runs. Otherwise, feel free to download my data here for the next step: analyzing the results of our experiment.

Analyzing the Results of the Steak Grilling Experiment

After collecting your data and entering it into Minitab, you should have an experimental worksheet that looks like this:

With your results entered in the worksheet, select Assistant > DOE > Analyze and Interpret. As you can see below, the only button you can click is "Fit Linear Model."

As you might gather from the flowchart, when it analyzes your data, the Assistant first checks to see if the response exhibits curvature. If it does, the Assistant will prompt you to gather more data so you it can fit a quadratic model. Otherwise, the Assistant will fit the linear model and provide the following output.

When you click the "Fit Linear Model" button, the Assistant automatically identifies your response variable.

All you need to do is confirm your response goal—maximizing flavor, in this case—and press OK. The Assistant performs the analysis, and provides you the results in a series of easy-to-interpret reports.

Understanding the DOE Results

First, the Assistant offers a summary report that gives you the bottom-line results of the analysis. The Pareto Chart of Effects in the top left shows that Turns, Grill type, and Seasoning are all statistically significant, and there's a significant interaction between Turns and Grill type, too.

The summary report also shows that the model explains very high proportion of the variation in flavor, with an R2 value of 95.75 percent. And the "Comments" window in the lower right corner puts things if plain language: "You can conclude that there is a relationship between Flavor and the factors in the model..."

The Assistant's Effects report, shown below, tells you more about the nature of the relationship between the factors in the model and Flavor, with both Interaction Plots and Main Effects plots that illustrate how different experimental settings affect the Flavor response.

And if we're looking to make some changes as a result of our experimental results—like selecting an optimal method for grilling steaks in the future—the Prediction and Optimization report gives us the optimal solution (1 turn on a charcoal grill, with Montreal seasoning) and its predicted Flavor response (8.425).

It also gives us the Top 5 alternative solutions, shown in the bottom right corner, so if there's some reason we can't implement the optimal solution—for instance, if we only have a gas grill—we can still choose the best solution that suits our circumstances.

I hope this example illustrates how easy a designed experiment can be when you use the Assistant to create and analyze it, and that designed experiments can be very useful not just in industry or the lab, but also in your everyday life.

Where could you benefit from analyzing process data to optimize your results?

In my last post, we took the red pill and dove deep into the unarguably fascinating and uncompromisingly compelling world of the matrix plot. I've stuffed this post with information about a topic of marginal interest...the marginal plot.

Margins are important. Back in my English composition days, I recall that margins were particularly prized for the inverse linear relationship they maintained with the number of words that one had to string together to complete an assignment. Mathematically, that relationship looks something like this:

Bigger margins = fewer words

stuffed crust In stark contrast to my concept of margins as information-free zones, the marginal plot actually utilizes the margins of a scatterplot to provide timely and important information about your data. Think of the marginal plot as the stuffed-crust pizza of the graph world. Only, instead of extra cheese, you get to bite into extra data. And instead of filling your stomach with carbs and cholesterol, you're filling your brain with data and knowledge. And instead of arriving late and cold because the delivery driver stopped off to canoodle with his girlfriend on his way to your house (even though he's just not sure if the relationship is really working out: she seems distant lately and he's not sure if it's the constant cologne of consumables about him, or the ever-present film of pizza grease on his car seats, on his clothes, in his ears?)

...anyway, unlike a cold, late pizza, marginal plots are always fresh and hot, because you bake them yourself, in Minitab Statistical Software.

I tossed some randomly-generated data around and came up with this half-baked example. Like the pepperonis on a hastily prepared pie, the points on this plot are mostly piled in the middle, with only a few slices venturing to the edges. In fact, some of those points might be outliers.

Scatterplot of C1 vs C2

If only there were an easy, interesting, and integrated way to assess the data for outliers when we make a scatterplot.

Boxplots are a useful way look for outliers. You could make separate boxplots of each variable, like so:

Boxplot of C1 Boxplot of C2

It's fairly easy to relate the boxplot of C1 to the values plotted on the y-axis of the scatterplot. But it's a little harder to relate the boxplot of C2 to the scatterplot, because the y-axis on the boxplot corresponds to the x-axis on the scatterplot. You can transpose the scales on the boxplot to make the comparison a little easier. Just double-click one of the axes and select Transpose value and category scales:

Boxplot of C2, Transposed

That's a little better. The only thing that would be even better is if you could put each boxplot right up against the scatterplot...if you could stuff the crust of the scatterplot with boxplots, so to speak. Well, guess what? You can! Just choose Graph > Marginal Plot > With Boxplots, enter the variables and click OK:

Marginal Plot of C1 vs C2

Not only are the boxplots nestled right up next to the scatterplot, but they also share the same axes as the scatterplot. For example, the outlier (asterisk) on the boxplot of C2 corresponds to the point directly below it on the scatterplot. Looks like that point could be an outlier, so you might want to investigate further.

Marginal plots can also help alert you to other important complexities in your data. Here's another half-baked example. Unlike our pizza delivery guy's relationship with his girlfriend, it looks like the relationship between the fake response and the fake predictor represented in this scatterplot really is working out:

Scatterplot of Fake Response vs Fake Predictor

In fact, if you use Stat > Regression > Fitted Line Plot, the fitted line appears to fit the data nicely. And the regression analysis is highly significant:

Fitted Line_ Fake Response versus Fake Predictor

Regression Analysis: Fake Response versus Fake Predictor The regression equation is Fake Response = 2.151 + 0.7723 Fake Predictor S = 2.12304 R-Sq = 50.3% R-Sq(adj) = 49.7% Analysis of Variance Source DF SS MS F P Regression 1 356.402 356.402 79.07 0.000 Error 78 351.568 4.507 Total 79 707.970

But wait. If you create a marginal plot instead, you can augment your exploration of these data with histograms and/or dotplots, as I have done below. Looks like there's trouble in paradise:

Marginal Plot of Fake Response vs Fake Predictor, with Histograms

Like the poorly made pepperoni pizza, the points on our plot are distributed unevenly. There appear to be two clumps of points. The distribution of values for the fake predictor is bimodal: that is, it has two distinct peaks. The distribution of values for the response may also be bimodal.

Why is this important? Because the two clumps of toppings may suggest that you have more than one metaphorical cook in the metaphorical pizza kitchen. For example, it could be that Wendy, who is left handed, started placing the pepperonis carefully on the pie and then got called away, leaving Jimmy, who is right handed, to quickly and carelessly complete the covering of cured meats. In other words, it could be that the two clumps of points represent two very different populations.

When I tossed and stretched the data for this example, I took random samples from two different populations. I used 40 random observations from a normal distribution with a mean of 8 and a standard deviation of 1.5, and 40 random observations from a normal distribution with a mean of 13 and a standard deviation of 1.75. The two clumps of data are truly from two different populations. To illustrate, I separated the two populations into two different groups in this scatterplot:

Scatterplot with Groups

This is a classic conundrum that can occur when you do a regression analysis. The regression line tries to pass through the center of the data. And because there are two clumps of data, the line tries to pass through the center of each clump. This looks like a relationship between the response and the predictor, but it's just an illusion. If you separate the clumps and analyze each population separately, you discover that there is no relationship at all:

Fitted Line_ Fake Response 1 versus Fake Predictor 1

Regression Analysis: Fake Response 1 versus Fake Predictor 1 The regression equation is Fake Response 1 = 9.067 - 0.1600 Fake Predictor 1 S = 1.64688 R-Sq = 1.5% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 1.609 1.60881 0.59 0.446 Error 38 103.064 2.71221 Total 39 104.673

Fitted Line_ Fake Response 2 versus Fake Predictor 2

Regression Analysis: Fake Response 2 versus Fake Predictor 2 The regression equation is Fake Response 2 = 12.09 + 0.0532 Fake Predictor 2 S = 1.62074 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 0.291 0.29111 0.11 0.741 Error 38 99.818 2.62679 Total 39 100.109

If only our unfortunate pizza delivery technician could somehow use a marginal plot to help him assess the state of his own relationship. But alas, I don't think a marginal plot is going to help with that particular analysis. Where is that guy anyway? I'm getting hungry.

MLB Logo When you perform a statistical analysis, you want to make sure you collect enough data that your results are reliable. But you also want to avoid wasting time and money collecting more data than you need. So it's important to find an appropriate middle ground when determining your sample size.

Now, technically, the Major League Baseball regular season isn't a statistical analysis. But it does kind of work like one, since the goal of the regular season is to "determine who the best teams are." The National Football League uses a 16-game regular season to determine who the best teams are. Hockey and Basketball use 82 games.

Baseball uses 162 games.

So is baseball wasting time collecting more data than it needs? Right now the MLB regular season is about halfway over. So could they just end the regular season now? Will playing another 81 games really have a significant effect on the standings? Let's find out.

How much do MLB standings change in the 2nd half of the season?

I went back through five years of records and recorded where each MLB team ranked in their league (American League and National League) on July 8, and then again at the end of the season. We can use this data to look at concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.

For example, let's compare the Astros and Angels from 2015. On July 8th, the Astros were ranked 2nd in the AL and the Angels were ranked 3rd. At the end of the season, Houston was ranked 5th and the Angles were ranked 6th. This pair is concordant since in both cases the Astros were ranked higher than the Angels. But if you compare the Astros and the Yankees, you'll see the Astros were ranked higher on July 8th, but the Yankees were ranked higher at the end of the season. That pair is discordant.

When we compare every team, we end up with 11,175 pairs. How many of those are concordant? Minitab Statistical Software has the answer.

Measures of Concordance

There are 8,307 concordant pairs, which is just over 74% of the data. So most of the time, if a team is higher in the standings as of July 8th, they will finish higher in the final standings too. We can also use Spearman's rho and Pearson's r to asses the association between standings on July 8th and the final standings. These two values give us a coefficient that can range from -1 to +1. The larger the absolute value, the stronger the relationship between the variables. A value of 0 indicates the absence of a relationship.

Pearsons r and Spearmans rho

Both values are high and positive, once again indicating that teams ranked higher than other teams on July 8th usually stay that way by the end of the season. So did we do it? Did we show that baseball doesn't really need the 2nd half of their season?

Not quite.

Consider that each league has 15 teams. So a lot of our pairs are comparing teams that aren't that close together, like 1st team to the 15th, the 1st team to the 14th, the 2nd team to the 15th, and so on. It's not very surprising that those pairs are going to be concordant. So let's dig a little deeper and compare each individual team's ranking in July compared to the end of the season. The following histogram shows the difference in a team's rank. Positive values mean the team moved up in the standings, negative values mean they fell.

Histogram

The most common outcome is that a team doesn't move up or down in the standings, as 34 of our observations have a difference of 0. However, there are 150 total observations, so most of the time a team does move up or down. In fact, 55 times a team moved up or down in the standings by 3 or more spots. That's over a third of the time! And there are multiple instances of a team moving 6, 7, or even 8 spots! That doesn't seem to imply that the 2nd half of the season doesn't matter. So what if we narrow the scope of our analysis?

Looking at the Playoff Teams

We previously noted that the regular season is supposed to determine the best teams. So let's focus on the top of the MLB standings. I took the top 5 teams in each league (since the top 5 teams make the playoffs) on July 8th, and recorded whether they were still a top 5 team (and in the playoffs) at the end of the season. The following pie chart shows the results.

Pie Chart

Twenty eight percent of the time, a team that was in the playoffs in July fell far enough in the standings to drop out. So over a quarter of your playoff teams would be different if the season ended around 82 games. That sounds like a significant effect to me. And last, let's return to our concordant and discordant pairs. Except this time, we'll just look at the top half of the standings (top 8 teams).

Measures of Concordance

This time our percentage of concordant pairs has dropped to 59%, and the values for Spearman's rho and Pearson's r show a weaker association. Teams ranked higher in the 1st half of the season are usually still ranked higher at the end of the season. But there is clearly enough shuffling among the top teams to warrant the 2nd half of the season. So don't worry baseball fans, your regular season will continue to extend to September.

Because, you know, Major League Baseball totally would have shorten the season if this statistical analysis suggested doing so!

And if you're looking to determine the appropriate sample size for your own analysis, Minitab offers a wide variety of power and sample size analyses that can help you out.

all-star game 2016 Last Tuesday Night, Major League Baseball announced the rosters for tomorrow's All-Star game in San Diego. Immediately, as I'm sure was anticipated, people began talking about who made it and who didn't. Who got left out, and who shouldn't have made it.

As a fun little exercise, I decided to take a visual look at the all-star teams, to see what kind of players were selected. I looked at position players only (no pitchers) and made a simple scatterplot, with the x-axis representing their offensive value so far this season, and the y-axis representing their defensive value. This would allow me to see any extreme outliers in terms of value generated so far this year. In Minitab Statistical Software, this command can be found by going to Graph > Scatterplot. I also added data labels through the Editor menu (Editor > Add > Data Labels) so that I could see which point on the plot corresponds to which player.

The plot below shows the American League selections:

scaterrplot

Looking at the graph, some groupings become apparent. The most populated quadrant is the upper right, which represents a high offensive and defensive value. For an all-star team, this makes sense: these are the best of the best. Here is where you'll find names like Mike Trout, Josh Donaldson, and Jose Altuve, the American League leaders in Wins Above Replacement, which is a metric that tries to capture all of a player's value into one nice statistic.

Another grouping that becomes apparent is the upper left quadrant. This is where we see our defensive maestros. To fall in the upper left quadrant, you need to have a high defensive value and a (relatively) low offensive output. We have a shortstop and three catchers here, which makes sense given that those are the two most demanding defensive positions.

The lower right corner represents players whose value is mostly on offense. Here we, see Edwin Encarnacion, David Ortiz, and Mark Trumbo. Their defensive value is so low because they don't even play defense—they are designated hitters.

This is a fun way to visualize what kind of all-stars we have, and what they excel at. If the manager needs to make a late-game defensive substitution, this graph can show us where they might lean. Additionally, if they need one pinch hitter for a key at-bat, we can see whom they might lean on by looking at the other end of the graph.

We looked at the American League in detail up above, and I've also created the same plot for the National League below:

*Note: All statistics from fangraphs.com

In the great 1971 movie Willy Wonka and the Chocolate Factory, the reclusive owner of the Wonka Chocolate Factory decides to place golden tickets in five of his famous chocolate bars, and allow the winners of each to visit his factory with a guest. Since restarting production after three years of silence, no one has come in or gone out of the factory. Needless to say, there is enormous interest in finding a golden ticket!

Through a series of news reports we get an understanding that all over the world, kids are desperately purchasing and opening Wonka bars in an attempt to win. But just what were the odds? Unfortunately young Charlie Bucket's teacher is not particularly good at percentages and doesn't offer much help:

I hope I can be at least a little more useful. While the movie only vaguely suggests how many bars were actually being opened, we are provided with two data points. First, the spoiled, bratty, unlikable Veruca Salt's factory-owning father states that he's had his workers open 760,000 Wonka bars just before one of them finds a golden ticket:

Meanwhile the polite, likable Charlie Bucket—who is very poor—has received one Wonka Bar for his birthday and another from his Grandpa Joe. Neither bar was a winner, but Charlie finds some money on the street to buy a third:

In the movie, you can't help but feel that Charlie's odds must have been much, much higher than the nasty Veruca Salt's (or any of the other winners). But is there statistical evidence of that?

In Minitab Statistical Software, I set up a basic 2x2 table like this:

2x2 table

Often when practitioners have a 2x2 table the Chi-Square test immediately comes to mind. but the Chi-Square test is not accurate when any of the cell counts or expected cell counts are small, which is clearly the case here. But we can use Fisher's exact test without such a restriction, which is available in the "Other Stats" subdialog of Stat > Tables > Cross Tabulation and Chi-Square. The output looks like this:

Fishers output

For the Chi-Square portion of the output, Minitab not only refuses to provide a p-value but gives two warnings and a note. The Fisher's exact test can be performed, however, and tests whether the likelihood of a winning tickets was the same for both Charlie and Veruca. The p-value of 0.0000079 confirms what we all knew—karma was working for Charlie and against Veruca!

For fun, let's ignore this evidence that the odds were not equal for each child. Let's pretend that the odds are the same, and a really unlikely thing happened anyway because that's what makes the movie great. Aside from our two data points, we have reports from two children in the classroom that they have opened 100 and 150 bars, respectively, and neither won. So we have two golden tickets among 3 + 760,000 + 100 + 150 = 760,253 Wonka bars. This would be a proportion of 3/760,253 = 0.00000395 or 0.0000395%. Think those odds are low? That represents an inflated estimate! That is because rather than randomly sampling many children, our sample includes two known winners. Selecting four children at random would almost certainly produce four non-winners and the estimate would be 0%.

There is one additional data point that doesn't really make logical sense, but let's use it to come up with a low-end estimate by accepting that it is likely not a real number. At one point, a news reporter indicates that five tickets are hidden among the "countless billions of Wonka bars." Were there actually "countless billions" of unopened Wonka bars in the world? Consider that the most popular chocolate bar in the world—the famous Hershey bar—has annual sales of about 250 million units. And that's per year! It is very, very unlikely that there were countless billions of unopened Wonka bars from that single factory at any one time. Further, that news report is about the contest being announced, so the Wonka factory had not yet delivered the bars with the golden tickets inside. Suffice to say, this is not an accurate number.

But let's suppose that even 1 billion Wonka bars were produced in the run that contained the golden tickets. Then the odds of a single bar containing one would be 5/1,000,000,000 = 0.000000005 or 0.0000005%.

Either way, the chances of finding one were incredibly low...confirming again what grandpa Joe told Charlie:

CHARLIE: "I've got the same chance as anybody else, haven't I?"

GRANDPA JOE: "You've got more, Charlie, because you want it more! Go on, open it!"

Design of Experiments (DOE) is the perfect tool to efficiently determine if key inputs are related to key outputs. Behind the scenes, DOE is simply a regression analysis. What’s not simple, however, is all of the choices you have to make when planning your experiment. What X’s should you test? What ranges should you select for your X’s? How many replicates should you use? Do you need center points? Etc. So let’s talk about center points.

What Are Center Points?

Center points are simply experimental runs where your X’s are set halfway between (i.e., in the center of) the low and high settings. For example, suppose your DOE includes these X’s:

TimeAndTemp

The center point would then be set midway at a Temperature of 150 °C and a Time of 20 seconds.

And your data collection plan in Minitab Statistical Software might look something like this, with the center points shown in blue:

Minitab Worksheet

You can have just 1 center point, or you can collect data at the center point multiple times. This particular design includes 2 experimental runs at the center point. Why pick 2, you may be asking? We’ll talk about that in just a moment.

Why Should You Use Center Points in Your Designed Experiment?

Including center points in a DOE offers many advantages:

1. Is Y versus X linear?

Factorial designs assume there’s a linear relationship between each X and Y. Therefore, if the relationship between any X and Y exhibits curvature, you shouldn’t use a factorial design because the results may mislead you.

So how do you statistically determine if the relationship is linear or not? With center points! If the center point p-value is significant (i.e., less than alpha), then you can conclude that curvature exists and use response surface DOE—such as a central composite design—to analyze your data. While factorial designs can detect curvature, you have to use a response surface design to model (build an equation for) the curvature.

Bad Fit Factorial Design Good Fit Response Surface Design

And the good news is that curvature often indicates that your X settings are near an optimum Y, and you've discovered insightful results!

2. Did you collect enough data?

If you don’t collect enough data, you aren’t going to detect significant X’s even if they truly exist. One way to increase the number of data points in a DOE is to use replicates. However, replicating an entire DOE can be expensive and time-consuming. For example, if you have 3 X’s and want to replicate the design, then you have to increase the number of experimental runs from 8 to 16!

Fortunately, using replicates is just one way to increase power. An alternative way to increase power is to use center points. By adding just a few center points to your design, you can increase the probability of detecting significant X’s, and estimate the variability (or pure error, statistically speaking).

Learn More about DOE

DOE is a great tool. It tells you a lot about your inputs and outputs and can help you optimize process settings. But it’s only a great tool if you use it the right way. If you want to learn more about DOE, check out our e-learning course Quality Trainer for $30 US. Or, you can participate in a full-day Factorial Designs course at one of our instructor-led training sessions.

If you've used our software, you’re probably used to many of the things you can do in Minitab once you’ve fit a model. For example, after you fit a response to a given model for some predictors with Stat > DOE > Response Surface > Analyze Response Surface Design, you can do the following:

Predict the mean value of the response variable for new combinations of settings of the predictors.
Draw factorial plots, surface plots, contour plots, and overlaid contour plots.
Use the model to find combinations of predictor settings that optimize the predicted mean of the response variable.

In the Response Surface Menu, you can see tools that you can use with a fitte model: Predict, Factorial Plots, Contour Plot, Surface Plot, Overlaid Contour Plot, Response Optimizer

But once your response has that little green check box that says you have a valid model, there’s even more that you can do. For example, you can also use conditional formatting to highlight two kinds of rows:

Unusual combinations of predictor values
Values of the response variable that the model does not explain well

Want to try it out? You can follow along using this data set about how deep a stream is and how fast the water flows. Open the data set in Minitab, then:

Choose Stat > Regression > Regression > Fit Regression Model.
In Response, enter Flow.
In Continuous Predictors, enter Depth.
In Categorical Predictors, enter Location. Click OK.

Once you’ve clicked OK, the green checkbox will appear in your worksheet to show that you have a valid model.

The green square with a white checkmark shows that the column is a response variable for a current model.

To show unusual combinations of predictors, follow these steps:

Choose Data > Conditional Formatting > Statistical > Unusual X.
In Response, enter Flow. Click OK.

The text and background color for the response value in row 7 changes so that you can see that it’s unusual to have a depth of 0.76 in the first stream.

The value of the response in the row with the unusual X value has red shading and red letters.

You can indicate values that aren’t fit well by the model in a similar fashion.

Choose Data > Conditional Formatting > Statistical > Large Residual.
In Response, enter Flow.
In Style, select Yellow. Click OK.

Now, in the worksheet, the unusual combinations of predictors are red and the values that aren’t fit well by the model are yellow:

The unusual cell with the response value for the row with an unusual X value has a red theme. The cell with the response value for a row that the model does not fit well has a yellow theme.

Not all of the ways that Minitab can conditionally format depend on the model. If you’re ready for more, take a look at the online support center to see examples of these other uses of conditional formats:

Here is a scenario involving process capability that we’ve seen from time to time in Minitab's technical support department. I’m sharing the details in this post so that you’ll know where to look if you encounter a similar situation.

You need to run a capability analysis. You generate the output using Minitab Statistical Software. When you look at the results, the Cpk is huge and the histogram in the output looks strange:

What’s going on here? The Cpk seems unrealistic at 42.68, the "within" fit line is tall and narrow, and the bars on the histogram are all smashed down. Yet if we use the exact same data to make a histogram using the Graph menu, we see that things don’t look so bad:

So what explains the odd output for the capability analysis?

Notice that the ‘within subgroup’ variation in the capability output is represented by the tall dashed line in the middle of the histogram. This is the StDev (Within) shown on the left side of the graph. The within subgroup variation of 0.0777 is very small relative to the overall standard deviation.

So what is causing the within subgroup variation to be so small? Another graph in Minitab can give us the answer: The Capability Sixpack. In the case above, the subgroup size was 1 and Minitab’s Capability Sixpack in Stat> Quality Tools> Capability Sixpack> Normal will plot the data on a control chart for individual observations, an I-chart:

Hmmm...this could be why, in Minitab training, our instructors recommend using the Capability Sixpack first.

In the Capability Sixpack above, we can see that the individually plotted values on the I-chart show an upward trend, and it appears that the process is not stable and in control (as it should be for data used in a capability analysis). A closer look at the data in the worksheet clearly reveals that the data was sorted in ascending order:

Because the within-subgroup variation for data not collected in subgroups is estimated based on the moving ranges (average of the distance between consecutive points), sorting the data causes the within-subgroup variation to be very small. With very little within-subgroup variation we see a very tall, narrow fit line that represents the within subgroup variation, and that is ‘smashing down’ the bars on the histogram. We can see this by creating a histogram in the Graph menu and forcing Minitab to use a very small standard deviation (by default this graph uses the overall standard deviation that is used when calculating Ppk): Graph> Histogram > Simple, enter the data, click Data View, choose the Distribution tab, check Fit distribution and for the Historical StDev enter 0.0777, then click OK and now we get:

Mystery solved! And if you still don’t believe me, we can get a better looking capability histogram by randomizing the data first (Calc> Random Data> Sample From Columns):

Now if we run the capability analysis using the randomized data in C2 we see:

A note of caution: I’m not suggesting that the data for a capability analysis should be randomized. The moral of the story is that the data in the worksheet should be entered in the order it was collected so that it is representative of the normal variation in the process (i.e., the data should not be sorted).

Too bad our Cpk doesn’t look as amazing as it did before…now it's time to get to work with Minitab to improve our Cpk!

You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available.

Let's look at how this works using the Assistant in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.

And it's probably a lot easier than you think.

A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose Assistant > Regression…

At this point, if you aren’t very comfortable with regression, the Assistant makes it easy to select the right option for your analysis.

A Decision Tree for Selecting the Right Analysis

We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.

This data set includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.

We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown.

Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.

Interpreting Regression Results in Plain Language

The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them.

The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.

We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354). This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).

Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.

The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price.

Residuals versus Fitted Values

While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.

For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.

In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.

Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price.

Sure enough, the records for those vehicles reveal that two of them have severe collision damage.

But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!

With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.

When I blogged about automation back in March, I made my husband out to be an automation guru. Well, he certainly is. But what you don’t know about my husband is that while he loves to automate everything in his life, sometimes he drops the ball. He’s human; even I have to cut him a break every now and then.

On the other hand, instances of hypocrisy in his behavior tend to make for a good story. So here we are again.

On Paying Bills

When we married 5 years ago and began combining our bank accounts, I learned a few things about my husband. Nothing that I haven’t already shared with you. Because he loves automation, it came as no surprise to me that all his accounts resided in a single online repository (mint.com) where he could view his net worth—assets such as his home and car value, and debts including the loan left on his home and bills and credit card expenses that needed to be paid. He’d also made sure to automate the payment of all loans, utility bills, and credit cards—and the respective account would notify him when a payment was made.

This mint.com account served as one dashboard view of all possible accounts he would otherwise have to access independently to see statements and make payments. It was genius!

mint

He could set up savings goals, budgets, email alerts for credit card payment reminders and notification of payment, suspicious account activity, and just about any other miscellaneous charge or activity or change in spending habits. It really did make life easier.

Until I entered the picture.

On Marriage

We married, I synced my bank accounts, and we combined cash. I scoured his historical data to observe spending habits—areas where we could save money (Taco Bell topped the ‘high spending’ for the Food/Dining category). As I began poking around his accounts, I noticed a monthly fee his Chase Freedom Visa credit card was charging him. I asked him about the fee; he pleaded ignorance. When I investigated further, I discovered that he’d been charged this fee for years, since he first got the credit card.

I researched online and discovered that other cardholders had complained of being erroneously enrolled in a protection program when they first got their Chase Freedom card, and were being charged a similar fee of varying amounts monthly. Turns out this monthly fee was a percentage of monthly spending—and the Chase Freedom Visa credit card incentivized a cardholder to make all his purchases with that card, given its offer of 5% cash back on all purchases at the time.

Needless to say, I wanted that money back. No less than a few minutes later, we were on the phone with Chase disputing the program enrollment and monthly charges. They acknowledged their error and refunded us the money lost over a span of several years.

The lesson in all of this? Marry someone who’s not afraid to dig through your historical data.

On Alert Systems

More seriously, automating processes or workflows is incredibly helpful, but without the proper attention and alert systems in place, you may still encounter holes in the story. Automation and alerts must go hand-in-hand to be effective—and as a consumer of the information you’re automating, you still must be invested enough to look at the big picture.

For my husband, the beauty in automating his bill payments and aggregating all his accounts on mint.com was to save time he'd otherwise spend paying bills separately and checking cash flows in multiple different accounts. But he failed to set up alerts about important aspects of the process he was automating, and he failed to check in on his process from time to time. Mint.com provides an incredibly useful dashboard to give you the big picture overview of your accounts and your net worth; it also provides a plethora of alert options that save a consumer time from digging for red flags after the undesirable event has become a regular occurrence in the process (like I did). But without checking the status of the system or using its full automation potential, the system is only as good as its inputs until you revisit it or tweak it.

This is just one piece of the puzzle. Alert systems offer so much more!

Awareness—setting alerts through mint.com with regard to miscellaneous fees would have offered insight about the credit card program my husband had been erroneously enrolled in.
Immediate Feedback—the first time a fee was charged, he would have been able to take immediate action rather than waiting years later for his wife to discover the charge (manually, mind you).
Time Saver—aside from automating bill pay and combining all accounts into a single repository for a big picture view of one’s financial status (which is certainly a time-saver in reviewing accounts and paying bills in various locations), an alert system would have saved me a lot of time in digging through my husband’s financial data to understand the origin of the fee Chase was charging him.
Money Saver—while we were refunded all the money charged in monthly fees by Chase, clearly an alert system would have been a more foolproof way to save money in the first place. Alerts are also effective in ensuring bill pay occurs on time, notifying you when a statement has been prepared, when the bill is due, and when the bill has been paid.

As process engineers or quality managers in the manufacturing world, you are very close to your process and its inputs. You want to know when something goes wrong, right when it happens. You don’t want a consumer to discover a flaw in a part or product you manufactured and sold years before, only to be faced with product recalls, customer reimbursements, time and money invested to re-manufacture and replace the defective product for unhappy customers, and in some cases, lawsuits. The stakes are high.

Minitab offers a solution to this pain point in its Real-Time SPC dashboard. The dashboard is completely powered by Minitab Statistical Software, taking the graphs and output you know and love and placing them on customized dashboard views that show the current state of your processes. The dashboard gives you a big picture view of your processes across all your production sites, for instance, and highlights where improvements can be made. You can incorporate any graph or analysis you want—such as histograms, control charts, or process capability analysis. You can automatically generate quality reports about your processes, and set up any alert that will help you respond to defects faster.

qualityDashboard

spcDashboard

In the case of my marriage, alert systems are certainly practical from a financial standpoint. But in the world of manufacturing, ensuring alerts are set up around your automated processes has far-reaching implications as the time- and money-saving elements of alert systems greatly impacts a company’s bottom line. To learn more about how Minitab can help you, contact us at Sales@minitab.com.

And if you’ve ever thought twice about whether or not you should marry, let this story be an encouragement to you—you may actually find a spouse who can make you richer.

While some posts in our Minitab blog focus on understanding t-tests and t-distributions this post will focus more simply on how to hand-calculate the t-value for a one-sample t-test (and how to replicate the p-value that Minitab gives us).

The formulas used in this post are available within Minitab Statistical Software by choosing the following menu path: Help> Methods and Formulas> Basic Statistics> 1-sample t.

The null and three alternative hypotheses for a one-sample t-test are shown below:

The default alternative hypothesis is the last one listed: The true population mean is not equal to the mean of the sample, and this is the option used in this example.

bear To understand the calculations, we’ll use a sample data set available within Minitab. The name of the dataset is Bears.MTW, because the calculation is not a huge bear to wrestle (plus who can resist a dataset with that name?). The path to access the sample data from within Minitab depends on the version of the software.

For the current version of Minitab, Minitab 17.3.1, the sample data is available by choosing Help> Sample Data.

For previous versions of Minitab, the data set is available by choosing File> Open Worksheet and clicking the Look in Minitab Sample Data folder button at the bottom of the window.

For this example, we will use column C2, titled Age, in the Bears.MTW data set, and we will test the hypothesis that the average age of bears is 40. First, we’ll use Stat> Basic Statistics> 1-sample t to test the hypothesis:

After clicking OK above we see the following results in the session window:

With a high p-value of 0.361, we don’t have enough evidence to conclude that the average age of bears is significantly different from 40.

Now we’ll see how to calculate the T value above by hand.

The formula for the T value (0.92) shown above is calculated using the following formula in Minitab:

The output from the 1-sample t test above gives us all the information we need to plug the values into our formula:

Sample mean: 43.43

Sample standard deviation: 34.02

Sample size: 83

We also know that our target or hypothesized value for the mean is 40.

Using the numbers above to calculate the t-statistic we see:

t = (43.43-40)/34.02/√83) = 0.918542
(which rounds to 0.92, as shown in Minitab’s 1-sample t-test output)

Now, we could dust off a statistics textbook and use it to compare our calculated t of 0.918542 to the corresponding critical value in a t-table, but that seems like a pretty big bear to wrestle when we can easily get the p-value from Minitab instead. To do that, I’ve used Graph> Probability Distribution Plot> View Probability:

In the dialog above, we’re using the t distribution with 82 degrees of freedom (we had an N = 83, so the degrees of freedom for a 1-sample t-test is N-1). Next, I’ve selected the Shaded Area tab:

In the dialog box above, we’re defining the shaded area by the X value (the calculated t-statistic), and I’ve typed in the t-value we calculated in the X value field. This was a 2-tailed test, so I’ve selected Both Tails in the dialog above.

After clicking OK in the window above, we see:

We add together the probabilities from both tails, 0.1805 + 0.1805 and that equals 0.361 – the same p-value that Minitab gave us for the 1-sample t test.

That wasn’t so bad—not a difficult bear to wrestle at all!

Statistics is all about modelling. But that doesn’t mean strutting down the catwalk with a pouty expression.

It means we’re often looking for a mathematical form that best describes relationships between variables in a population, which we can then use to estimate or predict data values, based on known probability distributions.

To aid in the search and selection of a “top model,” we often utilize calculated indices for model fit.

In a time series trend analysis, for example, mean absolute percentage error (MAPE) is used to compare the fit of different time series models. Smaller values of MAPE indicate a better fit.

You can see that in the following two trend analysis plots:

low Mape

high MAPE

The MAPE value is much lower in top plot for Model A (9.37) than it is for the bottom plot with Model B (24.84). So Model A fits its data better than Model B fits its dat—ah…er, wait…that doesn’t seem right.. I mean… Model B looks like a closer fit, doesn’t it…hmmm…do I have it backwards…what the...???

Step back from the numbers!

Statistical indices for model fit can be great tools, but they work best when interpreted using a broad, flexible attitude, rather than a narrow, dogmatic approach. Here are a few tips to make sure you're getting the big picture:

Look at your data

No, don't just look. Gaze lovingly. Stare rudely. Peer penetratingly. Because it's too easy to get carried away by calculated stats. If you graphically examine your data carefully, you can make sure that what you see, on the graph, is what you get, with the statistics. Looking at the data for these two trend models, you know the MAPE value isn’t telling the whole story.

Understand the metric

MAPE measures the absolute percentage error in the model. To do that, it divides the absolute error of the model by the actual data values. Why is that important to know? If there are data values close to 0, dividing by those very small fractional values greatly inflates the value of MAPE.

That’s what’s going on in Model B. To see this, look what happens when you add 200 to each value in the data set for Model B—and fit the same model.

MAPE lowest

Same trend, same fit, but now the absolute percentage of error is more than 25 times lower (0.94611) than it was with the data that included values close 0—and more than 10 times lower than the MAPE value in Model A. That result makes more sense, and is coherent with the model fit shown on the graph.

Examine multiple measures

MAPE is often considered the go-to measurement for the fit of time series models. But notice that there are two other measures of model error in the trend plots: MAD (mean absolute deviation) and MSD (mean squared deviation). Notice that in both trend plots for Model B, those values are low—and identical. They’re not affected by values close to 0.

Examining multiple measures helps ensure you won't be hoodwinked by a quirk for a single measure.

Interpret within the context

Generally you’re safest using measures of fit to compare the fits of candidate models for a single data set. Comparing model fits across different data sets, in different contexts, leads to invalid comparisons. That’s why you should be wary of blanket generalizations (and you’ll hear them), such as “every regression model should have an R-squared of at least 70%.” It really depends what you’re modelling, and what you’re using the model for. For more on that, read this post by Jim Frost on R-squared.

Finally, a good model is more than just a perfect fit

Don't let small numerical differences in model fit be your be-all and end-all. There are other important practical considerations, as shown by these models.

simple complex

I blogged a few months back about three different Minitab tools you can use to examine your data over time. Did you know you that you can also use a simple run chart to display how your process data changes over time? Of course those “changes” could be evidence of special-cause variation, which a run chart can help you see.

What’s special-cause variation, and how’s it different from common-cause variation?

You know that variation occurs in all processes, and common-cause is just that—a natural part of any process. Special-cause variation comes from outside the system and causes recognizable patterns, shifts, or trends in the data. A run chart shows graphically whether special causes are affecting your process.

A process is in control when special causes of variation have been eliminated.

How can I create a run chart in Minitab?

It’s easy! Follow along with this example:

Suppose you want to be sure the widgets your company makes are within the correct weight specifications requested by your customer. You’ve collected a data set that contains weight measurements from the injection molding process used to create the widgets (Open the worksheet WEIGHT.MTW that’s included with Minitab’s sample data sets—in Minitab 17.2, open Help > Sample Data).

To evaluate the variation in weight measurements, you create a run chart in Minitab:

Choose Stat > Quality Tools > Run Chart
In Single column, enter Weight
In Subgroup size, enter 1. Click Ok.

Here’s what Minitab creates for you:

Minitab Run Chart

*Note that Minitab plots the value of each data point in the order that they were collected and draws a horizontal reference line at the median.

What does my run chart tell me about my data?

You can examine the run chart to see if there are any obvious patterns, but Minitab includes two tests for randomness that provide information about non-random variation due to trends, oscillation, mixtures, and clustering in your data. Such patterns indicate that the variation observed is due to special-cause variation.

In the example above, because the approximate p-values for clustering, mixtures, trends, and oscillation are all greater than the significance level of 0.05, there’s no indication of special-cause variation or non-randomness. The data appear to be randomly distributed with no temporal patterns, but to be certain, you should examine the tests for runs about the median and runs up or down. However, it looks as if the variation in widget weights will be acceptable to your customer.

Tell me more about these nonrandom patterns that can be identified by a run chart …

There are four basic patterns of nonrandomness that a run chart will detect—mixture, cluster, oscillating, and trend patterns.

A mixture is characterized by an absence of points near the center line:

Clusters are groups of points in one area of the chart:

Oscillation occurs when the data fluctuates up and down:

A trend is a sustained drift in the data, either up or down:

To learn more about what these patterns can tell you about your data, visit run chart basics on Minitab 17 Support.

Have you ever accidentally done statistics? Not all of us can (or would want to) be “stat nerds,” but the word “statistics” shouldn’t be scary. In fact, we all analyze things that happen to us every day. Sometimes we don’t realize that we are compiling data and analyzing it, but that’s exactly what we are doing. Yes, there are advanced statistical concepts that can be difficult to understand—but there are many concepts that we use every day that we don’t realize are statistics.

I consider myself a student of baseball, so my example of unknowingly performing statistical procedures concerns my own experiences playing that game.

My baseball career ended as a 5’7” college freshman walk-on. When I realized that my ceiling as a catcher was a lot lower than my 6’0”-6’5” teammates I hung up my spikes. As an adult, while finishing my degree in Business Statistics, I had the opportunity to shadow a couple of scouts from the Major League Baseball Scouting Bureau. Yes, I’ve seen Moneyball and I know that traditional scouting methods are reputed to conflict with the methods of stat nerds like myself, but as a former player I wanted to see what these scouts were looking at.

baseball statistics My first day with the scouts I found out they were traditional baseball guys. They didn’t believe data could tell how good a player is better than observation could, and ultimately they didn't think statistics were important to what they do.

I found their thinking to be a little off, and a little funny. Although they didn’t believe in statistics, the tools they use for their jobs actually quantify a player's attributes. I watched as they used a radar gun to measure pitch speed, a stopwatch to measure running speed, and a notepad to record their measurements (they didn’t realize they were compiling data). As one of the scouts was conversing with me, asking how statistics are going to be brought into baseball, he was making a dot plot by hand of the pitcher's pitches by speed to find the velocity distribution of the pitcher.

After I explained to him that was unknowing creating a dot plot (like the one I created for Rasiel Iglesias using Minitab, and which has a bimodal distribution) we started talking about grading players’ skills. The scouts would grade how players hit, their power, how they run, arm strength, and fielding ability. They used a numeric grading system from 20-80 for each of the characteristics, with 20 being the lowest, 50 being average, and 80 being elite. After they compiled this data they would give the players grades through analysis, and they would create a report with these grades to convey to others what they saw in the player.

I was amazed at how these scouts—true, old-school baseball guys who said stats weren’t important for their jobs—were compiling data and analyzing it for their reports.

A few of the other statistical ideas the scouts were (accidentally) concerned about included the sample size of observations of a player, comparison analysis, and predicting a where a player falls within their physical development (regression).

Like the baseball scouts, many of us are unwittingly doing statistics. Just like these scouts, we run into data all day long without recognizing that we can compile and analyze it. In work we worry about customer satisfaction, wait time, average transaction value, cost ratios, efficiency, etc. And while many people get intimidated when we use the word "statistics," we don’t need advanced degrees to embrace observing, compiling data, and making solid decisions based on our analysis.

So, are you accidentally doing statistics? If you are wanting to get beyond accidentally doing statistics and analyze a little more deliberately, Minitab has many tools like the Assistant menu, and Stat Guide to help you on your stats journey.

My recent beach vacation began with the kind of unfortunate incident that we all dread: killing a distant relative.

It was about 3 a.m. Me, my two sons, and our dog had been on the road since about 7 p.m. the previous day to get to our beach house on Plum Island, Massachusetts. Google maps said our exit was coming up and that we were only about 15 minutes away from our palace. Buoyed by that projection, I sat a little taller in my seat.

Is that the salty sea air filling my nostrils? I thought to myself. Is that a refreshing ocean breeze cooling the air?

And then:

Is that a f—thumpity bump bump bump—ox that just disappeared under my car?!

"I think that was a fox, dad." My son answered my question before I could ask it.

"That's what I thought, too. Darn. Kind of ironic. And not in a good way."

"Yeah, way to go, dad," my other son added.

Everyone's a critic.

The irony is that my last name is Fox. And I've always kind of identified with the handsome, intelligent, and resourceful creatures. I couldn't feel too bad though; there was nothing I could have done about it. The poor critter had been sprinting across the highway. No sooner had its small frame popped into the glow of my headlamps, then it had disappeared into the empty void under our feet. Oh well, at least for him it was over quickly. And at least we were almost to the beach.

Before I could ponder this potential omen too long, we came to our exit. There, in the middle of the exit ramp, were the 2-dimensional remains of what looked to have been, in life, another fox. Apparently, we were traveling through an area of dense foxes. By which I mean that there was a high density of foxes in the area, not that the foxes in the area were highly dense. Although, truth be told, I was feeling a little dense myself at the time. Did I mention that it was now 3 a.m.?

We continued onto a back-country road that Google maps promised would lead us to our beach house. Is that a marsh off the left just up ahead? I thought to myself. Are those sea grasses waving in the gentle breeze?

Is that the oil light glowing on my console?

"Oh crap."

Stepping out of the car, I noticed the smells of sea air and motor oil mingling with the scent of the forest. I had hoped that the warning light was just an electrical glitch. However, a casual inspection confirmed that the oil that should be inside the engine, had been working it's way outside of the engine, where it is considerably less effective. I was reminded of the words of a noted transportation engineer, "If I push 'er any further cap'n, the engine's gonna blow!"

This sentiment was echoed by the tow truck driver as well. As he descended from his cab to assess the scene, he exclaimed "You left quite a trail. Can't be much oil left in that engine." I told him what had happened. He scratched his chin and asked, "Did you say a fox? That's funny because I towed a customer last week who hit a fox in a rental car. Busted the oil filter. What do you know?"

As we stood by the side of the road waiting for our taxi, dawn's first light broke slowly over the marsh, the birds began singing to greet the new day, and the mosquitoes worked persistently to move sizable quantities of blood from inside of our bodies to outside of our bodies. Where it is considerably less effective. Even so, it was kind of a nice moment. Moved by a surprising sense of peace, I turned to my sons.

"I think I know what this all means. I think that perhaps my spirit animal appeared in physical form to test me. To remind me that—to a large extent—happiness is a choice. And if I allow circumstance to rob me of my happiness, that, too, is a choice."

"Spirit animal, huh?" As he spoke I could actually hear my son's eyes rolling back in his head.

My other son chimed in, "If he wasn't a spirit before, he is now."

Everyone's a critic.

The rest of our vacation went swimmingly. (Pun intended.) In the end, the momentary hassle and added expense of the incident didn't detract at all from our enjoyment of the trip. However, I was curious about the confluence of jaywalking wildlife, so I started doing a little research and learned that some states are actively collecting data on such accidents. I found that Massachusetts has a web page where you can report animal collisions, so I contributed my data for the cause.

I also found out that California and Maine actually enlist and train "citizen scientists" to peruse roadways in a coordinated effort to determine where animals are most frequently hit, and what kinds of animals are hit in each location. This is important data, because animal crossings represent a significant hazard to motorists and wildlife alike. Knowing what kinds of animals are frequently hit in different locations can help authorities focus efforts to introduce culverts, bridges, and other means of safe passage for critters so they can get where they need to go safely, without venturing onto the black top.

You can read the details of a three-year Maine study and explore an interactive map on the Maine Audubon web site. I thought it might be interesting to create a few graphs in Minitab Statistical Software to bring the roadkill data to life, so to speak. (Pun intended. Ill-advised, perhaps, but intended.)

The first thing I noticed was that collisions with foxes are definitely not that unusual. The following bar chart shows the number of each species found during the data collection.

Bar chart of counts by species

The web site also gives data for whether the animals found during data collection were alive or dead. As this stacked bar chart makes clear, animals with wings fare much better than earthbound critters when they encounter an automobile.

Stacked bar chart by animal group

The same trend is clear in this pie chart. The red slice in each pie shows the proportion of animals that survived the encounter. For birds, the red slice is much bigger than the blue slice.

Pie chart of dead vs. live by group

Next time I encounter a spirit animal, or any animal on the road, I hope it has wings.

Olympics The Olympic games are about to begin in Rio de Janeiro. Over the next 16 days, more than 11,000 athletes from 206 countries will be competing in 306 different events. That's the most events ever in any Olympic games. It's almost twice as many events as there were 50 years ago, and exactly three times as many as there were 100 years ago.

Since the number of Olympic events has changed over time, this makes it a great data set for a time series analysis.

A time series is a sequence of observations over regularly spaced intervals of time. The first step when analyzing time series data is to create a time series plot to look for trends and seasonality. A trend is a long-term tendency of a series to increase or decrease. Seasonality is the periodic fluctuation in the time series within a certain period—for example, sales for a store might increase every year in November and December. Here is a time series plot of the number of Olympic events since 1896.

Time Series Plot

There is clearly an upward trend, but no seasonal pattern. The data is also a little choppy at the beginning. Part of the explanation is that the data points are not evenly spaced. Most Olympic games are 4 years apart, but a few of them are just 2 years apart, and during World War I and World War II there were 8-year and 12-year gaps, respectively. Since time series data should be evenly spaced over time, we'll only look at data from 1948 on, when the Olympics started being held every 4 years without any interruptions.

Time Series Plot

Now that we have an evenly spaced series that clearly exhibits a trend, we can use a trend analysis in Minitab Statistical Software to model the data. With a trend analysis, you can use four different types of models: linear, quadratic, exponential growth, and s-curve. We'll analyze our data using both the linear and s-curve models. An additional time series analysis you can use when your data exhibit a trend is double exponential smoothing, so we'll use that method too.

Trend Analysis

Double Exponential Smoothing

You can use the accuracy measures (MAPE, MAD, MSD) to compare the fits of different time series models. For all three of these statistics, smaller values usually indicate a better-fitting model. If a single model does not have the lowest values for all three statistics, MAPE is usually the preferred measurement.

For the time series of olympic event data, the s-curve model has the lowest values of MAPE and MAD, while the double exponential smoothing method has the lowest value for MSD. Based on the "MAPE breaks all ties" guideline, it appears that the s-curve model is the one we want to use.

However, accuracy measures shouldn't be the sole criteria you use to select a model. It's also important to examine the fit of the model, especially at the end of the series. And if the last 5 Olympics are any indication, it appears that the trend of adding large quantities of events to the Olympic Games is coming to an end. In the last 16 years, only 6 events have been added.

The double exponential smoothing model appears to have adjusted for this change, whereas the two trend analysis models have not. Given this additional consideration, the double exponential smoothing model is the one we should pick, especially if we want to use it to forecast future observations.

And now that we've settled on a model, we can sit back, relax, and watch all 918 medals be won. Let the games begin!

Often, when we start analyzing new data, one of the very first things we look at is whether certain pairs of variables are correlated. Correlation can tell if two variables have a linear relationship, and the strength of that relationship. This makes sense as a starting point, since we're usually looking for relationships and correlation is an easy way to get a quick handle on the data set we're working with.

We'll talk about the correlation between these two factors later in the post.

What Is Correlation?

How do we define correlation? We can think of it in terms of a simple question: when X increases, what does Y tend to do? In general, if Y tends to increase along with X, there's a positive relationship. If Y decreases as X increases, that's a negative relationship.

Correlation is defined numerically by a correlation coefficient. This is a value that takes a range from -1 to 1. A coefficient of -1 is perfect negative linear correlation: a straight line trending downward. A +1 coefficient is, conversely, perfect positive linear correlation. A correlation of 0 is no linear correlation at all.

Making a scatterplot in Minitab can give you a quick visualization of the correlation between variables, and you can get the correlation coefficient by going to Stat > Basic Statistics > Correlation... Here's a few examples of data sets that a correlation coefficient can accurately assess.

+ corr

This graph shows a positive correlation of 0.7; close to 1. As you can see from the scatterplot, it's a fairly strong linear relationship. As the values of X tend to increase, Y tends to increase as well. Below is a similar plot, but here the relationship shows a negative direction.

nega

Correlation's Limits

However, there are some drawbacks and limitations to simple linear correlation. A correlation coefficient can only tell whether your two variables have a linear relationship. Take, for example, the following chart, which has a correlation coefficient of about 0; we can pretty easily see that there isn't much of a relationship at all:

norel

However, now take a look at this graph, in which there is an obvious relationship, but not a linear one. Notice that the correlation coefficient is also 0 in this case:

nonl

This is what you have to keep in mind when interpreting correlations. The correlation coefficient will only detect linear relationships. Just because the correlation coefficient is near 0, it doesn't mean that there isn't some type of relationship there.

The other thing to remember is something most of us hear soon after we begin exploring data—that correlation does not imply causation. Just because X and Y are correlated in some way does not mean that X causes a change in Y, or vice versa.

Here's my favorite example for this. If we look at two variables, shark attacks and ice cream sales, we know intuitively that there's no way one variable has a cause-and-effect impact on the other. However, both shark attacks and ice cream sales will have greater numbers in summer months, so they will be strongly correlated with each other. Be careful not to fall into this trap with your data!

Correlation has a lot of benefits, and it is still a good starting point in a number of different cases, but it's important to know its limitations as well.

Figures lie, so they say, and liars figure. A recent post at Ben Orlin's always-amusing mathwithbaddrawings.com blog nicely encapsulates why so many people feel wary about anything related to statistics and data analysis. Do take a moment to check it out, it's a fast read.

ask about the mean In all of the scenarios Orlin offers in his post, the statistical statements are completely accurate, but the person offering the statistics is committing a lie of omission by not putting the statement in context. Holding back critical information prevents an audience from making accurate assessment of the situation.

Ethical data analysts know better.

Unfortunately, unethical data analysts know how to spin outcomes to put them in the most flattering, if not the most direct, light. Done deliberately, that's the sort of behavior that leads many people to mistrust statistics completely.

Lessons for People Who Consume Statistics

So, where does this leave us as consumers of statistics? Should we mistrust statistics? The first question to ask is whether we trust the people who deliver statistical pronouncements. I believe most people try to do the right thing.

However, we all know that it's easy—all too easy—for humans to make mistakes. And since statistics can be confusing, and not everyone who wants or needs to analyze data is a trained statistician, great potential exists for erroneous conclusions and interpretive blunders.

Bottom line: whether their intentions are good or bad, people often cite statistics in ways that may be statistically correct, but practically misleading. So how can you avoid getting fooled?

The solution is simple, and it's one most statisticians internalized long ago, but doesn't necessarily occur to people who haven't spent much time in the data trenches:

Always look at the underlying distribution of the data.

Especially if the statistic in question pertains to something extremely important to you—like mean salary at your company, for example—ask about the distribution of the data if those details aren't volunteered. If you're told the mean or median as a number, are you also given a histogram, boxplot, or individual value plot that lets you see how the data are arranged? My colleague Michelle Paret wrote an excellent post about this.

If someone is trying to keep the distribution of the data a mystery, then the ultimate meaning of parameters like mean, median, or mode is also unknown...and your mistrust is warranted.

Lessons for People Who Produce Statistics

As purveyors and producers of statistics, who need to communicate results with people who aren't statistically savvy, what lessons can we take from this? After reading the Math with Bad Drawings blog, I thought about it and came up with two rules of thumb.

1. Don't use statistics to obscure or deflect attention from a situation.

Most people do not deliberately set out to distort the truth or mislead others. Most people would never use the mean to support one conclusion when they know the median supports a far different story. Our conscience rebels when we set out to deceive others. I'm usually willing to ascribe even the most horrendous analysis to gross incompetence rather than outright malice. On the other hand, I've read far too many papers and reports that torture language to mischaracterize statistical findings.

Sometimes we don't get the outcomes we expected. Statisticians aren't responsible for what the data show—but we are responsible for making sure we've performed appropriate analyses, satisfied checks and assumptions, and that we have trustworthy data. It should go without saying that we are ethically compelled to report our results honestly, and...

2. Provide all of the information the audience needs to make informed decisions.

When we present the results of an analysis, we need to be thorough. We need to offer all of the information and context that will enable our audience to reach confident conclusions. We need to use straightforward language that helps people tune in, and avoid jargon that makes listeners turn off.

That doesn't mean that every presentation we make needs to be laden with formulas and extended explanations of probability theory; often the bottom line is all a situation requires. When you're addressing experts, you don't need to cover the introductory material. But if we suspect an audience needs some background to fully appreciate the results of an analysis, we should provide it.

There are many approaches to communicating statistical results clearly. One of the easiest ways to present the full context of an analysis in plain language is to use the Assistant in Minitab. As many expert statisticians have told us, the Assistant doesn't just guide you through an analysis, it also explains the output thoroughly and without resorting to jargon.

And when statistics are clear, they're easier to trust.

Bad drawing by Ben Orlin, via mathwithbaddrawings.com

Back when I used to work in Minitab Tech Support, customers often asked me, “What’s the difference between Cpk and Ppk?” It’s a good question, especially since many practitioners default to using Cpk while overlooking Ppk altogether. It’s like the '80s pop duo Wham!, where Cpk is George Michael and Ppk is that other guy.

Poofy hairdos styled with mousse, shoulder pads, and leg warmers aside, let’s start by defining rational subgroups and then explore the difference between Cpk and Ppk.

Rational Subgroups

A rational subgroup is a group of measurements produced under the same set of conditions. Subgroups are meant to represent a snapshot of your process. Therefore, the measurements that make up a subgroup should be taken from a similar point in time. For example, if you sample 5 items every hour, your subgroup size would be 5.

Formulas, Definitions, Etc.

The goal of capability analysis is to ensure that a process is capable of meeting customer specifications, and we use capability statistics such as Cpk and Ppk to make that assessment. If we look at the formulas for Cpk and Ppk for normal (distribution) process capability, we can see they are nearly identical:

The only difference lies in the denominator for the Upper and Lower statistics: Cpk is calculated using the WITHIN standard deviation, while Ppk uses the OVERALL standard deviation. Without boring you with the details surrounding the formulas for the standard deviations, think of the within standard deviation as the average of the subgroup standard deviations, while the overall standard deviation represents the variation of all the data. This means that:

Cpk:

Only accounts for the variation WITHIN the subgroups
Does not account for the shift and drift between subgroups
Is sometimes referred to as the potential capability because it represents the potential your process has at producing parts within spec, presuming there is no variation between subgroups (i.e. over time)

Ppk:

Accounts for the OVERALL variation of all measurements taken
Theoretically includes both the variation within subgroups and also the shift and drift between them
Is where you are at the end of the proverbial day

Examples of the Difference Between Cpk and Ppk

For illustration, let's consider a data set where 5 measurements were taken every day for 10 days.

Example 1 - Similar Cpk and Ppk

similar Cpk and Ppk

As the graph on the left side shows, there is not a lot of shift and drift between subgroups compared to the variation within the subgroups themselves. Therefore, the within and overall standard deviations are similar, which means Cpk and Ppk are similar, too (at 1.13 and 1.07, respectively).

Example 2 - Different Cpk and Ppk

different Cpk and Ppk

In this example, I used the same data and subgroup size, but I shifted the data around, moving it into different subgroups. (Of course we would never want to move data into different subgroups in practice – I’ve just done it here to illustrate a point.)

Since we used the same data, the overall standard deviation and Ppk did not change. But that’s where the similarities end.

Look at the Cpk statistic. It’s 3.69, which is much better than the 1.13 we got before. Looking at the subgroups plot, can you tell why Cpk increased? The graph shows that the points within each subgroup are much closer together than before. Earlier I mentioned that we can think of the within standard deviation as the average of the subgroup standard deviations. So less variability within each subgroup equals a smaller within standard deviation. And that gives us a higher Cpk.

To Ppk or Not to Ppk

And here is where the danger lies in only reporting Cpk and forgetting about Ppk like it’s George Michael’s lesser-known bandmate (no offense to whoever he may be). We can see from the examples above that Cpk only tells us part of the story, so the next time you examine process capability, consider both your Cpk and your Ppk. And if the process is stable with little variation over time, the two statistics should be about the same anyway.

(Note: It is possible, and okay, to get a Ppk that is larger than Cpk, especially with a subgroup size of 1, but I’ll leave explanation for another day.)

The Centers for Medicare and Medicaid Services (CMS) updated their star ratings on July 27. Turns out, the list of hospitals are a great way to look at how easy it is to get random samples from data within Minitab.

Say for example, that you wanted to look at the association between the government’s new star ratings and the safety rating scores provided by hospitalsafetyscore.org. The CMS score is about overall quality, which includes components that aren't explicitly about safety, such as the quality of the communication between patients and doctors.

The safety score judges patient safety, using components like how often patients begin antibiotics before surgery and whether the process by which doctors order medications is reliable.

The CMS score gives out 1 to 5 stars. The safety score gives out A through F grades. The two measures aren't supposed to be duplicates, but it would be interesting to know whether there's an association between being a safer hospital and being a higher-quality hospital.

The government, kindly, provides the ability to download all 4,788 rows of data in their star ratings, but hospitalsafetyscore.org prefers to provide information by location so that potential patients can quickly examine hospitals near them or find a particular hospital. To compare the star ratings and the safety scores, we need both values.

One solution would be to search hospitalsafetyscore.org for the names of all 4,788 hospitals in the government’s database and record all the scores we found. (Though even if we did this, we wouldn't find all of them. For example, hospitals in Maryland aren't required to provide the data hospitalsafetyscore.org uses.) However, searching 4,788 hospitals is time-consuming.

A faster solution is to study the relationship using a sample of the data. We’ll use the government’s star score data as our sampling frame.

A simple random sample

It’s easy to get a simple random sample in Minitab. If you already have the government's star data in Minitab, you can try this (or, you can skip getting it from the government and use this Minitab worksheet version I created):

Choose Calc > Random Data > Sample From Columns.
In Number of Rows to Sample, enter 50.
In From columns, enter c1-c29. That lets you get all of the information from a row of data into your new sample.
In store sample in, enter c30-c58. Click OK.
Copy the column headers from the original data to the sample data.

Now, you have a sample of 50 hospitals chosen where each row in the original data set was equally likely.

A stratified sample

Of course, every simple random sample that you draw might not give you something representative, especially if your sample is small. For example, in the government’s star rating, only 2.82% of hospitals achieved 5 stars (102 hospitals). Even worse, nearly 25% of the hospitals in the data don't have a star rating (1,171 hospitals with no star rating).

If we do a hypergeometric probability calculation on a sample of size 50, assuming 102 events in a population of 3617, we find that roughly 25% of the random samples we could take would have 0 hospitals that achieved 5 stars. A simple random sample without any 5-star hospitals could tell us about the general association, but wouldn’t give us much information about what expected safety ratings for hospitals that achieved 5-star rank.

One way to fix the problem would be to take a larger simple random sample. If you take a sample of size 100 instead of a sample of size 50, then the probability that you don’t get any 5-star hospitals is almost down to 5%. Another method would be to modify your sampling scheme to make sure that you got some of every hospital ranking into your sample. Usually, you break your sample down into different groups, or strata. Then you take a simple random sample from each strata. At the end, you combine your multiple simple random samples to form your final sample.

The exact way that you determine how many observations to take from each strata depends on your goals, but let’s say that for this case, we’re going to get 10 hospitals for each star rating. We start by dividing the data:

Choose Data > Split Worksheet.
In By variables, enter ‘Hospital overall rating’. Click OK.

Now, we have separate worksheets with the hospitals that achieved each number of stars. We repeat the simple random-sampling process on each worksheet so that we have a sample of 10 from each ranking.

Now we want to combine those samples from the different star rating data.

Choose Data > Stack Worksheets.
Move the worksheets with the star rating data from Available Worksheets to Worksheets to stack.
Name the new worksheet and click OK.

If you’d like the worksheet to be just your final sample, you can go one step further.

Choose Data > Copy > Columns to Columns.
In Copy from Columns, enter c29-c58.
Name the new worksheet.
Click Subset the data.
Select Rows that match and click Condition.
In Condition, enter c42 <>’*’. Click OK in all 3 dialog boxes.

Now you have a worksheet with 50 hospitals, 10 for each star rating.

Hospital Data

At hospitalsafetyscore.org, I was able to find safety ratings for 30 of the hospitals in my sample of hospitals with government star ratings. I have a little bit of concern because I was more likely to find safety ratings on hospitals with lower star ratings than with higher star ratings, but I did find at least 4 hospitals in each category. Because I'm interested in the relationship between the scores and not in the evaluating individual hospitals, I can proceed with my smaller sample size to see if I can get a rough idea about the relationship.

My sample data suggest a relationship between the safety score and the star rating from the government. If we treat the variables as ordinal, the Spearman's rho that measures their correlation is about 0.73 and significantly different from 0. We would not expect perfect agreement because the two ratings are intended to measure different constructs. Still, in the stratified sample, we can see that no 1-star hospital achieved a safety score better than a C and that no 5-star hospital had a safety rating less than a B.

As the overall rating from the government increases, so does the safety score.

Ready for more on Minitab? Read about the role Minitab played om helping Akron Children's Hospital could reduce costs while improving patient care.

The image of Roper-Saint Francis Hospital in Charleston, South Carolina, is by ProfReader and is licensed under this Creative Commons License.

Applying DOE for Great Grilling, part 2

Using Marginal Plots, aka "Stuffed-Crust Charts"

Does Major League Baseball Really Need the Second Half of the Season?

A Visual Look at Baseball's All-Star Teams

What Were the Odds of Getting into Willy Wonka's Chocolate Factory?

DOE Center Points: What They Are & Why They're Useful

Conditional Formatting of Large Residuals and Unusual Combinations of Predictors

High Cpk and a Funny-Looking Histogram: Is My Process Really that Amazing?

Can Regression and Statistical Software Help You Find a Great Deal on a Used Car?

On Paying Bills, Marriage, and Alert Systems

One-Sample t-test: Calculating the t-statistic is not really a bear

Model Fit: Don't be Blinded by Numerical Fundamentalism

All About Run Charts

Have You Accidentally Done Statistics?

Analyzing the Jaywalking Habits of New England Wildlife

Analyzing the History of Olympic Events with Time Series

Correlation: What It Shows You (and What It Doesn't)

When Should You Mistrust Statistics?

Process Capability Statistics: Cpk vs. Ppk

Taking a Stratified Sample in Minitab Statistical Software