Quantcast
Channel: Minitab | Minitab
Viewing all 828 articles
Browse latest View live

Using Data Analysis to Maximize Webinar Attendance

$
0
0

We like to host webinars, and our customers and prospects like to attend them. But when our webinar vendor moved from a pay-per-person pricing model to a pay-per-webinar pricing model, we wanted to find out how to maximize registrations and thereby minimize our costs.

We collected webinar data on the following variables:

  • Webinar topic
  • Day of week
  • Time of day – 11 a.m. or 2 p.m.
  • Newsletter promotion – no promotion, newsletter article, newsletter sidebar
  • Number of registrants
  • Number of attendees

Once we'd collected our data, it was time to analyze it and answer some key questions using Minitab Statistical Software.

Should we use registrant or attendee counts for the analysis?

First we needed to decide what we would use to measure our results: the number of people who signed up, or the number of people who actually attended the webinar. This question really boils down to answering the question, “Can I trust my data?”

Our data collection system for webinar registrants is much more accurate than our data collection system for webinar attendees. This is due to customer behavior and their willingness to share contact information, in addition to the automated database processes that connect our webinar vendor data with our own database. So, for a period of time, I manually collected the attendee data directly from our webinar vendor to see how it correlated with the easily-accessible and accurate registration data. The scatterplot above shows the results.

With a correlation coefficient of 0.929 and a p-value of 0.000, there was a strong positive linear relationship between the registrations and attendee counts. If registrations are high, then attendance is also high. If registrations are low, then attendance is also low. I concluded that I could use the registration data—which is both easily accessible and extremely reliable—to conduct my analysis.

Should we consider data for the last 6 years?

We’ve been collecting webinar data for 6 years, but that doesn’t mean we can treat the last 6 years of data as one homogeneous population.

A lot can change in a 6-year time period. Perhaps there was a change in the webinar process that affected registrations. To determine whether or not I should use all of the data, I used an Individuals and Moving Range (I-MR, also referred to as X-MR) control chart to evaluate the process stability of webinar registrations over time.

The graph revealed a single point on the MR chart that flagged as out-of-control. I looked more closely at this point and verified that the data was accurate and that this webinar belonged with the larger population. Based on this information, I decided to proceed with analyzing all 6 years of data together. (Note there is some clustering of points due to promotions, but again the goal here was to determine if we could use data over a 6-year time period.)

What variables impact registrations?

I performed an ANOVA using Minitab's General Linear Model tool to find out which factors—topic, day of week, time of day, or newsletter promotion—significantly affect webinar registrations.

The ANOVA results revealed that the day of week, time of day, and webinar topic do not affect webinar registrations, but the newsletter promotion type does (p-value = 0.000).

So which webinar promotion type maximizes webinar registrations?

Using Minitab to conduct Tukey comparisons, we can see that registrations for webinars promoted in the newsletter sidebar space were not significantly different from webinars that weren't promoted at all.

However, webinars that were promoted in the newsletter article space resulted in significantly more registrations than both the sidebar promotions and no promotions.

From this analysis, we concluded that we still had the flexibility to offer webinars at various times and days of the week, and we could continue to vary webinar topics based on customer demand and other factors. To maximize webinar attendance and minimize webinar cost, we needed to focus our efforts on promoting the webinars in our newsletter, utilizing the article space.

But over the past year, we’ve started to actively promote our webinars via other channels as well, so next up is some more data analysis—using Minitab—to figure out what marketing channels provide the best results…


Does the Impact of Your Quality Initiative Reach C-Level Executives?

$
0
0

Here's a shocking finding from the most recent ASQ Global State of Quality report: The higher you rise in your organization's leadership, the less often you receive reports about quality metrics. Only 2% of senior executives get daily quality reports, compared to 33% of front-line staff members.  

A quarter of the senior executives reported getting quality metrics on a monthly basis, at least. But just as many reported getting them only on an annual basis.  

reporting on quality initiatives is difficultThis is simultaneously scary and depressing. It's scary because it indicates that company leaders don't have good access to the kind of information they need about their quality improvement initiatives. More than half of the executives are getting updates about quality only once a quarter, or even less. You can bet they're making decisions that impact quality much more frequently than that. 

It's depressing because quality practitioners are a dedicated, hard-working lot, and their task is both challenging and often thankless. Their efforts don't appear to be reaching the C-level offices as often as they deserve. 

Why do so many leaders get so few reports about their quality programs?

Factors that Complicate Reporting on Quality Programs 

In fairness to everyone involved, from the practitioner to the executive, piecing together the full picture of quality in a company is daunting. Practitioners tell us that even in organizations with robust, mature quality programs, assessing the cumulative impact of an initiative can be difficult, and sometimes impossible. 

The challenges start with the individual project. Teams are very good at capturing and reporting their results, but a large company may have thousands of simultaneous quality projects. Just gathering the critical information from all of those projects and putting it into a form leaders can use is a monumental task. 

But there are other obstacles, too. 

  • Teams typically use an array of different applications to create charters, process maps, value stream maps, and other documents. So the project record is a hodgepodge of files for different applications. And since the latest versions of some documents may reside on several different computers, project leaders often need to track multiple versions of a document to keep the official project record current. 
     
  • Results and metrics aren’t always measured the same way from one team's project to another. If one team measures apples and the next team measures oranges, their results can't be evaluated or aggregated as if they were equivalent. 
     
  • Many organizations have tried quality tracking methods ranging from homegrown project databases to full-featured project portfolio management (PPM) systems. But homegrown systems often become a burden to maintain, while off-the-shelf PPM solutions created for IT or other business functions don’t effectively support projects involving methods like Lean and Six Sigma. 
     
  • Reporting on projects can be a burden. There are only so many hours in the day, and busy team members need to prioritize. Copying and pasting information from project documents into an external system seems like non-value-added time, so it's easy to see why putting the latest information into the system gets low priority—if it happens at all.
Reporting on Quality Shouldn't Be So Difficult

Given the complexity of the task, and the systemic and human factors involved in improving quality, it's not hard to see why many organizations struggle with knowing how well their initiatives are doing. 

But for quality practitioners and leaders, the challenge is to make sure that reporting on results becomes a critical step in every individual project, and that all projects are using consistent metrics. Teams that can do that will find their results getting more attention and more credit for how they affect the bottom line. 

This finding in the ASQ report caught my attention because it so dramatically underscores problems we at Minitab have been focusing on recently—in fact, last year we released Qeystone, a product portfolio management system for lean six sigma, to address many of these factors. 

Regardless of the tools they use, this issue—how to ensure the results of quality improvement initiatives are understood throughout an organization—is one that every practitioner is likely to grapple with in their career.  

How do you make sure the results of your work reach your organization's decision-makers?   

 

Can Regression and Statistical Software Help You Find a Great Deal on a Used Car?

$
0
0

You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal?  By analyzing the data you have available.  

Let's look at how this works using the Assistant in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.

And it's probably a lot easier than you think. 

A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose Assistant > Regression…

At this point, if you aren’t very comfortable with regression, the Assistant makes it easy to select the right option for your analysis.

A Decision Tree for Selecting the Right Analysis

We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.

This data set includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.

We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown. 

Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.

Interpreting Regression Results in Plain Language

The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them. 

The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.

We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354).  This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).

Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.

The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price. 

Residuals versus Fitted Values

While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.

For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values.  If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.

In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.

Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price. 

Sure enough, the records for those vehicles reveal that two of them have severe collision damage.

But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!

With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.

 

Simulating Robust Processing with Design of Experiments, part 1

$
0
0

by Jasmin Wong, guest blogger

The combination of statistical methods and injection moulding simulation software gives manufacturers a powerful way to predict moulding defects and to develop a robust moulding process at the part design phase. 

CAE (computer-aided engineering) is widely used in the injection moulding industry today to improve product and mould designs as well as to resolve or troubleshoot engineering problems. But CAE can also be used to carry out in-depth processing simulations, allowing the critical process parameters that influence part quality to be identified and to enable determination of an appropriate and achievable process window at the earliest stage of the development process.

injection-molded dispenser pump

Warpage and tube concentricity were the key
quality criteria in this mold-injected hand dispenser pump.

In order to produce good quality injection mouldings with high consistency, a well-designed part and mould is critical, along with the selection of the right material and processing parameters. Changes to any of these four factors can have a significant effect on the moulded part.

With regard to defining process parameters, the injection moulding industry has been dependent on experienced process engineers using trial-and-error methods. Without the insight into polymer behaviour inside the mould, more often than not engineers would ‘process the part dimensions in.’ Such an approach typically leads to a narrow process window, where just a slight change in processing conditions can cause part dimensions to fall outside of the specification limit. This trial-and-error method is also laborious, expensive, and frequently ineffective, making it unsuitable for today’s fast-moving plastics processing industry.

Plastic injection moulding simulation software such as Moldex 3D from CoreTech System can help in the validation and optimisation of the part and/or mould design by identifying potential moulding defects before the tool is manufactured. The software can reduce the need for expensive prototypes, minimise the cost of tooling (since less rework needs to be done), and shorten validation time. When combined with the Design of Experiments techniques available in statistical software such as Minitab, doing simulation ahead of real world mould trials can also be used to speed mould approval. 

The Design of Experiments (DOE) Approach

Design of Experiments, or DOE, involves performing a series of carefully planned, systematic tests while controlling the inputs and monitoring the outputs. In the context of injection moulding, the process parameters are usually referred to as the factors or inputs, while the customer requirements (part quality/dimensions or other part specifications) are referred to as responses or outputs. By analysing the results from these tests, moulders can characterise, optimise and/or troubleshoot the injection moulding process effectively and efficiently.

By applying DOE in an injection moulding simulation, designers and/or moulders can study the relationship between the moulding factors (inputs) and response (outputs) prior to the actual trial on the mould floor. This means that they can have a good understanding of which factors will affect the quality or certain part specifications as early as possible in the development process. Optimal moulding process conditions for the specific part design can then be identified so the focus can be directed to the conditions that have the biggest influence on the customer’s requirements. This can save time and increase productivity.

When Should Simulation Be Performed?

Ideally, CAE simulation should be carried out before the actual mould trial so potential mould defects—such as sink marks, weld lines, short shots, etc.—can be predicted and rectified in the original mould design.

The most challenging problem is often warpage. Due to temperature variations and differences in volumetric shrinkage, it is almost impossible to get a part which is exactly the same as the CAD model. It is, therefore, important to conduct a DOE to understand the impact certain processing parameters have and to define the optimum processing settings.

Before the DOE is conducted, however, it is important to carry out a preliminary simulation to understand the root cause of mould defects. Changes to the part are sometimes inevitable to avoid having too narrow a process window to work within. If the fill pattern is not balanced, for example, there is a high possibility of warpage occurring regardless of the process parameters.

 

The second half of this two-part post includes a detailed case study illustrating how moulding simulation software and design of experiments can be combined to speed part design and approval. 

 

About the guest blogger

Jasmin Wong is project engineer at UK-based Plazology, which provides product design optimisation, injection moulding flow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.

 

 
A version of this article originally appeared in the October 2012 issue of Injection World magazine.

Simulating Robust Processing with Design of Experiments, part 2

$
0
0

by Jasmin Wong, guest blogger

 

Part 1 of this two-part blog post discusses the issues and challenges in injection moulding and suggests using simulation software and the statistical method called Design of Experiments (DOE) to speed development and boost quality. This part presents a case study that illustrates this approach. 

Preliminary Fill and Designed Experiment

This case study considers the example of a hand dispensing pump for a sanitiser bottle where the main areas of concern were warpage and the concentricity of the tube, as this had a critical impact on fit and functionality. 

In this example, the first step was to carry out a preliminary fill, pack, cool and warp analysis to ensure that the part had no filling difficulties such as short shots or hesitation. DOE was then carried out and, since the areas of concern were warpage and concentricity, these were selected as the quality factor/responses.

Four control factors that affected warpage and concentricity were used to carry out the DOE: melt temperature, packing pressure, cooling time, and fill time. The factors levels are shown in the table below:

Taguchi DOE control factors

A Taguchi L9 DOE was then created using Minitab Statistical Software. It should be noted that a Taguchi DOE assumes no significant interaction between factors, but this may not necessarily be true. In this case, however, it was selected to determine the relationship between the factors and responses in the shortest simulation time.

The Minitab worksheet below shows the process settings for the nine runs using the Taguchi L9 Design.

Taguchi design worksheet

Moldex3D DOE was then used to perform the mathematical calculations based on the user’s specification (minimum warpage and linear shrinkage between nodes) to determine the optimum process setting.

From the nine different simulated runs, a main effect graph for warpage was plotted. 

Main Effects Plor for Warpage

From this, it could be seen that by increasing the packing pressure and cooling time, warpage was reduced. Increasing melt temperature, on the other hand, lead to higher warpage. Using a filling time of 0.2s or 0.3s seemed to give slightly lesser warpage than 0.1s. Hence, it was determined that to achieve lower warpage, the optimum process setting should be a melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.3s.

Taking the results obtained from Moldex3D, Minitab 17 statistical software was used to determine which of the four factors had the biggest influence on part warpage.

response table for warpage

This data analysis showed that cool time had the biggest impact on part warpage, followed by packing pressure, melt temperature and then filling time. An area graph of warpage (PDF DOWNLOAD CHART 1) showed a quick comparison of the nine different runs, indicating that run 3 gave the least warpage.

area graph of warpage

Concentricity is difficult to measure, in both real life and in simulation. In real life, the distance between different points is measured using a coordinate-measuring machine (CMM). In the Moldex3D simulation, the linear shrinkage between different nodes was measured. Eight different nodes were identified. The linear shrinkage of the diameter of the tube across was determined and the lower the linear shrinkage, the more circular or better concentricity of the part.

The main effects plot below for shrinkage shows that to get better concentricity/linear shrinkage between the nodes, a lower melt temperature, cooling time and filling time with a high pack pressure was preferable.

Main Effects Plot for Shrinkage

It had already been established that to achieve lower linear shrinkage, the optimum process setting should be melt temperature of 225°C, packing pressure of 15MPa, cooling time of 8s and filling time of 0.1s. However, a cooling time of 8s may not be practical, as the analysis of warpage shows it would give high warpage.

Minitab was also used to find out which of the four control factors resulted in the greatest impact on linear shrinkage.

Response Table for Shrinkage

This showed that pack pressure is ranked first, followed by cooling time, melt temperature and lastly the filling time. Since the 8s cooling time would lead to high warpage, a compromise had to be made.

As mentioned earlier, for linear shrinkage the packing pressure was more of a contributing factor than the cooling time, so it makes sense to use 12s cooling time with 15MPa packing pressure. Comparing the nine different runs for linear shrinkage in an area graph showed that run six gave the lowest linear shrinkage.

Based on the user specification, Moldex3D’s mathematical calculations obtained the optimised run. For this example, weighting for warpage was the same as for linear shrinkage. However, based on the DOE simulation results obtained, the optimum process setting for the lowest warpage was to have a cooling time of 12s and filling time of 0.3s. The optimum process for the lowest linear shrinkage, on the other hand, required a cooling time of 8s and fill time of 0.1s.

Concluding thoughts

Moldex3D simulation resulted in a compromise process setting (melt temperature of 225°C, packing pressure of 15MPa, cooling time of 12s and filling time of 0.1s), which was used as the optimum run. From the area graphs shown below, it can be seen that the optimised run 10 gives the lowest warpage compared to the other nine runs, while having low linear shrinkage.

optimized run - area chart

From the simulation in Moldex 3D, shown below, it can be seen that part warpage and concentricity of the tube has been significantly improved (warpage has been improved by 20-30% while linear shrinkage has been kept to 0.6-0.7%).

Moldex 3D simulation

It is important that designers and moulders understand that numerical results in a simulation such as this provide only a relative comparison and should not be treated as absolute. This is because there are various uncontrollable factors in the actual mould shop environment—‘noise’—which cannot be re-enacted in a simulation. However, running DOE using simulation can give the engineering team a head start on identifying which control factors to focus on and the relationship those factors have with part quality.

 

About the guest blogger

Jasmin Wong is project engineer at UK-based Plazology, which provides product design optimisation, injection moulding fl ow simulation, mould design, mould procurement, and moulding process validation services to global manufacturing customers. She is an MSc graduate in polymer composite science and engineering and recently gained Moldex3D Analyst Certification.

 

 

A version of this article originally appeared in the October 2012 issue of Injection World magazine.

WHO Cares about How Much Sugar You Eat on Halloween?

$
0
0

It’s almost Halloween, so there’s lots to do. If you haven’t picked out your costume, get ideas from the National Retail Federation’s list of the most popular costumes for 2014. Last-minute candy shopping? Check out kidzworld.com’s list of the top 10 candies for Halloween. And of course, you have to plan your daily candy consumption to match the limits on free sugar recommended by the World Health Organization (WHO) earlier this year.

Mixed candy

What’s that you say? You didn’t plan your candy consumption yet? Well, the guideline says that no more than 10% of your calories should come from free sugars and that you can achieve increased health benefits by keeping the number below 5%. If you’re a good nutrition tracker, that should be no problem for you. For those of you looking for more general suggestions, we’re going to make a scatterplot in Minitab that should provide a helpful reference.

We like to show some fairly nifty graph features on the Minitab blog. For example, Carly Barry’s shown you how to make your graphs more manageable with paneling, Jim Frost’s shown you how to adjust your scales for travel bumps, and Eston Martz adjusted contour plots while looking at data about hyena skulls. This time though, we’re going to see how our statistical software makes it easy to clarify a graph by taking something away.

The USDA last published their dietary guidelines in 2010. Appendix 6 contains calorie estimates based on age, gender and activity level, rounded to the nearest 200 calories. Multiply those levels by 0.05 to  get an estimate of your recommended sugar limit in calories. To change that into grams that you can find on candy labels, we’ll assume that sugar has 4 calories per gram.

Now, if we create the default graph in Minitab we get something a bit like this. Note the symbols crammed together along each line:

Crowded symbols make the graph less clear.

Let’s be honest, pushing all those symbols together to show a line with no variation looks a bit silly. But select those symbols and a clearer graph is only a right-click away:

Right-click the symbols and click Delete to make the graph clearer.

Without the symbols on the graph, the lines and the differences between them are clearer, especially when the lines are closest together during the early phase when people grow rapidly.

Without the symbols, the graph is clearer.

Much has been made of the fact that the 5% WHO guideline is less than the sugar in a can of soda, so Halloween can be a treacherous time for someone who wants to limit their sugar intake. After all, Popular Science reports that the average trick-or-treater begins home over 600 grams. So what do you do if your ghost or goblin brings home more candy than you want? Natalie Silverstein offers some suggestions about how to make your candy do some good for others.

  

The image of mixed candy is by Steven Depolo and appears under this Creative Commons license.

R-squared Shrinkage and Power and Sample Size Guidelines for Regression Analysis

$
0
0

Using a sample to estimate the properties of an entire population is common practice in statistics. For example, the mean from a random sample estimates that parameter for an entire population. In linear regression analysis, we’re used to the idea that the regression coefficients are estimates of the true parameters. However, it’s easy to forget that R-squared (R2) is also an estimate. Unfortunately, it has a problem that many other estimates don’t have. R-squared is inherently biased!

In this post, I look at how to obtain an unbiased and reasonably precise estimate of the population R-squared. I also present power and sample size guidelines for regression analysis.

R-squared as a Biased Estimate

R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.

  • An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
  • A biased estimate is systematically too high or low, and so is the average. It’s like a bathroom scale that always indicates you are heavier than you really are. No one wants that!

R-squared is like the broken bathroom scale: it is deceptively large. Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

This bias is a reason why some practitioners don’t use R-squared at all—it tends to be wrong.

R-squared Shrinkage

What should we do about this bias? Fortunately, there is a solution and you’re probably already familiar with it: adjusted R-squared. I’ve written about using the adjusted R-squared to compare regression models with a different number of terms. Another use is that it is an unbiased estimator of the population R-squared.

Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce an accurate weight. In statistics this is called shrinkage. (You Seinfeld fans are probably giggling now. Yes, George, we’re talking about shrinkage, but here it’s a good thing!)

We need to shrink the R-squared down so that it is not biased. Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model.

Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.

Line plot showing R-squared shrinkage by sample size per term

The graph shows greater shrinkage when you have a smaller sample size per term and lower R-squared values.

Precision of the Adjusted R-squared Estimate

Now that we have an unbiased estimator, let's take a look at the precision.

Estimates in statistics have both a point estimate and a confidence interval. For example, the sample mean is the point estimate for the population mean. However, the population mean is unlikely to exactly equal the sample mean. A confidence interval provides a range of values that is likely to contain the population mean. Narrower confidence intervals indicate a more precise estimate of the parameter. Larger sample sizes help produce more precise estimates.

All of this is true with the adjusted R-squared as well because it is just another estimate. The adjusted R-squared value is the point estimate, but how precise is it and what’s a good sample size?

Rob Kelly, a senior statistician at Minitab, was asked to study this issue in order to develop power and sample size guidelines for regression in the Assistant menu. He simulated the distribution of adjusted R-squared values around different population values of R-squared for different sample sizes. This histogram shows the distribution of 10,000 simulated adjusted R-squared values for a true population value of 0.6 (rho-sq (adj)) for a simple regression model.

Histogram showing distribution of adjusted R-squared values around the population value

With 15 observations, the adjusted R-squared varies widely around the population value. Increasing the sample size from 15 to 40 greatly reduces the likely magnitude of the difference. With a sample size of 40 observations for a simple regression model, the margin of error for a 90% confidence interval is +/- 20%. For multiple regression models, the sample size guidelines increase as you add terms to the model.

Power and Sample Size Guidelines for Regression Analysis

Satisfying these sample size guidelines helps ensure that you have sufficient power to detect a relationship and provides a reasonably precise estimate of the strength of that relationship. Specifically, if you follow these guidelines:

  • The power of the overall F-test ranges from about 0.8 to 0.9 for a moderately weak relationship (0.25). Stronger relationships yield higher power.
  • You can be 90% confident that the adjusted R-squared in your output is within +/- 20% of the true population R-squared value. Stronger relationships (~0.9) produce more precise estimates.

Terms

Total sample size

1-3

40

4-6

45

7-8

50

9-11

55

12-14

60

15-18

65

19-21

70

In closing, if you want to estimate the strength of the relationship in the population, assess the adjusted R-squared and consider the precision of the estimate.

Even when you meet the sample size guidelines for regression, the adjusted R-squared is a rough estimate. If the adjusted R2 in your output is 60%, you can be 90% confident that the population value is between 40-80%.

If you're learning about regression, read my regression tutorial! For more histograms and the full guidelines table, see the simple regression white paper and multiple regression white paper.

Comparing the College Football Playoff Top 25 and the Preseason AP Poll

$
0
0

The college football playoff committee waited until the end of October to release their first top 25 rankings. One of the reasons for waiting so far into the season was that the committee would rank the teams off of actual games and wouldn’t be influenced by preseason rankings.

At least, that was the idea.

Earlier this year, I found that the final AP poll was correlated with the preseason AP poll. That is, if team A was ranked ahead of team B in the preseason and they had the same number of losses, team A was still usually ranked ahead of team B. The biggest exception was SEC teams, who were able to regularly jump ahead of teams (with the same number of losses) ranked ahead of them in the preseason.

If the final AP poll can be influenced by preseason expectations, could the college football playoff committee be influenced, too? Let’s compare their first set of rankings to the preseason AP poll to find out.

Comparing the Ranks

There are currently 17 different teams in the committee’s top 25 that have just one loss. I recorded the order they are ranked in the committee’s poll and their order in the AP preseason poll. Below is an individual value plot of the data that shows each team’s preseason rank versus their current rank.

IVP

Teams on the diagonal line haven’t moved up or down since the preseason. Although Notre Dame is the only team to fall directly on the line, most teams aren’t too far off.

Teams below the line have jumped teams that were ranked ahead of them in the preseason. The biggest winner is actually not an SEC team, it’s TCU. Before the season, 13 of the current one-loss teams were ranked ahead of TCU, but now there are only 4. On the surface TCU seems to counter the idea that only SEC teams can drastically move up from their preseason ranking. However, of the 9 teams TCU jumped, only one (Georgia) is from the SEC. And the only other team to jump up more than 5 spots is Mississippi—who of course is from the SEC. So I wouldn’t conclude that the CFB playoff committee rankings behave differently than the AP poll quite yet.

Teams below the line have been passed by teams that had been ranked behind them in the preseason. Ohio State is the biggest loser, having had 9 different teams pass over them. Part of this can be explained by the fact that they have the worst loss (a 4-4 Virginia Tech game at home). But another factor is that the preseason AP poll was released before anybody knew Buckeye quarterback Braxton Miller would miss the entire season. Had voters known that, Ohio State probably wouldn’t have been ranked so high to begin with.  

Overall, 10 teams have moved up or down from their preseason spot by 3 spots or less. The correlation between the two polls is 0.571, which indicates a positive association between the preseason AP poll and the current CFB playoff rankings. That is, teams ranked higher in the preseason poll tend to be ranked higher in the playoff rankings.

Concordant and Discordant Pairs

We can take this analysis a step further by looking at the concordant and discordant pairs. A pair is concordant if the observations are in the same direction. A pair is discordant if the observations are in opposite directions. This will let us compare teams to each other two at a time.

For example, let’s compare Auburn and Mississippi. In the preseason, Auburn was ranked 3 (out of the 17 one-loss teams) and Mississippi was ranked 10. In the playoff rankings, Auburn is ranked 1 and Mississippi is ranked 2. This pair is concordant, since in both cases Auburn is ranked higher than Mississippi. But if you compare Alabama and Mississippi, you’ll see Alabama was ranked higher in the preseason, but Mississippi is ranked higher in the playoff rankings. That pair is discordant.

When we compare every team, we end up with 136 pairs. How many of those are concordant? Our favorite statistical software has the answer: 

Measures of Concordance

There are 96 concordant pairs, which is just over 70%. So most of the time, if a team ranked higher in the preseason poll, they are  ranked higher in the playoff rankings. And consider this: of the one-loss teams, the top 4 ranked preseason teams were Alabama, Oregon, Auburn, and Michigan St. Currently, the top 4 one loss teams are Auburn, Mississippi, Oregon, and Alabama. That’s only one new team—which just so happens to be from the SEC.

That’s bad news for non-SEC teams that started the season ranked low, like Arizona, Notre Dame, Nebraska, and Kansas State. It's going to be hard for them to jump teams with the same record, especially if those teams are from the SEC. Just look at Alabama’s résumé so far. Their best win is over West Virginia and they lost to #4 Mississippi. Is that really better than Kansas State, who lost to #3 Auburn and beat Oklahoma on the road? If you simply changed the name on Alabama’s uniform to Utah and had them unranked to start the season, would they still be ranked three spots higher than Kansas State?  I doubt it.

The good news is that there are still many games left to play. Most of these one-loss teams will lose at least one more game. But with 4 teams making the playoff this year, odds are we'll see multiple teams with the same record vying for the last playoff spot. And if this college football playoff ranking is any indication, if you're not in the SEC, teams who were highly thought of in the preseason will have an edge.


Methods and Formulas: How Are I-MR Chart Control Limits Calculated?

$
0
0

Users often contact Minitab technical support to ask how the software calculates the control limits on control charts.

A frequently asked question is how the control limits are calculated on an I-MR Chart or Individuals Chart. If Minitab plots the upper and lower control limits (UCL and LCL) three standard deviations above and below the mean, why are the limits plotted at values other than 3 times the standard deviation that I get using Stat > Basic Statistics

That’s a valid question—if we’re plotting individual points on the I-Chart, it doesn’t seem unreasonable to try to calculate a simple standard deviation of the data points, multiply by 3 and expect the UCL and LCL to be the data mean plus or minus 3 standard deviations. This can be especially confusing because the Mean line on the Individuals chart IS the mean of the data!

However, the standard deviation that Minitab Statistical Software uses is not the simple standard deviation of the data. The default method that Minitab uses (and an option to change the method) is available by clicking the I-MR Options button, and then choosing the Estimate tab:

There we can see that Minitab is using the Average moving range method with 2 as the length of moving range to estimate the standard deviation.

That’s all well and good, but exactly what the heck is an average moving range with length 2?!

Minitab’s Methods and Formulas section details the formulas used for these calculations.  In fact, Methods and formulas provides information on formulas used for all the calculations available through the dialog boxes: This information can be accessed via the Help menu, by choosing Help> Methods and Formulas...

Too see the formulas for control chart calculations, we choose Control Charts> Variables Charts for Individuals as shown below:

The next page shows the formulas organized by topic. By selecting the link Methods for estimating standard deviation we find the formula for the Average moving range:

Looking at the formula, things become a bit clearer—the ‘length of the moving range’ is the number of data points used when we calculate the moving range (i.e., the difference from point 1 to point 2, 2 to 3, and so forth).

If we want to hand-calculate the control limits for a dataset, we can do that with a little help from Minitab!

The dataset I’ve used for this example is available HERE.

First, we’ll need to get the values of the moving ranges. We’ll use the calculator by navigating to Calc> Calculator; in the example below, we’re storing the results in column C2 (an empty column) and we’re using the LAG function in the calculator.  That will move each of our values in column C1 down by 1 row.  Click OK to store the results in the worksheet.

Note: By choosing the Assign as a formula option at the bottom of the calculator, we can add a formula to column C2 which we can easily go back and edit if a mistake was made.

Now with the lags stored in C2, we use the calculator again: Calc> Calculator (here's a tip: press F3 on the keyboard to clear out the previous calculator entry), then subtract column C2 from column C1 as shown below, storing the results in C3.  We use the ABS calculator command to get the absolute differences of each row:

Next we calculate the sum of the absolute value of the moving ranges by using Calc> Calculator once again.  We’ll store the sum in the next empty column, C4:

The value of this sum represents the numerator in the Rbar calculation:

To complete the Rbar calculation, we use the information from Methods and Formulas to come up with the denominator; n is the number of data points (in this example it’s 100), w’s default value is 2 ,and we add 1, so the denominator is 100-2+1.  In Minitab, we can once again use Calc> Calculator to store the results in C5:

With Rbar calculated, we find the value of the unbiasing constant d2 from the table that is linked in Methods and Formulas:

For a moving-range of length 2, the d2 value is 1.128, so we enter 1.128 in the first row in column C6, and use the calculator one more time to divide Rbar by d2 to get the standard deviation, which works out to be 2.02549:

We can check our results by using the original data to create an I-MR chart.  We enter the data column in Variables, and then click I-MR Options and choose the Storage tab; here we can tell Minitab to store the standard deviation in the worksheet when we create the chart:

The stored standard deviation is shown in the new column titled STDE1, and it matched the value we hand-calculated.  Notice also that the Rbar we calculated is the average of the moving ranges on the Moving-Range chart. Beautiful!

Creating and Reading Statistical Graphs: Trickier than You Think

$
0
0

A few weeks ago my colleague Cody Steele illustrated how the same set of data can appear to support two contradictory positions. He showed how changing the scale of a graph that displays mean and median household income over time drastically alters the way it can be interpreted, even though there's no change in the data being presented.

Graph interpretation is tricky, especially if you're doing it quickly When we analyze data, we need to present the results in an objective, honest, and fair way. That's the catch, of course. What's "fair" can be debated...and that leads us straight into "Lies, damned lies, and statistics" territory.  

Cody's post got me thinking about the importance of statistical literacy, especially in a mediascape saturated with overhyped news reports about seemingly every new study, not to mention omnipresent "infographics" of frequently dubious origin and intent.

As consumers and providers of statistics, can we trust our own impressions of the information we're bombarded with on a daily basis? It's an increasing challenge, even for the statistics-savvy. 

So Much Data, So Many Graphs, So Little Time

The increased amount of information available, combined with the acceleration of the news cycle to speeds that wouldn't have been dreamed of a decade or two ago, means we have less time available to absorb and evaluate individual items critically. 

A half-hour television news broadcast might include several animations, charts, and figures based on the latest research, or polling numbers, or government data. They'll be presented for several seconds at most, then it's on to the next item. 

Getting news online is even more rife with opportunities for split-second judgment calls. We scan through the headlines and eyeball the images, searching for stories interesting enough to click on. But with 25 interesting stories vying for your attention, and perhaps just a few minutes before your next appointment, you race through them very quickly. 

But when we see graphs for a couple of seconds, do we really absorb their meaning completely and accurately? Or are we susceptible to misinterpretation?  

Most of the graphs we see are very simple: bar charts and pie charts predominate. But as statistics educator Dr. Nic points out in this blog post, interpreting even simple bar charts can be a deceptively tricky business. I've adapted her example to demonstrate this below.  

Which Chart Shows Greater Variation? 

A city surveyed residents of two neighborhoods about the quality of service they get from local government. Respondents were asked to rate local services on a scale of 1 to 10. Their responses were charted using Minitab Statistical Software, as shown below.  

Take a few seconds to scan the charts, then choose which neighborhood's responses exhibit the most variation, Ferndale or Lawnwood?

Lawnwood Bar Chart

Ferndale Bar Chart

Seems pretty straightforward, right? Lawnwood's graph is quite spiky and disjointed, with sharp peaks and valleys. The graph of Ferndale's responses, on the other hand, looks nice and even. Each bar's roughly the same height.  

It looks like Lawnwood's responses have the most variation. But let's verify that impression with some basic descriptive statistics about each neighborhood's responses:

Descriptive Statistics for Fernwood and Lawndale

Uh-oh. A glance at the graphs suggested that Lawnwood has more variation, but the analysis demonstrates that Ferndale's variation is, in fact, much higher. How did we get this so wrong?  

Frequencies, Values, and Counterintuitive Graphs

The answer lies in how the data were presented. The charts above show frequencies, or counts, rather than individual responses.  

What if we graph the individual responses for each neighborhood?  

Lawndale Individuals Chart

Ferndale Individuals Chart

In these graphs, it's easy to see that the responses of Ferndale's citizens had much more variation than those of Lawnwood. But unless you appreciate the differences between values and frequencies—and paid careful attention to how the first set of graphs was labelled—a quick look at the earlier graphs could well leave you with the wrong conclusion. 

Being Responsible 

Since you're reading this, you probably both create and consume data analysis. You may generate your own reports and charts at work, and see the results of other peoples' analyses on the news. We should approach both situations with a certain degree of responsibility.  

When looking at graphs and charts produced by others, we need to avoid snap judgments. We need to pay attention to what the graphs really show, and take the time to draw the right conclusions based on how the data are presented.  

When sharing our own analyses, we have a responsibility to communicate clearly. In the frequency charts above, the X and Y axes are labelled adequately—but couldn't they be more explicit?  Instead of just "Rating," couldn't the label read "Count for Each Rating" or some other, more meaningful description? 

Statistical concepts may seem like common knowledge if you've spent a lot of time working with them, but many people aren't clear on ideas like "correlation is not causation" and margins of error, let alone the nuances of statistical assumptions, distributions, and significance levels.

If your audience includes people without a thorough grounding in statistics, are you going the extra mile to make sure the results are understood? For example, many expert statisticians have told us they use the Assistant in Minitab 17 to present their results precisely because it's designed to communicate the outcome of analysis clearly, even for statistical novices. 

If you're already doing everything you can to make statistics accessible to others, kudos to you. And if you're not, why aren't you?  

What to Do When Your Data's a Mess, part 1

$
0
0

Isn't it great when you get a set of data and it's perfectly organized and ready for you to analyze? I love it when the people who collect the data take special care to make sure to format it consistently, arrange it correctly, and eliminate the junk, clutter, and useless information I don't need.  

Messy DataYou've never received a data set in such perfect condition, you say?

Yeah, me neither. But I can dream, right? 

The truth is, when other people give me data, it's typically not ready to analyze. It's frequently messy, disorganized, and inconsistent. I get big headaches if I try to analyze it without doing a little clean-up work first. 

I've talked with many people who've shared similar experiences, so I'm writing a series of posts on how to get your data in usable condition. In this first post, I'll talk about some basic methods you can use to make your data easier to work with. 

Preparing Data Is a Little Like Preparing Food

I'm not complaining about the people who give me data. In most cases, they aren't statisticians and they have many higher priorities than giving me data in exactly the form I want.  

The end result is that getting data is a little bit like getting food: it's not always going to be ready to eat when you pick it up. You don't eat raw chicken, and usually you can't analyze raw data, either.  In both cases, you need to prepare it first or the results aren't going to be pretty.

Here are a couple of very basic things to look for when you get a messy data set, and how to handle them.  

Kitchen-Sink Data and Information Overload

Frequently I get a data set that includes a lot of information that I don't need for my analysis. I also get data sets that combine or group information in ways that make analyzing it more difficult. 

For example, let's say I needed to analyze data about different types of events that take place at a local theater. Here's my raw data sheet:  

April data sheet

With each type of event jammed into a single worksheet, it's a challenge to analyze just one event category. What would work better?  A separate worksheet for each type of occasion. In Minitab Statistical Software, I can go to Data > Split Worksheet... and choose the Event column: 

split worksheet

And Minitab will create new worksheets that include only the data for each type of event. 

separate worksheets by event type

Minitab also lets you merge worksheets to combine items provided in separate data files. 

Let's say the data set you've been given contains a lot of columns that you don't need: irrelevant factors, redundant information, and the like. Those items just clutter up your data set, and getting rid of them will make it easier to identify and access the columns of data you actually need. You can delete rows and columns you don't need, or use the Data > Erase Variables tool to make your worksheet more manageable. 

I Can't See You Right Now...Maybe Later

What if you don't want to actually delete any data, but you only want to see the columns you intend to use? For instance, in the data below, I don't need the Date, Manager, or Duration columns now, but I may have use for them in the future: 

unwanted columns

I can select and right-click those columns, then use Column > Hide Selected Columns to make them disappear. 

hide selected columns

Voila! They're gone from my sight. Note how the displayed columns jump from C1 to C5, indicating that some columns are hidden:  

hidden columns

It's just as easy to bring those columns back in the limelight. When I want them to reappear, I select the C1 and C5 columns, right-click, and choose "Unhide Selected Columns." 

Data may arrive in a disorganized and messy state, but you don't need to keep it that way. Getting rid of extraneous information and choosing the elements that are visible can make your work much easier. But that's just the tip of the iceberg. In my next post, I'll cover some more ways to make unruly data behave.  

What to Do When Your Data's a Mess, part 2

$
0
0

In my last post, I wrote about making a cluttered data set easier to work with by removing unneeded columns entirely, and by displaying just those columns you want to work with now. But too much unneeded data isn't always the problem.

What can you do when someone gives you data that isn't organized the way you need it to be?  

That happens for a variety of reasons, but most often it's because the simplest way for people to collect data is with a format that might make it difficult to assess in a worksheet. Most statistical software will accept a wide range of data layouts, but just because a layout is readable doesn't mean it will be easy to analyze.

You may not be in control of how your data were collected, but you can use tools like sorting, stacking, and ordering to put your data into a format that makes sense and is easy for you to use. 

Decide How You Want to Organize Your Data

Depending on how its arranged, the same data can be easier to work with, simpler to understand, and can even yield deeper and more sophisticated insights. I can't tell you the best way to organize your specific data set, because that will depend on the types of analysis you want to perform, and the nature of the data you're working with. However, I can show you some easy ways to rearrange your data into the form that you select. 

Unstack Data to Make Multiple Columns

The data below show concession sales for different types of events held at a local theater. 

stacked data

If we wanted to perform an analysis that requires each type of event to be in its own column, we can choose Data > Unstack Columns... and complete the dialog box as shown:

unstack columns dialog 

Minitab creates a new worksheet that contains a separate column of Concessions sales data for each type of event:

Unstacked Data

Stack Data to Form a Single Column (with Grouping Variable)

A similar tool will help you put data from separate columns into a single column for the type of analysis required. The data below show sales figures for four employees: 

Select Data > Stack > Columns... and select the columns you wish to combine. Checking the "Use variable names in subscript column" will create a second column that identifies the person who made each sale. 

Stack columns dialog

When you press OK, the sales data are stacked into a single column of measurements and ready for analysis, with Employee available as a grouping variable: 

stacked columns

Sort Data to Make It More Manageable

The following data appear in the worksheet in the order in which individual stores in a chain sent them into the central accounting system.

When the data appear in this uncontrolled order, finding an observation for any particular item, or from any specific store, would entail reviewing the entire list. We can fix that problem by selecting Data > Sort... and reordering the data by either store or item. 

sorted data by item    sorted data by store

Merge Multiple Worksheets

What if you need to analyze information about the same items, but that were recorded on separate worksheets?  For instance, if one group was gathering historic data about all of a corporation's manufacturing operations, while another was working on strategic planning, and your analysis required data from each? 

two worksheets

You can use Data > Merge Worksheets to bring the data together into a single worksheet, using the Division column to match the observations:

merging worksheets

You can also choose whether or not multiple, missing, or unmatched observations will be included in the merged worksheet.  

Reorganizing Data for Ease of Use and Clarity

Making changes to the layout of your worksheet does entail a small investment of time, but it can bring big returns in making analyses quicker and easier to perform. The next time you're confronted with raw data that isn't ready to play nice, try some of these approaches to get it under control. 

In my next post, I'll share some tips and tricks that can help you get more information out of your data.

Leaving Out-of-control Points Out of Control Chart Calculations Looks Hard, but It Isn't

$
0
0

Houston skylineControl charts are excellent tools for looking at data points that seem unusual and for deciding whether they're worthy of investigation. If you use control charts frequently, then you're used to the idea that if certain subgroups reflect temporary abnormalities, you can leave them out when you calculate your center line and control limits. If you include points that you already know are different because of an assignable cause, you reduce the sensitivity of your control chart to other, unknown causes that you would want to investigate. Fortunately, Minitab Statistical Software makes it fast and easy to leave points out when you calculate your center line and control limits. And because Minitab’s so powerful, you have the flexibility to decide if and how the omitted points appear on your chart.

Here’s an example with some environmental data taken from the Meyer Park ozone detector in Houston, Texas. The data are the readings at midnight from January 1, 2014 to November 9, 2014. (My knowledge of ozone is too limited to properly chart these data, but they’re going to make a nice illustration. Please forgive my scientific deficiencies.) If you plot these on an individuals chart with all of the data, you get this:

The I-chart shows seven out-of-control points between May 3rd and May 17th.

Beginning on May 3, a two-week period contains 7 out of 14 days where the ozone measurements are higher than you would expect based on the amount that they normally vary. If we know the reason that these days have higher measurements, then we could exclude them from the calculation of the center line and control limits. Here are the three options for what to do with the points:

Three ways to show or hide omitted points

Like it never happened

One way to handle points that you don't want to use to calculate the center line and control limits is to act like they never happened. The points neither appear on the chart, nor are there gaps that show where omitted points were. The fastest way to do this is by brushing:

  1. On the Graph Editing toolbar, click the paintbrush.

The paintbrush is between the arrow and the crosshairs.

  1. Click and drag a square that surrounds the 7 out-of-control points.
  2. Press CTRL + E to recall the Individuals chart dialog box.
  3. Click Data Options.
  4. Select Specify which rows to exclude.
  5. Select Brushed Rows.
  6. Click OK twice.

On the resulting chart, the upper control limit changes from 41.94 parts per billion to 40.79 parts per billion. The new limits indicate that April 11 was also a measurement that's larger than expected based on the variation typical of the rest of the data. These two facts will be true on the control chart no matter how you treat the omitted points. What's special about this chart is that there's no suggestion that any other data exists. The focus of the chart is on the new out-of-control point:

The line between the data is unbroken, even though other data exists.

Guilty by omission

A display that only shows the data used to calculate the control line and center limits might be exactly what you want, but you might also want to acknowledge that you didn't use all of the data in the data set. In this case, after step 6, you would check the box labeled Leave gaps for excluded points. The resulting gaps look like this:

Gaps in the control limits and data connect lines show where points were omitted.

In this case, the spaces are most obvious in the control limit line, but the gaps also exist in the lines that connect the data points. The chart shows that some data was left out.

Hide nothing

In many cases, not showing data that wasn't in the calculations for the center line and control limits is effective. However, we might want to show all of the points that were out-of-control in the original data. In this case, we would still brush the points, but not use the Data Options. Starting from the chart that calculated the center line and control limits from all of the data, these would be the steps:

  1. On the Graph Editing toolbar, click the paintbrush.

The paintbrush is between the arrow and the crosshairs.

  1. Click and drag a square that surrounds the 7 out-of-control points.
  2. Press CTRL + E to recall the Individuals chart dialog box. Arrange the dialog box so that you can see the list of brushed points.
  3. Click I Chart Options.
  4. Select the Estimate tab.
  5. Under Omit the following subgroups when estimating parameters, enter the row numbers from the list of brushed points.
  6. Click OK twice.

This chart still shows the new center line, control limits, and out-of-control point, but also includes the points that were omitted from the calculations.

Points not in the calculations are still on the chart.

Wrap up

Control charts help you to identify when some of your data are different than the rest so that you can examine the cause more closely. Developing control limits that exclude data points with an assignable cause is easy in Minitab and you also have the flexibility to decide how to display these points to convey the most important information. The only thing better than getting the best information from your data? Getting the best information from your data faster!

Ready for more? Check out some more tips about optimizing the performance of your control charts!

 

The image of the Houston skyline is from Wikimedia commons and is licensed under this creative commons license.

The Power of Multivariate ANOVA (MANOVA)

$
0
0

Willy WonkaAnalysis of variance (ANOVA) is great when you want to compare the differences between group means. For example, you can use ANOVA to assess how three different alloys are related to the mean strength of a product. However, most ANOVA tests assess one response variable at a time, which can be a big problem in certain situations. Fortunately, Minitab statistical software offers a multivariate analysis of variance (MANOVA) test that allows you to assess multiple response variables simultaneously.

In this post, I’ll run through a MANOVA example, explain the benefits, and cover how to know when you should use MANOVA.

Limitations of ANOVA

Whether you’re using general linear model (GLM) or one-way ANOVA, most ANOVA procedures can only assess one response variable at a time. Even GLM, where you can include many factors and covariates in the model, the analysis simply cannot detect multivariate patterns in the response variable.

This limitation can be a huge roadblock for some studies because it may be impossible to obtain significant results with a regular ANOVA test. You don’t want to miss out on any significant findings!

Example That Compares MANOVA to ANOVA

What the heck are multivariate patterns in the response variable? It sounds complicated but it’s very easy to show the difference between how ANOVA and MANOVA tests the data by using graphs.

Let’s assume that we are studying the relationship between three alloys and the strength and flexibility of our products. Here is the dataset for the example.

The two individual value plots below show how one-way ANOVA analyzes the data—one response variable at a time. In these graphs, alloy is the factor and strength and flexibility are the response variables.

Individual value plot of strength by alloyIndividual value plot of flexibility by alloy

The two graphs seem to show that the type of alloy is not related to either the strength or flexibility of the product. When you perform the one-way ANOVA procedure for these graphs, the p-values for strength and flexibility are 0.254 and 0.923 respectively.

Drat! I guess Alloy isn't related to either Strength or Flexibility, right? Not so fast!

Now, let’s take a look at the multivariate response patterns. To do this, I’ll display the same data with a scatterplot that plots Strength by Flexibility with Alloy as a categorical grouping variable.

Scatterplot of strength by flexibility grouped by alloy

The scatterplot shows a positive correlation between Strength and Flexibility. MANOVA is useful when you have correlated response variables like these. You can also see that for a given flexibility score, Alloy 3 generally has a higher strength score than Alloys 1 and 2. We can use MANOVA to statistically test for this response pattern to be sure that it’s not due to random chance.

To perform the MANOVA test in Minitab, go to: Stat > ANOVA > General MANOVA. Our response variables are Strength and Flexibility and the predictor is Alloy.

Whereas one-way ANOVA could not detect the effect, MANOVA finds it with ease. The p-values in the results are all very significant. You can conclude that Alloy influences the properties of the product by changing the relationship between the response variables.

MANOVA results

For a more complete guide on how to interpret MANOVA results in Minitab, go to: Help > StatGuide > ANOVA > General MANOVA.

When and Why You Should Use MANOVA

Use multivariate ANOVA when you have continuous response variables that are correlated. In addition to multiple responses, you can also include multiple factors, covariates, and interactions in your model. MANOVA uses the additional information provided by the relationship between the responses to provide three key benefits.

  • Increased power: If the response variables are correlated, MANOVA can detect differences too small to be detected through individual ANOVAs.
  • Detects multivariate response patterns: The factors may influence the relationship between responses rather than affecting a single response. Single-response ANOVAs can miss these multivariate patterns as illustrated in the MANOVA example.
  • Controls the family error rate: Your chance of incorrectly rejecting the null hypothesis increases with each successive ANOVA. Running one MANOVA to test all response variables simultaneously keeps the family error rate equal to your alpha level.

Are Preseason Football or Basketball Rankings More Accurate?

$
0
0

College basketball season tips off today, and for the second straight season Kentucky is the #1 ranked preseason team in the AP poll. Last year Kentucky did not live up to that ranking in the regular season, going 24-10 and earning a lowly 8 seed in the NCAA tournament. But then, in the tournament, they overachieved and made a run all the way to the championship game...before losing to Connecticut.

In football, Florida State was the AP poll preseason #1 football team. While they are currently still undefeated, they aren't quite playing like the #1 team in the country. So this made me wonder, which preseason rankings are more accurate, football or basketball?

I gathered data from the last 10 seasons, and recorded the top 10 teams in the preseason AP poll for both football and basketball. Then I recorded the difference between their preseason ranking and their final ranking. Both sports had 10 teams that weren’t ranked or receiving votes in the final poll, so I gave all of those teams a final ranking of 40.

Creating a Histogram to Compare Two Distributions

Let’s start with a histogram to look at the distributions of the differences. (It's always a good idea to look at the distribution of your data when you're starting an analysis, whether you're looking at quality improvement data work or sports data for yourself.) 

You can create this graph in Minitab Statistical Software by selecting Graph > Histograms, choosing "With Groups" in the dialog box, and using the Basketball Difference and Football Difference columns as the graph variables:

Histogram

The differences in the rankings appear to be pretty similar. Most of the data is towards the left side of this histogram, meaning for most cases the difference between the preseason and final ranking is pretty small.

Conducting a Mann-Whitney Hypothesis Test on Two Medians

We can further investigate the data by performing a hypothesis test. Because the data is heavily skewed, I’ll use a Mann-Whitney test. This compares the medians of two samples with similarly-shaped distributions, as opposed to a 2-sample t test, which compares the means. The median is the middle value of the data. Half the observations are less than or equal to it, and half the observations are greater than or equal to it. 

To perform this test in our statistical software, we select Stat > Nonparametrics > Mann-Whitney, then choose the appropriate columns for our first and second sample: 

Mann-Whitney Test

The basketball rankings have a smaller median difference than the football rankings. However, when we examine the p-value we see that this difference is not statistically significant. There is not enough evidence to conclude that one preseason poll is more accurate than the other.

But what about the best teams? I grouped each of the top 3 ranked teams and looked at the median difference between their preseason and final rank.

Bar Chart

The preseason AP basketball poll has a smaller difference for the #1 and #3 ranked teams. But the football poll is better for the #2 team, having an impressive median value of 1. Overall, both polls are relatively good, as neither has a median value greater than 6. And the differences are close enough that we can’t conclude that one is more accurate than the other.

What Does It Mean for the Teams?

While the odds are against both Kentucky and Florida State to finish the season ranked #1 in their respective polls, previous seasons indicate that they’re still likely to finish as one of the top teams. This is better news for Kentucky, as being one of the top teams means they’ll easily make the NCAA basketball tournament and get a high seed. However, Florida State must finish as one of the top 4 teams, or else they’ll miss out on the football postseason completely.

So while we can’t conclude one poll is better than the other, teams at the top of the AP basketball poll are clearly much more likely to reach the postseason than football.


What to Do When Your Data's a Mess, part 3

$
0
0

Everyone who analyzes data regularly has the experience of getting a worksheet that just isn't ready to use. Previously I wrote about tools you can use to clean up and elminate clutter in your data and reorganize your data

In this post, I'm going to highlight tools that help you get the most out of messy data by altering its characteristics.

Know Your Options

Many problems with data don't become obvious until you begin to analyze it. A shortcut or abbreviation that seemed to make sense while the data was being collected, for instance, might turn out to be a time-waster in the end. What if abbreviated values in the data set only make sense to the person who collected it? Or a column of numeric data accidentally gets coded as text?  You can solve those problems quickly with statistical software packages.

Change the Type of Data You Have

Here's an instance where a data entry error resulted in a column of numbers being incorrectly classified as text data. This will severely limit the types of analysis that can be performed using the data.

misclassified data

To fix this, select Data > Change Data Type and use the dialog box to choose the column you want to change.

change data type menu

One click later, and the errant text data has been converted to the desired numeric format:

numeric data

Make Data More Meaningful by Coding It

When this company collected data on the performance of its different functions across all its locations, it used numbers to represent both locations and units. 

uncoded data

That may have been a convenient way to record the data, but unless you've memorized what each set of numbers stands for, interpreting the results of your analysis will be a confusing chore. You can make the results easy to understand and communicating by coding the data. 

In this case, we select Data > Code > Numeric to Text...

code data menu

And we complete the dialog box as follows, telling the software to replace the numbers with more meaningful information, like the town each facility is located in.  

Code data dialog box

Now you have data columns that can be understood by anyone. When you create graphs and figures, they will be clearly labelled.  

Coded data

Got the Time? 

Dates and times can be very important in looking at performance data and other indicators that might have a cyclical or time-sensitive effect.  But the way the date is recorded in your data sheet might not be exactly what you need. 

For example, if you wanted to see if the day of the week had an influence on the activities in certain divisions of your company, a list of dates in the MM/DD/YYYY format won't be very helpful.   

date column

You can use Data > Date/Time > Extract to Text... to identify the day of the week for each date.

extract-date-to-text

Now you have a column that lists the day of the week, and you can easily use it in your analysis. 

day column

Manipulating for Meaning

These tools are commonly seen as a way to correct data-entry errors, but as we've seen, you can use them to make your data sets more meaningful and easier to work with.

There are many other tools available in Minitab's Data menu, including an array of options for arranging, combining, dividing, fine-tuning, rounding, and otherwise massaging your data to make it easier to use. Next time you've got a column of data that isn't quite what you need, try using the Data menu to get it into shape.

 

 

Lessons in Quality from Guadalajara and Mexico City

$
0
0

View of Mexico CityLast week, thanks to the collective effort from many people, we held very successful events in Guadalajara and Mexico City, which gave us a unique opportunity to meet with over 300 Spanish-speaking Minitab users. They represented many different industries, including automotive, textile, pharmaceutical, medical devices, oil and gas, electronics, and mining, as well as academic institutions and consultants.

As I listened to my peers Jose Padilla and Marilyn Wheatley deliver their presentations, it was interesting to see people's reactions as they learned more about our products and services. Several attendees were particularly pleased to learn more about Minitab's ease-of-use and step-by-step help with analysis offered by the Assistant menu. I saw others react to demonstrations of Minitab's comprehensive Help system, the use of executables for automation purposes, and several of the tips and tricks discussed throughout our presentations.

We also had multiple conversations on Minitab's flexible licensing options. Several attendees who spend a lot of time on the road were particularly glad to learn about our borrowing functionality, which lets you “check out” a license so you can use Minitab software without accessing your organization’s license server.

Acceptance Sampling Plans

There were plenty of technical discussions as well. One interesting question came from a user who asked how Minitab's Acceptance Sampling Plans compare to the ANSI Z1.4 standard (a.k.a. MIL-STD 105E). The short answer is that the tables provided by the ANSI Z1.4 are for a specific AQL (Acceptable Quality Level), while implicitly assuming a certain RQL (Rejectable Quality Level) based solely on the lot size. The ANSI Z1.4 is an AQL-based system, while Minitab's acceptance sampling plans give you the flexibility to create a customized sampling scheme for a specific AQL, RQL, or lot size using both the binomial or hypergeometric distributions.

Destructive Testing and Gage R&R

Other users had questions about Gage R&R and destructive testing. Practitioners commonly assess a destructive test using Nested Gage R&R; however, this is not always necessary. The main problem with destructive testing is that every part tested is destroyed and thus can only be measured by a single operator. Since the purpose of this type of analysis is to measure the repeatability and reproducibility of the measurement system, one must identify parts that are as homogeneous as possible. Typically, instead of 10 parts, practitioners may use multiple parts from each of 10 batches. If the within-batch variation is small enough then the parts from each batch can be considered to be "the same" and thus the readings measured by all the operators can be used to produce repeatability and reproducibility measures. The main trick is to have homogenous units or batches that can give you enough samples to be tested by all operators for all replicates. If this is the case, you can analyze a destructive test with crossed gage R&R.

Control Charts and Subgroup Size

We also had an interesting discussion about the sensitivity of Shewhart control charts to the subgroup size. Specifically, one of the attendees asked our recommendation for subgroup size: 4, or 5? 

The answer to this intriguing question requires an understanding of the reason why subgroups are recommended.  Control charts have limits that are constructed so that if the process is stable, the probability of observing points out of these control limits is very small; this probability is typically referred to as the false alarm rate and it is usually set at 0.0027.  This calculation assumes the process is normally distributed, so if we were plotting the individual data as in an Individuals chart, the control limits would be effective to determine an out-of-control situation only if the data came from a normal distribution. To reduce the dependence on normality, Shewhart suggested collecting the data in subgroups, because if we plot the means instead of the individual data the control limits would become less and less sensitive to normality as the subgroup size increases. This is a result of the Central Limit Theorem (CLT), which states that regardless of the underlying distribution of the data, that if we take independent samples and compute the average (or a sum) of all the observations in each sample then the distribution of these sample means will converge to a normal distribution.

So going back to the original question, what is the recommended subgroup size for building control charts? The answer depends on how skewed the underlying distribution may be. For various distributions a subgroup size of 5 is sufficient to have the CLT kick in making our control charts robust to normality; however for extremely skewed distributions like the exponential, the subgroup sizes may need to be much larger than 50. This topic was discussed in a paper Schilling and Nelson titled "The Effect of Non-normality on the Control Limits of Xbar Charts" published in JQT back in 1976.

Analyzing Variability

We also had a great discussion about modeling variability in a process. One of the attendees, working for McDonald's, was looking for statistical methods for reducing the variation of the weight of apple slices. An apple is cut in 10 slices, and the goal was to minimize the variation in weight so that exactly four slices be placed in each bag without further rework.  This gave me the opportunity to demonstrate how to use the Analyze Variability command in Minitab, which happens to be one of the topics we cover in our DOE in Practice course.

We Love Your Questions

For me and my fellow trainers, there’s nothing better than talking with people who are using Minitab software to solve problems.  Sometimes we’re able to provide a quick, helpful answer.  Sometimes a question provokes a great discussion about some quality challenge we all have in common. And sometimes a question will lead to a great idea that we’re able to share with our developers and engineers to make our software better. 

If you have a question about Minitab, statistics, or quality improvement, please feel free to comment here.  And if you use Minitab software, you can always contact our customer support team for direct assistance from specialists in IT, statistics, and quality improvement.

 

Giving Thanks for Ways to Edit a Bar Chart of Pies

$
0
0

My siblings occasionally remind me that because I’m getting older, one day, my metabolism is going to collapse. When that day comes, consuming mass quantities of food will surely lead to the collapse of my body, mind, and soul. But, as that day is coming slowly, on Thanksgiving, I’m an every-pie-kind-of-guy.

Now, I know what you’re thinking. It’s Thanksgiving. I’ve just mentioned pies. We’re going to look at pie charts of pies. If you really want to look at pie charts of pies, go ahead and get it out of your system:

2012 survey by National Public Radio about pie preferences

2008 survey by Schwan’s Consumer Brands North  America

A Robot that Puts Pie Charts onto Actual Pies

In this post, we’re going to do something more like this:

At our house, we usually do three pies for Thanksgiving: Pumpkin, Chess, and Pecan. I’m going to use a chart of these to show you the things I’m most thankful you can do after you’ve made your bar chart in Minitab. Let’s say that we start with a chart of the calories per slice.

The default graph has all blue bars. In this case, the order of the bars is the order from the worksheet.

Reorder the bars

These bars are presently in the order that they were listed in the worksheet. But I like to eat them in order of difficulty, starting with the pecan and easing towards the pumpkin. This tends to follow the order of the calories, so we can put the pies in descending order.

  1. Double-click the bars.
  2. Select the Chart Options tab.
  3. In Order Main X Groups By, select Decreasing Y. Click OK.

The pecan pie is on the left because it has the most calories. Other pies follow in descending order.

Add labels that show the y-values

Bar charts are great for making comparisons. Ordering them makes it even clearer which categories are greatest and which are least. But if you want to get precise numbers, you can easily add labels that show the values from the data.

  1. Right-click the graph.
  2. Select Add > Data Labels. Click OK.

The numbers above the bars give the exact number of calories per slice.

Accumulate bars

As an every-pie-kind-of-guy, one of the things I might want to know is how many calories I eat when I have a slice of each pie.  That’s the kind of situation when it’s helpful to accumulate Y across X.

  1. Double-click the bars.
  2. Select the Chart Options tab.
  3. In Percent and Accumulate, check Accumulate Y across X. Click OK.

The resulting graph shows the number of calories for a slice of pecan, for a slice of pecan and a slice of chess, and for a slice of all 3.

The right bar shows the number of calories if I eat one slice of each pie.

Edit the fill patterns

Like when you’re making a graph about pies, it’s often helpful to make colorful bars that help to represent the categories in the data. In this case, all you have to do is follow these steps:

  1. Click the bars in the graph once to select all of them.
  2. Click one of the bars in the graph once to select only one bar.
  3. Double-click the selected bar to edit the bar.
  4. In Fill Pattern, select Custom.
  5. From Background color, select the color that represents your category. Click OK.

For example, we could make the pecan bar “chestnut,” the chess bar “gold,” and the pumpkin bar “orange.”

Colors of the bars are the colors of the pies.

It’s generally best to leave this step to last, because some other editing steps, like changing the order, can change the bar colors.

Wrap up

Very often, editing a graph so that it presents the message that you want is easier once you’re able to see the graph. That makes it wonderful that it’s so easy to edit a graph after you’ve already made it in Minitab. To see even more about what you can do with different types of graphs, check out the list of graph options. And have a Happy Thanksgiving where you are!

Four Quick Tips for Editing Control Charts

$
0
0

Hi everyone! Over the past month, I fielded some interesting customer calls regarding control chart creation and editing. I wanted to share these potential scenarios with you in hopes that you will find them informative and useful. For these scenarios, I used the XBar-R chart as my template, but you could easily apply them to many of the other control charts in Minitab. 

Scenario 1: Create a Control Chart with Stages

Suppose you want to create an XBar-R Chart with stages. Stages show how a process changes over specific time periods. At each stage, Minitab Statistical Software recalculates the center line and control limits on the chart by default.

You decide to create a two-stage chart. However, you want to use a historical standard deviation of 2 for the first stage, as opposed to letting Minitab calculate it for you. For the second stage, you’ll let Minitab calculate the standard deviation.

You can enter historical estimates for the standard deviation under the Parameters tab under the Xbar-R Options sub-menu:

You may be inclined to enter in 2 into the second box and hit OK, but that will set the standard deviation to 2 for both stages. You’ll need to add an asterisk to represent the stage that is not affected by the historical estimate:

The resulting Xbar-R chart will set the standard deviation to 2 for the first stage, leaving the second stage unaffected.

Scenario 2: Showing Control Limits for Different Stages

Let’s piggyback off of our first scenario and look at an Xbar-R chart with stages:

You'll notice that only the last stage’s control limits are displayed, but you really want the first stage's to be displayed as well. This change can be made in the Xbar-R Options sub-menu, under the Display tab:

After checking this box and hitting OK a few times, your Xbar-R Chart will show the control limits for all stages.  You could set this to be the default behavior under Tools > Options  >Control Charts and Quality Tools > Other:

Scenario 3: Hiding Symbols on Your Control Charts

On our Xbar-R chart, let's hide the  symbols from the top graph (Xbar) by right-clicking on one of the symbols, and going to the Edit Symbols sub-menu…

Choose "None" under "Custom" and select OK. The symbols disappear from the top graph. But what if we have a change of heart and want those symbols back? We right-click on different places on the graph, but all we see are options for “Edit Figure Region…”, “Edit Data Region”, or “Edit Connect Line…” Uh-oh.  Have we lost our symbols?

Fortunately, we have not.  Go to the Editor Menu. (Make sure you have your graph selected prior to doing this. The Editor Menu dynamically changes based on what you are currently selecting with Minitab.) Under Editor > Select Item, select "Symbols." If we had hidden the points from the bottom graph, we would have selected "Symbols 2."

After the symbols have been selected, press CTRL + T to go to the ‘Edit Symbols’ dialog window or simply to go Editor > Edit Symbols…

Scenario 4: Setting Data Labels on Control Charts

I would like to close this post with a very minor but useful tip. When adding reference lines to a control chart, you can choose whether you want the data label to appear on the lower end or higher end of the line. First add a reference line to your control chart by right clicking on your chart and selecting Add > Reference Lines…

Fill out the dialog as you please by entering a few values, and hit OK to add your reference line to the chart. With the reference line selected, press CTRL+T to open up the Edit Reference Lines dialog. Under the Show tab, you can choose what side of the reference line you want the label to appear.

Before:

After Changing to ‘Low Side’:

 

Sharing these tips has helped many people who have contacted us about control charts over the years, and I hope they will help you the next time you find yourself in our control chart menus!

 

An Unauthorized Biography of the Stem-And-Leaf Plot, Part I: A Stem by Any Other Name

$
0
0

Greetings fair reader. In the past, I've written several posts with practical tips related to Minitab graphs, such as:

In this post, I thought I'd take a step back and explore the historical side of Minitab's graphs. Specifically, we'll explore the history behind the dodo bird of the graph kingdom: the Stem-and-Leaf plot. 

Recently, I was chatting with Bob (not her real name) in Minitab's excellent Technical Support department. Bob recently got a call from a customer who wanted to know how the Stem-and-Leaf plot got its name. Imagine my shock and horror as Bob explained that many budding statisticians don't know this story! (For the record, I personally do not subscribe to the notion that budding is the only successful reproductive strategy for statisticians like myself. There's also cloning.) 

For the curious, but uninitiated, here's an example of a Stem-and-Leaf plot:

Each digit in the right column represents a single value from the sample. The left column serves as a base of equally spaced intervals (or "bins," as I will later call them). Together, the value in the left column and the digits in the right column give you the data values in each bin. For example, the largest value in the sample of budding rates is 15.3 (bottom row).

At this juncture, experienced readers might ask, "Wait a minute, what about the counts?" I, for one, take no stock in antiquated monarchistic hierarchies, so I tend to leave out the counts. However, you are correct: Stem-and-Leaf plots often include a column of counts on the far left. For example, here is a Stem-and-Leaf plot as it appears in Minitab Statistical Software:

The left-most column shows the cumulative count of observations from the top to the middle and from the bottom to the middle. In the example, the counts indicated that the top 5 rows include a total of 14 observations. And the bottom 5 rows also contain 14 observations.

The Seeds of a Plot

The Stem-and-Leaf plot is sometimes called a "character graph", owing (no doubt) to the colorful and interesting characters who invented it: Dr. Woodrow "Woody" Stem, and Dr. August "Russell" Leaf. (I think they won the Nobel prize for their work. I don't know for sure, but I'd have to look it up, so let's go with that.)

Woody and Russell were aspiring students under the careful tutelage of renowned statistician, Dr. Histeaux Graham. The good professor challenged our heroes, as he did all of his students, to come up with a new and better way to examine the distribution of values in a sample. (Mind you, this was before computers, calculators, or even highly-leveraged derivatives trading.)

Legend has it that our heroes Woody and Russell "got into it" one night at a pub, after a particularly intense lecture by Dr. Graham on the twin scourges of platykurtosis and leptokurtosis. (Mind you, this was before penicillin.)

Woody was adamant that in order to meet Dr. Graham's, challenge they must divide the sample into equal intervals (or "bins," as he would later call them).

Russell insisted that Woody's idea was a load of BS (Basic Statistics). "How can you understand a sample," Russell demanded, "by looking at a bunch of evenly spaced intervals, or 'bins' as you will no doubt resort to calling them?!?"

Sadly, it was at this juncture that our learned gentlemen succumbed to more pugilistic impulses: fisticuffs broke out. (Mind you, this was before boxing gloves, moisturizing cream, or even those convenient Isotoner one-size-fits-most gloves.)

I will spare you the details, but suffice it to say that the damage was significant. Woody looked like he had been worked over with the business end of a boxplot, and Russell's wallis was completely kruskaled. It looked like Dr. Graham's challenge might go answered. And it may have, but for an unlikely twist of fate.

To be continued ...

Viewing all 828 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>