Minitab | Minitab

This is an era of massive data. A huge amount of data is being generated from the web and from customer relations records, not to mention also from sensors used in the manufacturing industry (semiconductor, pharmaceutical, petrochemical companies and many other industries).

Univariate Control Charts

In the manufacturing industry, critical product characteristics get routinely collected to ensure that all products at every step of the process remain well within specifications. Dedicated univariate control charts are deployed to ensure that any drift gets detected as early as possible to avoid negative effects on the final product performance. Ideally, when a special cause gets identified, the equipment should be immediately stopped until the issue gets resolved.

Monitoring Tool Process Parameters

In modern plants, many manufacturing tools are connected to IT networks so that tool process parameters can be collected and stored in real time (pressures, temperatures etc.). Unfortunately, this type of data is, very often, not continuously monitored, although we might expect process parameters to play an important role in terms of final product quality. When a quality incident occurs, data from these numerous upstream process parameters are sometimes retrieved from databases, to investigate (after the fact) why this incident took place in the first place.

A more efficient approach would be to monitor these process parameters in real time and try to understand how they affect complex manufacturing processes: Which process parameters are really important, and which ones are not? What are their best settings?

Multivariate Control Charts

Monitoring upstream tool parameters might lead to a huge increase in the number of control charts, though. In this context, process engineers might benefit from using multivariate charts which let you monitor up to 7 or 8 parameters together in a single chart. Rather than using equipment process parameter data to investigate the causes of previous quality incidents in a fire-fighting mode, this approach would focus on long-term improvements.

Multivariate control charts are based on squared standardized (generalized) multivariate distances from the general mean. In Minitab, the T² Hotelling method is used to generate multivariate charts. If you don't already have Minitab and you'd like to try creating some of the charts I'm discussing, you can download the free 30-day trial.

An obvious advantage of using multivariate charts is that they enable you to minimize the total number of control charts you need to manage, but there are some additional related benefits involved as well:

Analyzing process parameters jointly: Many process parameters are related to one another, for example, for a particular process step we might expect the pressure value to be large when temperature is high. Considering every process parameter separately is not necessarily a good option and might even be misleading. Detecting any mismatch between parameter settings may be very useful.
In the graph below, the Y1 and Y2 parameter values are correlated (high values for Y1 are associated with high values for Y2) so that the red point in the lower right corner appears to be out-of-control (beyond the control ellipse) from a multivariate point of view. From a univariate perspective, this red point remains within the usual fluctuation bounds for both Y1 and Y2, though. This point clearly represents a mismatch between Y1 and Y2. The squared generalized multivariate distance from the red point to the scatterplot mean is unusually large.

Overall rate of false alarms: The probability of a false alarm with three-sigma standard limits in a control chart is 0,27%. If 100 charts are monitored at the same time, the probability of a false alarm automatically increases to 27% (0.27% * 100).
However, when numerous variables are monitored simultaneously using a single multivariate chart, the overall/family rate of false alarms remains close to 0.27%.
3-D measurements: When three-dimensional measurements of a product are taken, the amount of data needed to ensure that all dimensions (X, Y and Z) remain within specifications can get pretty big. But if the product gets damaged in a particular area, it will usually affect more than one dimension, so the three dimensions should not be considered separately from one another. If a multivariate chart simultaneously monitors deviations from the ideal planned X, Y, Z values, their combined effects will be taken into account.

A Simple Example

Eight process parameters have been monitored using eight univariate Xbar control charts. No out-of-control point has been detected (see below):

The eight control charts above may be replaced by a single multivariate chart that monitors the eight variables simultaneously. Although no out-of-control point had been detected in the univariate charts, subgroup number 12 turns out to be out of control in the multivariate chart:

To investigate why an out-of-control point (subgroup 12) occurred in the multivariate chart, I used simple graphs (scatterplots) to analyze time trends. Note that as far as the X3, X4 and X5 parameters are involved, subgroup 12 is positioned far away from the other points.

Conclusion

When process parameters have no direct critical effect, a dedicated univariate chart is not necessarily required. Multivariate charts enable you to routinely monitor many tool process parameters with fewer charts. The objective would be to better understand whether out-of-control points in a multivariate chart may be used to anticipate quality issues as far as the product characteristics are concerned.

To better control a process, we need to assess how upstream tool parameters affect the final product. Multivariate charts are also very useful to monitor 3-D measurements. Identifying the reason for an out-of-control point in a multivariate chart is a key aspect of using it successfully.

Over the past year I've been able to work with and learn from practitioners and experts who are using data analysis and Six Sigma to improve the quality of healthcare, both in terms of operational efficiency and better patient outcomes. I've been struck by how frequently a very basic analysis can lead to remarkable improvements, but some insights cannot be attained without conducting more sophisticated analyses. One such situation is covered in a 2011 Quality Engineering article on the application of binary logistic regression in a healthcare Six Sigma project.

In this series of blog posts, I'll follow the path of the project discussed in that article and show you how to perform the analyses described using Minitab Statistical Software. (I am using simulated data, so my analyses will not match those in the original article.)

The Six Sigma Project Goal

The goal of this Six Sigma project was to attract and retain more patients in a hospital's cardiac rehabilitation program. On being discharged, heart-surgery patients are advised to join this program, which offers psychological support and guidance on a healthy diet and lifestyle. Program participants also have two or three physical therapy sessions per week, for up to 45 sessions.

An average of 33 new patients begin participating in the program per month, and participants attend an average of 29 sessions. But many discharged patients do not enroll in the program, and many who do drop out before they complete it. Greater rates of participation would benefit individual patients' health and increase the hospital's revenues.

The project team identified two critical metrics they might improve:

The number of patients participating in the program each month
The number of therapy sessions for each participant

The team set a goal to increase the average number of new participants to 36 per month, and to increase the average number of sessions each patient attends to 32.

Available Patient Data

Existing data on the hospital's cardiac patients includes:

The distance between each patient's home and the hospital
Patient's age and gender
Whether or not the patient has access to a car
Whether or not the patient participated in the rehabilitation program

To illustrate the analyses conducted for this project, we will use a simulated set of data for 500 patients. Download the data set to follow along and try these analyses yourself. If you don't already have Minitab, you can download and use our statistical software free for 30 days.

Exploring Why Patients Leave the Program with a Pareto Chart

Encouraging patients who start the program to complete it, or at least to attend a greater number of sessions, has the potential to be a quick and easy "win," so the project team began by looking at why 156 patients who started the program eventually dropped out.

The reasons patients gave for dropping out of the rehabilitation program were placed into several different categories, then visualized with a Pareto chart.

The Pareto chart is a must-have in any analyst’s toolbox. The Pareto principle states that about 80% of outcomes come from 20% of the possible causes. By plotting the frequencies and corresponding percentages of a categorical variable, a Pareto chart helps identify the "vital few"—the “20%" that really matter, so you can focus your efforts where they can make the most difference.

To create this chart in Minitab, open Stat > Quality Tools > Pareto Chart... From our worksheet of simulated hospital data, select the Reason column as shown:

Pareto Dialog

When you press OK, Minitab creates the following chart:

Pareto Chart of Reasons

Along the x-axis, Minitab displays the reasons people dropped out of the rehabilitation program, along with the percent of the total and the cumulative percentage each reason accounted for. We can see that some 80% of these patients dropped out of the program for one of the following reasons:

They were readmitted to the hospital.
Work or other obligations conflicted with the program schedule.
They could not participate for medical reasons.
They had their own exercise facilities.

While encouraging existing participants to complete the program seemed like a good strategy, the Pareto chart shows that most people stop participating due to factors that are beyond the hospital's control. Therefore, rather than focusing on keeping existing participants, the team decided to explore how to attract more new participants.

Getting More Patients to Participate in the Program

Having decided to focus on increasing initial enrollment, the project team next gathered cardiologists, physical therapists, patients, and other stakeholders to brainstorm about the factors that influence participation.

At these brainstorming sessions, many stakeholders insisted that more people would participate in the rehabilitation program if the brochure about it were better. Another suggested solution involved sending a letter to cardiologists encouraging them to be more positive about the program and to mention it to patients at an earlier point in their treatment.

The project team recorded these suggestions, but they were wary of jumping to conclusions that weren't supported by data. They decided to look more closely at the data they had from existing patients before proceeding with any potential solutions.

In part 2, we will review how the team used graphs and basic descriptive statistics to get quick insight into the influence of individual factors on patient participation in the program.

My previous post covered the initial phases of a project to attract and retain more patients in a cardiac rehabilitation program, as described in a 2011 Quality Engineering article. A Pareto chart of the reasons enrolled patients left the program indicated that the hospital could do little to encourage participants to attend a greater number of sessions, so the team focused on increasing initial enrollment from 32 to 36 patients per month.

heart with stethoscope Stakeholders offered several solutions. Before implementing any improvement strategy, however, the team decided to look at how other individual factors influenced patient participation in the program. Taking this step can help avoid devoting resources to "fixing" factors that have little impact on the outcome.

In this post, we will look at how the team analyzed those individual factors. We have (simulated) data from 500 patients, including:

Address and distance between each patient's home and hospital
Each patient's age and gender
Whether or not the patient had a car
Whether or not the patient participated in the program

Download the data set to follow along and try these analyses yourself. If you don't already have Minitab, you can download and use our statistical software free for 30 days.

The team used simple statistics and graphs to get some preliminary insight into how these different factors affected whether or not patients decided to participate in the rehabilitation program.

Looking at the Influence of Distance on Patient Participation

The team looked first at the influence of distance on participation using a boxplot. Also known as a box-and-whisker diagram, the boxplot gives you an indication of your data's general shape, central tendency, and variability with a single glance. Displaying boxplots side-by-side lets you easily compare the distribution of data between groups. You can easily compare the central value and spread of the distribution for each group and determine if the data for each group are symmetric about the center.

To create this graph, open the patient data set in Minitab and select Graph > Boxplot > One Y With Groups.

boxplot dialog

In the dialog box, select "Distance" as the graph variable, choose "Participation" as the categorical variable, and click OK.

Boxplot of Distance dialog

Minitab generates the following graph:

Boxplot of Distance vs. Patient Participation

The boxplot indicates that patients who live closer to the hospital are more likely to participate in the program. This is valuable, but it would be interesting to know more about the relationship between distance and participation. But because "Participation" is a binary response—a patient either participates, or does not—we can't visualize that relationship directly with graphs that require a continuous response.

However, to get a bit more insight, the project team divided the patients into groups according to how far away from the hospital they live, then calculated the relative percentage of participation for each group. To do this, select Data > Recode > To Text... and complete the dialog box using the following groups. The picture below shows only the first five of the seven groups, so here is the complete list:

Group 1: 0 to 25 km
Group 2: 25 to 35 km
Group 3: 35 to 45 km
Group 4: 45 to 55 km
Group 5: 55 to 65 km
Group 6: 65 to 75 km
Group 7: 75 to 200 km

recode distance

When you recode the data, Minitab creates new columns of coded data and provides a summary in the Session Window:

distance group summary

Minitab automatically names the new column of data "Recoded Distance," which I've renamed as "Distance Group."

To determine the relative frequency of participation among each group, choose Stat > Tables > Descriptive Statistics... In the dialog box, select 'Distance Group' as the variable for rows, and Participation as the variable for columns, as shown. Click on the "Categorical Variables" button and make sure 'Counts' and 'Row percents' are selected, then press OK twice.

table of descriptive statistics for distance dialog

In the session window, Minitab will display a table that shows the total number in each distance group, the number participating, and the relative frequency of participation for each group.

Tabbed Data

If we enter that information into the Minitab worksheet like this:

table of descriptive statistics for distance

we can create a scatterplot that reveals more about the relationship between distance and participation. Select Graph > Scatterplot..., and choose "With connect line."

scatterplot dialog

Select 'Part %' as the Y variable and 'Distance Grp' as the X variable, and Minitab creates the following graph, which shows the relationship between distance and participation more clearly:

scatterplot of participation vs distance

We can see that the percentage of participation is very high among patients who live closest to the hospital, but decreases steadily among groups who lived further than 45 miles away.

Looking at the Influence of Age on Patient Participation

We can use the same methods to get initial insight into how age affects a patient's likelihood of participation in the program. The boxplot below indicates age does have some influence on participation:

Boxplot of Age

By dividing the patient data into groups based on Age as we did for Distance, as detailed in the table below, we can create a similar rough scatterplot to enhance our understanding of the relationship between these variables. We’ll divide the data as shown here before using Stat > Tables > Descriptive Statistics… to determine the relative participation rates:

table of age groups

The scatterplot of the relative frequency of participation for patients in each Age group again yields greater insight into the relationship between this factor and the likelihood of participation. In this case, a much higher percentage of patients in the younger groups take part.

Scatterplot of participation vs age group

Looking at the Influence of Mobility and Gender on Patient Participation

Because both "Mobility" and "Participation" are binary variables, we can select Stat > Tables > Descriptive Statistics... to give us a tabular view of the data. Select "Mobility" as the row, and Participation as the columns, and Minitab will provide the following output, which gives you percentages of participation among those patients who do not own a car and those who do.

We can put these data into a bar chart for a quick visual assessment. Minitab offers several ways to accomplish this easily; I opted to place the table data for each variable into the worksheet as shown here:

Gender and Mobility Data

Now, by selecting Graph > Bar Chart, and choosing a simple chart in which "Bars represent values from a table"...

Bar Chart dialog

we can create the following bar charts that show the proportion of those with and without cars who participate in the program, and the proportion of men and women who participate:

Bar Chart of Gender

Participation by Mobility

It appears that gender could have a slight influence on participation, but the impact of having a car on participation is clearly an important factor.

An initial look at these factors indicates that access to the hospital is very important in getting people to participate. Offering a bus or shuttle service for people who do not have cars might be a good way to increase participation, but only if such service doesn't cost more than the amount of increased revenue it might generate by increasing participation.

In the next part of this series, we'll use binary logistic regression—which is not as scary as it might sound—to develop a model that will let us predict the probability a patient will join the program based on the influence factors we've looked at. A good estimate of that probability will enable us to calculate the break-even point for such a service.

Suppose you’ve collected data on cycle time, revenue, the dimension of a manufactured part, or some other metric that’s important to you, and you want to see what other variables may be related to it. Now what?

When I graduated from college with my first statistics degree, my diploma was bona fide proof that I'd endured hours and hours of classroom lectures on various statistical topics, including linear regression, ANOVA, and logistic regression.

However, there wasn’t a single class that put it all together and explained which tool to use when. I have all of this data for my Y and X's and I want to describe the relationship between them, but what do I do now?

Back then, I wish someone had clearly laid out which regression or ANOVA analysis was most suited for this type of data or that. Let's start with how to choose the right tool for a continuous Y…

Continuous Y, Continuous X(s)

Example:

Y: Weights of adult males

X’s: Age, Height, Minutes of exercise per week

What tool should you use? Regression

Where’s that in Minitab? Stat > Regression > Regression > Fit Regression Model

Continuous Y, Categorical X(s)

Example:

Y: Your Mario Kart Wii score

X’s: Wii controller type (racing wheel or standard), whether you stand or sit while playing, character (Mario, Luigi, Yoshi, Bowser, Peach)

What tool should you use? ANOVA

Where’s that in Minitab? Stat > ANOVA > General Linear Model > Fit General Linear Model

Continuous Y, Continuous AND Categorical X(s)

Example:

Y: Number of hours people sleep per night

X’s: Age, activity prior to sleeping (none, read a book, watch TV, surf the internet), whether or not the person has young children…“I had a bad dream, I'm thirsty, there’s a monster under my bed!”

What tool should you use? You have a choice of using either ANOVA or Regression

Where’s that in Minitab? Stat > ANOVA > General Linear Model > Fit General Linear Model orStat > Regression > Regression > Fit Regression Model

I personally prefer GLM because it offers multiple comparisons, which are useful if you have a significant categorical X with more than 2 levels. For example, suppose activity prior to sleep is significant. Comparisons will tell you which of the 4 levels—none, read a book, watch TV, surf the Internet—are significantly different from one another.

Do people who watch TV sleep, on average, the same as people who surf the Internet, but significantly less than people who do nothing or read? Or, perhaps, are internet surfers significantly different from the other three categories? Comparisons help you detect these differences.

Categorical Y

If Y is categorical, then you can use logistic regression for your continuous and/or categorical X’s. The 3 types of logistic regression are:

Binary: Y with 2 levels (yes/no, pass/fail)

Ordinal: Y with more than 2 levels that have a natural order (low/medium/high)

Nominal: Y with more than 2 levels that have no order (sedan/SUV/minivan/truck)

So the next time you have a bunch of X’s and a Y and you want to see if there's a relationship between them, here is a summary of which tool to use when:

Tool Selection Guide

For step-by-step instructions on how to use General Regression, General Linear Model, or Logistic Regression in Minitab Statistical Software, just navigate to any of these tools in Minitab and click Help in the bottom left corner of the dialog. You will then see ‘example’ located at the top of the Help screen. And Minitab customers can always contact Minitab Technical Support at 814-231-2682 or www.minitab.com/contact-us. Our Tech Support team is staffed with statisticians, and best of all, accessing them is free!

What does the eyesight of a homeless person have in common with complications from dental anesthesia? Or with reducing side-effects from cancer? Or monitoring artificial hip implants?

These are all subjects of recently published studies that use statistical analyses in Minitab to improve healthcare outcomes. And they're a good reminder that when we improve the quality of healthcare for others, we improve it for ourselves.

Vision care for the homeless

eye A recent retrospective review study was the first to investigate the visual healthcare needs of homeless people in the United Kingdom. Using clinical records of over 1,000 homeless individuals in East London who sought vision care, researchers summarized the demographics of this special-needs population and established baseline reference levels for future studies.

Using t-tests in Minitab, they determined that the homeless population tend to have more eye problems and greater need for visual care than the general population. Although vision problems might appear to be a secondary issue for those facing the constellation of severe, chronic problems often associated with homelessness, researchers point out that even something simple as a spectacle correction can substantially improve a person's quality of life. BMC Health Services Research 2016; 16:54.

Reducing complications from local anesthesia in oral surgery

dentistry Noting the proven ability of Six Sigma methodology to increase patient compliance and satisfaction, as well as hospital profitability, investigators applied quality improvement tools to identify and reduce the most common complications from local anesthesia in dental and oral surgery.

They used a Pareto chart to identify the most common complications, and a binomial capability analysis to evaluate the rate of complications before and after implementing remedial measures. The results showed a significant reduction in complications from local anesthesia (pre-improvement % defective 7.99 (95% CI 6.65, 9.51), vs post-improvement % defective 4.58 (95% CI 3.58, 5.77). Journal Clinical & Diagnostic Research. 2015;9(12) ZC34-ZC38.

Exercise, quality of life, and fatigue in breast cancer patients

woman walking Researchers explored associations between physical activity in women with breast cancer and their quality of life and levels of fatigue. Descriptive statistics were used to summarize characteristics of the study group. The nonparametric Kruskal-Wallis was used to evaluate differences in median scores, and Pearson's chi-square test was used to explore possible associations between categorical variables.

The authors found a significant positive correlation between increased physical activity level and a higher quality of life, as well as less fatigue. Although the study didn't prove a causal connection, their results support other studies that suggest that physical activity may help preserve quality of life and reduce side effects during cancer treatment. Rev Assoc Med Bras. 2016, 62(1).

Monitoring results of total hip replacement

hippy In hip replacement surgery, an important technical factor is the inclination angle of the acetabular component. Variations from the target angle can lead to increased amount of wear and poorer outcomes after surgery. Therefore, researchers used time-weighted control charts in Minitab, such as CUSUM, EWMA, and MA charts, to monitor the acetabular inclination angle in the postoperative radiographs of patients who underwent hip replacement surgery. The control charts demonstrated that the surgical process, in relation to the angle achieved, was stable and in control. The researchers noted that the time-weighed control charts helped them make a "faster visual decision." Biomed Research International 2015; ID 199610.

Additional Questions

What other types of quality improvement studies are being published in the fields of health and medicine? What are the overall trends for these studies? And how can the studies themselves be improved?

We'll look at that in my next post.

running The last thing you want to do when you purchase a new piece of software is spend an excessive amount of time getting up and running. You’ve probably been ready to the use the software since, well, yesterday. Minitab has always focused on making our software easy to use, but many professional software packages do have a steep learning curve.

Whatever package you’re using, here are three things you can do to speed the process of starting to analyze your data with statistical software:

1. Get Technical Support

If you’re having trouble figuring out how to do something in a statistical software package, the makers of the software should be ready to provide the assistance you need.

When you purchase Minitab, whether for a single user or for your entire organization, we offer free technical support, by phone or online, to help you install and use the software. We’ve also got quick-start installation guides and an extensive library of installation-related FAQs to browse.

Minitab’s technical support team includes specialists in statistics and quality improvement, as well as technology, so they can assist with virtually any challenge you encounter while using the software.

2. Consult Help

Let’s face it, when a problem arises, the documentation for a lot of software is not all that helpful. That’s why many of us tend to ignore the “Help” menu when we encounter a software-related question. But if you haven’t explored the Help options offered by your statistical software, you should check them out.

Most software have some sort of built-in Help content, but our team has taken it a step further by offering truly useful, valuable information within Minitab. That information includes concise overviews of major statistical topics, guidance for setting up your data, information on methods and formulas, comprehensive guidance for completing dialog boxes, and easy-to-follow examples. And that’s just the start. Minitab’s built-in help options also include:

The Assistant: You certainly don’t need to be a statistics expert to get the insight you need from your data. Minitab’s Assistant menu interactively guides you through several types of analyses—including Measurement Systems Analysis, Capability Analysis, Hypothesis Tests, Control Charts, DOE and Multiple Regression.

StatGuide: After you analyze your data, the built-in StatGuide helps you interpret statistical graphs and tables in a practical, straightforward way. To access the StatGuide, just right-click on your output, press Shift+F1 on the keyboard, or click the StatGuide icon in the toolbar:

stat guide

Tutorials: For a refresher on statistical tasks, take a look at built-in tutorials (Help > Tutorials), which include an overview of data requirements, step-by-step instructions, and guidance on interpreting the results.

3. Free Web Site Resources

See what kinds of material exists on the web site of your statistical software package. There may be much more there than basic information about the product!

For instance, at the Minitab web site you can attend live webinars, view recorded webcasts, and read step-by-step how-to’s and detailed technical articles. The Minitab Blog also offers tips and techniques for using Minitab in quality improvement projects, research, and more.

Perhaps my favorite resource on Minitab.com is the Minitab Product Support Section, which features a getting started guide, a topic library with all the various analyses available in Minitab, a free data set library to practice analyses, and a macro library that contains over 100 helpful macros you can use to automate, customize, and extend the functionality of Minitab analyses.

Interested in learning more about our pricing and licensing options? Visit http://www.minitab.com/products/minitab/pricing and contact us if you have questions.

In part 2 of this series, we used graphs and tables to see how individual factors affected rates of patient participation in a cardiac rehabilitation program. This initial look at the data indicated that ease of access to the hospital was a very important contributor to patient participation.

physical therapy facility Given this revelation, a bus or shuttle service for people who do not have cars might be a good way to increase participation, but only if such a service doesn't cost more than the amount of revenue generated by participation.

A good estimate of that probability will enable us to calculate the break-even point for such a service. We can use regression to develop a statistical model that lets us do just that.

We have a binary response variable, because only two outcomes exist: a patient either participates in the rehabilitation program, or does not. To model these kinds of responses, we need to use a statistical method called "Binary Logistic Regression." This may sound intimidating, but it's really not as scary as it sounds, especially with a statistical software package like Minitab.

Download the data set to follow along and try these analyses yourself. If you don't already have Minitab, you can download and use our statistical software free for 30 days.

Using Stepwise Binary Logistic Regression to Obtain an Initial Model

First, let's review our data. We know the gender, age, and distance from the hospital for 500 cardiac patients. We also know whether or not they have access to a vehicle ("Mobility") and whether or not they participated in the rehabilitation program after their surgery (coded so that 0 = no, and 1 = yes).

data

The process of developing a regression equation that can predict a response based on your data is called "Fitting a model." We'll do this in Minitab by selecting Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model...

Binary Logistic Regression menu

In the dialog box, we need to select the appropriate columns of data for the response we want to predict, and the factors we wish to base the predictions on. In this case, our response variable is "Participation," and we're basing predictions on the continuous factors of "Age" and "Distance," along with the categorical factor "Mobility."

binary logistic regression dialog 1

After selecting the factors, click on the "Model" button. This lets us tell Minitab whether we want to consider interactions and polynomial terms in addition to the main effects of each factor. Complete the Model dialog as shown below. To include the two-way interactions in the model, highlight all the items in the Predictors window, make sure that the “Interactions through order:” drop-down reads “2,” and press the Add button next to it:

Binary Logistic Regression Dialog 2

Click OK to return to the main dialog, then press the “Coding” button. In this subdialog, we can tell Minitab to automatically standardize the continuous predictors, Age and Distance. There are several reasons you might want to standardize the continuous predictors, and different ways of standardizing depending on your intent.

In this case, we’re going to standardize by subtracting the mean of the predictor from each row of the predictor column, then dividing the difference by the standard deviation of the predictor. This centers the predictors and also places them on a similar scale. This is helpful when a model contains highly correlated predictors and interaction terms, because standardizing helps reduce multicollinearity and improves the precision of the model’s estimated coefficients. To accomplish this, we just need to select that option from the drop-down as shown below:

Binary Logistic Regression - Coding

After you click OK to return to the main dialog, press the "Stepwise" button. We use this subdialog to perform a stepwise selection, which is a technique that automatically chooses the best model for your data. Minitab will evaluate several different models by adding and removing various factors, and select the one that appears to provide the best fit for the data set. You can have Minitab provide details about the combination of factors it evaluates at each "step," or just show the recommended model.

Binary Logistic Regression - stepwise

Now click OK to close the Stepwise dialog, and OK again to run the analysis. The output in Minitab's Session window will include details about each potential model, followed by a summary or "deviance" table for the recommended model.

Assessing and Refining the Regression Model

Using software to perform stepwise regression is extremely helpful, but it's always important to check the recommended model to see if it can be refined further. In this case, all of the model terms are significant, and the deviance table's adjusted R2 indicates that the model explains about 40 percent of the observed variation in the response data.

stepwise regression selected model

We also want to look at the table of coded coefficients immediately below the summary. The final column of the table lists the VIFs, or variance inflation factors, for each term in the model. This is important because VIF values greater than 5–10 can indicate unstable coefficients that are difficult to interpret.

None of these terms have VIF values over 10.

variance inflation factors (VIF)

Minitab also performs goodness-of-fit tests that assess how well the model predicts observed data. The first two tests, the deviance and Pearson chi-squared tests, have high p-values, indicating that these tests do not support the conclusion that this model is a poor fit for the data. However, the low p-value for the Hosmer-Lemeshow test indicates that the model could be improved.

goodness-of-fit tests

It may be that our model does not account for curvature that exists in the data. We can ask Minitab to add polynomial terms, which model curvature between predictors and the response, to see if it improves the model. Press CTRL-E to recall the binary logistic regression dialog box, then press the "Model" button. To add the polynomial terms, select Age and Distance in the Predictors window, make sure that "2" appears in the “Terms through order:” drop-down, and press "Add" to add those polynomial terms to the model. An order 2 polynomial is the square of the predictor.

binary logistic regression dialog 4

You may have noticed that we did not select “Mobility” above. Why? Because that categorical variable is coded with 1’s and 0’s, so the polynomial term would be identical to the term that is already in the model.

Now press OK all the way out to have Minitab evaluate models that include the polynomial terms. Minitab generates the following output::

binary logistic regression final model

However, the VIFs for Mobility and the Distance*Mobility interaction remain higher than desirable:

VIF

So far, so good—all model terms are significant, and the adjusted R2 indicates that the new model accounts for 51 percent of the observed variation in the response, compared to the initial model’s 40 percent. The coefficients are also acceptable, with no variance inflation factors above 10. These terms are moderately correlated, but probably not enough to make the regression results unreliable:

binary-logistic-regression-model-VIF

The goodness-of-fit tests for this model also look good—the lack of p-values below 0.05 indicate that these tests do not suggest the model is a poor fit for the observed data.

final-binary-logistic-regression-model-goodness-of-fit-tests

The Binary Logistic Regression Equations

This model seems like the best option for predicting the probability of patient participation in the program. Based on the available data, Minitab has calculated the following regression equations, one that predicts the probability of attendance for people who have access to their own transportation, and one for those who do not:

regression equations

Now we can use this model to make predictions about the probability of participation and how much we can afford to invest in a transportation program to help more cardiac patients participate in the rehabilitation program. In the next post, we'll complete that process.

By looking at the data we have about 500 cardiac patients, we've learned that easy access to the hospital and good transportation are key factors influencing participation in a rehabilitation program.

monitor Past data shows that each month, about 15 of the patients discharged after cardiac surgery do not have a car. Providing transportation to the hospital might make these patients more likely to join the rehabilitation program, but the costs of such a service can't exceed the potential revenue from participation.

We can use the binary logistic regression model developed in part 3 to predict probabilities of participation, to identify where transportation assistance might make the biggest impact, and to develop an estimate of how much we could invest in such assistance.

Download the data set to follow along and try these analyses yourself. If you don't already have Minitab, you can download and use our statistical software free for 30 days.

Using the Regression Model to Predict Patient Participation

We want to develop some estimates of the probability of participation based on whether or not a patient has access to transportation. The first step is make some mesh data representing our population. In Minitab, go to Calc > Create Mesh Data..., and complete the dialog box as shown below. (The maximum and minimum ranges for Age and Distance are drawn directly from the descriptive statistics for the sample data we used to create our regression model.)

Make Mesh Data Dialog

When you press OK, Minitab adds 2 new columns to the worksheet that contain the 200 different combinations of the levels of these factors. Now we'll add two additional columns, one representing patients who have access to a car, and one representing those who don't. Now our worksheet should include four columns of data as shown:

mesh data in worksheet

Now we'll go to Stat > Regression > Binary Logistic Regression > Predict... Minitab remembers the last regression model that was run; to make sure it's the right one, click the "View Model..." button...

view model

and confirm that the model displayed is the correct one.

view model

Next, press the "Predict" button and complete the dialog box using the mesh variables we created, as shown. We can also press the "Storage" button to tell Minitab to store the Fits (the predicted probabilities) for each data point in the worksheet. Note that the column selected for the Mobility term is "Car," so all of these predictions will be based on the equation for patients who have access to a vehicle.

regression prediction dialog

When you click OK through all dialogs, Minitab will add a column of data that shows the predicted probability of participation for patients, assuming they have a vehicle.

Now we'll create the predictions for individuals who don't have cars. Press CTRL-E to edit the previous dialog box. This time, for the Mobility column, select "NoCar."

no car

When you press OK, Minitab recalculates the probabilities for the patients, this time using the equation that assumes they do not have a vehicle. The probabilities of participation for each data point are stored in two columns in the worksheet, which I've renamed PFITS-Car and PFITS-No car.

pfits

Where Can Providing Transportation Make an Impact?

Now we have estimated probabilities of participation for patients with the same age and distance characteristics, both with and without access to a vehicle. It would be helpful to visualize the differences in these probabilities to see where offering transportation might make the biggest impact in increasing participation rates.

First, we'll use Minitab's calculator to compute the difference in probabilities between having and not having a car. Go to Calc > Calculator... and complete the dialog as shown:

calculator

Now we have column of data named "Car - NoCar" that contains the probability difference for patients with the same age and distance characteristics both with and without a vehicle. We can use that column to create a contour plot that offers additional insight into the relationships between the likelihood of participation in the rehabilitation program and a patient's age, distance, and mobility. Select Graph > Contour Plot... and complete the dialog as shown:

contour plot dialog box

Minitab produces this contour plot (we have edited the range of colors from the default):

contour plot

From this plot we can see the patients for whom transportation assistance is likely to make the most impact. These are the patients whose age and distance characteristics fall within the dark-red-colored area, where access to a vehicle raises the probability of participation by more than 40 percent.

The hospital could use this information to carefully target potential recipients of transportation assistance, but doing so would raise many ethical issues. Instead, the hospital will offer transportation assistance to any potential participant who needs it. The project team decides to calculate the average probability of participation for all patients without access to a vehicle.

To obtain that average, select Stat > Basic Statistics > Display Descriptive Statistics... in Minitab, and choose "PFITS-NoCar" as the variable. Click on the "Statistics" button to make sure the Mean is among the descriptive statistics being calculated, and click OK. Minitab will display the descriptive statistics you've selected in the Session Window.

descriptive statistics

According to our binary logistic regression model, the average probability of participation for all patients without a car equals 0.1695, which we will round up to .17. Now we can easily calculate an estimated break-even point for ensuring transport for patients who need it. We have the following information on hand:

Patients per month without a car................................................. 15 Average probability of participation without a car........................... .30 Average number of sessions per participant.................................. 29 Revenue per session.................................................................. $23

Based on these figures, a per-patient maximum for transportation can be calculated as:

.17 probability of participation x 29 sessions x $23 per session = $113.39

Since about 15 discharged cardiac patients each month do not have a car, we can invest at most 15 x $113.39 = $1700.85/month in transportation assistance.

Implementing Transportation Assistance for Patient Participation

As described in the article on which inspired this series of posts, the project team evaluated potential improvement options against this this economic calculation and developed a process that brought together patients with cars and those without to carpool to sessions. A pilot-test of the process proved successful, and most of the car-less patients noted that they would not have participated in the rehabilitation program without the service.

After implementing the new carpool process, the project team revisited the key factors they had considered at the start of the initiative, the number of patients enrolling in the program each month, and the average number of sessions participants attended.

After implementing the carpool process, the average number of sessions attended remained constant at 29. But patient participation rose from 33 to 45 per month, which exceeded the project goal of increasing participation to 36 patients per month. Additional revenues turned out to be circa $96,000 annually.

Take-Away Lessons from This Project Study

If you've read all four parts of this series, you may recall that at the start of the Six Sigma project, several stakeholders believed that the problem of low participation could be addressed by creating a nicer brochure for the program, and by encouraging surgeons to tell their patients about it at an earlier point in their treatment.

None of those initial ideas wound up being implemented, but the project team succeeded in meeting the project goals by enacting improvements that were supported by their data analysis. For me, this is a core takeaway from this article.

As the authors note, "Often people’s ideas on processes are incorrect, but improvement actions based on these are still being implemented. These actions cause frustrated employees, may not be cost effective, and in the end do not solve the problem."

Thus, the article makes a compelling case for the value of applying data analysis to improve processes in healthcare. "Even when a somewhat more advanced technique like logistic regression modeling is required," the authors write, "exploratory graphics such as boxplots and bar charts point the direction toward a valuable solution."

There may not be a situation more perilous than being a character on Game of Thrones. Warden of the North, Hand of the King, and apparent protagonist of the entire series? Off with your head before the end of the first season! Last male heir of a royal bloodline? Here, have a pot of molten gold poured on your head! Invited to a wedding? Well, you probably know what happens at weddings in the show.

So what do all these gruesome deaths have to do with statistics? They are data that come from a Poisson distribution.

Data from a Poisson distribution describe the number of times an even occurs in a finite observation space. For example, a Poisson distribution can describe the number of defects in the mechanical system of an airplane, the number of calls to a call center, or in our case it can describe the number of deaths in an episode of Game of Thrones.

Goodness-of-Fit Test for Poisson

If you're not certain whether your data follow a Poisson distribution, you can use Minitab Statistical Software to perform a goodness-of-fit test. If you don't already use Minitab and you'd like to follow along with this analysis, download the free 30-day trial.

I collected the number of deaths for each episode of Game of Thrones (as of this writing, 57 episodes have aired), and put them in a Minitab worksheet. Then I went to Stat > Basic Statistics > Goodness-of-Fit Test for Poisson to determine whether the data follow a Poisson distribution. You can get the data I used here.

Goodness-of-Fit Test for Poisson Distribution

Before we interpret the p-value, we see that we have a problem. Three of the categories have an expected value less than 5. If the expected value for any category is less than 5, the results of the test may not be valid. To fix our problem, we can combine categories to achieve the minimum expected count. In fact, we see that Minitab actually already started doing this by combining all episodes with 7 or more deaths.

So we'll just continue by making the highest category 6 or more deaths, and the lowest category 1 or 0 deaths. To do this, I created a new column with the categories 1, 2, 3, 4, 5 and 6. Then I made a frequency column that contained the number of occurrences for each category. For example, the "1" category is a combination of episodes with 0 deaths and 1 death, so there were 14 occurrences. Then I ran the analysis again with the new categories.

Goodness-of-Fit Test for Poisson Distribution

Now that all of our categories have expected counts greater than 5, we can examine the p-value. If the p-value is less than the significance level (usually 0.05 works well), you can conclude that the data do not follow a Poisson distribution. But in this case the p-value is 0.302, which is greater than 0.05. Therefore, we cannot conclude that the data do not follow the Poisson distribution, and can continue with analyses that assume the data follow a Poisson distribution.

Confidence Interval for 1-Sample Poisson Rate

When you have data that come from a Poisson distribution, you can use Stat > Basic Statistics > 1-Sample Poisson Rate to get a rate of occurrence and calculate a range of values that is likely to include the population rate of occurrence. We'll perform the analysis on our data.

1-Sample Poisson Rate

The rate of occurrence tells us that on average there are about 3.2 deaths per episode on Game of Thrones. If our 57 episodes were a sample from a much larger population of Game of Thrones episodes, the confidence interval would tell us that we can be 95% confident that the population rate of deaths per episode is between 2.8 and 3.7.

The length of observation lets you specify a value to represent the rate of occurrence in a more useful form. For example, suppose instead of deaths per episode, you want to determine the number of deaths per season. There are 10 episodes per season. So because an individual episode represents 1/10 of a season, 0.1 is the value we will use for the length of observation.

1-Sample Poisson Rate

With a different length of observation, we see that there are about 32 deaths per season with a confidence interval ranging from 28 to 37.

Poisson Regression

The last thing we'll do with our Poisson data is perform a regression analysis. In Minitab, go to Stat > Regression > Poisson Regression > Fit Poisson Model to perform a Poisson regression analysis. We'll look at whether we can use the episode number (1 through 10) to predict how many deaths there will be in that episode.

Poisson Regression

The first thing we'll look at is the p-value for the predictor (episode). The p-value is 0.042, which is less than 0.05, so we can conclude that there is a statistically significant association between the episode number and the number of deaths. However, the Deviance R-Squared value is only 18.14%, which means that the episode number explains only 18.14% of the variation in the number of deaths per episode. So while an association exists, it's not very strong. Even so, we can use the coefficients to determine how the episode number affects the number of deaths.

Poisson Regression

The episode number was entered as a categorical variable, so the coefficients show how each episode number affects the number of deaths relative to episode number 1. A positive coefficient indicates that episode number is likely to have more deaths than episode 1. A negative coefficient indicates that episode number is likely to have fewer deaths than episode 1.

We see that the start of each season usually starts slow, as 7 of the 9 episode numbers have positive coefficients. Episodes 8, 9, and 10 have the highest coefficients, meaning relative to the first episode of the season they have the greatest number of deaths. So even though our model won't be great at predicting the exact number of deaths for each episode, it's clear that the show ends each season with a bang.

And considering episode 8 of the current season airs this Sunday, if you're a Game of Thrones viewer you should brace yourself, because death is coming. Or, as they would say in Essos:

Valar morghulis.

Time series data is proving to be very useful these days in a number of different industries. However, fitting a specific model is not always a straightforward process. It requires a good look at the series in question, and possibly trying several different models before identifying the best one. So how do we get there? In this post, I'll take a look at how we can examine our data and get a feel for what models might work in a particular case.

How Does a Time Series Work?

The first thing to note is how Time Series work in general, and how those concepts apply to fitting the ARIMA model we're going to create.

In general, there are two things we look at when trying to fit a time series model. One is past values, which is what we use in AR (autoregressive) models. Essentially, we predict what our next point would be based on looking at a certain number of past points. An AR(1) model would forecast future values by looking at 1 past value.

The second thing we can look at is past prediction errors. These are called MA (moving average) models, and an MA(1) model would be predicting future values using 1 past prediction error.

Both of these concepts make sense individually; they're just different approaches to how we predict future points. An ARIMA model uses both of these ideas and allows us to fit one nice model that looks at both past values and past prediction errors.

Example of Fitting a Time Series Model

So let's take a look at an example and see if we can't fit a model. I've randomly created some time series data, and the first thing to do it simply plot it and see what's happening. Here, I've plotted my series:

tsplot

Here are some things to look for. First, a key assumption with these models is that our series has to be stationary. A stationary time series is one whose mean and variance are constant over time. In our case, it's clear that our mean is not constant over time—it's decreasing.

To resolve this, we can take a first difference of our data, and investigate that. In Minitab, this can be done by going to Stat > Time Series > Differences and taking a difference of lag 1. (This means that we are subtracting each data point from the one that follows it.)

When we plot this lag 1 difference data, we can see it is now stationary:

first diff

It took one difference to make our data stationary, so we now have one piece of our ARIMA model, the "I", which stands for "Integration." We know that we have an ARIMA(p,1,q). Now, how do we find the AR term(p) and the MA term(q)? To do that, we need to dive into two plots, namely the ACF and PACF—and this is where it gets tricky.

Interpreting ACF and PACF Plots

The ACF stands for Autocorrelation function, and the PACF for Partial Autocorrelation function. Looking at these two plots together can help us form an idea of what models to fit. Autocorrelation computes and plots the autocorrelations of a time series. Autocorrelation is the correlation between observations of a time series separated by k time units.

Similarly, partial autocorrelations measure the strength of relationship with other terms being accounted for, in this case other terms being the intervening lags present in the model. For example, the partial autocorrelaton at lag 4 is the correlation at lag 4, accounting for the correlations at lags 1, 2, and 3. To generate these plots in Minitab, we go to Stat > Time Series > Autocorrelation or Stat > Time Series > Partial Autocorrelation. I've generated these plots for our simulated data below:

acf

So what do these plots tell us? They each show a clear pattern, but how does that pattern help us to determine what our p and q values will be? Let's notice our patterns. Our PACF slowly tapers to 0, although it has two spikes at lags 1 and 2. On the other side, our ACF shows a tapering pattern, with lags slowly degrading towards 0. The table below can be used to help identify patterns, and what model conclusions we can make about those patterns.

ACF Pattern PACF Pattern Conclusion Tapers to 0 in some fashion non-zero values at first p points; zero values elsewhere AR(p) Model non-zero values at first q points; zero values elsewhereTapers to 0 in some fashion MA(q) model Values that remain close to 1, no tapering off Values that remain close to 1, no tapering off Symptoms of a non-stationary series. Differencing is most likely needed. No significant correlations No significant correlations Random Series

If a model contains both AR and MA terms, the interpretation gets trickier. In general, both will taper off to 0. There may still be spikes in the ACF and/or PACF which could lead you to try AR and MA terms of that quantity. However, it usually helps to try a few different models, and based on model diagnostics, choose which one fits best.

In this case, I used simulated data, so I know the best fit for my model is going to be an ARIMA(1,1,1). However, with real-world data, the answer may not be so obvious, and thus many models may have to be considered before landing on a single choice.

In my next post, I'll go over some diagnostic measures we can compare between models to see which gives us the best fit.

Remember the classic science fiction film The Matrix? The dark sunglasses, the leather, computer monitors constantly raining streams of integers (inexplicably in base 10 rather than binary or hexadecimal)? And that mind-blowing plot twist when Neo takes the red pill from Morpheus' outstretched hand? Well to me, there's one thing even more mind-blowing than the plot of the Matrix: the Matrix Plot. You know, in Minitab Statistical Software. (Click here to download a free trial.)

Just as Neo and his band of futuristic rebels were constantly barraged with endless streams of data, it seems like we, too, often face large amounts of data and we must make sense of. When faced with such a challenge, a good place to start is to create some exploratory graphs in Minitab. Previous posts have extolled the virtues of the Individual Value Plot and Graphical Summary for this purpose. Today, we're going to use the oracle of all plots, the Matrix Plot, to uncover the secrets of automobile specifications data. (Follow the link and scroll to the bottom of the page to download the worksheet.)

The data set looks like this:

AutoSpecs data set

There's a lot to take in here. The columns look like streams of random numbers...but are they? Time to enter the matrix. A matrix plot is a great exploratory tool because you can throw a bunch of data in it and just see what happens.

From Minitab's Graph menu, choose Matrix Plot. Under Matrix of plots, choose With Groups, and fill out the dialog box thusly:

Click the red pill

It is at this point that you must make a difficult choice. You can choose the blue pill 1 (a.k.a., the Cancel button) and go about your business, oblivious to and untroubled by the mind-blowing automotive realities that surround you. Or you can choose the red pill (click OK), after which your life will forever be altered by your ability to see into the data, to understand it, and—with practice—to even control it.2

If you chose the blue pill, click here.

If you chose the red pill, read on.

Matrix plot of auto data

As you can see, the matrix plot packs a lot of information into a small space. I like to do a couple of things to allow the data to spread out just a little. Remove the graph title by clicking it and pressing Delete. Then, choose Editor > Graph Options, and select Don't alternate (under Alternate Ticks on Plots). There, that's a little better:

Matrix plot of vehicle data without title

It's a lot to take in, but don't worry. Just as our band of heroes in The Matrix learned to read the endless streams of integers on their monitors, so too will this mass of dots soon make sense to you.

The matrix plot is simply a grid of scatterplots. For example, the left-most scatterplot in the top row shows City MPG on y-axis and Hwy MPG on the x-axis. Not surprisingly, there appears to be a very tight relationship between these two variables: vehicles with good city mileage tend to also have good highway mileage. You can tell from the scales that city MPG for all vehicles ranges between about 10 and 55 and that highway MPG ranges between about 19 and 50. From the symbols, you can also easily tell that the hybrid vehicles (red squares) get better mileage than gas-only vehicles (blue dots).

To simplify things, we can remove City MPG and Hwy MPG from the plot and leave just Total MGP (which is just City MPG + Hwy MPG). We can also remove Total Volume (which is Interior Volume + Cargo Volume).

To return to the Matrix Plot dialog box, you can press Ctrl + E. (This handy shortcut was #2 in the Minitab Tips and Tricks: Top 10 Countdown.) This time, in Graph variables, enter just columns C6 through C10.

Matrix plot without the redundant variables

(To maximize the space for data, I deleted the title and un-alternated the tick marks for this graph like we did for the last one.)

One thing that jumps out is that Safety isn't like the other variables. The other variables are continuous, but the safety ratings take on one of three discrete values: 3, 4, or 5. For discrete variables, the plot looks like an individual value plot. Interestingly, all hybrid vehicles scored a 4 or a 5; the only vehicles to score a 3 were gas-only.

Another thing that jumps out is the outlier in the Retail (price) measurements. While the other vehicles cost under $45,000, one vehicle sells for more than $70,000. Conveniently, we can brush the outlier and quickly see how that vehicle scores on the other measures. (For more information on this powerful tool, see Using brushing to investigate data points.)

The magic of brushing

The brushing palette shows that the outlier is in row 10 of the worksheet. The point for this observation is highlighted in each plot of the matrix. So you can quickly tell, for example, that even though you may have to ransack your kid's college fund to afford this beauty, at least he or she will enjoy the extra passenger room afforded by this luxury vehicle. And they are assured to arrive at their non-college-campus destinations in one piece because this vehicle gets the highest safety rating. However, you may have to pass the hat for gas because it looks like this baby is always thirsty.

Among its other virtues, the high price tag has the added effect of squishing the data for the other vehicles into the low end of the scale and thus making the graph harder to read. Now that I've scratched this rig off my wish list, let's go ahead and remove it from the plot. Again, we use the Ctrl + E trick to reopen the dialog box. This time we click the Data Options button and specify to exclude row 10 from the graph:

Exclude row 10 from the matrix

Matrix plot without row 10

Without the gas-guzzling outlier in the picture, it becomes clear that there is another outlier in town. One of the vehicles has an unusually low interior volume. Again, we can brush this point to see what's going on.

Brushed outlier in Volume

Brushing shows that this vehicle is about average on the other measures. It doesn't cost less than the others and doesn't seem to get better mileage; it's just cramped on the inside. Not a big selling point. Let's remove this point as well. (This vehicle is in row 15.)

Final matrix, no outliers

Without the outliers, the overall picture becomes still clearer. In general, it looks like more money does not buy you better gas mileage. The negative relationship between price and mileage is clear for both hybrid and gas-only vehicles. However, more money does seem to buy you more space. It looks like there is a positive relationship between price and interior volume and between price and cargo volume. Bigger vehicles are heavier and generate more wind resistance, so no wonder the more expensive vehicles tend to get worse gas mileage.

I think you'll agree that we have learned a lot about these data since we first entered the matrix just a few mouse clicks ago. No doubt more time in the matrix will reveal even more insights. Aren't you glad you chose the red pill?

Notes

1. The Matrix Plot dialog box featured in this post has been embellished for the purpose of dramatizing this reenactment. In real life, Minitab dialog boxes do not feature pills, or pharmaceutical agents of any kind. No actual dialog boxes or buttons were harmed during the making of this blog post. [return]

2. OK, so you can't really use a matrix plot to actually change the data in the worksheet. But you *can* use the matrix plot to change how *you see* the data and enable you to reveal more of your data secrets. And isn't that what's important? [return]

Acknowledgements

Credit for the original pill images goes to W.carter. Pills and steak dinner available under Creative Commons License 2.0 and Creative Commons License 1.0 respectively.

Businesses are getting more and more data from existing and potential customers: whenever we click on a web site, for example, it can be recorded in the vendor's database. And whenever we use electronic ID cards to access public transportation or other services, our movements across the city may be analyzed.

In the very near future, connected objects such as cars and electrical appliances will continuously generate data that will provide useful insights regarding user preferences, personal habits, and more. Companies will learn a lot from users and the way their products are being used. This learning process will help them focus on particular niches and improve their products according to customer expectations and profiles.

For example, insurance companies will monitor how motorists are driving connected cars, to adjust insurance premiums according to perceived risks, or to analyze driving behaviors so they can advise motorists how to boost fuel efficiency. No formal survey will be needed, because customers will be continuously surveyed.

Let's look at some statistical tools we can use to create and analyze user profiles, map expectations, study which expectations are related, and so on. I will focus on multivariate tools, which are very efficient methods for analyzing surveys and taking into account a large number of variables. My objective is to provide a very high level, general overview of the statistical tools that may be used to analyze such survey data.

A Simple Example of Multivariate Analysis

Let us start with a very simple example. The table below presents data some customers have shared about their enjoyment of specific types of food :

A simple look at the table does not really help us easily understand preferences. So we can use Simple Correspondence Analysis, a statistical multivariate tool, has been used to visually display expectations.

In Minitab, go to Stat > Multivariate > Simple Correspondence Analysis... and enter your data as shown in the dialogue box below. (Also click on "Graphs" and check the box labeled "Symmetric plot showing rows and columns.")

Minitab creates the following plot:

Looking at the plot, we quickly see that vegetables tend to be associated with “Disagree” (positioned close to each other in the graph) and Ice cream is positioned close to “Neutral” (they are related to each other). As for Meat and Potatoes, the panel tends either to “Agree” or “Strongly agree.”

We now have a much better understanding of the preferences of our panel, because we know what they tend to like and dislike.

Selecting the Right Type of Tool to Analyze Survey Data

Many multivariate tools are available, so how can you choose the right one to analyze your survey data?

The decision tree below shows which method you might choose according to your objectives and the type of data you have. For example, we selected correspondence analysis in the previous example because all our variables were categorical, or qualitative in nature.

Categorical Data and Prediction of Group Membership (Right Branch)

Clustering
If you have some numerical (or continuous) data and you want to understand how your customers might be grouped / aggregated (from a statistical point of view) into several homogeneous groups, you can use clustering techniques. This could be helpful to define profiles and user groups.

Discriminant Analysis or Logistic Regression (Scoring)
If your individuals already belong to different groups and you want to understand which variables are important to define an existing user group, or predict group membership for new individuals, you can use discriminant analysis, or binary logistic regression (if you only have two groups).

Correspondence Analysis
As we saw in the first example, correspondence analysis lets us study relationships between variables that are categorical / qualitative.

Numeric or Continuous Data Analysis (Left Branch)

Principal Component Analysis or Factor Analysis
If all your variables are numeric, you can use principal components analysis to understand how variables are related to one another. Factorial analysis may be useful to identify an underlying, unknown factor associated to your variables.

Item Analysis
This tool was specifically created for survey analysis. Do the items of a survey evaluate similar characteristics? Which items differ from the remaining questions The objective is to assess internal consistency of a survey.

They are computationally intensive, but performing these multivariate analyses in Minitab is very user-friendly, and the software produces easy-to-understand graphs (as in the food preference example above).

A Closer Look at Some Specific Multivariate Tools

Let's take a closer look at the tools for numerical survey data analysis. The graph below shows the tools that are available to you and their objectives in each case. These methods are often used to group numeric variables according to similarity, they may also be useful in studying how individuals are positioned according the main groups of variables in order to identify user profiles.

And now let's look a bit more closely at the tools we can use for analyzing categorical survey data. Again, the diagram below shows the tools that are available to you and their objectives. Many of these tools can be used to study how numeric variable relate to qualitative categories.

Conclusion

This is a very general overview of multivariate tools for survey analysis. If you want to go deeper and learn more about these techniques, you can find some resources on the Minitab web site, in the Help menu in Minitab's statistical software, or you can contact our technical support team.

Technology is very much part of our lives nowadays. We use our smartphones to have video calls with our friends and family, and watch our favourite TV shows on tablets. Technology has also transformed the fitness industry with the increasing popularity of fitness trackers.

Recently, I got myself a fitness watch and it's becoming my favourite gadget. It can track how many steps I’ve taken, my heart rate during a workout, and how many calories I've burned during my workout and over the whole day. Based on the calories burned, I can adjust my diet to ensure I have eaten what I require for the day. I’ve been collecting data from my weekly Zumba sessions, gym workouts and lunch-time walks. After collecting data for over a month, I decided to do some analysis with it using Minitab. Below is a snapshot of the data I collected in Minitab.

fitbit data

For each activity, I have the following information:

Duration of exercise in minutes and seconds
Time spent (rounded to nearest minutes) on peak/high-intensity exercise heart-rate zone—heart rate greater than 85% of maximum
Time spent (rounded to nearest minutes) on cardio/medium-to-high-intensity exercise heart-rate zone—heart rate is 70 to 84% of maximum
Time spent (rounded to nearest minutes) on fat-burn/low-to-medium-intensity exercise heart-rate zone—heart rate is 50-69% of maximum
Average heart rate during the session
Total calories burned during the session

It appears that the higher average heart rate results in more calories burned. Also, this depends on time spent at different heart rate zones. Let’s do some calculation using correlation coefficients.

Correlation - Cardio and Calories

As expected, all three variables are positively correlated with calories burned. However, spending hours on the treadmill is probably not a very good way to burn calories. With the best summer weather just around the corner, I need a more efficient way to exercise to lose the few pounds from my indulgence in the winter months!

According to research, exercising at higher intensity can result in more calories burned due to the “afterburn” effect. The afterburn effect is the additional calories burned after intensive exercise. Recently, at my local gym, they have introduced 30-minute HIIT (high-intensity interval training) sessions, which I am considering taking. Hence, fitting a regression model using my data will probably help me make the decision.

In Minitab, I opened Stat > Regression > Regression > Fit Regression Model, and completed the dialog and sub-dialog boxes as shown below.

fitbit regression dialog

fitbit regression subdialog

Instead of using a trial-and-error approach to select terms for the model, I will use the stepwise approach to help me identify suitable terms for the model.

fitbit stepwise regression

And after I press OK on each of my dialogs, Minitab returns the regression equation:

Regression Equation for Fitbit Data

fitbit regression model summary

The final model is quite decent, as the three types of R-squared values are all above 80%. This implies I can use this model to make predictions. The regression equation appears complex, but I can use the response optimizer in Minitab 17 to identify optimum settings to achieve my goal.

There is a common belief that 1 pound of fat (0.45 kilogram) is approximately equal to 3500 calories. Let’s say I aim to burn about 300 calories in each session. This means after about 12 sessions I would have lost approximately a pound of fat, provided I also had a healthy diet. Since exercising at higher heart rate tends to burn more calories, I will also aim to maintain an average heart rate between, say, 128 and 148, which for me works out as somewhere between 70-80% of maximum heart rate.

With all the conditions above, using Stat > Regression > Regression > Response Optimizer, here are some screenshots of the dialog boxes.

response optimizer for fitbit

response optimizer options for fitbit data

My target heart rate is 300, and getting above 300 would be a bonus. Hence, I am using 310 as the upper limit.

fitbit upper limit

I would like to spend no more than 45 minutes per session and hence I am using a maximum of 30 minutes exercising in the cardio zone, and 15 minutes in the fat-burn zone.

Response optimization output for fitbit data

Fitbit optimizer response plot

To achieve my goal, I need to exercise in the cardio zone for about 21 minutes, exercise in the fat burn zone for about 15 minutes, and maintain my average heart rate at about 148 for the session.

I understand that the HIIT sessions involve very intense bursts of exercise followed by short, sometimes active, recovery periods. This type of training gets and keeps your heart rate up. Based on this, if out of a 30-minute HIIT session I can maintain about 21 minutes in the cardio zone, and spend the rest of the session exercising in the fat-burn zone, I will be close to achieving my goal. I can always supplement this by a few minutes on the exercise bike or cross-trainer after the class.

Another good feature with the response optimizer is that I can evaluate different settings to see how the changes can affect the response. Let's consider the days when the HIIT class is not offered and I need to use the machines. I normally go for a longer session on the cross trainer (20-30 minutes), followed by a quick 10-minute session on the step machine. From past experience, I can easily get into the cardio heart-rate zone when using the cross-trainer. Now I can use the optimizer to predict the calories burned for 30 minutes of working out in the cardio zone and 10 minutes in the fat-burn zone. I will also use a lower average heart rate of 140.

By clicking on the current setup, I can input new settings.

Fitbit response optimizer new settings

response optimizer for fitbit data cardio heart rate zone

Well, this solution is not too far off from my target of 300 calories burned!

It’s turned out to be an enjoyable and informative experience analysing my own fitness data to see what my best workout options are. Taking the data collected by my fitness tracker and doing further analysis on it has definitely helped me to decide on how to exercise wisely and efficiently.

Gym photo by Indigo Fitness Club Zurich, used under Creative Commons 2.0 license.

convert numeric 2 into to It’s not easy to get data ready for analysis. Sometimes, data that include all the details we want aren’t clean enough for analysis. Even stranger, sometimes the exact opposite can be true: Data that are convenient to collect often don’t include the details that we want when we analyze them.

Let’s say that you’re looking at the documentation for the National Health and Nutrition Examination Survey (NHANES) from 2001-2002. By convention, the data set uses a symbol for missing values, but some variables have additional numeric codes for data that are missing for a specific reason. For example, one data set records hearing measurements (Audiometry). One variable in this data set is the middle ear pressure in the right ear, which has values from -282 to 180, but also includes these codes:

555: Compliance <=0.2
777: Refused
888: Could not obtain

Although in some cases knowing how often each of these situations occurs could be important, to analyze the numeric data, you have to change these code values from numbers to something that won’t be analyzed. After all, leaving in a bunch of values that are more than twice what the maximum should be would have a serious effect on the mean of the data set.

In Minitab, try this:

Choose Data > Recode > To Numeric.
In Recode values in the following columns, enter the variables with the specialized missing values. If you’re following along with the NHANES data, the variable is AUXTMEPR.
In Method, select Recode range of values.
Complete the table with the endpoints and recoded values like this:

Lower endpoint

Upper endpoint

Recoded value

555

556

777

778

888

889

In Endpoints to include, select Lower endpoint only. Click OK.

The resulting column has missing values instead of the coded values. And that means the statistics that you calculate will now have the correct values.

Recoding can let you prepare data with numeric measurements for correct analysis, but the CDC data sets also often use numeric codes to represent categories. For example, one variable records these codes for the status of an audio exam:

1: Complete
2: Partial
3: Not done

Another reason to recode your data before analyzing it is so that both the data itself and the values that subsequently appear as categories and on graphs are descriptive. You can recode these numeric codes to text in a similar fashion. Try this:

Choose Data > Recode > To Text.
In Recode values in the following columns, enter the variables with the numeric codes. If you are following along with the NHANES data, the variable is AUAEXSTS.
In Method, select Recode individual values.
Complete the table with the current values and recoded values like this:

Current value

Recoded value

Complete

Partial

Not done

Click OK.

The resulting column has the text labels instead of the numeric codes. When you create graphs, the labels will be descriptive.

Sometimes, data that are good to collect differ from data that are good to analyze. Sometimes we need more detail in the data that we collect than we need in the data that we analyze, such as when we record the reason that data are missing. Sometimes, we need data that are faster to record than is convenient when we analyze data, so we use abbreviations or codes that aren’t as descriptive as they can be.

Fortunately, Minitab makes it easy for you to balance those needs by making it easy to manipulate your data, with features like recoding. Ready for more? Check out some of the ways that Minitab makes it easy to merge different worksheets together.

an outlier among falcon tubes An outlier is an observation in a data set that lies a substantial distance from other observations. These unusual observations can have a disproportionate effect on statistical analysis, such as the mean, which can lead to misleading results. Outliers can provide useful information about your data or process, so it's important to investigate them. Of course, you have to find them first.

Finding outliers in a data set is easy using Minitab Statistical Software, and there are a few ways to go about it.

Finding Outliers in a Graph

If you want to identify them graphically and visualize where your outliers are located compared to rest of your data, you can use Graph > Boxplot.

Boxplot

This boxplot shows a few outliers, each marked with an asterisk. Boxplots are certainly one of the most common ways to visually identify outliers, but there are other graphs, such as scatterplots and individual value plots, to consider as well.

Finding Outliers in a Worksheet

To highlight outliers directly in the worksheet, you can right-click on your column of data and choose Conditional Formatting > Statistical > Outlier. Each outlier in your worksheet will then be highlighted in red, or whatever color you choose.

Conditional Formatting Menu in Minitab

Removing Outliers

If you then want to create a new data set that excludes these outliers, that’s easy to do too. Now I’m not suggesting that removing outliers should be done without thoughtful consideration. After all, they may have a story – perhaps a very important story – to tell. However, for those situations where removing outliers is worthwhile, you can first highlight outliers per the Conditional Formatting steps above, then right-click on the column again and use Subset Worksheet > Exclude Rows with Formatted Cells to create the new data set.

The Math

If you want to know the mathematics used to identify outliers, let's begin by talking about quartiles, which divide a data set into quarters:

Q1 (the 1st quartile): 25% of the data are less than or equal to this value
Q3 (the 3rd quartile): 25% of the data are greater than or equal to this value
IQR (the interquartile range): the distance between Q3 – Q1, it contains the middle 50% of the data

Outliers are then defined as any values that fall outside of:

Q1 – (1.5 * IQR)

Q3 + (1.5 * IQR)

Of course, rather than doing this by hand, you can leave the heavy-lifting up to Minitab and instead focus on what your data are telling you.

Don't see these features in your version of Minitab? Choose Help > Check for Updates to see if you're using Minitab 17.3.

You often hear the data being blamed when an analysis is not delivering the answers you wanted or expected. I was recently reminded that the data chosen or collected for a specific analysis is determined by the analyst, so there is no such thing as baddata—only bad analysis.

This made me think about the steps an analyst can take to minimise the risk of producing analysis that fails to answer answer the questions posed. Here are four tips I think are critical; we'd love to hear your thoughts and tips, too!

Tip 1: Diving Is Not Allowed no diving!

When presented with a business problem to solve, I love to dive straight into analyst mode; however, experience has taught me that to resist this temptation at all costs. Before diving in, it's vital to step back, think about the problem, and consider what type of analysis you are going to do. Broadly speaking, there are three distinct types of analysis:

Descriptive—exploring what has happened.
The tools you might use for this type include graphical analysis, hypothesis testing, capability, and control charts.
Predictive—forecasting what will happen next.
In this category of analysis, you use techniques such as regression, time series forecasting, and reliability analysis.
Prescriptive—determining what should the business do next.
Techniques in this type of analysis include design of experiments, optimisation, and simulation.

Once you have determined the type of analysis you want to do, you can start trying to find existing data or collect new data to complete your analysis.

Tip 2: Reliable Data Is Key

There are three things you need to consider when collecting data for a specific type of analysis.

How are you going to measure performance (your response variable)?
Once you have decided this, you need to ensure that this measurement can be collected accurately and precisely. If your measurements are unreliable for any reason, then your analysis and any recommendations also will be unreliable. Measurement system analysis, including gage analyses and attribute agreement analysis can help with these problems.
What factors or input parameters might affect your performance?
These are useful in descriptive analysis for segmenting the results you are seeing, allowing you to highlight opportunities and problems is specific areas of your business. In predictive and prescriptive analysis these are essential for optimising your future business performance.
What are the potential impacts of this analysis?
Finally, you need to understand the costs, benefits and risks associated with any analysis. This will help you determine how much you are prepared to spend on the analysis itself, and more important, what you are prepared invest to fix any problems and/or develop new opportunities the analysis reveals.

Tip 3: It’s All about the Power

Once you know what kind of analysis you need to do, then you can work out how much data you need to collect. Minitab's Power and Sample Size menu is one of the best tools for this, as it allows an analyst to calculate the sample size needed for different types of analyses, under a number of scenarios with a minimal amount of prior knowledge about the data you are going to collect. power-and-sample-size

The decisions you an as analyst need to make are:

How big is the effect you need to find?
Power is the probability of finding an effect if it exists. For example if you are making bolts that should be 10 mm in diameter on average, maybe a +/- 1 mm difference would result in too many bolts scrapped for being too big or too small. The determination of this effect (or difference) has to be done by someone with process knowledge, because it is a business, not a statistical decision. However, it is a decision that will impact the sample size.
How much variation can you expect in your data, measured as a standard deviation?
You need to decide this because the Power calculation is proportional to the ratio of the size of the effect you are looking for. (If you don’t have a historical standard deviation, you can use the value “1” and enter the differences you are looking for as standard deviations. Typically a one-standard-deviation difference is considered small, and a three-standard-deviation difference large.)
How powerful do you want your analyses to be?
The power is the probability of finding an effect if there is one to find, and as a minimum this should be 80%. The higher the certainty you want of finding an effect if it exists, the larger the sample you will need.

Once you have completed your power and sample size analysis, you are ready to collect your data and analyse it.

Tip 4: Good Analysis Always Has Value

When you start an analysis, you often have an idea of what you expect the results to be, because you have seen some evidence of the problem or opportunity. Consequently, when our ideas or theories are not supported by the analysis we become disappointed in the results. If you have followed a rigorous analytical methodology to answer a specific question, then accept the results, present the recommendations (in some cases this will be the recommendation of no change), and move on to the next analysis. Finding out that something is not important to your business performance can be just as important as finding out what the key influencers are!

Do you have additional suggestions for avoiding bad analyses?

by Matthew Barsalou, guest blogger

Control charts plot your process data to identify and distinguish between common cause and special cause variation. This is important, because identifying the different causes of variation lets you take action to make improvements in your process without over-controlling it.

When you create a control chart, the software you're using should make it easy to see where you may have variation that requires your attention. For example, Minitab Statistical Software automatically flags any control chart data point that is more than three standard deviations above the centerline, as shown in the I chart below.

I chart example with one out-of-control point.

A data point that more than three standard deviations from the centerline is one indicator for detecting special-cause variation in a process. There are additional control chart rules introduced by Dr. Lloyd S. Nelson in his April 1984 Journal of Quality Technology column. The eight Nelson Rules are shown below, and if you're interested in using them, they can be activated in Minitab.

Nelson Rules for special cause variation in control charts

The Nelson rules for tests of special causes. Reprinted with permission from Journal of Quality Technology©1984 ASQ, asq.org.

To activate the Nelson rules, go to Control Charts > Variables Charts for Individuals > Individuals... and then click on "I Chart Options." Go to the Tests tab and place a check mark next to the test you would like to select—or simply use the drop-down menu and select “Perform all tests for special causes,” as shown below.

Individual Charts Options in Minitab

The resulting session window explains which tests failed.

session window output

On the chart itself, the data points that failed each test are identified in red as shown below.

I chart of data

Simply activating all of the rules is not recommended—the false positive rate goes up as each additional rule is activated. At some point the control chart will become more sensitive than it needs to be and corrective actions for special causes of variation may be implemented when only common cause is variation present.

Fortunately, Nelson provided detailed guidance on the correct application of his namesake rules. Nelson’s guidance on applying his rules for tests of special causes is presented below.

comments on test for special causes
Comments on tests for special causes. Reprinted with permission from Journal of Quality Technology©1984 ASQ, asq.org.

Nelson’s tenth comment is an especially important one, regardless of which tests have been activated.

Minitab, together with the Nelson rules, can be very helpful, but neither can replace or remove the need for the analyst's judgment when assessing a control chart. These rules can, however, assist the analyst in making the proper decision.

About the Guest Blogger

Matthew Barsalou is a statistical problem resolution Master Black Belt at BorgWarner Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time, Statistics for Six Sigma Black Belts and The ASQ Pocket Guide to Statistics for Six Sigma Black Belts.

Earlier this month, PLOS.org published an article titled "Ten Simple Rules for Effective Statistical Practice." The 10 rules are good reading for anyone who draws conclusions and makes decisions based on data, whether you're trying to extend the boundaries of scientific knowledge or make good decisions for your business.

Carnegie Mellon University's Robert E. Kass and several co-authors devised the rules in response to the increased pressure on scientists and researchers—many, if not most, of whom are not statisticians—to present accurate findings based on sound statistical methods.

Since the paper and the discussions it has prompted focus on scientists and researchers, it seems worthwhile to consider how the rules might apply to quality practitioners or business decision-makers as well. In this post, I'll share the 10 rules, some with a few modifications to make them more applicable to the wider population of all people who use data to inform their decisions.

1. Statistical Methods Should Enable Data to Answer Scientific Specific Questions

As the article points out, new or infrequent users of statistics tend to emphasize finding the "right" method to use—often focusing on the structure or format of their data, rather than thinking about how the data might answer an important question. But choosing a method based on the data is putting the cart before the horse. Instead, we should start by clearly identifying the question we're trying to answer. Then we can look for a method that uses the data to answer it. If you haven't already collected your data, so much the better—you have the opportunity to identify and obtain the data you'll need.

2. Signals Always Come With Noise

If you're familiar with control charts used in statistical process control (SPC) or the Control phase of a Six Sigma DMAIC project, you know that they let you distinguish process variation that matters (special-cause variation) from normal process variation that doesn't need investigation or correction.

Control charts are one common tool used to distinguish "noise" from "signal."

The same concept applies here: whenever we gather and analyze data, some of what we see in the results will be due to inherent variability. Measures of probability for analyses, such as confidence intervals, are important because they help us understand and account for this "noise."

3. Plan Ahead, Really Ahead

Say you're starting a DMAIC project. Carefully considering and developing good questions right at the start of a project—the DEFINE stage—will help you make sure that you're getting the right data in the MEASURE stage. That, in turn, should result in a much smoother and stress-free ANALYZE phase—and probably more successful IMPROVE and CONTROL phases, too. The alternative? You'll have to complete the ANALYZE phase with the data you have, not the data you wish you had.

4. Worry About Data Quality

gauge "Can you trust your data?" My Six Sigma instructor asked us that question so many times, it still flashes through my mind every time I open Minitab. That's good, because he was absolutely right: if you can't trust your data, you shouldn't do anything with it. Many people take it for granted that the data they get is precise and accurate, especially when using automated measuring instruments and similar technology. But how do you know they're measuring precisely and accurately? How do you know your instruments are calibrated properly? If you didn't test it, you don't know. And if you don't know, you can't trust your data. Fortunately, with measurement system analysis methods like gage R&R and attribute agreement analysis, we never have to trust data quality to blind faith.

5. Statistical Analysis Is More Than a Set of Computations

Statistical techniques are often referred to as "tools," and that's a very apt metaphor. A saw, a plane, and a router all cut wood, but they aren't interchangeable—the end product defines which tool is appropriate for a job. Similarly, you might apply ANOVA, regression, or time series analysis to the same data set, but the right tool depends on what you want to understand. To extend the metaphor further, just as we have circular saws, jigsaws, and miter saws for very specific tasks, each family of statistical methods also includes specialized tools designed to handle particular situations. The point is that we select a tool to assist our analysis, not to define it.

6. Keep it Simple

Many processes are inherently messy. If you've got dozens of input variables and multiple outcomes, analyzing them could require many steps, transformations, and some thorny calculations. Sometimes that degree of complexity is required. But a more complicated analysis isn't always better—in fact, overcomplicating it may make your results less clear and less reliable. It also potenitally makes the analysis more difficult than necessary. You may not need a complex process model that includes 15 factors if you can improve your output by optimizing the three or four most important inputs. If you need to improve a process that includes many inputs, a short screening experiment can help you identify which factors are most critical, and which are not so important.

7. Provide Assessments of Variability

No model is perfect. No analysis accounts for all of the observed variation. Every analysis includes a degree of uncertainty. Thus, no statistical finding is 100% certain, and that degree of uncertainty needs to be considered when using statistical results to make decisions. If you're the decision-maker, be sure that you understand the risks of reaching a wrong conclusion based on the analysis at hand. If you're sharing your results with stakeholders and executives, especially if they aren't statistically inclined, make sure you've communicated that degree of risk to them by offering and explaining confidence intervals, margins of error, or other appropriate measures of uncertainty.

8. Check Your Assumptions

Different statistical methods are based on different assumptions about the data being analyzed. For instance, many common analyses assume that your data follow a normal distribution. You can check most of these assumptions very quickly using functions like a normality test in your statistical software, but it's easy to forget (or ignore) these steps and dive right into your analysis. However, failing to verify those assumptions can yield results that aren't reliable and shouldn't be used to inform decisions, so don't skip that step. If you're not sure about the assumptions for a statistical analysis, Minitab's Assistant menu explains them, and can even flag violations of the assumptions before you draw the wrong conclusion from an errant analysis.

9. When Possible, Replicate Verify Success!

In science, replication of a study—ideally by another, independent scientist—is crucial. It indicates that the first researcher's findings weren't a fluke, and provides more evidence in support of the given hypothesis. Similarly, when a quality project results in great improvements, we can't take it for granted those benefits are going to be sustained—they need to be verified and confirmed over time. Control charts are probably the most common tool for making sure a project's benefits endure, but depending on the process and the nature of the improvements, hypothesis tests, capability analysis, and other methods also can come into play.

10. Make Your Analysis Reproducible Share How You Did It

In the original 10 Simple Rules article, the authors suggest scientists share their data and explain how they analyzed it so that others can make sure they get the same results. This idea doesn't translate so neatly to the business world, where your data may be proprietary or private for other reasons. But just as science benefits from transparency, the quality profession benefits when we share as much information as we can about our successes. Of course you can't share your company's secret-sauce formulas with competitors—but if you solved a quality challenge in your organization, chances are your experience could help someone facing a similar problem. If a peer in another organization already solved a problem like the one you're struggling with now, wouldn't you like to see if a similar approach might work for you? Organizations like ASQ and forums like iSixSigma.com help quality practitioners network and share their successes so we can all get better at what we do. And here at Minitab, we love sharing case studies and examples of how people have solved problems using data analysis, too.

How do you think these rules apply to the world of quality and business decision-making? What are your guidelines when it comes to analyzing data?

big wave It's been called a "demographic watershed".

In the next 15 years alone, the worldwide population of individuals aged 65 and older is projected to increase more than 60%, from 617 million to about 1 billion.1

Increasingly, countries are asking themselves: How can we ensure a high quality of care for our growing aging population while keeping our healthcare costs under control?

The answer? More efficiency. Less waste. Reduced error. Quicker turnaround. Improved patient outcomes. Which is, of course, where lean/six sigma quality improvement comes in.

Another Watershed: Lean/Six Sigma in Healthcare

Faced with the challenge of increased demand and rapidly rising cost, it's no surprise that more and more lean and six sigma studies are being performed and published in the fields of medicine and healthcare. A search for "lean sigma" or "six sigma" in the U.S. National Library of Medicine/National Institutes of Health database pulls up over 470 published studies. The year-by-year search results are shown in the following Trend Analysis Plot in Minitab.

trend

Note that this trend forecast based on a linear model could be conservative. The database already shows 39 published studies in the first 6 months of this year (2016). In the coming years, these data may require a quadratic (curved) model!

A Surprising Breadth of Applications

QI studies in healthcare run the gamut. Reducing appointment no-shows. Avoiding preventable hospital readmissions. Cutting down on the amount of expired medical drugs and equipment that must be discarded. Reducing the number of falls in nursing care facilities. Even tracking and reducing the number of times the door is opened and closed during surgeries!

The one consensus in all these diverse studies? There's no shortage of opportunity for trimming waste and for improving patient outcomes in the healthcare setting.

Improving on Improvement: Statistical "Soft Spots"

With the increased number of published studies in lean/six sigma healthcare, comes increased scrutiny. Review studies are beginning to look more closely at the methods used to monitor and measure healthcare quality outcomes. These reviews have identified statistical shortcomings in the published studies.2-4

No testing for statistical significance
A study might report a reduction in, say, mean waiting times for surgery for patients after an improvement initiative. But no statistical analyses are performed to determine whether that reduction is statistically significant.
Lack of randomization
The samples examined in a study are not randomly selected, and may not be representative of the population. Therefore, the results are subject to selection bias.
Inadequate follow-up
A study might report a statistically significant change after a lean/six sigma initiative, but there is lack of adequate follow-up. Therefore, it's unclear whether the improvement "stuck" or whether the initial change was simply a short-term Hawthorne effect.
Missing confidence intervals
Many studies neglect to report associated confidence intervals with their results. As a result, the level of precision is unclear, and the clinical significance of the results cannot be reliably interpreted.

No Single Study is a Statistical Slam-Dunk

There's no such thing as a "perfect" study. You do the best you can, with the resources you have and the constraints you face. Progress is made in increments. Conditions change. And results must be re-verified and reproduced.

Still, it's important to remember that "continuous improvement" can be applied not only to the process itself, but to the methods that we use to monitor and improve a process. To get the most out of quality improvement efforts in healthcare, it's critical to be aware of the common statistical soft spots, and how to avoid them.

That way, when the wave breaks, we'll be on sturdier ground.

Sources

1 An Aging World: 2015. International Population Reports. US Census Bureau. Available here.

2 Nicolay CR, Purkayastha S, Greenhalgh A, Benn J, Chaturvedi S, Phillips N, Darzi A . Systematic review of the application of quality improvement methodologies from the manufacturing industry to surgical healthcare. Br J Surg. 2012 Mar;99(3):324-35. 2011 Nov 18.

3 Amaratunga T, Dobranowski J. Systematic review of the application of Lean and Six Sigma quality improvement methodologies in radiology. J Am Coll Radiol. 2016 May 18.

4 Mason SE, Nicolay CR, Darzi A. The use of Lean and Six Sigma methodologies in surgery: a systematic review. Surgeon. 2015 Apr;13(2):91-100.

Photo: Big Wave Breaking by Brocken Inaglory. Licensed by Wikimedia Commons.

Design of Experiments (DOE) has a reputation for difficulty, and to an extent, this statistical method deserves that reputation. While it's easy to grasp the basic idea—acquire the maximum amount of information from the fewest number of experimental runs—practical application of this tool can quickly become very confusing.

steaks Even if you're a long-time user of designed experiments, it's still easy to feel uncertain if it's been a while since you last looked at split-plot designs or needed to choose the appropriate resolution for a fractional factorial design.

But DOE is an extremely powerful and useful tool, so when we launched Minitab 17, we added a DOE tool to the Assistant to make designed experiments more accessible to more people.

Since summer is here at Minitab's world headquarters, I'm going to illustrate how you can use the Assistant's DOE tool to optimize your grilling method.

If you're not already using it and you want to play along, you can download the free 30-day trial version of Minitab Statistical Software.

Two Types of Designed Experiments: Screening and Optimizing

To create a designed experiment using the Assistant, open Minitab and select Assistant > DOE > Plan and Create. You'll be presented with a decision tree that helps you take a sequential approach to the experimentation process by offering a choice between a screening design and a modeling design.

DOE Assistant

A screening design is important if you have a lot of potential factors to consider and you want to figure out which ones are important. The Assistant guides you through the process of testing and analyzing the main effects of 6 to 15 factors, and identifies the factors that have greatest influence on the response.

Once you've identified the critical factors, you can use the modeling design. Select this option, and the Assistant guides you through testing and analyzing 2 to 5 critical factors and helps you find optimal settings for your process.

Even if you're an old hand at analyzing designed experiments, you may want to use the Assistant to create designs since the Assistant lets you print out easy-to-use data collection forms for each experimental run. After you've collected and entered your data, the designs created in the Assistant can also be analyzed using Minitab's core DOE tools available through the Stat > DOE menu.

Creating a DOE to Optimize How We Grill Steaks

For grilling steaks, there aren't that many variables to consider, so we'll use the Assistant to plan and create a modeling design that will optimize our grilling process. Select Assistant > DOE > Plan and Create, then click the "Create Modeling Design" button.

Minitab brings up an easy-to-follow dialog box; all we need to do is fill it in.

First we enter the name of our Response and the goal of the experiment. Our response is "Flavor," and the goal is "Maximize the response." Next, we enter our factors. We'll look at three critical variables:

Number of turns, a continuous variable with a low value of 1 and high value of 3.
Type of grill, a categorical variable with Gas or Charcoal as options.
Type of seasoning, a categorical variable with Salt-Pepper or Montreal steak seasoning as options.

If we wanted to, we could select more than 1 replicate of the experiment. A replicate is simply a complete set of experimental runs, so if we did 3 replicates, we would repeat the full experiment three times. But since this experiment has 16 runs, and neither our budget nor our stomachs are limitless, we'll stick with a single replicate.

When we click OK, the Assistant first asks if we want to print out data collection forms for this experiment:

Choose Yes, and you can print a form that lists each run, the variables and settings, and a space to fill in the response:

Alternatively, you can just record the results of each run in the worksheet the Assistant creates, which you'll need to do anyway. But having the printed data collection forms can make it much easier to keep track of where you are in the experiment, and exactly what your factor settings should be for each run.

If you've used the Assistant in Minitab for other methods, you know that it seeks to demystify your analysis and make it easy to understand. When you create your experiment, the Assistant gives you a Report Card and Summary Report that explain the steps of the DOE and important considerations, and a summary of your goals and what your analysis will show.

Now it's time to cook some steaks, and rate the flavor of each. If you want to do this for real and collect your own data, please do so! Tomorrow's post will show how to analyze your data with the Assistant.

A Simple Guide to Multivariate Control Charts

A Six Sigma Healthcare Project, part 1: Examining Factors with a Pareto Chart

A Six Sigma Healthcare Project, part 2: Visualizing the Impact of Individual Factors

Regression versus ANOVA: Which Tool to Use When

The Life You Improve May Be Your Own: Honing Healthcare with Statistical Data Analysis

3 Ways to Get Up and Running with Statistical Software—Fast

A Six Sigma Healthcare Project, part 3: Creating a Binary Logistic Regression Model for Patient ...

A Six Sigma Healthcare Project, part 4: Predicting Patient Participation with Binary Logistic ...

Poisson Data: Examining the Number Deaths in an Episode of Game of Thrones

Fitting an ARIMA Model

The Matrix, It's a Complex Plot

Using Multivariate Statistical Tools to Analyze Customer and Survey Data

Using Fitness Tracker Data to Make Wise Decisions: Are You Working Out in the Right Zone?

2 Reasons 2 Recode Data and How 2 Do It in Less than 2 Minutes

How to Identify Outliers (and Get Rid of Them)

There Is No Such Thing as “Bad” Data: Top Tips to Avoid Bad Analysis

Using the Nelson Rules for Control Charts in Minitab

Those 10 Simple Rules for Using Statistics? They're Not Just for Research

QI Trends in Healthcare: What Are the Statistical "Soft Spots"?

Applying DOE for Great Grilling, part 1