Outliers in Boxplots

Minitab is the leading global provider of software and services for quality improvement and statistics education. Quality. Analysis. Results. For more info visit Minitab.com. This podcast is available at KeithBower.com Hello, today I’m going to talk about boxplots, and the detection of potential outliers when using a boxplot. Now, boxplots were invented by John Tukey. And, if you think of the construction of a boxplot, the middle fifty percent of your set of data is encompassed by this box. The lower value, we call that the first quartile or Q1. The top of the box being the third quartile or Q3. Therefore we have 50% of the data in that box. And then there are whiskers that stretch out from it. These whiskers stretch up, potentially to the highest observation, and stretching down to the lowest observation. But you’ll notice that they may not necessarily do that. The whiskers may go so far, and then stop, and then there'll be, let's say, an asterisk, further away. What those asterisks represent are, let's say, potential outliers. How do we flag those potential outliers in one of Tukey’s boxplots? Well for most statistical software programs (that I’ve seen anyway) the way in which it’s computed is take that box in the boxplot the middle 50% (we call that the Interquartile Range, Q3-Q1). Let’s take that distance and multiply it by 1.5. Call that distance k. If we add k to the top of the box so it’s going to be Q3 plus this distance k, and then we take this distance k and subtract it from the first quartile, any values that fall outside that region are flagged as potential outliers. An important consideration with a boxplot is that these outliers are being flagged, and there’s no relationship to using a standard deviation in this detection. A lot of outlier detection tests out there like Weisberg’s test, Extreme Studentized Deviate tests, will incorporate a standard deviation. So of course if you got skewed data, where means may be increased the standard deviations are going to be affected, it may not be the most worthwhile methodology to detect potential outliers. The beauty of Tukey’s boxplot is that it’s utterly indifferent to the skewness of the data. It’s basing it purely on how spread out the middle 50% is, and multiplying that by 1.5 and then adding it and subtracting it from the third quartile and the first quartile respectively. You must keep in mind of course that if you are using boxplots, and let’s say there are 3 box plots next to each other, corresponding to fertilizer a, fertilizer b, fertilizer c, the detection of that potential outlier within a box, that’s coming solely from that box itself, the middle 50% for that particular level of the factors. It’s got nothing to do with any of the other factors. So I hope this has been useful, of course if you have any questions on this or anything else, please feel free to email me through my website KeithBower.com. For more information on statistical methods for quality improvement, visit KeithBower.com