Tip:
Highlight text to annotate it
X
Minitab is the leading global provider of software and services for quality improvement
and statistics education.
Quality. Analysis. Results. For more info visit Minitab.com.
This podcast is available at KeithBower.com
Hello, today I’m going to talk about boxplots, and the detection of potential outliers when
using a boxplot. Now, boxplots were invented by John Tukey. And, if you think of the construction
of a boxplot, the middle fifty percent of your set of data is encompassed by this box.
The lower value, we call that the first quartile or Q1. The top of the box being the third
quartile or Q3. Therefore we have 50% of the data in that box. And then there are whiskers
that stretch out from it. These whiskers stretch up, potentially to the highest observation,
and stretching down to the lowest observation. But you’ll notice that they may not necessarily
do that. The whiskers may go so far, and then stop,
and then there'll be, let's say, an asterisk, further away. What those asterisks represent
are, let's say, potential outliers. How do we flag those potential outliers in one of
Tukey’s boxplots? Well for most statistical software programs (that I’ve seen anyway)
the way in which it’s computed is take that box in the boxplot the middle 50% (we call
that the Interquartile Range, Q3-Q1). Let’s take that distance and multiply it by 1.5.
Call that distance k. If we add k to the top of the box so it’s going to be Q3 plus this
distance k, and then we take this distance k and subtract it from the first quartile,
any values that fall outside that region are flagged as potential outliers. An important
consideration with a boxplot is that these outliers are being flagged, and there’s
no relationship to using a standard deviation in this detection. A lot of outlier detection
tests out there like Weisberg’s test, Extreme Studentized Deviate tests, will incorporate
a standard deviation. So of course if you got skewed data, where
means may be increased the standard deviations are going to be affected, it may not be the
most worthwhile methodology to detect potential outliers. The beauty of Tukey’s boxplot
is that it’s utterly indifferent to the skewness of the data. It’s basing it purely
on how spread out the middle 50% is, and multiplying that by 1.5 and then adding it and subtracting
it from the third quartile and the first quartile respectively. You must keep in mind of course
that if you are using boxplots, and let’s say there are 3 box plots next to each other,
corresponding to fertilizer a, fertilizer b, fertilizer c, the detection of that potential
outlier within a box, that’s coming solely from that box itself, the middle 50% for that
particular level of the factors. It’s got nothing to do with any of the other factors.
So I hope this has been useful, of course if you have any questions on this or anything
else, please feel free to email me through my website KeithBower.com. For more information
on statistical methods for quality improvement, visit KeithBower.com