Explore The Distribution of a Variable in Sas

In this video, you will learn to explore the distribution of a variable using the UNIVARIATE Procedure in SAS. In this video, you will learn to explore the distribution of a variable using the UNIVARIATE Procedure in SAS. Suppose that you are performing an exploratory data analysis on a data set that you were just handed. During this exploratory analysis, you want to look at the distribution of one of the included variables. You want descriptive statistics and images that enable you to visualize the distribution of the variable. PROC UNIVARIATE will be your procedure of choice. Let’s take a closer look at the Height variable from the bodyfat2 data set using PROC UNIVARIATE. The DATA= option directs SAS to the bodyfat2 data set that is located in the sasuser library. The variable of interest, Height, is placed in the VAR statement. Failure to provide a VAR statement results in a distribution analysis of all numeric variables in the data set. The HISTOGRAM statement requests that a histogram of the Height variable be generated. The NORMAL option overlays a normal curve on top of the histogram. You have the option to specify a mean and variance for this normal curve. If you specify MU=EST and SIGMA=EST, SAS overlays the normal curve that shares the same mean and variance of the observed data. The KERNEL option overlays the kernel curve over the histogram. The kernel curve is a smooth line estimation of the histogram. The first INSET statement places, within the histogram graphic, a text box that contains the listed statistics. In our example, we will include the skewness and kurtosis statistics in our text box. The POSITION= option enables you to select the position within the graphic to place the text box. Our example places the text box in the northeast corner of the graphic. The default position is the northwest corner. The PROBPLOT statement requests a probability plot for the Height variable. The NORMAL option inserts a diagonal reference line that enables you to visually check for normality of the variable. Similar to the HISTOGRAM statement, you have the option to specify a mean and variance for the normal distribution. Like before, we will have the mean and variance of the normal distribution match the mean and variance of the observed data. The second INSET statement places a text box on the probability plot graphic in the northwest corner that will contain the skewness and kurtosis statistics. Now let’s submit the code and look at the output. The output begins with the Moments table. This table contains the sample size, sum of observation weights (which matches the sample size in this problem because the weight associated with each observation is 1), mean, the sum of the observations, standard deviation, variance, skewness, kurtosis, both the uncorrected and corrected sum of squares, the coefficient of variation, and the standard error of the mean. The Basic Statistical measures table displays measures of location (mean, median, and mode) as well as measures of variability (standard deviation, variance, range, and interquartile range). The Tests for Location table provides several statistical tests for the null hypothesis that the mean of the distribution is equal to a specific value. By default, this is zero. By including the option MU0=, you can change this to any value. The table includes tests that are both parametric and non-parametric. The Quantiles table displays several of the calculated quantiles of the data. These include but are not limited to the minimum, maximum, median, lower quartile, and upper quartile. The Extreme Observations table displays the lowest and highest five values of the data set. In addition to the value, the observation number where the value occurs is listed. In the histogram, the blue line represents the normal distribution with the mean and variance values estimated from the data. The red line is the kernel curve and is the smooth line approximation of the histogram. In the northeast corner, we find the text box created by the INSET statement, containing the skewness and kurtosis statistics. The Parameters for Normal Distribution table displays the mean and variance values estimated from the observed data. These are the values that we are using for our normal distribution comparisons. The Goodness of Fit table displays three tests for, in our example, normality. Those tests are the Kolmogorov-Smirnov statistic, the Cramer-von Mises statistic, and the Anderson-Darling statistic. The Quantiles for Normal Distribution table displays the information from the probability plot in a table format. For several quantiles, we see how the observed data compares to the estimated quantiles from a normal distribution with our specified mean and variance. The probability plot graphic visually compares the quantiles of the observed data to the quantiles of, in our example, a normal distribution. The NORMAL= option in the PROBPLOT statement draws the diagonal reference line. Notice the text box in the northwest corner containing the skewness and kurtosis statistics. It should be noted that, in our example, we compared our data to the normal distribution. PROC UNIVARIATE has the capability to compare to other distributions. These include but are not limited to the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, and WEIBULL distributions. Now that I have shown you a brief tour of PROC UNIVERATE in SAS, it’s your turn to look at the distributions of your variables. Thank you for your interest in SAS. Thank you for your interest in SAS.