Data Mining, Initial Graphical Review - Session 5

Welcome back to the Data Mining with STATISTICA Series. This is the fifth session, Initial Graphical Review. Previously, the data set that will be used in this session and many of the upcoming sessions was introduced. If you have not yet watched this short session, Session 3, you may find benefit in it. Also, note that we have another graphing topic coming up in a few weeks where we will explore the cleaned data for relationships. Today we will review the Credit Scoring data graphically. This initial exploration should highlight any problems with the data, such as outliers or data entry errors. We are looking for problems in the data that will need to be addressed in the data cleaning phase. It is important to involve someone who is knowledgeable about the data in this chore, someone who can recognize data errors and the like. Let's move to STATISTICA and start the graphical exploration. The interactive drill-down tool is perfect for exploring all types of data. On the "Review" tab, I'll select the relevant variables for analysis and create histograms of each, which shows the frequency of customers in each category, as well as where we have missing data. The histogram of "credit rating" shows that more than twice as many customers are classified as good credit than bad. Knowing this will be important as we analyze the data. Likely, we will need to use a stratified random sample of the data or use misclassification costs to safeguard against models always predicting "good," since the majority of customers are in this category. These techniques will be discussed more in future sessions. Another interesting variable is "payment of previous credits." This variable has a few missing entries which will have to be dealt with during the data cleaning phase. The majority of our customers either have no previous credits or paid back their previous loans. The plot of "gender" shows more than twice as many male customers as female. "Number of previous credits at this bank" is interesting. The last 2 categories look as though they should be combined, as there are not many customers in either category. So, we may want to make the categories "5 to 6" and "7 or more" one category, and we would call it "5 or more." The plot of "age" shows strange behavior. I would expect customers to be at least eighteen years old to apply for credit, so this variable needs further review as well. We can look at relationships between variables with the drill-down tool. Let me select the variables. For instance, let's drill down on the "payment of previous credits." So, I will select the drill button that says "Down" and select "No previous credits." So, this will let us look at just the customers with no previous credits. And what we would expect the plot of "number of previous credits at this bank" to look like? The plot sheet seems to show some contradictions. We're looking at just customers with no previous credits, but the number of previous credits at this bank for some of the customers are either "2 to 4," "5 to 6," or "7 or more," and I would expect all of the customers to fall on this "1 or less" category. And, so, these other entries will need to be explored more. A scatter plot matrix of the 3 continuous variables, shows us outliers and data entry errors. But let's focus on duration of credit and the amount of credit. So, I'm going to make a scatter plot of these 2 variables so we can focus on them a little bit better. And the brushing tool will allow us to visually explore this data. And say that we know that no loan from this department was made for more than $30,000. So, we are going to use the box selection method to select the points that are obviously more than $30,000 and turn those points off. And the graph updates automatically. Say we also know that no loan was made for more than 72 months. So, these points are also either data entry errors, or they (for whatever reason) don't belong in our data set. When I click "Apply," the plot is updated and the erroneous points are removed. But now another problem becomes apparent. Duration of credit should be at least 3 months. Let's use the extended method to highlight the points that are less than 3 months. I'll click "Highlight." And so, these points have an issue, too. So, I will remove those, and then on the Y axis, let's turn off points where loans are less than $100... and the plot is updated. Looking back at the data sets, the points that we found to be suspect are still contained in our data. The slash icons show which cases are turned off, and it is also important to know that these points that are turned off will not be used in future analyses until that case date is changed back to "On." Now we have reviewed the credit scoring data and found some issues that need to be addressed in data cleaning. We saw how graph brushing tools can be used to exclude erroneous points from the analysis. The next few sessions, we'll use what we've learned in this graphical review: namely, that some of the variables contain outliers, missing data, and wrong entries, and we'll apply those data cleaning techniques. For more information about StatSoft and STATISTICA Data Miner, please visit StatSoft.com or call 918-749-1119. Outside of the United States and Canada StatSoft.com has links to our international offices that can be of assistance. And be sure to sign up for reminders when new episodes are available at StatSoft.com/dmsubscribe. And thank you for watching.