Tip:
Highlight text to annotate it
X
Welcome back to the Data Mining with STATISTICA Series. This is the fifth
session, Initial Graphical Review.
Previously, the data set that will be used in this session and many of the upcoming
sessions was introduced.
If you have not yet watched this short session, Session 3,
you may find benefit in it.
Also, note that we have another graphing topic coming up in a few weeks
where we will explore the
cleaned data for relationships.
Today we will review the Credit Scoring data graphically. This initial
exploration should highlight any problems with the data,
such as outliers or data entry errors.
We are looking for problems in the data that will need to be addressed in the
data cleaning phase.
It is important to involve someone who is knowledgeable about the data in this chore,
someone who can recognize data errors and the like.
Let's move to STATISTICA and start the graphical exploration.
The interactive drill-down tool is perfect for exploring all types of data.
On the "Review" tab, I'll select the relevant variables
for analysis
and create histograms of each,
which shows the frequency
of customers in each category, as well as where we have missing data.
The histogram of "credit rating" shows that more than twice as many customers
are classified as good credit than bad.
Knowing this will be important as we analyze the data.
Likely, we will need to use a stratified random sample
of the data or use misclassification costs to safeguard against models always
predicting "good," since the majority of customers are in this category. These
techniques will be discussed more in future sessions.
Another interesting variable is "payment of previous credits."
This variable has a few missing entries
which will have to be dealt with during the data cleaning phase.
The majority of our customers either have no previous credits or paid back
their previous loans.
The plot of "gender" shows more than twice as many male customers as female.
"Number of previous credits at this bank" is interesting.
The last 2 categories look as though they should be combined,
as there are not many customers in either category.
So, we may want to make the categories "5 to 6" and "7 or more"
one category, and we would call it "5 or more."
The plot of "age" shows strange behavior.
I would expect customers to be at least eighteen years old to apply for credit,
so this variable needs further review as well.
We can look at relationships between variables
with the drill-down tool.
Let me select
the variables.
For instance, let's drill down on the "payment of previous credits." So, I will select
the drill button that says "Down"
and select "No previous credits."
So, this will let us look at just the customers with
no previous credits.
And what we would expect the plot of "number of previous credits at this bank"
to look like?
The plot sheet seems to show some contradictions.
We're looking at just customers with no previous credits,
but the number of previous credits at this bank for some of the customers are
either "2 to 4," "5 to 6," or "7 or more,"
and I would expect all of the customers to fall on this "1 or less" category.
And, so, these other entries will need to be explored more.
A scatter plot matrix of the 3 continuous variables,
shows us outliers and data entry errors.
But let's focus on duration of credit and the amount of credit.
So, I'm going to make a scatter plot of these 2 variables
so we can focus on them
a little bit better.
And the brushing tool will allow us to visually
explore this data.
And say that we know that no loan from this department was made for more than
$30,000.
So, we are going to use the box
selection method
to select the
points that are obviously more than $30,000 and turn those points off.
And the graph updates automatically. Say we also know that no loan was made for more
than 72 months.
So, these
points are also either data entry errors, or they (for whatever reason) don't belong
in our data set.
When I click "Apply," the plot is updated
and the erroneous points are removed.
But now another problem becomes apparent.
Duration of credit should be at least 3 months.
Let's use the extended method
to
highlight the points
that are less than 3 months.
I'll click "Highlight."
And so, these points
have an issue, too. So, I will remove those,
and then on the Y axis, let's turn off points where loans are less than $100...
and the plot is updated.
Looking back at the data sets, the points that we found to be suspect are still
contained in our data.
The slash icons show which cases are turned off, and
it is also important to know that these points that are turned off
will not be used in future analyses until that case date is changed back
to "On."
Now we have reviewed the credit scoring data and found some issues that need to be
addressed in data cleaning.
We saw how graph brushing tools can be used
to exclude erroneous points from the analysis.
The next few sessions, we'll use what we've learned in this graphical review:
namely, that some of the variables contain outliers,
missing data, and wrong entries,
and we'll apply those data cleaning techniques.
For more information about StatSoft and STATISTICA Data Miner,
please visit StatSoft.com or call
918-749-1119.
Outside of the United States and Canada StatSoft.com has links to our
international offices
that can be of assistance.
And be sure to sign up for reminders when new episodes are available at
StatSoft.com/dmsubscribe.
And thank you for watching.