Tip:
Highlight text to annotate it
X
all right so for
chapter four we will cover data mining
for Business Intelligence
we're going to go to chapters four and then chapter 5 later on five later on
but not for a couple of weeks
the last chapter we'll do is chapter 3 so we're going to skip ahead
ok
the first slide goes over
data mining concepts and definitions
Why Data Mining?
Thomas Davenport said
strategic
analytical decision making is a strategic weapon for companies
and it actually has become something ...companies
want you to know how to analyze data
and they want you to be able to make decisions using
using information derived from the analysis
There is more intense competition on the the global scale
there's more and more information available
you know
a lot more time
goes into analyzing data than it used to
now
that's in terms of managers
it used to be
data analyze manually so it took a long time
you know a lot of people
now it can be done
management can do it themselves at their desks
they can pull a lot of data from this data warehouse
and they can analyze it using software
so storage capabilities are increased cost has decreased
so it's become more popular
and it is almost essential now to compete in the industry
you have to know how your competitors are doing and you have to know how you're
doing
so what is data mining?
data mining
basically mining for knowledge in large amounts of data
you're identifying
valid, novel, potentially useful, and ultimately understandable patterns in data
so basically the data...
the process
is iterative, meaning there's many repeating steps
but basically your data is non-trivial
it's valid
it needs to be novel
meaning it's not something you already could have guessed
and then potentially useful and understandable
nontrivial and understandable kind of go to gether
if it's trivial it's really not going to be understandable
or even useful
it is useful and nontrivial that go together (correction). If it is not useful
then it is likely trivial
and those definitions are all in the notes
let me go back
yea they are all down here in the notes if you want to read through them
okay
so this chart shows you that data mining is at the center of many
different disciplines so it's not a new concept
it is basically a new name for the use of all these disciplines together
so using artificial intelligence using statistics
using modeling, databases, pattern recognition
and MIS systems. All of these you're using together
to mine for information
and so as the book said here
the miner is often an end user empowered by data drills and
other powerful query tools
who may obtain answers quickly with little or no programming skills
it used to be
in order to query a database
you had to know SQL
now it's more
user interfaces and you type in what you want to know
so characteristics of data mining
it's often a consolidated data warehouse
but not always
it's either client-server or web base, so you either have a
clients PC accessing the server
or you ahve a web-based system
again mine is often the end user
and ease of use is essential
so this portion (types of data) you may have seen before if you've taken
business analysis
the data in data mining (heading)
data is either categorical
or numerical
categorical data is either nominal ordinal
numerical data
is either interval or ratio
and this is important to know when you're going to analyze the data. Many
software programs will ask you
and the data is analyzed differently
based on the type of data it is
so that's where a lot of this will...
it will make more sense when you're doing it (hands-on)
is good to learn in now to have the basis, but once we go into SAP
things will make a lot more sense then
it'll be king of like "I get it now!"
a moment where you do it and then realize
Oh, that's what that meant
and then data consists of numbers, words, and images
data nowadays can be anything from pictures to just numbers
ok and I'll go into these (data types)
categorical data
it basically
can be categorised
you can either put it into groups or categories. Race, Sex, Age group
so it's not
see (next slide) numerical data can be measured
categorical data can be categorized, that's the difference between those two
now within categorical data you have nominal and ordinal data
nominal data
cannot be ranked
Ordinal data can be ranked
such as you have freshman, sophomore, junior senior
that's a ranking system
low/medium/high
where single/married/divorced that's not a rank. Single is not better than married
or is not higher level than married
you might have 1st/2nd/3rd, that's a ranking.
these can make the most sense but when you're going to actually
look at data sometimes these can be the hardest to actually differentiate between
nominal or ordinal
you have to pay close attention to the differences
ok
so again numerical data can be measured
age, number of children
age you could take
current date - year (of birth)
as your age
that's a
measurement
the interval scale
interval data can be measured on the interval scale
all you have to remember is interval data has an arbitrary 0
that means 0 does not mean
an absence of something
zero
just means zero
take for instance, zero degrees fahrenheit
that does not mean there is no temperature
there is a temperature
it's just that
temperature is 0
with ratio
zero means you don't have something there
yes zero dollars in the bank you don't have any dollars
it doesn't mean there's a $0 sitting there, there is nothing
but if it is 0 degrees outside, it's 0 degrees outside
so that's the easiest way
now with these
I know in Statistics we were supposed to memorize
these, most of us
memorized what categorical and numerical were and just remembered the definition of one of these (types)
so if you you know one you know the other
If you know ordinal can be ranked, you know nominal can't be
if you know ratio
has a 0 where something is missing, you know interval has the 0 that does not mean the
absence of something
so that's a good technique to remember those two
so what does data mining do
it looks for patterns in data
you might look for a certain number of people
maybe you're going through
the data
some data I analyzed was for the Career & Testing Center at Lamar
and I had to go through the data and i would
mark everything
you know which advisor they saw
what they came in for, what college they came from, what their major was
then I would organize everything
and what we came up with were
charts and graphs that told us
patterns
so we were able to see
we hav a higher percentage of engineering students coming in
we have a higher percentage of business students coming in
well those are the ones we market to the most
so there is a pattern between them (marketing and use of services)
there is also a pattern when
one adviser has
more visitors than others because maybe that advisor is more popular
software systems will help to do it (analyze) on a much bigger scale
and much more quickly
in a lot less
time. It would take me hours to analyze that data
this (software) is going to do a lot of it for you but you have to be able to understand it
and to be able to read it
and you have to know what you're looking for
now
data mining has 3 main categories
there's prediction
association and clustering
on a more technical note
learning algorithms of data mining can be classified as supervised or
unsupervised
just know that supervised means that
descriptive and class attributes are included
unsupervised means that just descriptive attributes are included
and that has to do with now the data is mined
or analyzed
but we won't get that technical because this isn't a database class
just how much understand what those mean and that these are the three types
of data mining
with protection
that's pretty much where you're predicting something
but predicting is not guessing
prediction is where you are taking into account
your experiences, opinions
and other relevant information
to fortell
forecasting on the other hand uses data and models forecast
so with prediction you're going to have
use information and try to
not guess but kind of guesstimate
basically foretell what might happen. You may have worked in an industry for a long
time and you kind of know how
things work like
a lot of predictions occurr in the fashion industry
they can use forecasting as well, but there is a lot of prediction based on knowledge of fashion
knowledge of the industry
you know
a lot of times they can predict what is going to be "in" in the next season
forecasting is going to be more data and numbers
like when you forecast weather
what the temperature is going to be tomorrow, you're using big model system, data,
numbers
classification
that might be more prediction where you're saying it is going to be rainy tomorrow
that it is going to be sunny
but it could also be forecasting
with forecasting you can get more specific
into the actual temperatures
so these are types of predictions
now clustering
that's where you are grouping
things based on similar characteristics
so you might have customers in a region
maybe you own a gym and you are recruiting them based on whether they have a gym
membership already or not
you're going to market to those who have a gym membership differently than you would to those who don't
you might offer them
special deals for switching
special new sign-on bonuses
if they don't have one
you might offer new membership deals
or maybe tips on
why they should be working out, maybe
take into consideration why they don't have a gym membership
maybe how far away it is
do they have kids? You might have a daycare
all these things are going to affect how you market (to) those customers
so you will cluster
based on characteristics
you can do these manually
or there are systems that will cluster them based on the characteristics
the goal is to create groups so that members within each group have maximum similarity
so kind of like
you may have the gym
members, non-gym members, but then you may have those with
kids
those who
live far away. Things like that
associations
basically you're looking for groupings
such as
when someone buys diapers they often buy beer
you might notice that grouping (in the data)
if you have kids
and you drink beer, that might make sense
you know maybe
you may get other things
look at chocolates and flowers
someone's going on a date
you can probably think of several different groupings where people buy
things together
that's where they came up with the idea to do
plane
rental car, and hotel together
you get a special deal for buying all 3
because many who travel buy all three
so the link analysis
it's automatically discovered
the linkage
where with sequence mining
you identify associations over time
it is often called the market basket analysis (associations)
because
a lot of times association analysis they are used in
large-scale transaction reporting systems, point of sale (POS) systems
such as when you use your kroger card at kroger
okay so other data mining tasks
you have time series forecasting and visualization
time series forecasting
it is data that is captured and stored over time. You can see
how it changes over time
a good analysis is
if you look at stock prices
they have years and years where you can see how
they've changed over time and see similarities
different years and things that affected it
maybe try to figure out a pattern
now with stocks it is a little more
it can be a little more based on
extenuating circumstances, you know stock market crashes
visualization that's
making things make more sense by having a visual
view of it
now there are two types of data mining
you have hypothesis driven
and you have discovery driven
hypothesis testing you'll do in statistics
that's where you have a hypothesis and you are trying to validate it
or you may
invalidate it
discovery driven data mining that's where you're going to find patterns, associations,
and relationships
so maybe something you didn't previously know was there
you're going to discover it
where with hypothesis
you have a proposition and you want to prove it
so data mining applications
customer relationship management that's where you want to create one-on-one
relationships with your customers
so you want to understand their needs and wants
and these (bullets on slide) are all things that
having that information can improve
you can
maximize your return on marketing campaigns, so you can better market to
your customers
you can identify
your valued customers and treat them better, maybe offer them special benefits
if you see something someone normally buys a lot, you can send them a catalog at the time
they normally buy it
for instance my mom always buys
Tillamook Beef Jerky
We are from there (Oregon) originally
she buys it every Christmas so every Christmas they email my mom to say
hey, order more beef jerky
it's stuff like that
banking you can automate loan applications, maximize customer value
detect fraudulent transactions. It might
flag if someone is suddenly using your card in China
and then again some more ways it can improve
processes
these are two areas that it (data mining) is
highly used
especially in medicine
because you're looking for
patterns
maybe it is survivability
for cancer patients
symptom and illness similarities
organ transplants success rates, you might put in factors and find certain factors that
affect whether the transplant will be accepted or not
and then you can put in certain factors to find out why
maybe some people certain medicine doesn't work yet it works on others
the book has quite a few other examples
that begin on page
146 and 147, if you want you can look through those
data mining software
the most popular tools are developed by the largest statistical tool vendors
you have SPSS, you have SAS
you have IBM
and you have StatSoft
there are more but these are the largest
business intelligence vendors, SAP Business Objects
Teradata
they will have a level of data mining integrated
but they're not
but they are not considered as direct competitors for data mining tools because they
focus on multidimensional modeling
so they focus on analyzing the data
as opposed to
pulling the data
so Weka is the most popular open-source
it has a large number of algorithms
an intuitive user interface
RapidMiner is a newer software
it's more graphically enhanced
but
the difference between commercial and free is that
free software will not run as well on large amounts of data, and it may not run
at all
commercial software can handle large amounts of data better
and this is an example of a churn analysis
in Microsoft SQL Server 2008
churn analysis is the rate of attrition in the customer base
so how are you are keeping your customers
actually i think that's retention, attrition is how well you are obtaining customers
and keeping them
that kind of has to do with retention as well
it says SQL server
actually that should be s-e-r-v-e-r
2008 BI Development Suite
and it has models that are stored in a relational database environment so it
makes management of the
data model easier, because you're only accessing one relational
database environment
again a lot of this will make more sense when we actually go in and start doing
the processes
we won't use Microsoft SQL but we'll use SAP
it's good to have the background now so that when you go to do the work
it makes more sense
so that it for Chapter 4
and you will have an assignment for this week and then an assignment for next
week
and i think you have a
video or a session to listen to as well