Misy 4390 Chapter 4 Spring 2013 Lecture Video

all right so for chapter four we will cover data mining for Business Intelligence we're going to go to chapters four and then chapter 5 later on five later on but not for a couple of weeks the last chapter we'll do is chapter 3 so we're going to skip ahead ok the first slide goes over data mining concepts and definitions Why Data Mining? Thomas Davenport said strategic analytical decision making is a strategic weapon for companies and it actually has become something ...companies want you to know how to analyze data and they want you to be able to make decisions using using information derived from the analysis There is more intense competition on the the global scale there's more and more information available you know a lot more time goes into analyzing data than it used to now that's in terms of managers it used to be data analyze manually so it took a long time you know a lot of people now it can be done management can do it themselves at their desks they can pull a lot of data from this data warehouse and they can analyze it using software so storage capabilities are increased cost has decreased so it's become more popular and it is almost essential now to compete in the industry you have to know how your competitors are doing and you have to know how you're doing so what is data mining? data mining basically mining for knowledge in large amounts of data you're identifying valid, novel, potentially useful, and ultimately understandable patterns in data so basically the data... the process is iterative, meaning there's many repeating steps but basically your data is non-trivial it's valid it needs to be novel meaning it's not something you already could have guessed and then potentially useful and understandable nontrivial and understandable kind of go to gether if it's trivial it's really not going to be understandable or even useful it is useful and nontrivial that go together (correction). If it is not useful then it is likely trivial and those definitions are all in the notes let me go back yea they are all down here in the notes if you want to read through them okay so this chart shows you that data mining is at the center of many different disciplines so it's not a new concept it is basically a new name for the use of all these disciplines together so using artificial intelligence using statistics using modeling, databases, pattern recognition and MIS systems. All of these you're using together to mine for information and so as the book said here the miner is often an end user empowered by data drills and other powerful query tools who may obtain answers quickly with little or no programming skills it used to be in order to query a database you had to know SQL now it's more user interfaces and you type in what you want to know so characteristics of data mining it's often a consolidated data warehouse but not always it's either client-server or web base, so you either have a clients PC accessing the server or you ahve a web-based system again mine is often the end user and ease of use is essential so this portion (types of data) you may have seen before if you've taken business analysis the data in data mining (heading) data is either categorical or numerical categorical data is either nominal ordinal numerical data is either interval or ratio and this is important to know when you're going to analyze the data. Many software programs will ask you and the data is analyzed differently based on the type of data it is so that's where a lot of this will... it will make more sense when you're doing it (hands-on) is good to learn in now to have the basis, but once we go into SAP things will make a lot more sense then it'll be king of like "I get it now!" a moment where you do it and then realize Oh, that's what that meant and then data consists of numbers, words, and images data nowadays can be anything from pictures to just numbers ok and I'll go into these (data types) categorical data it basically can be categorised you can either put it into groups or categories. Race, Sex, Age group so it's not see (next slide) numerical data can be measured categorical data can be categorized, that's the difference between those two now within categorical data you have nominal and ordinal data nominal data cannot be ranked Ordinal data can be ranked such as you have freshman, sophomore, junior senior that's a ranking system low/medium/high where single/married/divorced that's not a rank. Single is not better than married or is not higher level than married you might have 1st/2nd/3rd, that's a ranking. these can make the most sense but when you're going to actually look at data sometimes these can be the hardest to actually differentiate between nominal or ordinal you have to pay close attention to the differences ok so again numerical data can be measured age, number of children age you could take current date - year (of birth) as your age that's a measurement the interval scale interval data can be measured on the interval scale all you have to remember is interval data has an arbitrary 0 that means 0 does not mean an absence of something zero just means zero take for instance, zero degrees fahrenheit that does not mean there is no temperature there is a temperature it's just that temperature is 0 with ratio zero means you don't have something there yes zero dollars in the bank you don't have any dollars it doesn't mean there's a $0 sitting there, there is nothing but if it is 0 degrees outside, it's 0 degrees outside so that's the easiest way now with these I know in Statistics we were supposed to memorize these, most of us memorized what categorical and numerical were and just remembered the definition of one of these (types) so if you you know one you know the other If you know ordinal can be ranked, you know nominal can't be if you know ratio has a 0 where something is missing, you know interval has the 0 that does not mean the absence of something so that's a good technique to remember those two so what does data mining do it looks for patterns in data you might look for a certain number of people maybe you're going through the data some data I analyzed was for the Career & Testing Center at Lamar and I had to go through the data and i would mark everything you know which advisor they saw what they came in for, what college they came from, what their major was then I would organize everything and what we came up with were charts and graphs that told us patterns so we were able to see we hav a higher percentage of engineering students coming in we have a higher percentage of business students coming in well those are the ones we market to the most so there is a pattern between them (marketing and use of services) there is also a pattern when one adviser has more visitors than others because maybe that advisor is more popular software systems will help to do it (analyze) on a much bigger scale and much more quickly in a lot less time. It would take me hours to analyze that data this (software) is going to do a lot of it for you but you have to be able to understand it and to be able to read it and you have to know what you're looking for now data mining has 3 main categories there's prediction association and clustering on a more technical note learning algorithms of data mining can be classified as supervised or unsupervised just know that supervised means that descriptive and class attributes are included unsupervised means that just descriptive attributes are included and that has to do with now the data is mined or analyzed but we won't get that technical because this isn't a database class just how much understand what those mean and that these are the three types of data mining with protection that's pretty much where you're predicting something but predicting is not guessing prediction is where you are taking into account your experiences, opinions and other relevant information to fortell forecasting on the other hand uses data and models forecast so with prediction you're going to have use information and try to not guess but kind of guesstimate basically foretell what might happen. You may have worked in an industry for a long time and you kind of know how things work like a lot of predictions occurr in the fashion industry they can use forecasting as well, but there is a lot of prediction based on knowledge of fashion knowledge of the industry you know a lot of times they can predict what is going to be "in" in the next season forecasting is going to be more data and numbers like when you forecast weather what the temperature is going to be tomorrow, you're using big model system, data, numbers classification that might be more prediction where you're saying it is going to be rainy tomorrow that it is going to be sunny but it could also be forecasting with forecasting you can get more specific into the actual temperatures so these are types of predictions now clustering that's where you are grouping things based on similar characteristics so you might have customers in a region maybe you own a gym and you are recruiting them based on whether they have a gym membership already or not you're going to market to those who have a gym membership differently than you would to those who don't you might offer them special deals for switching special new sign-on bonuses if they don't have one you might offer new membership deals or maybe tips on why they should be working out, maybe take into consideration why they don't have a gym membership maybe how far away it is do they have kids? You might have a daycare all these things are going to affect how you market (to) those customers so you will cluster based on characteristics you can do these manually or there are systems that will cluster them based on the characteristics the goal is to create groups so that members within each group have maximum similarity so kind of like you may have the gym members, non-gym members, but then you may have those with kids those who live far away. Things like that associations basically you're looking for groupings such as when someone buys diapers they often buy beer you might notice that grouping (in the data) if you have kids and you drink beer, that might make sense you know maybe you may get other things look at chocolates and flowers someone's going on a date you can probably think of several different groupings where people buy things together that's where they came up with the idea to do plane rental car, and hotel together you get a special deal for buying all 3 because many who travel buy all three so the link analysis it's automatically discovered the linkage where with sequence mining you identify associations over time it is often called the market basket analysis (associations) because a lot of times association analysis they are used in large-scale transaction reporting systems, point of sale (POS) systems such as when you use your kroger card at kroger okay so other data mining tasks you have time series forecasting and visualization time series forecasting it is data that is captured and stored over time. You can see how it changes over time a good analysis is if you look at stock prices they have years and years where you can see how they've changed over time and see similarities different years and things that affected it maybe try to figure out a pattern now with stocks it is a little more it can be a little more based on extenuating circumstances, you know stock market crashes visualization that's making things make more sense by having a visual view of it now there are two types of data mining you have hypothesis driven and you have discovery driven hypothesis testing you'll do in statistics that's where you have a hypothesis and you are trying to validate it or you may invalidate it discovery driven data mining that's where you're going to find patterns, associations, and relationships so maybe something you didn't previously know was there you're going to discover it where with hypothesis you have a proposition and you want to prove it so data mining applications customer relationship management that's where you want to create one-on-one relationships with your customers so you want to understand their needs and wants and these (bullets on slide) are all things that having that information can improve you can maximize your return on marketing campaigns, so you can better market to your customers you can identify your valued customers and treat them better, maybe offer them special benefits if you see something someone normally buys a lot, you can send them a catalog at the time they normally buy it for instance my mom always buys Tillamook Beef Jerky We are from there (Oregon) originally she buys it every Christmas so every Christmas they email my mom to say hey, order more beef jerky it's stuff like that banking you can automate loan applications, maximize customer value detect fraudulent transactions. It might flag if someone is suddenly using your card in China and then again some more ways it can improve processes these are two areas that it (data mining) is highly used especially in medicine because you're looking for patterns maybe it is survivability for cancer patients symptom and illness similarities organ transplants success rates, you might put in factors and find certain factors that affect whether the transplant will be accepted or not and then you can put in certain factors to find out why maybe some people certain medicine doesn't work yet it works on others the book has quite a few other examples that begin on page 146 and 147, if you want you can look through those data mining software the most popular tools are developed by the largest statistical tool vendors you have SPSS, you have SAS you have IBM and you have StatSoft there are more but these are the largest business intelligence vendors, SAP Business Objects Teradata they will have a level of data mining integrated but they're not but they are not considered as direct competitors for data mining tools because they focus on multidimensional modeling so they focus on analyzing the data as opposed to pulling the data so Weka is the most popular open-source it has a large number of algorithms an intuitive user interface RapidMiner is a newer software it's more graphically enhanced but the difference between commercial and free is that free software will not run as well on large amounts of data, and it may not run at all commercial software can handle large amounts of data better and this is an example of a churn analysis in Microsoft SQL Server 2008 churn analysis is the rate of attrition in the customer base so how are you are keeping your customers actually i think that's retention, attrition is how well you are obtaining customers and keeping them that kind of has to do with retention as well it says SQL server actually that should be s-e-r-v-e-r 2008 BI Development Suite and it has models that are stored in a relational database environment so it makes management of the data model easier, because you're only accessing one relational database environment again a lot of this will make more sense when we actually go in and start doing the processes we won't use Microsoft SQL but we'll use SAP it's good to have the background now so that when you go to do the work it makes more sense so that it for Chapter 4 and you will have an assignment for this week and then an assignment for next week and i think you have a video or a session to listen to as well