High - Performance glm with r - An auto insurance example

>> Sue: Hello! This morning I’d like to show you a quick example of analyzing insurance data using Revolution’s big data GLM capabilities. The full-featured rxGlm function in the RevoScaleR package is fast and scalable. The same code works on small and big data, and on a laptop, server, cluster, or the cloud. Estimation time scales about linearly with the number of rows in your data set, and does so without increasing memory requirements. Generalized linear models provide a framework for a wide range of analyses. For example, count data – such as the number of vehicles an auto policy holder owns, or data containing only positive values – like as the value of auto insurance claims, are typically modeled using GLM. Here we’ll use a third example – a Tweedie model. The Tweedie model is appropriate for data with lots of exact zeros, but also positive values. For example, data on insured vehicles. The claims amount for many vehicles is zero, but there are also a range of positive claims values. I had the great pleasure of working with Jim Guszcza of Deloitte Consulting and the University of Wisconsin Business School to develop this example. Jim had downloaded the Allstate data on insured vehicles used in a Kaggle competition. Let me switch my screen to the Revolution R Enterprise R Productivity Environment to show you what we did. The data was imported into two files: one with data from 2005 and 2006, which we use for training, and one with data for 2007, which we can use for testing our model. First, let’s look at summary statistics for the claims for the first 2 years. It went by pretty fast, but 27 blocks of data with a total of about 8 and half million observations were processed, showing us that less than 1 percent of the insured vehicles had claims, and that the largest claim was over 11 thousand dollars. Since most of the variables in the data set provided don’t have meaningful names, our quick modeling attempt was somewhat arbitrary. Here you see the formula. Now I’ll run the estimation, and time it for you. It’s a fairly big model, estimating 58 coefficients. And GLM is an iterative procedure. Looking at the results, we can see it made 16 passes through the data. All done on my laptop in a little over 2 minutes – not bad. Next, we’ll use the estimated model to compute predicted claims on our 2007 test data. And last, we’ll visualize our results using some R graphics code for a relativity plot that Jim provided. The next step, of course, would be to refine our model. If I planned to experiment with many models, I’d put the data on our small cluster of commodity hardware. Then I could run the same code and get the results back about 4 times faster. If you’d like to know more about estimating scalable generalized linear models on your data, please get in touch. Thanks for listening.