Tip:
Highlight text to annotate it
X
>> Sue: Hello! This morning I’d like to show you a quick example of analyzing insurance
data using Revolution’s big data GLM capabilities.
The full-featured rxGlm function in the RevoScaleR package is fast and scalable. The same code
works on small and big data, and on a laptop, server, cluster, or the cloud. Estimation
time scales about linearly with the number of rows in your data set, and does so without
increasing memory requirements.
Generalized linear models provide a framework for a wide range of analyses. For example,
count data – such as the number of vehicles an auto policy holder owns, or data containing
only positive values – like as the value of auto insurance claims, are typically modeled
using GLM. Here we’ll use a third example – a Tweedie model. The Tweedie model is
appropriate for data with lots of exact zeros, but also positive values. For example, data
on insured vehicles. The claims amount for many vehicles is zero, but there are also
a range of positive claims values.
I had the great pleasure of working with Jim Guszcza of Deloitte Consulting and the University
of Wisconsin Business School to develop this example. Jim had downloaded the Allstate data
on insured vehicles used in a Kaggle competition.
Let me switch my screen to the Revolution R Enterprise R Productivity Environment to
show you what we did. The data was imported into two files: one with data from 2005 and
2006, which we use for training, and one with data for 2007, which we can use for testing
our model. First, let’s look at summary statistics for the claims for the first 2
years. It went by pretty fast, but 27 blocks of data with a total of about 8 and half million
observations were processed, showing us that less than 1 percent of the insured vehicles
had claims, and that the largest claim was over 11 thousand dollars.
Since most of the variables in the data set provided don’t have meaningful names, our
quick modeling attempt was somewhat arbitrary. Here you see the formula. Now I’ll run the
estimation, and time it for you. It’s a fairly big model, estimating 58 coefficients.
And GLM is an iterative procedure. Looking at the results, we can see it made 16 passes
through the data. All done on my laptop in a little over 2 minutes – not bad.
Next, we’ll use the estimated model to compute predicted claims on our 2007 test data. And
last, we’ll visualize our results using some R graphics code for a relativity plot
that Jim provided.
The next step, of course, would be to refine our model. If I planned to experiment with
many models, I’d put the data on our small cluster of commodity hardware. Then I could
run the same code and get the results back about 4 times faster.
If you’d like to know more about estimating scalable generalized linear models on your
data, please get in touch. Thanks for listening.