QAC 201 Date Decisions 2 - Freq dist

Up until this point all of the decisions that you’ve made are based on the codebooks. You’ve looked at the codebooks to decide what kind of a dataset you were going to use; you looked at them to identify what kind of variables you want to look at, and you can make decisions about your rows and columns and which ones you want to keep based on your codebook. From this point forward, we’re going to start making decisions based on the data — what we see in the data, and we’ re going to focus a lot. I mean, this is something that I don’t care how sophisticated you get in the research process or how many years you’ve been doing it, you always start with a frequency distribution. What that is, is a tabulation of the values that one or more variables take in a sample. Here is a frequency distribution for gender. This is Stata output, and I’ll show you output from each of the three programs so that you get again the feeling that they’re all doing the same thing. I've actually subset my sample — the NESARC — to people who have smoked in the past year, and I've subset to 18 to 25-year-olds, Anyway, if I look at the sex or gender variable in this sample, one is male and two is female. My frequency distribution here basically, again, it shows me my response codes — one and two. It shows me the frequency in the population. I have a total of 1,706 people in the sample that smoked within the last year and are between 18 and 25. It shows me the percent of each gender, and it also shows me a cumulative percent, which can be very helpful. If you go to the SPSS output, it looks pretty much the same. It has an extra column here called “valid percent.” When you’ve looked at the medical record data, you’ve already been acquainted with missing data. Sometimes you don’t know; sometimes there is no information, and so you have missing data. SPSS actually has an extra column that’s not just based on the whole sample, but on valid reports. This is what SPSS would look like, and SAS looks very similar as well — again, gender, frequency, percent, cumulative frequency and cumulative percent. This is incredibly important. Lots and lots of decisions that you’re going to make based on these frequency distributions, so you’re going to run them; you're going to run them often, and even when you get into your data analysis, you're going to go back and run them again. They’re constantly telling you things about the data that are helpful to you. One thing that I know here is that I can just look at this frequency distribution and say that if I’m interested in gender differences, I have a really good chance of being able to look at them because it’s about 50/50. What if I found once I subset the data that if you take smokers who are between 18 and 25, I’m sure in certain countries that would be mostly men, right? There would be maybe very few women. So if I've got 2 percent women or 1 percent women and 99 percent males, it’s possible that I won’t have the power; I won’t have the numbers to actually look at gender differences. When you look at these frequency distributions, they can help you to determine am I going to be able to do what I want to do with this data? Again, we’re using this to learn about our data. The place you probably want to start are your main constructs. Gender is not my main construct, and so that may be not the place I want to start in terms of looking at my data and managing this. Nicotine dependence is my main topic. If I look at the variable in NESARC for this is nicotine dependence in the past 12 months: tab12mdx. They don’t have lovely names, but again, our codebook tells us exactly what these variables mean. What I find is that in this group of 1,706 young smokers about half of them — 52 percent, 896 — have nicotine dependence and about half, 47.5 percent do not. Yes is one, and no is zero, a general convention in our data. Again, what I know here now is that I have some variability here. They’re not all nicotine-dependent and they’re not all not nicotine dependent. Again, there’s a good chance that I can pursue this question. Another thing that you can do by looking at these frequency distributions is that you can actually error check your code. I subset in my data to individuals who are 18 to 25. If I look at age, I run age and I find that there are people who are 30 and 40, and people that I don’t expect to have been in there because I subsetted my code. This is where you’re going to find that out. You run your frequency distribution to make sure that your code worked, that in fact you only have people between 18 and 25. Frequency distributions are the first step. After you subset your data to the variables that you want, you’re going to be looking at frequency distributions for your particular variables.