Tip:
Highlight text to annotate it
X
Up until this point all of the decisions that you’ve made are based on the codebooks.
You’ve looked at the codebooks to decide what kind of a dataset you were going to use;
you looked at them to identify what kind of variables you want to look at, and you can
make decisions about your rows and columns and which ones you want to keep based on your
codebook.
From this point forward, we’re going to start making decisions based on the data — what
we see in the data, and we’ re going to focus a lot. I mean, this is something that
I don’t care how sophisticated you get in the research process or how many years you’ve
been doing it, you always start with a frequency distribution. What that is, is a tabulation
of the values that one or more variables take in a sample. Here is a frequency distribution
for gender. This is Stata output, and I’ll show you output from each of the three programs
so that you get again the feeling that they’re all doing the same thing.
I've actually subset my sample — the NESARC — to people who have smoked in the past
year, and I've subset to 18 to 25-year-olds,
Anyway, if I look at the sex or gender variable in this sample, one is male and two is female.
My frequency distribution here basically, again, it shows me my response codes — one
and two. It shows me the frequency in the population. I have a total of 1,706 people
in the sample that smoked within the last year and are between 18 and 25. It shows me
the percent of each gender, and it also shows me a cumulative percent, which can be very
helpful.
If you go to the SPSS output, it looks pretty much the same. It has an extra column here
called “valid percent.” When you’ve looked at the medical record data, you’ve
already been acquainted with missing data. Sometimes you don’t know; sometimes there
is no information, and so you have missing data. SPSS actually has an extra column that’s
not just based on the whole sample, but on valid reports. This is what SPSS would look
like, and SAS looks very similar as well — again, gender, frequency, percent, cumulative frequency
and cumulative percent.
This is incredibly important. Lots and lots of decisions that you’re going to make based
on these frequency distributions, so you’re going to run them; you're going to run them
often, and even when you get into your data analysis, you're going to go back and run
them again. They’re constantly telling you things about the data that are helpful to
you.
One thing that I know here is that I can just look at this frequency distribution and say
that if I’m interested in gender differences, I have a really good chance of being able
to look at them because it’s about 50/50. What if I found once I subset the data that
if you take smokers who are between 18 and 25, I’m sure in certain countries that would
be mostly men, right? There would be maybe very few women. So if I've got 2 percent women
or 1 percent women and 99 percent males, it’s possible that I won’t have the power; I
won’t have the numbers to actually look at gender differences.
When you look at these frequency distributions, they can help you to determine am I going
to be able to do what I want to do with this data?
Again, we’re using this to learn about our data. The place you probably want to start
are your main constructs. Gender is not my main construct, and so that may be not the
place I want to start in terms of looking at my data and managing this. Nicotine dependence
is my main topic. If I look at the variable in NESARC for this is nicotine dependence
in the past 12 months: tab12mdx. They don’t have lovely names, but again, our codebook
tells us exactly what these variables mean. What I find is that in this group of 1,706
young smokers about half of them — 52 percent, 896 — have nicotine dependence and about
half, 47.5 percent do not. Yes is one, and no is zero, a general convention in our data.
Again, what I know here now is that I have some variability here. They’re not all nicotine-dependent
and they’re not all not nicotine dependent. Again, there’s a good chance that I can
pursue this question.
Another thing that you can do by looking at these frequency distributions is that you
can actually error check your code. I subset in my data to individuals who are 18 to 25.
If I look at age, I run age and I find that there are people who are 30 and 40, and people
that I don’t expect to have been in there because I subsetted my code. This is where
you’re going to find that out. You run your frequency distribution to make sure that your
code worked, that in fact you only have people between 18 and 25.
Frequency distributions are the first step. After you subset your data to the variables
that you want, you’re going to be looking at frequency distributions for your particular
variables.