Tip:
Highlight text to annotate it
X
What we will do for the next couple of weeks is we’re really making decisions about our
data, and so it’s all of the things that you’re going to be thinking about related
to your software and how you want to use it are about decisions that you want to make.
Now, you’ve made your first decision and that was what dataset you’re going to use.
The second decision you made in selecting your codebook was what variables do you actually
want to examine. What we’re going to do is we’re going to take another step with
that decision. You have a codebook that has the variables that you want to examine, and
so now we’re actually going to make that decision in the data. The data decision is
just asking what columns in the data do we want to keep.
One thing that I want to point you to in terms of the availability of the code, you’ll
probably be getting additional resources in your lab. But whenever you need sort of what’s
the code for doing this particular step in my program, Moodle actually has a line of
code — a few lines of code — in each of the different software packages. Again, we
just really want to highlight if you read across them, the similarity of the logic behind
them. They’re a little bit different, but they’re doing the exact same thing.
I should mention, too, the way that we tried to write this is that the bold is meant to
be things that you don’t change, and the unbolded things are things that you adapt
for your own purposes in your own program. We’ve tried to keep that pretty consistent.
If as we’re moving through we need to make any changes, just let us know, but we’ve
tried to keep that convention.
The first step in terms of selecting the columns or the variables that we want, we’re actually
going to write a program today or tomorrow in your lab section. Whatever software you
use, they all have the same steps. The first step is that you’re going to call in your
dataset. You’re going to basically point to your dataset and say, “That’s where
I want you to get my data from.” That’s the data that all of the other code that I
use below it is going to speak to.
You can see that SPSS, Stata and SAS all use slightly different words — get file, use,
libname and data. Again, they’re all doing the exact same thing; they’re just pointing
the program to the particular dataset of interest.
The next step is that we select the variables that we want to work with, and they all use
the “keep statement” for that. The reason we’re doing that is you have a dataset,
of course, that you’re going to use. You have a caterpillar dataset or Add Health or
the CMT Data from the Middletown Schools or whatever, but many of those are absolutely
huge and they require a lot of processing power basically.
What we want to do is we want subset to the variables that you’re going to look at — both
so that you can really make a commitment to those variables, but most importantly, because
we don’t want when you run these programs for it to take five minutes to only find that
you have an error message and have to go back and spend the five minutes again. What this
is going to do is really speed up the process. You’re going to select the columns or variables
that you need, and your programs are going to run much more quickly.
We’ll also sort the data, which is really good programming etiquette. We’ll talk a
little bit about that. Then the most important step, and the last one, is we’re going to
output an abbreviated dataset — your abbreviated dataset with just the variables you’re looking
at. While several of you may be looking at Add Health data, you’re all going to have
very different looking datasets at the end of this step, because you’re going to be
having selected from Add Health all of the variables of particular interest to you that
will help you to answer your question.
Again, just looking at data, because that’s the part that we often don’t see when we’re
doing this programming, this is actually a dataset you may have seen in lab. It’s one
of those nice datasets that has got a lot of string variables so that you can see really
easily with words what the data is. I've described this differently at different times, is what
kind of a dataset it is. It strikes me from looking at it this morning that it may be
like a Visa card dataset, because they’ve got membership status here whether they’re
active or paused and that kind of thing. We’ll say that this is like the membership status
of your Visa card.
Let’s say that my interest in this dataset is that I want to see whether that membership
— silver, gold, bronze — is related to gender. Are males or females more likely to
have a higher membership rating? I decide that I don’t really need all of those variables.
I definitely need those two; I need my unique identifier, and I may even want to look at
a few other variables. Basically, what I’m doing in this first program is choosing the
columns.
You’re going to have a keep statement that selects the variables of interest to you,
and you’re basically getting rid of, in your dataset, all of the columns you don’t
need and keeping things that you think you’re going to use.
The next decision that you want to make and this is not a final decision, but if you happen
to be decision-phobic, all of the decisions we’re talking about short of what dataset
you’re using, because that one you’re now very committed to, you can always change
them. That’s what research is all about. We make the best decisions that we possibly
can with the information that we have in front of us right now. We keep moving forward with
research; we keep collecting more information, and we can often make better decisions.
This is a decision you’re going to sort of revisit again and again, and that is do
you want to focus on a subset of your observations? In other words, are there particular rows
in the dataset that you want to keep? Now, the answer at this point may be no; the answer
may be yes; the answer at this point may be no and later in the semester it may be yes,
but it’s a decision that I want you to always sort of carefully consider when you’re thinking
about your research question.
Let’s assume that I have an interest in disabilities in the population. What kinds
of disabilities are there in mental health vs. physical health? Let’s say that I have
a nationally representative dataset. When I look at the disability item, I find that
there are 15.4 percent of the population that report a physical health disability; 3.5 percent
report a mental health disability, and I have most of the sample at 81.1 percent that report
no disability at all.
For this particular question my observations are the whole dataset. If my dataset is the
U.S. population and my question is a question within that, then I would keep everybody.
I wouldn’t subset the data; however, what if I have a school sample and I’m not interested
in the disabilities of the entire school, but I’m interested specifically in the kids
who have access to special education or IDEA? That stands for Individuals With Disabilities
Education Act, which is Federal funding that goes into schools to help with special education
services.
What if in this school sample I’m really interested in disabilities reported within
kids that are eligible for special education? What I find here is that we have 8.2 percent
that are identified with an emotional disturbance and then other disabilities — anything from
dyslexia to math disabilities to autism or whatever — we have 91.8 percent of the population
there.
Now, for that what I would need to do to ask that question, is I’d need to subset my
observations from the school system — specifically to youth served by IDEA, or youth in special
education. That’s the kind of decision that I want you to think about when you’re thinking
about your question.
Again, if we go back to this dataset on Visa cards, I may say that I’m interested in
whether this sort of membership level is associated with gender, but I’m actually only interested
among people who are active. What I would do is I would put a line in my program to
subset by row — by observation. I've basically gotten rid of all of my non-active or paused
people, and now I've just got active people in my dataset.