Qac 201 Data Decisions - Rows and columns

What we will do for the next couple of weeks is we’re really making decisions about our data, and so it’s all of the things that you’re going to be thinking about related to your software and how you want to use it are about decisions that you want to make. Now, you’ve made your first decision and that was what dataset you’re going to use. The second decision you made in selecting your codebook was what variables do you actually want to examine. What we’re going to do is we’re going to take another step with that decision. You have a codebook that has the variables that you want to examine, and so now we’re actually going to make that decision in the data. The data decision is just asking what columns in the data do we want to keep. One thing that I want to point you to in terms of the availability of the code, you’ll probably be getting additional resources in your lab. But whenever you need sort of what’s the code for doing this particular step in my program, Moodle actually has a line of code — a few lines of code — in each of the different software packages. Again, we just really want to highlight if you read across them, the similarity of the logic behind them. They’re a little bit different, but they’re doing the exact same thing. I should mention, too, the way that we tried to write this is that the bold is meant to be things that you don’t change, and the unbolded things are things that you adapt for your own purposes in your own program. We’ve tried to keep that pretty consistent. If as we’re moving through we need to make any changes, just let us know, but we’ve tried to keep that convention. The first step in terms of selecting the columns or the variables that we want, we’re actually going to write a program today or tomorrow in your lab section. Whatever software you use, they all have the same steps. The first step is that you’re going to call in your dataset. You’re going to basically point to your dataset and say, “That’s where I want you to get my data from.” That’s the data that all of the other code that I use below it is going to speak to. You can see that SPSS, Stata and SAS all use slightly different words — get file, use, libname and data. Again, they’re all doing the exact same thing; they’re just pointing the program to the particular dataset of interest. The next step is that we select the variables that we want to work with, and they all use the “keep statement” for that. The reason we’re doing that is you have a dataset, of course, that you’re going to use. You have a caterpillar dataset or Add Health or the CMT Data from the Middletown Schools or whatever, but many of those are absolutely huge and they require a lot of processing power basically. What we want to do is we want subset to the variables that you’re going to look at — both so that you can really make a commitment to those variables, but most importantly, because we don’t want when you run these programs for it to take five minutes to only find that you have an error message and have to go back and spend the five minutes again. What this is going to do is really speed up the process. You’re going to select the columns or variables that you need, and your programs are going to run much more quickly. We’ll also sort the data, which is really good programming etiquette. We’ll talk a little bit about that. Then the most important step, and the last one, is we’re going to output an abbreviated dataset — your abbreviated dataset with just the variables you’re looking at. While several of you may be looking at Add Health data, you’re all going to have very different looking datasets at the end of this step, because you’re going to be having selected from Add Health all of the variables of particular interest to you that will help you to answer your question. Again, just looking at data, because that’s the part that we often don’t see when we’re doing this programming, this is actually a dataset you may have seen in lab. It’s one of those nice datasets that has got a lot of string variables so that you can see really easily with words what the data is. I've described this differently at different times, is what kind of a dataset it is. It strikes me from looking at it this morning that it may be like a Visa card dataset, because they’ve got membership status here whether they’re active or paused and that kind of thing. We’ll say that this is like the membership status of your Visa card. Let’s say that my interest in this dataset is that I want to see whether that membership — silver, gold, bronze — is related to gender. Are males or females more likely to have a higher membership rating? I decide that I don’t really need all of those variables. I definitely need those two; I need my unique identifier, and I may even want to look at a few other variables. Basically, what I’m doing in this first program is choosing the columns. You’re going to have a keep statement that selects the variables of interest to you, and you’re basically getting rid of, in your dataset, all of the columns you don’t need and keeping things that you think you’re going to use. The next decision that you want to make and this is not a final decision, but if you happen to be decision-phobic, all of the decisions we’re talking about short of what dataset you’re using, because that one you’re now very committed to, you can always change them. That’s what research is all about. We make the best decisions that we possibly can with the information that we have in front of us right now. We keep moving forward with research; we keep collecting more information, and we can often make better decisions. This is a decision you’re going to sort of revisit again and again, and that is do you want to focus on a subset of your observations? In other words, are there particular rows in the dataset that you want to keep? Now, the answer at this point may be no; the answer may be yes; the answer at this point may be no and later in the semester it may be yes, but it’s a decision that I want you to always sort of carefully consider when you’re thinking about your research question. Let’s assume that I have an interest in disabilities in the population. What kinds of disabilities are there in mental health vs. physical health? Let’s say that I have a nationally representative dataset. When I look at the disability item, I find that there are 15.4 percent of the population that report a physical health disability; 3.5 percent report a mental health disability, and I have most of the sample at 81.1 percent that report no disability at all. For this particular question my observations are the whole dataset. If my dataset is the U.S. population and my question is a question within that, then I would keep everybody. I wouldn’t subset the data; however, what if I have a school sample and I’m not interested in the disabilities of the entire school, but I’m interested specifically in the kids who have access to special education or IDEA? That stands for Individuals With Disabilities Education Act, which is Federal funding that goes into schools to help with special education services. What if in this school sample I’m really interested in disabilities reported within kids that are eligible for special education? What I find here is that we have 8.2 percent that are identified with an emotional disturbance and then other disabilities — anything from dyslexia to math disabilities to autism or whatever — we have 91.8 percent of the population there. Now, for that what I would need to do to ask that question, is I’d need to subset my observations from the school system — specifically to youth served by IDEA, or youth in special education. That’s the kind of decision that I want you to think about when you’re thinking about your question. Again, if we go back to this dataset on Visa cards, I may say that I’m interested in whether this sort of membership level is associated with gender, but I’m actually only interested among people who are active. What I would do is I would put a line in my program to subset by row — by observation. I've basically gotten rid of all of my non-active or paused people, and now I've just got active people in my dataset.