R 2.4 - For - Loops and handling missing observations

A second key programming tool is the for loop. A for loop is a structure used to execute a set of code repeatedly. The for loop statement specifier is an index over which the loop is computed. For example, here I'll execute a for loop using an index called i. The object, i, will start the loop by taking the first value in the 1 through 10 vector. That is, to start, i will equal 1. Next, the for loop will execute. Then i will take the next value in this vector, which is 2, and loop will execute again. This will continue until i has taken the last value in the vector, 10, and the code executes one last time. In this set of code, the value i squared will be appended to the vector, x, using the append function. Here I got an error. I'll take a closer look. I can see that x never actually existed, so there was no way to append anything onto the first iteration. To fix this, I'll just initialize x as an empty vector using the concatenate function, but leaving the arguments empty. OK, this runs well. Look at x. In each iteration the value of i squared was appended to the end of x. So the first element was created when i was 1, the second when i was 2, and so on. OK, I've done something pretty cool here. I've done 10 calculations using a for loop, and it wouldn't be hard to do many more with the same set of code. For example, i could easily go from 1 to 100, rather than just 1 to 10. While there are other, better ways to do this particular calculation, there are instances where for loop is very useful. All right, one more look at the stock data. To calculate the smallest and largest values for each stock in the stock data set, I'm going to start by creating an object called the.tickers that is just a list of the unique stock tickers in the data. Since for loops can iterate over any vector, I will write a for loop to iterate over the object that I've called the.tickers. It's sometimes helpful to also give a meaningful name to the index, so I'm going to change the index i to ticker. Now I need to create code for the general case. For a given ticker, calculate both the low and the high value. I'll start by identifying which rows are of interest in the stocks data set. The vector called look.at is a Boolean vector indicating which observations represent the.ticker for the given iteration. Next, I can create two statements to calculate the lowest low and highest high of these observations. Finally, I need to store these values somehow. I can start by initializing two objects, lows and highs. Next, I can use an append command to append a value on to the end of the vector. All right, I can run the code and print the results, but something's wrong. While I'd want to spot check some of my data anyways, something bad has happened. A value of NA in R means that a value is missing, and, more generally, oftentimes functions will return NA if any of the observations are missing. If I took a look through our data set, I'd find that there are several observations with missing NA values. Here, I've checked how many entries in the column low of the stocks data set are missing. In many functions, such as min and max, there's an optional extra argument that is useful for ignoring missing data, the NA.RM argument, which I will set to true in the min and max functions. Now when I re-run the code, I get sensible results. I'd want to look at the data more carefully to see why some observations are missing, but I'll leave this as a topic for another set of videos. One final word-- even in this example, there are other, better functions that could have been used to get the same results much more quickly. This would be important for code that should be implemented efficiently, and I'll get to these functions in the future. However, for the beginning R programmer, it's sometimes easier and clearer to simply implement a for loop.