Intro to programming in R part two 2

Part of QMUL RMC361-2013. © Robert Verity & Yannick Wurm

In this session we will build on some of the skills learned in the previous practical, and will move on to consider some more advanced ways of controlling program flow. As before, you will want to refer frequently to the help files and your own notes when exploring these new concepts.

        To get your brains warmed up, here a a few questions on the previous material.

Table of Contents

Research Methods and Communication - R Session 2

Table of Contents

Functions

Intro to functions

Defining a function and its inputs

The main workings of a function

Outputs

More complex functions

Loops

Intro to loops

Slightly more complex loops

Creating Custom Data Sets

Reshaping data using loops

Bringing it Together – i.e. exam practice!

*Bonus Section: default values

Functions

Intro to functions

We start by taking a more detailed look at functions. You have already encountered a number of different functions, and should be quite familiar with their usage. For example, rbind(), subset(), grep() and gsub() are all functions that you have used already (if you can't remember what these functions do then I suggest you take another look at your notes from the first session)! Although these functions all do different things, they have the same common structure to them. This structure is key to understanding how a function really works.

A function has three main parts:

  1. Inputs
  2. Main workings of the function
  3. Outputs

The inputs are what go into the function. For example, in the case of grep() we were required to input a pattern that we were searching for, and a vector x within which to search. The main workings of the function are a set of rules or calculations describing what should be done using these inputs. Finally, the end result of these calculations is an output, or a series of outputs, that should be returned by the function. Figure 1 gives a rough schematic showing how a function works. Don't worry if this does not make perfect sense to you at this stage - hopefully it will do shortly!

Figure 1   Rough schematic of a function.

Now take a look at the following code. This code is designed to take a number in seconds, and convert it into hours, minutes and remaining seconds:

# Input raw number of seconds

number.of.seconds <- 19845

# Convert number.of.seconds into hours, minutes, and seconds

hours                <- floor(number.of.seconds /(60*60))

minutes        <- floor((number.of.seconds -hours*(60*60))/60)

seconds        <- number.of.seconds -((hours*60)+minutes)*60

# Create a single vector containing all three quantities

outputvec <- c(hours, minutes, seconds)

# Output the solution to the console

outputvec

Try to understand roughly what this code does, although don't worry too much about the calculations in the middle. Copy the code into a new script window, and run the script. You should find that the program outputs the solution, which in this case is 5 hours, 30 minutes, and 45 seconds.

        Notice that there are three clearly identifiable parts to the code above - the inputs (the variable number.of.seconds), the main workings of the code (where the actual calculations are performed), and the outputs (where the value of outputvec is returned). So, we can see that this script already has some similarities with our definition of a function. But! Now imagine that we need to carry out this operation not just once on the value number.of.seconds=19845, but many hundreds of times, and each time with a different input. At the moment we would have no choice but to copy the entire code many times over, varying the input each time. This is clearly an extremely wasteful and time-consuming way to program, and is not a practical solution to the problem.

        A much better solution is to write this code as a function, rather than as a standalone script. A function takes the input value - in this case the raw time in seconds - performs all of the necessary calculations behind the scenes, and then simply returns the desired output. We will now work through the process of converting this example script into a working function, looking at each of the stages in turn.

Defining a function and its inputs

The first thing that we need is a name for our function. As with naming variables, this is not rocket science, and the name timeconverter will do fine here.


Note - there a many functions loaded into R by default. When coming up with a function name, make sure you are not overwriting one of these existing functions. Do this by searching the help for that function name. For example, if you wanted to call your function prod then type ?prod (in this example you will find that this function name is already taken).


Now we need to tell R that we want to create a new function called timeconverter. This can be done as follows:

timeconverter <- function(inputseconds) { }

(don't run this code on its own, as it will not do anything). The crucial part here, i.e. the part that cannot be fiddled with, is the word function(), which tells R that we want timeconverter to be a function, rather than an ordinary variable. The part that can be fiddled with is the text inside the curly brackets. This is where we write the name, or names, of the inputs to the function. These are often called the arguments of the function. You are already familiar with the arguments of several functions from previous work. For example, the function grep() has arguments pattern and x, as well as the optional arguments value and ignore.case. In the code above we are stating that our function timeconverter has a single argument, which is called inputseconds. So far so simple.

The main workings of a function

Once we have defined our function and stated the inputs, we can get stuck into the main body of the function. This section always opens with a squiggly bracket "{", and closes with another squiggly bracket "}".  These tell R where the function starts, and where is stops. The overall format should look like this:

timeconverter <- function(inputseconds) {

        

        # this is where the calculations go

        

}

Notice that I have indented everything inside curly brackets using the Tab key. This is not strictly required to make the function work, but is good coding practice as it makes it more obvious where the function starts and stops.

        In this example we want to paste into this section the code that actually performs the time conversion. The new code should look like this:

timeconverter <- function(inputseconds) {

        

        # Convert the inputseconds into hours, minutes, and seconds

        hours                <- floor(inputseconds/(60*60))

        minutes        <- floor((inputseconds-hours*(60*60))/60)

        seconds        <- inputseconds-((hours*60)+minutes)*60

        

        # Create a single vector containing all three quantities

        outputvec <- c(hours, minutes, seconds)

        

}

In the initial script we used the fixed variable number.of.seconds in our calculations. In the code above we have replaced this with the variable inputseconds, which is an argument of the function. The value of inputseconds can be different every time the function is run, while in the initial script we were stuck with the fixed value number.of.seconds=19845. In other words, by writing this as a function the user will be able to decide whatever value they want for the value of inputseconds.

Outputs

The only thing missing from the function above is the code for outputting the solution. At the moment the function simply takes the input, performs the calculations, and then keeps the solution to itself! This is no good.

        Telling a function which values to output is very simple. We simply need to use the command return() within the function, with the variable that we want to output written inside the brackets. For example, in this case we will write return(outputvec). The complete code should look like this:

timeconverter <- function(inputseconds) {

        

        # Convert the inputseconds into hours, minutes, and seconds

        hours                <- floor(inputseconds/(60*60))

        minutes        <- floor((inputseconds-hours*(60*60))/60)

        seconds        <- inputseconds-((hours*60)+minutes)*60

        

        # Create a single vector containing all three quantities

        outputvec <- c(hours, minutes, seconds)

        

        # Output the solution

        return(outputvec)

}

Copy this code over into a new script window, and run the script. Once this script has been evaluated you will have access to the function timeconverter(). Try typing timeconverter(345) in the console. You should find that all of the calculations required to convert this number into hours, minutes, and seconds, takes place behind the scenes, and just the final answer is output to the console.

        Remember that you can assign the output of a function to a new variable. For example, try typing converted<-timeconverter(345) in the console. The variable converted should now contain the desired output.

Returning to the beginning of this section, our initial problem was that we wanted to perform this time conversion many times over, and with different inputs each time. The unsatisfactory solution was to copy the entire script many times over, varying the input each time. We can now see that the way to solve this problem is as follows:

  1. Define the function timeconverter() at the beginning of the script. Let the raw time in seconds be an argument to this function.
  2. Use the function timeconverter()throughout the rest of the script whenever we need to perform a conversion.

When using this approach it is important that you remember to define the function at the beginning of the script. Otherwise when you load up R the next time, or when you try to run the script on a different machine, you will not have access to the function.

Q1. Write your own function for converting distances between different units. Your function should take the distance in kilometres as input, and return the distance in miles as output (8 kilometres is roughly equal to 5 miles). Remember to clearly annotate your code, and make appropriate use of white space.

More complex functions

Hopefully you can already see how functions can be very useful things. We can make them even more useful by considering some simple extensions.

        First, you can define multiple arguments to a function. For example, in the case of timeconverter() we might have another input - say, an additional number of hours that need to be added on to the solution. Lets call this argument additionalhours. This extra argument goes in at the point where we define the function, separated by a comma from the previous argument. The new code looks like this:

timeconverter <- function(inputseconds, additionalhours) {

        

        # Convert the inputseconds into hours, minutes, and seconds

        hours                <- floor(inputseconds/(60*60))

        minutes        <- floor((inputseconds-hours*(60*60))/60)

        seconds        <- inputseconds-((hours*60)+minutes)*60

        

        # Create a single vector containing all three quantities

        outputvec <- c(additionalhours+hours, minutes, seconds)

        

        # Output the solution

        return(outputvec)

}

The value of additionalhours has been included in the calculations, being added on to the existing solution. Run this code and have a go at using the function with some different input values. Remember that you will now have to specify values for both arguments when you run the function, i.e. timeconverter(inputseconds=345, additionalhours=10). See the bonus section at the end for how to set default values for your arguments.

It is also worth noting that functions do not have to take single numbers as input. They can take vectors, matrices, data frames, or any other type of object, and they can also take character and logical data as well as numerical. For example, the following function takes a vector of words as input, and outputs the number of characters in the longest word:

maxcharacters <- function(wordvec) {

        

        # Find the number of characters in the longest word

        characters                <- nchar(wordvec)

        largestvalue        <- max(characters)

        

        # Output the solution

        return(largestvalue)

}

Try creating  a vector of words and running it through this function. If you want to understand how this function actually works, simply copy the main workings of the function over into a new script and look at each command in turn (you may not have encountered the functions nchar() and max() yet).

Q2. This task is a bit more challenging! Go back to your function for converting kilometres to miles; make a copy with an appropriate new name. The new extended function you must create should:

  1. take a distance in kilometres and a time in minutes as input.
  2. convert the distance into miles, and the time into hours. It should then calculate a speed in miles per hour.
  3. return the speed in miles per hour as output.

All of the skills required to complete this task this are given above. Take your time, and approach this problem one step at a time.

Q3. Go back to the script you wrote in the previous practical for calculating the volume of a rectangular room. Rewrite this as a function, with the width, length, and height as arguments.

Q4. Write a function that makes use of logical expressions. For example, you might design a simple function that takes x and y as input, and returns TRUE if x is greater than y, or FALSE otherwise.

Loops

Intro to loops

One of the main things that computers are useful for is doing the same task over and over again many times. Something that may take hours to do manually might take seconds on a computer, just as long as we have the right code to do it. The programming tool that allows you to perform a task many times over is the loop.

Have a look at the following code. This code takes a particular time value, converts it to hours, minutes, and seconds using the function timeconverter(), and then prints the output in the console. It does this for a range of different time values, one after the other:

# Convert first time and print output

time <- 178

converted <- timeconverter(inputseconds=time, additionalhours=0)

print(converted)

# Convert second time and print output

time <- 179

converted <- timeconverter(inputseconds=time, additionalhours=0)

print(converted)

# Convert third time and print output

time <- 180

converted <- timeconverter(inputseconds=time, additionalhours=0)

print(converted)

# Convert fourth time and print output

time <- 181

converted <- timeconverter(inputseconds=time, additionalhours=0)

print(converted)

Notice that at each stage we are simply overwriting the value of the variables time and converted as we need to. Notice also that the script is very repetitive, with the same block of code being copied many times over. This is very wasteful in terms of the space it takes up, and the amount of time it takes to write. Now imagine doing it thousands of times rather than just four times!

        A loop solves this problem by creating a block of code that will be evaluated many times over automatically. As part of this process we create a single variable, called the index of the loop, that is overwritten with a different value each time the block is evaluated. As with functions, there is a standard syntax (way of writing) that must be used when defining a loop. A simple example of a loop is as follows:

possible.values <- 1:5

for (index in possible.values) {

        

        print(index)

}

Here I have used black text to show things that cannot be changed, and red text to show things that can be changed. To start with, the word "for" simply tells R that you are defining a for loop (the only type of loop that we will consider here). Then we write some curly brackets. The first thing inside these brackets is the name of the index variable of the loop. Here I have called it index, although any valid variable name can be used. The index variable will take on a different value each time through the loop. Now we need to tell R what values to use for the variable index. These values are written after the word "in". Here I have used a simple vector of values going from 1 to 5. Thus, the first time through the loop the variable index will have a value 1, the second time through it will have a value 2, etc. Finally, we create a block using squiggly brackets, just as we did when defining a function. Everything inside of these brackets falls within the scope of the loop, and will be evaluated each time through. In this example I have written a simple command that prints the value of the index to the console.

        Try running this code. You should see the numbers one to five output in the console. It is important to understand exactly where these numbers come from. The complete workings of the loop can be written out as follows:

  1. First time through the loop. Set index=1. Evaluate code within squiggly brackets.
  2. Second time through the loop. Set index=2. Evaluate code within squiggly brackets.
  3. Third time through the loop. Set index=3. Evaluate code within squiggly brackets.
  4. Fourth time through the loop. Set index=4. Evaluate code within squiggly brackets.
  5. Fifth time through the loop. Set index=5. Evaluate code within squiggly brackets.

This is an extremely simple example, but its easy to see how this method can be put to good use. Going back to the problem stated at the outset, we can now write the code for converting a series of different time values in a much more compact way as follows:

time.values <- 178:181

for (time in time.values) {

        

        # Convert time and print output

        converted <- timeconverter(inputseconds=time, additionalhours=0)

        print(converted)

}

Notice that the index of the loop has been given a different name here, and the values that we are looping over have also changed. Run this code and look at the output. Really try to get your head around how the loop is operating at each iteration (each time through the loop), and how the output is being produced.

Slightly more complex loops

There are some interesting ways in which we can stretch our understanding of loops. First of all, it is important to recognise that the values that we are indexing over do not have to be sequential. The part where we write 178:181 in the example above really just specifies a vector that we are indexing over. Compare this with the following code:

loop.values <- c(15,5,2.3,100,16)

for (index in loop.values) {

        

        print(index)

}

Evaluate this code and look at the output. Notice that we have simply looped over the vector of values contained in loopvalues. We can even do away with numerical values altogether. Evaluate the following code:

loop.text <- c("How", "Now", "Brown", "Cow")

for (index in loop.text) {

        

        print(index)

}

Looking at the output we can see that this loop indexes over a series of character strings. This technique can be useful in some situations.

        Another important way of extending loops is to consider nested loops - in other words, loops within other loops! Have a look at the following code:

for (i in 1:10) {

        

        for (j in 3:6) {

                

                print(c(i,j))

        }

}

Here we have one loop (with index j) nested within another loop (with index i). I have also defined the values that i and j can take directly within the loops, rather than outside of the loops as in previous examples - this is simply a way of saving space. Evaluate this code and try to make sense of the output. Fiddle around with the different elements of this code until you are comfortable with nested loops. You could even try writing a third-degree nested loop (put another loop inside the j-loop)! If you plan to do this then keep in mind the following warning...


Warning - loops require your computer to perform many operations, and as such it is quite easy to crash R using loops. A simple block of code evaluated 100,000 times amounts to quite a big job. If you want to force R to exit a loop part way through, simply press Esc. Nested loops are particularly hazardous!


Once you are comfortable with loops, have a go at the following tasks:

Creating Custom Data Sets

Reshaping data using loops

One way in which you might find loops particularly useful is in reformatting data. By looping through all of the fields of a particular data set, and at each iteration dropping the relevant entry into a new data structure, it is possible to convert from one data format to another.

        The data set that we will use in this example is typical of the sort of data that you might be faced with in the future. Load the data by running the following line of code:

helianthus.data <- as.matrix(read.table("http://www.antgenomes.org/~yannickwurm/tmp/HelianthusData_num.txt ", header=T))

Alternatively download the file from here and similarly read it into R:

helianthus.data <- as.matrix( read.table(file.choose()), header=T)

Each row in this data set represents a different strain of Helianthus annuus (sunflowers), grown under controlled conditions. The first column tells us the Strain (these are numbered 1 to 5). The remaining columns describe the number of plants found in the study area at six different points in time. For example, looking at the first row, we can see that strain 1 started out with 12 plants, but by the final time point contained 57 plants.

        We want to get this data into a new format - sometimes called long format - in which we have a matrix of three columns; the first column describes the strain, the second column describes the time point, and the third column describes the number of plants observed. The first few lines of this new data structure should look like this:

Strain        Time        Count

1        0        12

1        1        33

1        2        71

1        3        61

1        4        73

1        5        57

2        0        10

2        1        27

...

We can make the transition from the wide format of helianthus.data to the long format described above using a nested loop. But first, let us create an empty matrix, called long.data, which we will eventually fill with our new values. This matrix must have 3 columns, and 30 rows (the number 30 comes from the fact that we have five strains at six time points each). You should already know the code to produce this matrix. Have a go yourself before looking at the answer below...

# Create an empty matrix

long.data <- matrix(0, nrow=30, ncol=3)

What you might not know is how to name each of the columns of this matrix. This can be done as follows:

colnames(long.data) <- c("Strain", "Time", "Count")

With this empty matrix created we can move on to the next part of the problem - populating it with values. We want to look at each of the elements of helianthus.data one after the other, using a nested loop. The basic structure of this nested loop is as follows:

# Loop through all rows

for (myrow in 1:5) {

        

        # Loop through all columns except the first

        for (mycol in 2:7) {

                

                # This is where the main code goes.

                

        }

}

Here we are using loops to index through each of the rows of the matrix helianthus.data, and for each row we are indexing through columns 2 to 7 (as these are the columns that contain relevant data). At any point in the two loops, the value that we are focusing on is given by helianthus.data[myrow,mycol]. For example, the following script could be used to print out each of the values that we are interested in one after the other:

# Loop through all rows

for (myrow in 1:5) {

        

        # Loop through all columns except the first

        for (mycol in 2:7) {

                

                # Print out helianthus.data for this row and column

                print(helianthus.data[myrow,mycol])

                

        }

        

}

Hopefully you can already see that these are the exact values  we want to drop into the third column of our matrix long.data. However, we are presented with a problem - how do we drop these values one after the other into the right place in the matrix long.data? We cannot use the index myrow to help us, as this only goes through values 1:5. Similarly, we cannot use the index mycol, as this only goes through values 2:7. What we really need is a new index that goes all the way from 1 to 30, irrespective of the row or column that we are focusing on.

        If we want such an index then we will have to make it ourselves! Let us start by creating the variable myindex, and giving it a value of zero. Then, each time through the inner loop we want to increase the value of myindex by 1; this can be done using the simple command myindex <- myindex+1. This is called incrementing the variable myindex by a value of 1. At this point the script might look something like this:

# Create myindex

myindex <- 0

# Loop through all rows

for (myrow in 1:5) {

        

        # Loop through all columns except the first

        for (mycol in 2:7) {

                

                # Increment myindex by a value of 1

                myindex <- myindex+1

                # Print values of all three indexes

                print(c(myindex,myrow,mycol))

                                

        }

        

}

This example code prints out the values of all three indices each time through the inner loop. Have a look at the output produced by this code, and make sure you can explain the pattern of numbers that you see (ask a demonstrator to explain if you do not).

        Now that we have three indices - one going through the rows of helianthus.data, one going through the columns of helianthus.data, and one simply going from 1 to 30 - we have all the ingredients we need to populate the matrix long.data. The adventurous among you might want to have a go at this yourselves! Otherwise, read on for the solution.

         We have already seen that the values we want to drop into the third column of our matrix long.data are given by helianthus.data[myrow,mycol]. We have also created the variable myindex telling us the row of longdata that we want to drop into. Thus, the code we need at each iteration of the loop is the following:

long.data[myindex,3] <- helianthus.data[myrow,mycol]

This is fairly straightforward. We also want to drop the time point in the second column of long.data. Although we do not have a vector describing each of the time points, in fact the timings are very simply given by mycol minus two. For example, if we are looking at the fourth column then we are looking at the second time point. Therefore, we need the following line of code to extract the timings:

long.data[myindex,2] <- mycol-2

Finally, we want to drop the strain type into the first column of long.data. The strain type is given by the first element in every row, meaning it is given by helianthus.data[myrow,1]. Therefore, we need the following line of code to extract the strain types:

long.data[myindex,1] <- helianthus.data[myrow,1]

With the addition of these three lines of code we are finished! The final script should look something like this:

# Create an empty matrix

long.data           <- matrix(0, nrow=30, ncol=3)

colnames(long.data) <- c("Strain", "Time", "Count")

# Create myindex

myindex <- 0

# Loop through all rows

for (myrow in 1:5) {

        

        # Loop through all columns except the first

        for (mycol in 2:7) {

                

                # Increment myindex by a value of 1

                myindex <- myindex+1

                # Drop the appropriate values into long.data

                long.data[myindex,1] <- helianthus.data[myrow,1]

                long.data[myindex,2] <- mycol-2

                long.data[myindex,3] <- helianthus.data[myrow,mycol]

                

        }

        

}

After running this code the matrix long.data should contain the same data as helianthus.data, but in long format rather than wide format.

Bringing it Together

Well done for making it this far! Now try to bring all of your knowledge together. Have another look over your previous notes, and make sure these ideas are still fresh in your mind. Have a go at writing your own scripts, preferably combining a number of different concepts (for example, you could try making a function that acts on a matrix, and then using this function within a loop). Another good idea would be to come up with a problem that you want to solve (either for fun or for other coursework), and then write a function that solves it.

Many of the ideas presented in this session and the last are designed to challenge you. Do not expect to understand them from a single read through the material. Rather, you will have to play around with many different examples and applications before the penny drops completely. On the plus side, structures such as loops and functions are common to almost all programming languages, and so once you understand these concepts the world of programming is your oyster!

*Bonus Section: default values

Returning to functions, another neat thing that we can do is set default values for our arguments. Have another look at the timeconverter() function. This function will complain if you try to run it without setting a value for both the argument inputseconds and the argument additionalhours - for example, if we tried to run the line timeconverter(inputseconds=123) we would get an error message. This might become very annoying - especially if the vast majority of the time the value of additionalhours will be set to zero. We can set a default value for this argument when we define the function as follows:

timeconverter <- function(inputseconds, additionalhours=0) {

        

        # (the rest of the code)

        

}

        

Make this change to your function, and then try running the line timeconverter(inputseconds=123). You should find that this command goes through fine, using the value additionalhours=0 as the default.