6.1 Functions in R
Functions are your loyal servants, waiting patiently to do your bidding to the best of their ability. They’re made with the utmost care and attention … though sometimes may end up being something of a Frankenstein’s monster - with an extra limb or two and a head put on backwards. But no matter how ugly they may be they’re completely faithful to you.
They’re also very stupid.
If we asked you to go to the supermarket to get us some ingredients to make Francesinha, even if you don’t know what the heck that is, you’d be able to guess and bring at least something back. Or you could decide to make something else. Or you could ask a celebrity chef for help. Or you could pull out your phone and search online for what Francesinha is. The point is, even if we didn’t give you enough information to do the task, you’re intelligent enough to, at the very least, try to find a work around.
If instead, we asked our loyal function to do the same, it would listen intently to our request, stand still for a few milliseconds, compose itself, and then start shouting Error: 'data' must be a data frame, or other object ...
. It would then repeat this every single time we asked it to do the job. The point here, is that code and functions are not intelligent. They cannot find workarounds. It’s totally reliant on you, to tell it very explicitly what it needs to do step by step.
Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.
To prevent functions from being too stupid you must provide the information the function needs in order for it to function. As with the Francesinha example, if we’d supplied a recipe list to the function, it would have managed just fine. We call this “fulfilling an argument”. The vast majority of functions require the user to fulfill at least one argument.
This can be illustrated in the pseudocode below. When we make a function we can specify what arguments the user must fulfill (e.g. argument1
and argument2
), as well as what to do once it has this information (expression
):
<- function(argument1, argument2, ...) {expression} nameOfFunction
The first thing to note is that we’ve used the function function()
to create a new function called nameOfFunction
. To walk through the above code; we’re creating a function called nameOfFunction
. Within the round brackets we specify what information (i.e. arguments) the function requires to run (as many or as few as needed). These arguments are then passed to the expression part of the function. The expression can be any valid R command or set of R commands and is usually contained between a pair of braces { }
(if a function is only one line long you can omit the braces). Once you run the above code, you can then use your new function by typing:
nameOfFunction(argument1, argument2)
Confused? Let’s work through an example to help clear things up.
First we are going to create a data frame called city
, where columns porto
, aberdeen
, nairobi
, and genoa
are filled with 100 random values drawn from a bag (using the rnorm()
function to draw random values from a Normal distribution with mean 0 and standard deviation of 1). We also include a “problem”, for us to solve later, by including 10 NA
values within the nairobi
column (using rep(NA, 10)
).
<- data.frame(
city porto = rnorm(100),
aberdeen = rnorm(100),
nairobi = c(rep(NA, 10), rnorm(90)),
genoa = rnorm(100)
)
Let’s say that you want to multiply the values in the variables Porto
and Aberdeen
and create a new object called porto_aberdeen
. We can do this “by hand” using:
<- city$porto * city$aberdeen porto_aberdeen
We’ve now created an object called porto_aberdeen
by multiplying the vectors city$porto
and city$aberdeen
. Simple. If this was all we needed to do, we can stop here. R works with vectors, so doing these kinds of operations in R is actually much simpler than other programming languages, where this type of code might require loops (we say that R is a vectorised language). Something to keep in mind for later is that doing these kinds of operations with loops can be much slower compared to vectorisation.
But what if we want to repeat this multiplication many times? Let’s say we wanted to multiply columns porto
and aberdeen
, aberdeen
and genoa
, and nairobi
and genoa
. In this case we could copy and paste the code, replacing the relevant information.
<- city$porto * city$aberdeen
porto_aberdeen <- city$aberdeen * city$aberdeen
aberdeen_genoa <- city$nairobi * city$genoa nairobi_genoa
While this approach works, it’s easy to make mistakes. In fact, here we’ve “forgotten” to change aberdeen
to genoa
in the second line of code when copying and pasting. This is where writing a function comes in handy. If we were to write this as a function, there is only one source of potential error (within the function itself) instead of many copy-pasted lines of code (which we also cut down on by using a function).
In this case, we’re using some fairly trivial code where it’s maybe hard to make a genuine mistake. But what if we increased the complexity?
$porto * city$aberdeen / city$porto + (city$porto * 10^(city$aberdeen))
city- city$aberdeen - (city$porto * sqrt(city$aberdeen + 10))
Now imagine having to copy and paste this three times, and in each case having to change the porto
and aberdeen
variables (especially if we had to do it more than three times).
What we could do instead is generalise our code for x
and y
columns instead of naming specific cities. If we did this, we could recycle the x * y
code. Whenever we wanted to multiple columns together, we assign a city to either x
or y
. We’ll assign the multiplication to the objects porto_aberdeen
and aberdeen_nairobi
so we can come back to them later.
# Assign x and y values
<- city$porto
x <- city$aberdeen
y
# Use multiplication code
<- x * y
porto_aberdeen
# Assign new x and y values
<- city$aberdeen
x <- city$nairobi
y
# Reuse multiplication code
<- x * y aberdeen_nairobi
This is essentially what a function does. OK down to business, let’s call our new function multiply_columns()
and define it with two arguments, x
and y
. In the function code we simply return the value of x * y
using the return()
function. Using the return()
function is not strictly necessary in this example as R will automatically return the value of the last line of code in our function. We include it here to make this explicit.
<- function(x, y) {
multiply_columns return(x * y)
}
Now that we’ve defined our function we can use it. Let’s use the function to multiple the columns city$porto
and city$aberdeen
and assign the result to a new object called porto_aberdeen_func
<- multiply_columns(x = city$porto, y = city$aberdeen)
porto_aberdeen_func
porto_aberdeen_func## [1] -0.307114712 -0.360959996 3.941625914 -0.289150026 0.492506990
## [6] 0.209054609 -1.347587705 0.276356974 -0.054092126 0.021149013
## [11] 1.043516596 0.045826546 1.565602509 -1.443223617 -0.373649438
## [16] 0.502097231 -0.349780883 0.028953227 1.119563183 -0.017142371
## [21] -1.886050888 -0.074130489 -0.044650165 0.773984761 0.685732826
## [26] 0.022614751 -0.006395326 0.707118262 -0.235188719 0.780752616
## [31] 0.710584101 -0.021210710 -1.772706349 1.295354938 -1.524136493
## [36] 0.018464777 -0.141345982 -0.115425893 -1.475300110 -0.550203379
## [41] -2.763911010 -0.546173142 0.038728029 -0.011646731 2.166084263
## [46] -0.004622742 -0.316633650 -0.287744333 -0.879742181 1.100278000
## [51] -0.144737850 0.033280573 -0.103285334 0.617822177 1.038480349
## [56] 0.121759557 -0.350370195 -0.310088449 0.331685890 -2.794717107
## [61] 2.656723105 0.225779677 -0.382431055 -0.552516413 -0.006441641
## [66] -0.426830897 -0.641953686 -2.459007811 0.009554811 0.666721643
## [71] 0.507328145 -0.148350817 0.594508188 -0.437764373 0.167038047
## [76] 0.009953480 0.225905602 -1.029616325 0.269291709 1.095903828
## [81] -0.096381978 -0.406608120 0.107608825 0.068760938 0.188963732
## [86] -3.819174431 -1.558811489 -0.626827884 -0.178879942 0.196669303
## [91] 0.053352855 0.024180604 0.131036169 0.509112777 0.366264190
## [96] 0.741064070 -0.344254475 -1.094911631 -0.944383229 -1.081067560
If we’re only interested in multiplying city$porto
and city$aberdeen
, it would be overkill to create a function to do something once. However, the benefit of creating a function is that we now have that function added to our environment which we can use as often as we like. We also have the code to create the function, meaning we can use it in completely new projects, reducing the amount of code that has to be written (and retested) from scratch each time. As a rule of thumb, you should consider writing a function whenever you’ve copied and pasted a block of code more than twice.
To satisfy ourselves that the function has worked properly, we can compare the porto_aberdeen
variable with our new variable porto_aberdeen_func
using the identical()
function. The identical()
function tests whether two objects are exactly identical and returns either a TRUE
or FALSE
value. Use ?identical
if you want to know more about this function.
identical(porto_aberdeen, porto_aberdeen_func)
## [1] TRUE
And we confirm that the function has produced the same result as when we do the calculation manually. We recommend getting into a habit of checking that the function you’ve created works the way you think it has.
Now let’s use our multiply_columns()
function to multiply columns aberdeen
and nairobi
. Notice now that argument x
is given the value city$aberdeen
and y
the value city$nairobi
.
<- multiply_columns(x = city$aberdeen, y = city$nairobi)
aberdeen_nairobi_func
aberdeen_nairobi_func## [1] NA NA NA NA NA
## [6] NA NA NA NA NA
## [11] -0.503208196 0.054365058 0.722799877 1.094486045 -0.525756681
## [16] 0.714138022 0.456418967 0.090766852 3.741193871 -0.018398977
## [21] 0.837778958 -0.120064066 -0.343139541 -0.021082738 0.133871619
## [26] 0.161163637 -0.072404966 0.170946000 0.117603326 0.271239680
## [31] 0.251949244 0.062692865 1.953966553 -2.050275560 2.141756872
## [36] -0.009278149 -0.218571789 -1.087386929 0.395002279 2.663559943
## [41] 1.124099422 -0.470393778 0.387113170 0.032825952 -1.755396727
## [46] 0.281451272 -0.388776499 -0.469883122 -1.533107339 -0.660747499
## [51] -0.266369025 -0.300856628 0.098916178 -0.254046196 0.388328959
## [56] 2.945564275 -0.185086010 -0.203982037 0.684224309 0.004342962
## [61] 1.374867289 -0.308028852 -0.375770542 -0.061522006 -0.069597272
## [66] 0.115869426 -0.147976197 1.574260867 0.391410214 -1.811687329
## [71] -0.199521652 0.074789249 -0.161706748 -0.930227445 -0.007129059
## [76] -0.173457003 0.408962592 1.933713741 -0.453110727 0.689874837
## [81] -0.004959891 -0.122503185 0.255589758 0.187936204 0.027866297
## [86] 0.777804930 0.434716215 0.301224151 0.370487707 1.107499519
## [91] 0.248733999 -0.092824218 0.122094993 0.253625083 -1.058436924
## [96] 1.349231476 0.037417755 -0.668251216 -0.483922791 -0.265941332
So far so good. All we’ve really done is wrapped the code x * y
into a function, where we ask the user to specify what their x
and y
variables are.
Now let’s add a little complexity. If you look at the output of nairobi_genoa
some of the calculations have produced NA
values. This is because of those NA
values we included in nairobi
when we created the city
data frame. Despite these NA
values, the function appeared to have worked but it gave us no indication that there might be a problem. In such cases we may prefer if it had warned us that something was wrong. How can we get the function to let us know when NA
values are produced? Here’s one way.
<- function(x, y) {
multiply_columns <- x * y
temp_var if (any(is.na(temp_var))) {
warning("The function has produced NAs")
return(temp_var)
else {
} return(temp_var)
}
}
<- multiply_columns(city$aberdeen, city$nairobi)
aberdeen_nairobi_func ## Warning in multiply_columns(city$aberdeen, city$nairobi): The function has
## produced NAs
<- multiply_columns(city$porto, city$aberdeen) porto_aberdeen_func
The core of our function is still the same. We still have x * y
, but we’ve now got an extra six lines of code. Namely, we’ve included some conditional statements, if
and else
, to test whether any NA
s have been produced and if they have we display a warning message to the user. The next section of this Chapter will explain how these work and how to use them.