R2

From SourceWiki
Revision as of 16:23, 3 November 2014 by GethinWilliams (talk | contribs)
Jump to navigation Jump to search

Open Source Statistics with R

Writing Faster R Code

In the above sections we've introduced a number of features of R and have begun the journey to becoming a proficient and productive user of the language. In the remaining sections, we'll switch tack and focus on a question commonly asked by those beginning to use R in anger--"My R code is slow. How can I speed it up?". In this section we'll consider the related tasks of finding which bits of your R code is responsible for the majority of the run-time and what you can do about it.

Profiling & Timing

In order to remain productive (and sane, and have a social life...), it is essential that we first identify which portions of your R code are responsible for the majority of the run-time. We could spend ages optimising a portion that we think may be running slowly, but computers have the gift(!) to constantly surprise us, and if that portion of your program accounted for, say, 10% of the run-time, then you will have sweated for absolutely no useful gain.

The simplest method of investigation is to simply time the application of a function:

system.time(some.function())

You can get a more detailed analysis of a block of code using the built-in R profiler. The general pattern of invocation is:

Rprof(filename="~/rprof.out")
# Do some work
Rprof()
summaryRprof(filename="~/rprof.out")

For example, here's an R script, profile.r:

Rprof(filename="~/rprof.out")
# Create a 10 x 100,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(100000)})
# Map a function over the matrix.  First in serial..
x <- lapply(data, function(x) {loess.smooth(x,x)})
Rprof()
summaryRprof(filename="~/rprof.out")

Which I ran by typing:

R CMD BATCH profile.r

In the output file, profile.r.Rout, I found the following break down:

               self.time self.pct total.time total.pct
"simpleLoess"       4.84    88.00       5.10     92.73
"rnorm"             0.22     4.00       0.22      4.00
"loess.smooth"      0.18     3.27       5.28     96.00

The profile tells us that the function simpleLoess take 88% of the runtime, whereas rnorm takes only 4%.

Preallocation of Memory

As with other scripting languages, such as MATLAB, the simplest method that you can use to speed up your R code is to pre-allocate the storage for variables whenever possible. To see the benefits of this, consider the following two functions:

> f1 <- function() {
+ v <- c()
+ for (i in 1:30000)
+   v[i] <- i^2
+ }

and:

> f2 <- function() {
+ v <- c(NA)
+ length(v) <- 30000
+ for (i in 1:30000)
+   v[i] <- i^2
+ }

Timing calls to each of them shows that the pre-allocation of memory gives a whopping ~x30 speed-up. Your mileage will vary depending upon the details of your code.

> system.time(f1())
   user  system elapsed 
  1.720   0.040   1.762
> system.time(f2())
   user  system elapsed 
  0.052   0.000   0.05

Vectorised Operations

The other principle method for speeding up your R code is to eliminate loops whenever you can. Many functions and operators in R will accept arrays as input, rather than just single values and this may allow you to not use a loop. The examples in the previous section used for loops to step through an array, squaring each element. However, you can achieve the same result far more quickly by passing the array en masse to exponentiation operator:

> system.time(v <- (1:1000000)^2)
   user  system elapsed 
  0.024   0.004   0.026

Here we've been able to square 1,000,000 items in half the time it took to process 30,000!

Calling Functions Written in a Compiled Language (e.g. C or Fortran)

Another way to get more speed is to outsource portions of R code that are found to be slow to a compiled language, such as C or Fortran. A good starting point on this topic is:

R and HPC

If you've profiled your code and tried all that you can to speed it up, as described in the previous section, you might be interested in the various initiatives that exist to run R on high performance computers, such as bluecrsytal:

We will see in the following examples, the general approach to running R in parallel is to arrange your task so that a function is applied to a list of inputs, and then to split the list over several CPU cores or cluster worker nodes.

Multicore

The multicore package allows us to make use of several CPU cores within a single machine. Note, however, that the package does not work on a MS Windows computers.

As an example, let's look at the use of the package's mclapply function, a multicore equivalent of R's built-in list apply mapper, lapply. I saved the following commands into an R script called mutlicore.r:

library(multicore)
# how many cores are present?
multicore:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix.  First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))

And used the following submission script to run it on bluecrystal phase2:

#!/bin/bash 

#PBS -l nodes=1:ppn=8,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH multicore.r

After the job had run, I got the following output in the file multicore.r.Rout:

> library(multicore)
> # how many cores are present?
> multicore:::detectCores()
[1] 8
> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix.  First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
   user  system elapsed 
  0.674   0.007   0.749 
> # .. and secondly in parallel (using multicore, within a node)
> system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
   user  system elapsed 
  0.301   0.074   0.113 

Rmpi

The Rmpi package allows us to create and use cohorts of message passing processes from within R. It does so by providing an interface to the MPI (Message Passing Interface) library.

In order to use the Rmpi package on BCp2, you will need the ofed/openmpi/gcc/64/1.4.2-qlc module loaded.

Here's a short example that I saved as Rmpi.r:

library(Rmpi)
# spawn as many slaves as possible
mpi.spawn.Rslaves()
mpi.remote.exec(mpi.get.processor.name())
mpi.remote.exec(runif(1))
mpi.close.Rslaves()
mpi.quit()

I submitted the job to BCp2 using the following submission script:

#!/bin/bash 

#PBS -l nodes=4:ppn=1,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Create a machine file (used for multi-node jobs)
cat $PBS_NODEFILE > machine.file.$PBS_JOBID

#! Disable PSM on the QLogic HCAs
export OMPI_MCA_mtl=^psm

#! Run the R script
mpirun -np 1 -machinefile machine.file.$PBS_JOBID R CMD BATCH Rmpi.r

and got the following output:

> library(Rmpi)
> # spawn as many slaves as possible
> mpi.spawn.Rslaves()
        4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: u03n074 
slave1 (rank 1, comm 1) of size 5 is running on: u03n098 
slave2 (rank 2, comm 1) of size 5 is running on: u04n029 
slave3 (rank 3, comm 1) of size 5 is running on: u04n030 
slave4 (rank 4, comm 1) of size 5 is running on: u03n074 
> mpi.remote.exec(mpi.get.processor.name())
$slave1
[1] "u03n098"

$slave2
[1] "u04n029"

$slave3
[1] "u04n030"

$slave4
[1] "u03n074"

> mpi.remote.exec(runif(1))
         X1        X2        X3        X4
1 0.5154871 0.5154871 0.5154871 0.5154871
> mpi.close.Rslaves()
[1] 1
> mpi.quit()

Snow

Calling MPI routines from within R may be too low level for many people to use comfortably. Happily, the snow package provides a higher level abstraction for distributed memory programming from within R.

Here's my example program that a saved as snow.r:

library(snow)
# request a cluster of 3 worker nodes
cl <- makeCluster(3)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix.  First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using snow, across a cluster of workers)
system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
stopCluster(cl)

I ran it on BCp2 using the same submission script given for Rmpi, save for changing Rmpi.r to snow.r. The output was:

> library(snow)
> # request a cluster of 3 worker nodes
> cl <- makeCluster(3)
Loading required package: Rmpi
        3 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
 nodename   machine 
"u01n105"  "x86_64" 

[[2]]
 nodename   machine 
"u02n014"  "x86_64" 

[[3]]
 nodename   machine 
"u03n098"  "x86_64" 

> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix.  First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
   user  system elapsed 
  0.711   0.001   0.715 
> # .. and secondly in parallel (using snow, across a cluster of workers)
> system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
   user  system elapsed 
  0.259   0.001   0.260 
> stopCluster(cl)

Parallel

The parallel package is an amalgamation of functionality from the multicore and snow packages. The shared memory parallelism in this package runs on an MS Windows machine (unlike the multicore package).

I trivial translation of our previous multicore example is:

library(parallel)
# how many cores are present?
parallel:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix.  First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))

I have not been able to get a distributed memory cluster working on BCp2 using the parallel package.

Further Reading