SourceWiki - User contributions [en]

DataScience

2015-11-05T10:45:45Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

[[Media:intro-to-data-science-nov15.pdf|Intro to Data Science]]

DataScience

2015-11-05T10:45:15Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

[[File:intro-to-data-science-nov15.pdf|Intro to Data Science]]

DataScience

2015-11-05T10:44:17Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

[[:File:intro-to-data-science-nov15.pdf|Intro to Data Science]]

DataScience

2015-11-05T10:43:34Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

[[:File:intro-to-data-science-nov15.pdf]]

File:Intro-to-data-science-nov15.pdf

2015-11-05T10:38:01Z

GethinWilliams:

DataScience

2015-01-05T13:49:43Z

GethinWilliams: /* Topics would include */

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

=Introduction=

[[Image:Data_Science_VD.png|400px|thumbnail|center|Drew Conway's Venn diagram of data science]]

=Topics would include=

* What is relevant for the UoB?
* y=f(x) relationships:- classifiers & regression
** Examples: Linear & logistic regression, K-Nearest Neighbours, Decision Trees, Neural Networks etc.
* Data topics:
** Training, Test & validation data.
** Sources of data, e.g. web scraping.
** Exploratory Data Analysis (EDA).
** Cleaning & munging data (90% of your effort?). Useful Linux tools.
** Feature selection.
* Model selection & training topics:
** Algorithms that scale.
** Supervised vs. Unsupervised training.
** Overfitting.
** The curse of dimensionality.
* Programming Skills:
** "Clean code shows clarity of mind,"
** Languages: R? Python? Others?
** Version control.
** Build systems.
** Testing.
** Scripting and automation.

DataScience

2015-01-05T13:29:32Z

GethinWilliams: /* Introduction */

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

=Introduction=

[[Image:Data_Science_VD.png|400px|thumbnail|center|Drew Conway's Venn diagram of data science]]

=Topics would include=

* What is relevant for the UoB?
* y=f(x) relationships:- classifiers & regression
** Examples: Linear & logistic regression, K-Nearest Neighbours, Decision Trees, Neural Networks etc.
* Data topics:
** Training, Test & validation data.
** Sources of data, e.g. web scraping.
** Exploratory Data Analysis (EDA).
** Cleaning & munging data (90% of your effort?). Useful Linux tools.
** Feature selection.
* Model selection & training topics:
** Algorithms that scale.
** Supervised vs. Unsupervised training.
** Overfitting.
** The curse of dimensionality.
* Programming Skills:
** "Clean code shows clarity of mind,"
** Version control.
** Build systems.
** Testing.
** Scripting and automation.

DataScience

2015-01-05T13:28:51Z

GethinWilliams: /* Introduction */

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

=Introduction=

[[Image:Data_Science_VD.png|400px|thumbnail|center|Drew Conway's Venn diagram of data science]]

Topics would include:
* What is relevant for the UoB?
* y=f(x) relationships:- classifiers & regression
** Examples: Linear & logistic regression, K-Nearest Neighbours, Decision Trees, Neural Networks etc.
* Data topics:
** Training, Test & validation data.
** Sources of data, e.g. web scraping.
** Exploratory Data Analysis (EDA).
** Cleaning & munging data (90% of your effort?). Useful Linux tools.
** Feature selection.
* Model selection & training topics:
** Algorithms that scale.
** Supervised vs. Unsupervised training.
** Overfitting.
** The curse of dimensionality.
* Programming Skills:
** "Clean code shows clarity of mind,"
** Version control.
** Build systems.
** Testing.
** Scripting and automation.

DataScience

2015-01-05T13:17:02Z

GethinWilliams: /* Introduction */

File:Data Science VD.png

2015-01-05T13:16:06Z

GethinWilliams:

DataScience

2015-01-05T13:15:37Z

GethinWilliams: /* Introduction */

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

=Introduction=

[[Image:R-lm(cars)-abline.png|400px|thumbnail|center|linear regression of stopping distance against speed from the built-in data set, cars]]

DataScience

2015-01-05T13:14:39Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''What would a course on Data Science look like?'''

=Introduction=

DataScience

2015-01-05T13:13:51Z

GethinWilliams: Created page with 'category:Pragmatic Programming '''Open Source Statistics with R''' =Introduction='

[[category:Pragmatic Programming]]
'''Open Source Statistics with R'''

=Introduction=

R2

2014-12-12T12:23:38Z

GethinWilliams:

[[category:Pragmatic Programming]]
'''Open Source Statistics with R'''

=Submitting R jobs on BlueCrystal=

If you have an R script called, for example, '''myscript.r''', you can run it on BlueCrystal using the following submission script:

<source>
#!/bin/bash

#PBS -l nodes=1:ppn=1,walltime=01:00:00

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH myscript.r
</source>

Note that this script runs an R script on a single processor. That processor is requested for 1 hr in the above example (walltime=01:00:00). You can change this by modifying the 'walltime' resource request. (See later examples is you would like to use more than one processor.) The output of the job will be placed in a file called '''myscript.r.Rout'''.

If the submission script is saved as '''r-submit''', then you would submit the job by typing: '''qsub r-submit'''.

=Writing Faster R Code=

In the above sections we've introduced a number of features of R and have begun the journey to becoming a proficient and productive user of the language. In the remaining sections, we'll switch tack and focus on a question commonly asked by those beginning to use R in anger--'''"My R code is slow. How can I speed it up?"'''. In this section we'll consider the related tasks of finding which bits of your R code is responsible for the majority of the run-time and what you can do about it.

==Profiling & Timing==

In order to remain productive (and sane, and have a social life...), it is essential that we first identify which portions of your R code are responsible for the majority of the run-time. We could spend ages optimising a portion that we ''think'' may be running slowly, but computers have the gift(!) to constantly surprise us, and if that portion of your program accounted for, say, 10% of the run-time, then you will have sweated for absolutely no useful gain.

The simplest method of investigation is to simply time the application of a function:

<source>
system.time(some.function())
</source>

You can get a more detailed analysis of a block of code using the built-in R profiler. The general pattern of invocation is:

<source>
Rprof(filename="~/rprof.out")
# Do some work
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

For example, here's an R script, '''profile.r''':
<source>
Rprof(filename="~/rprof.out")
# Create a 10 x 100,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(100000)})
# Map a function over the matrix. First in serial..
x <- lapply(data, function(x) {loess.smooth(x,x)})
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

Which I ran by typing:

<pre>
R CMD BATCH profile.r
</pre>

In the output file, '''profile.r.Rout''', I found the following break down:

<pre>
self.time self.pct total.time total.pct
"simpleLoess" 4.84 88.00 5.10 92.73
"rnorm" 0.22 4.00 0.22 4.00
"loess.smooth" 0.18 3.27 5.28 96.00
</pre>

The profile tells us that the function '''simpleLoess''' take 88% of the runtime, whereas '''rnorm''' takes only 4%.

==Preallocation of Memory==

As with other scripting languages, such as MATLAB, the simplest method that you can use to speed up your R code is to pre-allocate the storage for variables whenever possible. To see the benefits of this, consider the following two functions:

<source>
> f1 <- function() {
+ v <- c()
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

and:

<source>
> f2 <- function() {
+ v <- c(NA)
+ length(v) <- 30000
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

Timing calls to each of them shows that the pre-allocation of memory gives a whopping ~'''x30 speed-up'''. Your mileage will vary depending upon the details of your code.

<source>
> system.time(f1())
user system elapsed
1.720 0.040 1.762
> system.time(f2())
user system elapsed
0.052 0.000 0.05
</source>

==Vectorised Operations==

The other principle method for speeding up your R code is to eliminate loops whenever you can. Many functions and operators in R will accept arrays as input, rather than just single values and this may allow you to not use a loop. The examples in the previous section used for loops to step through an array, squaring each element. However, you can achieve the same result far more quickly by passing the array ''en masse'' to exponentiation operator:

<source>
> system.time(v <- (1:1000000)^2)
user system elapsed
0.024 0.004 0.026
</source>

Here we've been able to square 1,000,000 items in half the time it took to process 30,000!

==Calling Functions Written in a Compiled Language (e.g. C or Fortran)==

Another way to get more speed is to outsource portions of R code that are found to be slow to a compiled language, such as C or Fortran. A good starting point on this topic is:

* http://mazamascience.com/WorkingWithData/?p=1067

=R and HPC=

If you've profiled your code and tried all that you can to speed it up, as described in the previous section, you might be interested in the various initiatives that exist to run R on high performance computers, such as bluecrsytal:

* http://cran.r-project.org/web/views/HighPerformanceComputing.html

We will see in the following examples, the general approach to running R in parallel is to arrange your task so that a function is applied to a list of inputs, and then to split the list over several CPU cores or cluster worker nodes.

==Multicore==

The '''multicore''' package allows us to make use of several CPU cores within a single machine. Note, however, that the package does not work on a MS Windows computers.

As an example, let's look at the use of the package's '''mclapply''' function, a multicore equivalent of R's built-in list apply mapper, '''lapply'''. I saved the following commands into an R script called '''mutlicore.r''':
<source>
library(multicore)
# how many cores are present?
multicore:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

And used the following submission script to run it on bluecrystal phase2:
<source>
#!/bin/bash

#PBS -l nodes=1:ppn=8,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH multicore.r
</source>

After the job had run, I got the following output in the file '''multicore.r.Rout''':
<pre>
> library(multicore)
> # how many cores are present?
> multicore:::detectCores()
[1] 8
> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.674 0.007 0.749
> # .. and secondly in parallel (using multicore, within a node)
> system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.301 0.074 0.113
</pre>

==Rmpi==

The '''Rmpi''' package allows us to create and use cohorts of message passing processes from within R. It does so by providing an interface to the MPI (Message Passing Interface) library.

In order to use the Rmpi package on BCp2, you will need the '''ofed/openmpi/gcc/64/1.4.2-qlc''' module loaded.

Here's a short example that I saved as '''Rmpi.r''':
<source>
library(Rmpi)
# spawn as many slaves as possible
mpi.spawn.Rslaves()
mpi.remote.exec(mpi.get.processor.name())
mpi.remote.exec(runif(1))
mpi.close.Rslaves()
mpi.quit()
</source>

I submitted the job to BCp2 using the following submission script:
<source>
#!/bin/bash

#PBS -l nodes=4:ppn=1,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Create a machine file (used for multi-node jobs)
cat $PBS_NODEFILE > machine.file.$PBS_JOBID

#! Disable PSM on the QLogic HCAs
export OMPI_MCA_mtl=^psm

#! Run the R script
mpirun -np 1 -machinefile machine.file.$PBS_JOBID R CMD BATCH Rmpi.r
</source>

and got the following output:
<pre>
> library(Rmpi)
> # spawn as many slaves as possible
> mpi.spawn.Rslaves()
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: u03n074
slave1 (rank 1, comm 1) of size 5 is running on: u03n098
slave2 (rank 2, comm 1) of size 5 is running on: u04n029
slave3 (rank 3, comm 1) of size 5 is running on: u04n030
slave4 (rank 4, comm 1) of size 5 is running on: u03n074
> mpi.remote.exec(mpi.get.processor.name())
$slave1
[1] "u03n098"

$slave2
[1] "u04n029"

$slave3
[1] "u04n030"

$slave4
[1] "u03n074"

> mpi.remote.exec(runif(1))
X1 X2 X3 X4
1 0.5154871 0.5154871 0.5154871 0.5154871
> mpi.close.Rslaves()
[1] 1
> mpi.quit()
</pre>

==Snow==

Calling MPI routines from within R may be too low level for many people to use comfortably. Happily, the '''snow''' package provides a higher level abstraction for distributed memory programming from within R.

Here's my example program that a saved as '''snow.r''':
<source>
library(snow)
# request a cluster of 3 worker nodes
cl <- makeCluster(3)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using snow, across a cluster of workers)
system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
stopCluster(cl)
</source>

I ran it on BCp2 using the same submission script given for Rmpi, save for changing Rmpi.r to snow.r. The output was:

<pre>
> library(snow)
> # request a cluster of 3 worker nodes
> cl <- makeCluster(3)
Loading required package: Rmpi
3 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename machine
"u01n105" "x86_64"

[[2]]
nodename machine
"u02n014" "x86_64"

[[3]]
nodename machine
"u03n098" "x86_64"

> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.711 0.001 0.715
> # .. and secondly in parallel (using snow, across a cluster of workers)
> system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.259 0.001 0.260
> stopCluster(cl)
</pre>

==Parallel==

The '''parallel''' package is an amalgamation of functionality from the multicore and snow packages. The shared memory parallelism in this package runs on an MS Windows machine (unlike the multicore package).

I trivial translation of our previous multicore example is:
<source>
library(parallel)
# how many cores are present?
parallel:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

I have not been able to get a distributed memory cluster working on BCp2 using the parallel package.

=Further Reading=

* [http://shop.oreilly.com/product/9780596801717.do R in a Nutshell]
* [http://shop.oreilly.com/product/0636920021421.do Parallel R]

R1

2014-12-10T17:06:03Z

GethinWilliams: /* Packages */

R1

2014-12-10T17:03:49Z

GethinWilliams: /* Packages */

Subversion

2014-11-07T11:24:03Z

GethinWilliams: /* Modifying your Working Copy */

[[Category:Pragmatic Programming]]

=Introduction=

In this workshop, we'll look at using a particular Version Control System (VCS) called Subversion (often abbreviated to SVN). Before getting into the nitty-gritty of using SVN, we'll pause to consider the motivations for adopting version control and also the key concepts that are common to most available systems.

==Why is Version Control useful?==

OK, here's the sales pitch:

* It '''removes confusion''' about versions. For example, you will no longer have to keep inventing names for different versions of essentially the same document e.g. blah.old, blah.sav, blah.older, blah.newest2 (look familiar?).
* It makes '''collaborative working''' easier. Version control assists in coordination as it removes any confusion about versions, highlights conflicts, allows the use of independent working copies, records log messages and much more besides.
* It makes '''distributing your code''' easier. A version control repository can be visible to the world (often as a URL). However, using some highly customisable access controls, you can arrange for some (perhaps anyone) to download your project while also specifying that only a select few may be trusted to upload.
* It makes '''reproducing experiments''' easier. The ability to reproduce an experiment is a ''key characteristic of science''. However, all too often, in the digital age, people are unable to run the same version of a model that they ran six months ago. With version control, you can always access any previous version of your model.
* It aids '''disaster recovery'''. You computer is fried? No problem! Just checkout your code to another and you're working productively again in minutes.

=Version Control Concepts=

A picture can be worth a thousand words, so let's try illustrating some of the key version control concepts, before wading into acres of text:

[[Image:Svn-cartoon.jpg|700px|thumbnail|center|Files stored in a repository; a checkout; modification and commit. All versions recorded.]]
[[Image:File-tree.jpg|400px|thumbnail|center|A checkout adds a copy of the files held in the repository to your local computer.]]



Subversion is a centralised version control system. Centralised version control means that a copy of your project is held in a central location called the '''repository''' and the subversion server logs all operations happening on the repository: every time something is changed in the repository, the server logs the '''time and date''', the '''changes''', the '''author''' as well as a '''log message'''. The server can be configured to give privavcy; allowing some people actions which are disallowed for others. For instance, in the practical, the server allows anonymous read-only access but only a selected number of people can changes things.

All the operations described above (logging and authentication) happen on the server. However, the server is only accessible directly to system administrators. To interact with the server, a user makes use of a subversion '''client'''. Some of you might already know about some graphical subversion clients such as TortoiseSVN (see the screen grab below). This practical will show how the command line client can be used. The subversion client can be used to (1) ask information from and (2) send information to the server. The client can also be used to get information about your '''working copy''' which is the local copy of the project that resides on your filespace. You can use the client to ask questions such as:
* which files have I modified since I last synchronised with the server?
* when was that file last modified?
* who, inadvertently, created a bug at line 18 in file foo.c?
* what has changed in that file?

<gallery widths=500px heights=350px perrow=2>
File:Tsvn_switch1.png|TortoiseSVN for MS Windows
File:Svn-cli.png|svn command line client for Linux
</gallery>

=Acquiring a Repository=

For the purposes of this practical, you can get yourself a repository from one of the hosting sites that can be found out in the cloud. Or you could use another repository that you have access to--perhaps hosted here in the University, some other portion of UK academia or elsewhere. I've prepared the examples using a repository accessed via the GitHub website (https://github.com). '''Note, however, that to obtain a free repository on GitHub, you must agree to it being readable by anyone.'''

'''NB With that in mind, you may want to have a think about the right home for any of your intellectual property.'''

OK, let's assume that you are happy to work with a GitHub hosted repository--at least for your initial steps learning about version control and subversion. (A natty feature of GitHub repositories is that they can be used with both Subversion and Git VCS.)

Registering and creating a repository is easy, just follow the instructions on the webpage:

[[Image:Github.png|500px|thumbnail|center|The Github web interface]]

Be sure to check the box: '''Initialize this repository with a README'''

In addition to the command line client that we will describe in the following sections, you can manage your repository, view and even edit your files through the GitHub website:

[[Image:Github-ggdagw-test.png|500px|thumbnail|center|A test repository]]

=svn: The Subversion Command Line Client=
The subversion command line client is called '''svn'''. To execute a subversion command, simply type:
<pre>
$ svn command arguments
</pre>

Some commands can also use options which are given with dashes:
<pre>
svn command arguments --option optionvalue
</pre>
Subversion provides extensive help about the commands to use. To get help for a particular subversion command, simply use:
<pre>
$ svn help command
</pre>

=Checkout a Working Copy=

Now that you have access to a repository, let's create a '''working copy''' of the files in the repository. To do this we use the '''svn checkout''' command, or '''svn co''', for short:

<pre>
gethin@gethin-desktop:~$ svn co https://github.com/ggdagw/test ./test
A test/branches
A test/trunk
A test/trunk/README.md
Checked out revision 1.
</pre>

This command gets a copy of the content at the URL and places it in a new directory called test. The letter "A" simply means that these files have been added to your working copy. You'll also notice two subdirectories called '''trunk''' and '''branches'''. This pattern follows an convention. Usually, subversion repositories are organised so that the main strand of development is in the ''trunk''. Sometimes it is useful to store variants of the trunk version (more of that later) and the ''branches'' folder exists to accommodate those. (This is purely convention as far as subversion is concerned, however, and "trunk" and "branches" are merely two folders under the URL.)

The content that you saw through your browser is now in your own file space. You may also notice hidden directories called ".svn". '''It is very important that you do not touch these directories.'''

<pre>
gethin@gethin-desktop:~$ cd test/
gethin@gethin-desktop:~/test$ ls -al
total 28
drwxr-xr-x 5 gethin gethin 4096 2013-07-26 12:29 .
drwxr-xr-x 117 gethin gethin 12288 2013-07-26 12:29 ..
drwxr-xr-x 3 gethin gethin 4096 2013-07-26 12:29 branches
drwxr-xr-x 6 gethin gethin 4096 2013-07-26 12:29 .svn
drwxr-xr-x 3 gethin gethin 4096 2013-07-26 12:29 trunk
</pre>

=Modifying your Working Copy=

Right oh. The working copy is yours to work with so let's go ahead and modify the README.md file.

<pre>
gethin@gethin-desktop:~/test$ cd trunk
gethin@gethin-desktop:~/test/trunk$ ls
README.md
gethin@gethin-desktop:~/test/trunk$ emacs -nw README.md
</pre>

(See, e.g. https://www.acrc.bris.ac.uk/acrc/pdf/emacs.pdf, if you'd like to use the emacs text editor, but are new to it.)

To see what files you have modified, you ask the client for the status of your working copy:
<pre>
gethin@gethin-desktop:~/test/trunk$ svn status
M README.md
</pre>

The status shows the letter "M" for README.md, indicating it has been modified.

Note that this status only shows the things that have changed in '''your''' working copy. It does not show any changes made by others, either in the repository or in their own working copies.

You can also add a new file. Let's add a file called '''foo.txt''':

<pre>
gethin@gethin-desktop:~/test/trunk$ touch foo.txt
gethin@gethin-desktop:~/test/trunk$ svn status
? foo.txt
M README.md
</pre>

The question mark shows that the subversion client knows nothing about the new file (i.e. it is not currently under the auspices to version control). By default, svn will ignore new files. To indicate that a new file should be versioned, use the '''add''' command:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn add foo.txt
A foo.txt
gethin@gethin-desktop:~/test/trunk$ svn status
A foo.txt
M README.md
</pre>

The letter "A" is used to indicate an addition.

=Recording Changes in the Repository=

Sending changes to the repository is called a '''commit'''. Here's the command I used to send our two local modifications:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn commit --message "Added text to README.md and added the empty file foo.txt"
Sending trunk/README.md
Adding trunk/foo.txt
Transmitting file data ..
Committed revision 2.
</pre>

where the '''--message''', or '''-m''' for short, allows us to write a log message inline.

Notice the revision number. These numbers encode the state of the whole repository at a given juncture and are the passport to retrieving earlier versions of your project. As you commit future changes to your repository, your revision numbers will steadily increase.

Sometimes, you want a long message to go with a commit. To do this, simply execute the commit without the --message option. A text editor will then pop-up to be used to write the message and by saving and exiting, the commit will be done. Note that svn uses the editor indicated by the '''EDITOR''' environment variable. The editor often defaults to vi if this variable is undefined. If you are an emacs fan, set the variable first:

<pre>
$ export EDITOR=emacs
</pre>

(Note that you can use ''':q!''' to get out of vi, if you started it by accident. You could also set EDITOR=nano or gedit etc. You can also use the SVN_EDITOR environment variable.)

=Revert: Your "Get-Out-of-Jail Card"=

Just as we can add files, we can '''delete''', for example:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn delete foo.txt
D foo.txt
gethin@gethin-desktop:~/test/trunk$ ls
README.md
</pre>

The letter "D" indicates deletions and we see from typing 'ls' that the file has gone.

Subversion allows you to '''revert''' changes when you have made an error. Let's assume that 'foo.txt' was deleted in error. Fear not, you can get it back with:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn revert foo.txt
Reverted 'foo.txt'
gethin@gethin-desktop:~/test/trunk$ ls
foo.txt README.md
gethin@gethin-desktop:~/test/trunk$ svn status
gethin@gethin-desktop:~/test2/trunk$
</pre>

and foo.txt is back! A silent return from svn status, svn stat for short, indicates that there are no pending modifications in your working copy. Put another way, it exactly matches the repository version of the project when you made the checkout.

=Updating your Working Copy=

You can '''update''' your working copy to synchronise it with the latest version (known as the HEAD) held in the repository. The general form of the update command is:

<pre>
$ svn update
... <- list of files that have been added/modified
At revision X.
</pre>

If I update my working copy now:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn update
At revision 2.
</pre>

we see an empty list of files--i.e. there is nothing to update and my working copy perfectly matches the HEAD of the repository.

That needn't be the case, however. Let's imagine that you and a collaborator in Japan have access to the repository. You obviously work independently and, for good measure, in different time zones. Your collaborator may have committed some changes to the repository since you were last in front of a computer. That being the case, an update will bring all of her changes to your working copy.

A similar situation can arise if you are simultaneously operating two checkouts. Perhaps one at work and another on your home computer. If you had done some work at home yesterday evening and committed the fruits of your labours, and update will bring your work copy in line.

You can even update your working copy if you have some local modifications pending. In that situation, SVN will attempt to merge your changes with those from the southern hemisphere. If you both have edited the same line in a file, a '''conflict''' is flagged. More on that possibility later.

With all the foregoing in mind, '''status''', '''commit''' and '''update''' will probably be your most widely used commands:
# '''update''' regularly to bring other people's work
# '''status''' to make sure all is well
# '''commit''' frequently so that you can always recover a version you care about

=Investigating Changes=

This section highlights some commands can be used to boost productivity.

==log==
To get a log of what happened in the repository, use the log command. To see the files that have been modified as well as the log messages, use the --verbose option:
<pre>
gethin@gethin-desktop:~/test/trunk$ svn log --verbose
------------------------------------------------------------------------
r2 | ggdagw | 2013-07-26 14:58:08 +0100 (Fri, 26 Jul 2013) | 2 lines
Changed paths:
M /trunk/README.md
A /trunk/foo.txt

Added text to README.md and added the empty file foo.txt

------------------------------------------------------------------------
r1 | ggdagw | 2013-07-26 12:28:32 +0100 (Fri, 26 Jul 2013) | 2 lines
Changed paths:
A /branches
A /trunk
A /trunk/README.md

Initial commit

------------------------------------------------------------------------
</pre>

You can also invoke the log command on a particular file/path and provide a range of revisions.
For instance to see which commits affected file1 between revisions 4 and 6, one could use:
<pre>
$ svn log --verbose --revision 4:6 file1
... <- log output
</pre>

==diff==
After you have modified something, it can be handy to highlight what you've done. You can do this using the '''diff''' command.

For instance add some text to 'README.md' and use '''diff''' to see what you have done.

<pre>
gethin@gethin-desktop:~/test/trunk$ svn stat
M README.md
gethin@gethin-desktop:~/test/trunk$ svn diff README.md
Index: README.md
===================================================================
--- README.md (revision 2)
+++ README.md (working copy)
@@ -7,3 +7,16 @@
------------

This file is formatted in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) and will be automatically rendered on your GitHub webpage.
+
+Here is an itemised list:
+* bread
+* butter
+* marmalade
+
+A Table:
+
+| Name | Colour | Price |
+| ------- |:-------------:|--------------:|
+| Thomas | centered | right-aligned |
+| Gordon | blue | £3.56 |
+| Henry | green | £2.81 |
</pre>

You can also use diff to highlight differences between two versions of some file, as stored in the repository:

<pre>
$ svn diff -r73:74 foo.txt
...
</pre>

==blame (praise)==
Sometimes, you want to know who wrote a particular bit of code. Subversion makes that easy with the '''blame''' command:
<pre>
$ svn blame file2
2 jprenaud Added some stuff
3 jprenaud Another line
4 jprenaud A third line.
</pre>

You see the content of file2 and for each line the name of the author and the revision number. You could then fetch the log message for that particular revision to get more information.

<pre>
$ svn log file2 --revision 3
------------------------------------------------------------------------
r3 | jprenaud | 2008-05-14 16:03:45 +0100 (Wed, 14 May 2008) | 1 line

More things.
------------------------------------------------------------------------
</pre>

However, this feature is not available for github hosted repositories:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn blame README.md
svn: Server does not support custom revprops via log
</pre>

=Conflicts=

Sometimes, a commit or an update will fail because of conflicting changes. As a rule, you should always update before a commit so the example here will show a conflict created after an update.

==Creating the conflict==

As mentioned previously, conflicts arise when SVN cannot merge together changes to the same file--i.e. the changes are on the same line.

You can manufacture a conflict using two checkouts--let's call them A and B. I could create two such working copies by typing the following:

<pre>
svn co https://github.com/ggdagw/test ./testA
svn co https://github.com/ggdagw/test ./testB
</pre>

There is nothing to stop me having multiple checkouts on the same computer. Now, that we have the raw materials:

# Ensure that both A and B are up-to-date.
# edit line 1 of README.md in A and commit.
# edit line 1 of README.md in B--do not commit.
# Now attempt to update B.

Et voila, you will have a conflict. Since SVN cannot resolve it, we must apply the old human grey matter to the task. If you were working with a collaborator, this may well involve a phone or email conversation to decide on the best course of action.

The update does not immediately fail. Rather, you are be presented with some options:

<pre>
Conflict discovered in 'README.md'.
Select: (p) postpone, (df) diff-full, (e) edit,
(mc) mine-conflict, (tc) theirs-conflict,
(s) show all options: df
</pre>

If you choose "df", then you will be presented with a summary of the 3-way difference: First how it was prior to your local change; second how it is in your working copy and lastly how it currently is in the repository. You will also be presented with the list of options again. If you choose mine-conflict, "mc", your local modifications will be preferred--at least in this working copy, since nothing has been committed back at this stage. Theirs-conflict, "tc", will prefer the repository version. If you elect to postpone, "p", then 'README.md' is flagged with the letter "C" indicating a conflict and you will notice new files in your working copy:
<pre>
$svn status
? README.md.r5
? README.md.r6
? README.md.mine
C README.md
</pre>

* README.md.r5 is README.md as at revision 5 (i.e. the one at your last update)
* README.md.r6 is README.md at revision 6 (i.e. the one that is on the repository now)
* README.md.mine is README.md as it was in working copy before the update
* README.md contains an attempt at merging the changes (this will be similar to what you see with "df").

The last option, edit (e), will present to you the attempted merge in a text editor, for you to resolve as you see fit. Note that if you type "svn status" after editing README.md in this situation, you will see that the file is still marked as conflicted and you will not be able to commit your changes until you have '''resolved''' the conflict by typing e.g.:

<pre>
svn resolved README.md
</pre>

=Other useful commands=

==info==

From inside a working copy, e.g.:

<pre>
svn info
</pre>

will give:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn info
Path: .
URL: https://github.com/ggdagw/test/trunk
Repository Root: https://github.com/ggdagw/test
Repository UUID: be566c2d-dc09-ebaf-f5e5-ce57b7db7bff
Revision: 3
Node Kind: directory
Schedule: normal
Last Changed Author: ggdagw
Last Changed Rev: 3
Last Changed Date: 2013-07-26 16:03:50 +0100 (Fri, 26 Jul 2013)
</pre>

==list==

<pre>
gethin@gethin-desktop:~/test2/trunk$ svn list https://github.com/ggdagw/test
</pre>

will give:

<pre>
branches/
trunk/
</pre>

and

<pre>
gethin@gethin-desktop:~/test2/trunk$ svn list https://github.com/ggdagw/test/trunk
</pre>

will give:

<pre>
README.md
foo.txt
</pre>

==move & copy==
If you rename a file or directory manually, you loose its history, this is because subversion needs to be notified that a tracked file or directory will have a new name. It is simpler to use the subversion '''move''' command. For instance, to rename "file2", do:
<pre>
$ svn move file2 new_file2
A new_file2
D file2
</pre>

You notice that the new file is added and the old one deleted. You could have done this manually but the advantage of this is that the history of the new file before the new name is still available.

A close relation to '''move''' is '''copy'''. This creates a new file, with a copy of the revision history of it's template:

<pre>
$ svn copy havana havana2
A havana2
</pre>

==import==
When you ask for a new repository, it is empty by default. To populate it, you can use the import command. (import is because the action is done from the server, it imports something). The syntax is:

<pre>
$ svn import PATH URL/trunk --message "Log message."
</pre>

* PATH is the path to the local folder (by default it ses "./", i.e. the current folder)
* URL is the full URL of the repository. In the example, I also added "trunk/" at the end and the trunk would be created automatically.

==mkdir ==
Often people ask how then can create the "branches/" directory in the repository to store some specific versions of their code. This can be done by invoking mkdir directly on the server. The syntax is:

<pre>
$ svn mkdir URL/branches --message "Log message."
</pre>

==export==
Sometimes, you want to get the files from the version control system but this will not be used as a working copy, for instance you are going to send the files to somebody who is not involved in the development. For instance, it is the command that was used for the [[Linux1]] and [[Linux2]] practicals.

You could do a checkout and remove all the hidden ".ssh" directories manually, but the easiest to to use the "export" command. It works exactly like a checkout except that you end up with a normal local folder, not a working copy. The syntax is:

<pre>
$ svn export URL PATH
</pre>

==branching and merge==

Subversion can support multiple development strands via the creation of branches. The svn '''copy''' command is used to make a (space efficient) copy of an entire file tree from, say, the trunk to a subdir called "branches". A popular reason for creating a branch is for a particular developer (or team) to work on something speculative or disruptive:

<pre>
svn copy https://svn.ggy.bris.ac.uk/subversion/ourproject/trunk \
https://svn.ggy.bris.ac.uk/subversion/ourproject/branches/sally_dev \
-m "The reason why I'm branching is..."
</pre>

When you have different branches in your project, you might want to merge the changes from one branch to another. For instance somebody has fixed a bug on a branch that is still present in the trunk. You might want to apply the changes done on branch back onto the trunk. Subversion allows you to do this and it is called a merge operation. We will cover this topic very quickly here but you can refer to the [http://svnbook.red-bean.com Subversion Red Book] for more information about merging.

[[Image:Svn_merge_general.jpg|frame|centre|General situation for svn merge]]

For example, the following command will merge in a change set from revA to revB from the sally_dev branch of ourproject into your working copy:

<pre>
svn merge -r revA:revB https://svn.ggy.bris.ac.uk/subversion/ourproject/branches/sally_dev
</pre>

You can then commit these changes--should you so desire--which will end up in the development line of whatever you chose to checkout in order to obtain your working copy.

=To go further=
The [http://svnbook.red-bean.com/ Subversion Red Book] is the bible of subversion. Highly recommended.

In the book, you can see how to create your own repository, should you desire. For example, some simple repository setup commands will provide you with a working facility via the filesystem (i.e. the repository is on the same computer that you typically work on), or SSH (i.e. you have SSH access to the machine that will host the respository). See below. '''However, note two things:'''
* '''It is dangerous to create a repository on a files-ystem that is not backed-up. You could lose all your work that way.'''
* '''You can request a (backed-up) University of Bristol repository simply by contacting the service desk.'''

==Creating our own Repository==

<pre>
svnadmin create $HOME/my_test_repo
</pre>

Next you might import some files (that you have stored in a directory called 'projectX'). If you're working on the same filesystem you would use:

<pre>
cd projectX
svn import . file://$HOME/my_test_repo/trunk -m "my import message"
svn list file://$HOME/my_test_repo/trunk
...
</pre>

Which--via SSH--would translate to:

<pre>
cd projectX
svn import . svn+ssh://user@host/absolute/path/to/my_test_repo/trunk -m "my import message"
svn list svn+ssh://user@host/absolute/path/to/my_test_repo/trunk
...
</pre>

NB where the components of '''user@host/absolute/path/to''' are:
* user: your username on the remote machine hosting your repository
* host: the hostname of the remote machine
* /absolute/path/to: you must specify the absolute path to your repository ($HOME will be evaluated on the machine to SSH'ing ''from''

And you're away!

Subversion

2014-11-05T14:34:12Z

GethinWilliams: /* To go further */

[[Category:Pragmatic Programming]]

=Introduction=

In this workshop, we'll look at using a particular Version Control System (VCS) called Subversion (often abbreviated to SVN). Before getting into the nitty-gritty of using SVN, we'll pause to consider the motivations for adopting version control and also the key concepts that are common to most available systems.

==Why is Version Control useful?==

OK, here's the sales pitch:

* It '''removes confusion''' about versions. For example, you will no longer have to keep inventing names for different versions of essentially the same document e.g. blah.old, blah.sav, blah.older, blah.newest2 (look familiar?).
* It makes '''collaborative working''' easier. Version control assists in coordination as it removes any confusion about versions, highlights conflicts, allows the use of independent working copies, records log messages and much more besides.
* It makes '''distributing your code''' easier. A version control repository can be visible to the world (often as a URL). However, using some highly customisable access controls, you can arrange for some (perhaps anyone) to download your project while also specifying that only a select few may be trusted to upload.
* It makes '''reproducing experiments''' easier. The ability to reproduce an experiment is a ''key characteristic of science''. However, all too often, in the digital age, people are unable to run the same version of a model that they ran six months ago. With version control, you can always access any previous version of your model.
* It aids '''disaster recovery'''. You computer is fried? No problem! Just checkout your code to another and you're working productively again in minutes.

=Version Control Concepts=

A picture can be worth a thousand words, so let's try illustrating some of the key version control concepts, before wading into acres of text:

[[Image:Svn-cartoon.jpg|700px|thumbnail|center|Files stored in a repository; a checkout; modification and commit. All versions recorded.]]
[[Image:File-tree.jpg|400px|thumbnail|center|A checkout adds a copy of the files held in the repository to your local computer.]]



Subversion is a centralised version control system. Centralised version control means that a copy of your project is held in a central location called the '''repository''' and the subversion server logs all operations happening on the repository: every time something is changed in the repository, the server logs the '''time and date''', the '''changes''', the '''author''' as well as a '''log message'''. The server can be configured to give privavcy; allowing some people actions which are disallowed for others. For instance, in the practical, the server allows anonymous read-only access but only a selected number of people can changes things.

All the operations described above (logging and authentication) happen on the server. However, the server is only accessible directly to system administrators. To interact with the server, a user makes use of a subversion '''client'''. Some of you might already know about some graphical subversion clients such as TortoiseSVN (see the screen grab below). This practical will show how the command line client can be used. The subversion client can be used to (1) ask information from and (2) send information to the server. The client can also be used to get information about your '''working copy''' which is the local copy of the project that resides on your filespace. You can use the client to ask questions such as:
* which files have I modified since I last synchronised with the server?
* when was that file last modified?
* who, inadvertently, created a bug at line 18 in file foo.c?
* what has changed in that file?

<gallery widths=500px heights=350px perrow=2>
File:Tsvn_switch1.png|TortoiseSVN for MS Windows
File:Svn-cli.png|svn command line client for Linux
</gallery>

=Acquiring a Repository=

For the purposes of this practical, you can get yourself a repository from one of the hosting sites that can be found out in the cloud. Or you could use another repository that you have access to--perhaps hosted here in the University, some other portion of UK academia or elsewhere. I've prepared the examples using a repository accessed via the GitHub website (https://github.com). '''Note, however, that to obtain a free repository on GitHub, you must agree to it being readable by anyone.'''

'''NB With that in mind, you may want to have a think about the right home for any of your intellectual property.'''

OK, let's assume that you are happy to work with a GitHub hosted repository--at least for your initial steps learning about version control and subversion. (A natty feature of GitHub repositories is that they can be used with both Subversion and Git VCS.)

Registering and creating a repository is easy, just follow the instructions on the webpage:

[[Image:Github.png|500px|thumbnail|center|The Github web interface]]

Be sure to check the box: '''Initialize this repository with a README'''

In addition to the command line client that we will describe in the following sections, you can manage your repository, view and even edit your files through the GitHub website:

[[Image:Github-ggdagw-test.png|500px|thumbnail|center|A test repository]]

=svn: The Subversion Command Line Client=
The subversion command line client is called '''svn'''. To execute a subversion command, simply type:
<pre>
$ svn command arguments
</pre>

Some commands can also use options which are given with dashes:
<pre>
svn command arguments --option optionvalue
</pre>
Subversion provides extensive help about the commands to use. To get help for a particular subversion command, simply use:
<pre>
$ svn help command
</pre>

=Checkout a Working Copy=

Now that you have access to a repository, let's create a '''working copy''' of the files in the repository. To do this we use the '''svn checkout''' command, or '''svn co''', for short:

<pre>
gethin@gethin-desktop:~$ svn co https://github.com/ggdagw/test ./test
A test/branches
A test/trunk
A test/trunk/README.md
Checked out revision 1.
</pre>

This command gets a copy of the content at the URL and places it in a new directory called test. The letter "A" simply means that these files have been added to your working copy. You'll also notice two subdirectories called '''trunk''' and '''branches'''. This pattern follows an convention. Usually, subversion repositories are organised so that the main strand of development is in the ''trunk''. Sometimes it is useful to store variants of the trunk version (more of that later) and the ''branches'' folder exists to accommodate those. (This is purely convention as far as subversion is concerned, however, and "trunk" and "branches" are merely two folders under the URL.)

The content that you saw through your browser is now in your own file space. You may also notice hidden directories called ".svn". '''It is very important that you do not touch these directories.'''

<pre>
gethin@gethin-desktop:~$ cd test/
gethin@gethin-desktop:~/test$ ls -al
total 28
drwxr-xr-x 5 gethin gethin 4096 2013-07-26 12:29 .
drwxr-xr-x 117 gethin gethin 12288 2013-07-26 12:29 ..
drwxr-xr-x 3 gethin gethin 4096 2013-07-26 12:29 branches
drwxr-xr-x 6 gethin gethin 4096 2013-07-26 12:29 .svn
drwxr-xr-x 3 gethin gethin 4096 2013-07-26 12:29 trunk
</pre>

=Modifying your Working Copy=

Right oh. The working copy is yours to work with so let's go ahead and modify the README.md file.

<pre>
gethin@gethin-desktop:~/test$ cd trunk
gethin@gethin-desktop:~/test/trunk$ ls
README.md
gethin@gethin-desktop:~/test/trunk$ emacs -nw README.md
</pre>

(See, e.g. http://refcards.com/docs/gildeas/gnu-emacs/emacs-refcard-a4.pdf, if you'd like to use the emacs text editor, but are new to it.)

To see what files you have modified, you ask the client for the status of your working copy:
<pre>
gethin@gethin-desktop:~/test/trunk$ svn status
M README.md
</pre>

The status shows the letter "M" for README.md, indicating it has been modified.

Note that this status only shows the things that have changed in '''your''' working copy. It does not show any changes made by others, either in the repository or in their own working copies.

You can also add a new file. Let's add a file called '''foo.txt''':

<pre>
gethin@gethin-desktop:~/test/trunk$ touch foo.txt
gethin@gethin-desktop:~/test/trunk$ svn status
? foo.txt
M README.md
</pre>

The question mark shows that the subversion client knows nothing about the new file (i.e. it is not currently under the auspices to version control). By default, svn will ignore new files. To indicate that a new file should be versioned, use the '''add''' command:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn add foo.txt
A foo.txt
gethin@gethin-desktop:~/test/trunk$ svn status
A foo.txt
M README.md
</pre>

The letter "A" is used to indicate an addition.

=Recording Changes in the Repository=

Sending changes to the repository is called a '''commit'''. Here's the command I used to send our two local modifications:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn commit --message "Added text to README.md and added the empty file foo.txt"
Sending trunk/README.md
Adding trunk/foo.txt
Transmitting file data ..
Committed revision 2.
</pre>

where the '''--message''', or '''-m''' for short, allows us to write a log message inline.

Notice the revision number. These numbers encode the state of the whole repository at a given juncture and are the passport to retrieving earlier versions of your project. As you commit future changes to your repository, your revision numbers will steadily increase.

Sometimes, you want a long message to go with a commit. To do this, simply execute the commit without the --message option. A text editor will then pop-up to be used to write the message and by saving and exiting, the commit will be done. Note that svn uses the editor indicated by the '''EDITOR''' environment variable. The editor often defaults to vi if this variable is undefined. If you are an emacs fan, set the variable first:

<pre>
$ export EDITOR=emacs
</pre>

(Note that you can use ''':q!''' to get out of vi, if you started it by accident. You could also set EDITOR=nano or gedit etc. You can also use the SVN_EDITOR environment variable.)

=Revert: Your "Get-Out-of-Jail Card"=

Just as we can add files, we can '''delete''', for example:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn delete foo.txt
D foo.txt
gethin@gethin-desktop:~/test/trunk$ ls
README.md
</pre>

The letter "D" indicates deletions and we see from typing 'ls' that the file has gone.

Subversion allows you to '''revert''' changes when you have made an error. Let's assume that 'foo.txt' was deleted in error. Fear not, you can get it back with:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn revert foo.txt
Reverted 'foo.txt'
gethin@gethin-desktop:~/test/trunk$ ls
foo.txt README.md
gethin@gethin-desktop:~/test/trunk$ svn status
gethin@gethin-desktop:~/test2/trunk$
</pre>

and foo.txt is back! A silent return from svn status, svn stat for short, indicates that there are no pending modifications in your working copy. Put another way, it exactly matches the repository version of the project when you made the checkout.

=Updating your Working Copy=

You can '''update''' your working copy to synchronise it with the latest version (known as the HEAD) held in the repository. The general form of the update command is:

<pre>
$ svn update
... <- list of files that have been added/modified
At revision X.
</pre>

If I update my working copy now:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn update
At revision 2.
</pre>

we see an empty list of files--i.e. there is nothing to update and my working copy perfectly matches the HEAD of the repository.

That needn't be the case, however. Let's imagine that you and a collaborator in Japan have access to the repository. You obviously work independently and, for good measure, in different time zones. Your collaborator may have committed some changes to the repository since you were last in front of a computer. That being the case, an update will bring all of her changes to your working copy.

A similar situation can arise if you are simultaneously operating two checkouts. Perhaps one at work and another on your home computer. If you had done some work at home yesterday evening and committed the fruits of your labours, and update will bring your work copy in line.

You can even update your working copy if you have some local modifications pending. In that situation, SVN will attempt to merge your changes with those from the southern hemisphere. If you both have edited the same line in a file, a '''conflict''' is flagged. More on that possibility later.

With all the foregoing in mind, '''status''', '''commit''' and '''update''' will probably be your most widely used commands:
# '''update''' regularly to bring other people's work
# '''status''' to make sure all is well
# '''commit''' frequently so that you can always recover a version you care about

=Investigating Changes=

This section highlights some commands can be used to boost productivity.

==log==
To get a log of what happened in the repository, use the log command. To see the files that have been modified as well as the log messages, use the --verbose option:
<pre>
gethin@gethin-desktop:~/test/trunk$ svn log --verbose
------------------------------------------------------------------------
r2 | ggdagw | 2013-07-26 14:58:08 +0100 (Fri, 26 Jul 2013) | 2 lines
Changed paths:
M /trunk/README.md
A /trunk/foo.txt

Added text to README.md and added the empty file foo.txt

------------------------------------------------------------------------
r1 | ggdagw | 2013-07-26 12:28:32 +0100 (Fri, 26 Jul 2013) | 2 lines
Changed paths:
A /branches
A /trunk
A /trunk/README.md

Initial commit

------------------------------------------------------------------------
</pre>

You can also invoke the log command on a particular file/path and provide a range of revisions.
For instance to see which commits affected file1 between revisions 4 and 6, one could use:
<pre>
$ svn log --verbose --revision 4:6 file1
... <- log output
</pre>

==diff==
After you have modified something, it can be handy to highlight what you've done. You can do this using the '''diff''' command.

For instance add some text to 'README.md' and use '''diff''' to see what you have done.

<pre>
gethin@gethin-desktop:~/test/trunk$ svn stat
M README.md
gethin@gethin-desktop:~/test/trunk$ svn diff README.md
Index: README.md
===================================================================
--- README.md (revision 2)
+++ README.md (working copy)
@@ -7,3 +7,16 @@
------------

This file is formatted in [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) and will be automatically rendered on your GitHub webpage.
+
+Here is an itemised list:
+* bread
+* butter
+* marmalade
+
+A Table:
+
+| Name | Colour | Price |
+| ------- |:-------------:|--------------:|
+| Thomas | centered | right-aligned |
+| Gordon | blue | £3.56 |
+| Henry | green | £2.81 |
</pre>

You can also use diff to highlight differences between two versions of some file, as stored in the repository:

<pre>
$ svn diff -r73:74 foo.txt
...
</pre>

==blame (praise)==
Sometimes, you want to know who wrote a particular bit of code. Subversion makes that easy with the '''blame''' command:
<pre>
$ svn blame file2
2 jprenaud Added some stuff
3 jprenaud Another line
4 jprenaud A third line.
</pre>

You see the content of file2 and for each line the name of the author and the revision number. You could then fetch the log message for that particular revision to get more information.

<pre>
$ svn log file2 --revision 3
------------------------------------------------------------------------
r3 | jprenaud | 2008-05-14 16:03:45 +0100 (Wed, 14 May 2008) | 1 line

More things.
------------------------------------------------------------------------
</pre>

However, this feature is not available for github hosted repositories:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn blame README.md
svn: Server does not support custom revprops via log
</pre>

=Conflicts=

Sometimes, a commit or an update will fail because of conflicting changes. As a rule, you should always update before a commit so the example here will show a conflict created after an update.

==Creating the conflict==

As mentioned previously, conflicts arise when SVN cannot merge together changes to the same file--i.e. the changes are on the same line.

You can manufacture a conflict using two checkouts--let's call them A and B. I could create two such working copies by typing the following:

<pre>
svn co https://github.com/ggdagw/test ./testA
svn co https://github.com/ggdagw/test ./testB
</pre>

There is nothing to stop me having multiple checkouts on the same computer. Now, that we have the raw materials:

# Ensure that both A and B are up-to-date.
# edit line 1 of README.md in A and commit.
# edit line 1 of README.md in B--do not commit.
# Now attempt to update B.

Et voila, you will have a conflict. Since SVN cannot resolve it, we must apply the old human grey matter to the task. If you were working with a collaborator, this may well involve a phone or email conversation to decide on the best course of action.

The update does not immediately fail. Rather, you are be presented with some options:

<pre>
Conflict discovered in 'README.md'.
Select: (p) postpone, (df) diff-full, (e) edit,
(mc) mine-conflict, (tc) theirs-conflict,
(s) show all options: df
</pre>

If you choose "df", then you will be presented with a summary of the 3-way difference: First how it was prior to your local change; second how it is in your working copy and lastly how it currently is in the repository. You will also be presented with the list of options again. If you choose mine-conflict, "mc", your local modifications will be preferred--at least in this working copy, since nothing has been committed back at this stage. Theirs-conflict, "tc", will prefer the repository version. If you elect to postpone, "p", then 'README.md' is flagged with the letter "C" indicating a conflict and you will notice new files in your working copy:
<pre>
$svn status
? README.md.r5
? README.md.r6
? README.md.mine
C README.md
</pre>

* README.md.r5 is README.md as at revision 5 (i.e. the one at your last update)
* README.md.r6 is README.md at revision 6 (i.e. the one that is on the repository now)
* README.md.mine is README.md as it was in working copy before the update
* README.md contains an attempt at merging the changes (this will be similar to what you see with "df").

The last option, edit (e), will present to you the attempted merge in a text editor, for you to resolve as you see fit. Note that if you type "svn status" after editing README.md in this situation, you will see that the file is still marked as conflicted and you will not be able to commit your changes until you have '''resolved''' the conflict by typing e.g.:

<pre>
svn resolved README.md
</pre>

=Other useful commands=

==info==

From inside a working copy, e.g.:

<pre>
svn info
</pre>

will give:

<pre>
gethin@gethin-desktop:~/test/trunk$ svn info
Path: .
URL: https://github.com/ggdagw/test/trunk
Repository Root: https://github.com/ggdagw/test
Repository UUID: be566c2d-dc09-ebaf-f5e5-ce57b7db7bff
Revision: 3
Node Kind: directory
Schedule: normal
Last Changed Author: ggdagw
Last Changed Rev: 3
Last Changed Date: 2013-07-26 16:03:50 +0100 (Fri, 26 Jul 2013)
</pre>

==list==

<pre>
gethin@gethin-desktop:~/test2/trunk$ svn list https://github.com/ggdagw/test
</pre>

will give:

<pre>
branches/
trunk/
</pre>

and

<pre>
gethin@gethin-desktop:~/test2/trunk$ svn list https://github.com/ggdagw/test/trunk
</pre>

will give:

<pre>
README.md
foo.txt
</pre>

==move & copy==
If you rename a file or directory manually, you loose its history, this is because subversion needs to be notified that a tracked file or directory will have a new name. It is simpler to use the subversion '''move''' command. For instance, to rename "file2", do:
<pre>
$ svn move file2 new_file2
A new_file2
D file2
</pre>

You notice that the new file is added and the old one deleted. You could have done this manually but the advantage of this is that the history of the new file before the new name is still available.

A close relation to '''move''' is '''copy'''. This creates a new file, with a copy of the revision history of it's template:

<pre>
$ svn copy havana havana2
A havana2
</pre>

==import==
When you ask for a new repository, it is empty by default. To populate it, you can use the import command. (import is because the action is done from the server, it imports something). The syntax is:

<pre>
$ svn import PATH URL/trunk --message "Log message."
</pre>

* PATH is the path to the local folder (by default it ses "./", i.e. the current folder)
* URL is the full URL of the repository. In the example, I also added "trunk/" at the end and the trunk would be created automatically.

==mkdir ==
Often people ask how then can create the "branches/" directory in the repository to store some specific versions of their code. This can be done by invoking mkdir directly on the server. The syntax is:

<pre>
$ svn mkdir URL/branches --message "Log message."
</pre>

==export==
Sometimes, you want to get the files from the version control system but this will not be used as a working copy, for instance you are going to send the files to somebody who is not involved in the development. For instance, it is the command that was used for the [[Linux1]] and [[Linux2]] practicals.

You could do a checkout and remove all the hidden ".ssh" directories manually, but the easiest to to use the "export" command. It works exactly like a checkout except that you end up with a normal local folder, not a working copy. The syntax is:

<pre>
$ svn export URL PATH
</pre>

==branching and merge==

Subversion can support multiple development strands via the creation of branches. The svn '''copy''' command is used to make a (space efficient) copy of an entire file tree from, say, the trunk to a subdir called "branches". A popular reason for creating a branch is for a particular developer (or team) to work on something speculative or disruptive:

<pre>
svn copy https://svn.ggy.bris.ac.uk/subversion/ourproject/trunk \
https://svn.ggy.bris.ac.uk/subversion/ourproject/branches/sally_dev \
-m "The reason why I'm branching is..."
</pre>

When you have different branches in your project, you might want to merge the changes from one branch to another. For instance somebody has fixed a bug on a branch that is still present in the trunk. You might want to apply the changes done on branch back onto the trunk. Subversion allows you to do this and it is called a merge operation. We will cover this topic very quickly here but you can refer to the [http://svnbook.red-bean.com Subversion Red Book] for more information about merging.

[[Image:Svn_merge_general.jpg|frame|centre|General situation for svn merge]]

For example, the following command will merge in a change set from revA to revB from the sally_dev branch of ourproject into your working copy:

<pre>
svn merge -r revA:revB https://svn.ggy.bris.ac.uk/subversion/ourproject/branches/sally_dev
</pre>

You can then commit these changes--should you so desire--which will end up in the development line of whatever you chose to checkout in order to obtain your working copy.

=To go further=
The [http://svnbook.red-bean.com/ Subversion Red Book] is the bible of subversion. Highly recommended.

In the book, you can see how to create your own repository, should you desire. For example, some simple repository setup commands will provide you with a working facility via the filesystem (i.e. the repository is on the same computer that you typically work on), or SSH (i.e. you have SSH access to the machine that will host the respository). See below. '''However, note two things:'''
* '''It is dangerous to create a repository on a files-ystem that is not backed-up. You could lose all your work that way.'''
* '''You can request a (backed-up) University of Bristol repository simply by contacting the service desk.'''

==Creating our own Repository==

<pre>
svnadmin create $HOME/my_test_repo
</pre>

Next you might import some files (that you have stored in a directory called 'projectX'). If you're working on the same filesystem you would use:

<pre>
cd projectX
svn import . file://$HOME/my_test_repo/trunk -m "my import message"
svn list file://$HOME/my_test_repo/trunk
...
</pre>

Which--via SSH--would translate to:

<pre>
cd projectX
svn import . svn+ssh://user@host/absolute/path/to/my_test_repo/trunk -m "my import message"
svn list svn+ssh://user@host/absolute/path/to/my_test_repo/trunk
...
</pre>

NB where the components of '''user@host/absolute/path/to''' are:
* user: your username on the remote machine hosting your repository
* host: the hostname of the remote machine
* /absolute/path/to: you must specify the absolute path to your repository ($HOME will be evaluated on the machine to SSH'ing ''from''

And you're away!

R1

2014-11-04T14:42:03Z

GethinWilliams: /* Standard Graphics: A taster */

R1

2014-11-03T16:28:27Z

GethinWilliams: /* Standard Graphics: A taster */

2014-03-07T12:15:13Z

GethinWilliams: Protected "R1" ([edit=sysop] (indefinite) [move=sysop] (indefinite))

[[category:Pragmatic Programming]]
'''Open Source Statistics with R'''

=Introduction=

R is a mature, open-source (i.e. free!) statistics package, with an intuitive interface, excellent graphics and a vibrant community constantly adding new methods for the statistical investigation of your data to the library of packages available.

The goal of this tutorial is to introduce you to the R package, and not to be an introductory course in statistics.

If you are working on a Linux system, you will typically start R from the command line. On a Windows machine, or a Mac, you will typically start up R in some form of GUI. However you get R started, you will have access to an R command prompt. The good news is that the examples below will all work at the R command prompt, however you gained access to it.

Further resources:

* The R manual is a great resource for learning R: http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
* Some excellent examples of using R can also be found at: http://msenux.redwoods.edu/math/R/ and http://www.r-tutor.com/

=Getting Started=

The very simplest thing we can do with R is to perform some arithmetic at the command prompt:

<source>
> phi <- (1+sqrt(5))/2
> phi
[1] 1.618034
</source>

Parentheses are used to modify the usual order of precedence of the operators ('''/''' will typically be evaluated before '''+'''). Note the '''[1]''' accompanying the returned value. All numbers entered at the console are interpreted as a vector. The '[1]' indicates that the line in question is displaying the vector of values starting at first index. We can use the handy sequence function to create a vector containing more than a single element:

<source>
> odds <- seq(from=1, to=67, by=2)
> odds
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67
</source>

From the above example, we can see that both the '''<-''' and '''=''' operators can be used for assignment.

Vectors are commonly used data structures in R:

<source>
coords.bris <- c(51.5, 2.6)
</source>

As are matrices:

<source>
> magic <- matrix(data=c(2,7,6,9,5,1,4,3,8),nrow=3,ncol=3)
> magic
[,1] [,2] [,3]
[1,] 2 9 4
[2,] 7 5 3
[3,] 6 1 8
</source>

Where the '''c''' function combines the arguments given in the parentheses. We can access portions of the array using the syntax shown in the square brackets. For example, we can access the first row using the '''[1,]''' notation, and similarly the second column using '''[,2]'''. Since the square is 3x3 magic, the numbers in both slices should sum to 15:

<source>
> sum(magic[1,])
[1] 15
> sum(magic[,2])
[1] 15
</source>

Single elements and ranges can also accessed:

<source>
> magic[2,2]
[1] 5
> magic[2:3,2:3]
[,1] [,2]
[1,] 5 3
[2,] 1 8
</source>

R also provides '''arrays''', which have more than two dimensions, and '''lists''' to hold heterogeneous collections.

An example list:

<source>
> list.r4 <- list(name="Radio4", frequency="93.7")
</source>

The items of which, we can access in several ways:

<source>
> list.r4$frequency
[1] "93.7"
> list.r4[1]
$name
[1] "Radio4"

> list.r4[[1]]
[1] "Radio4"
</source>

A very commonly used data structure is the '''data frame''', which R uses to store tabular data. Given several vectors of equal length, we can collate them into a data frame:

<source>
> country <- c("USA", "China", "GB")
> gold <- c(46, 38, 29)
> silver <- c(29, 27, 17)
> bronze <- c(29, 23, 19)
> medals.2012 <- data.frame(country, gold, silver, bronze)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 GB 29 17 19
</source>

We can access columns of a data frame using the '''$''' operator:

<source>
> medals.2012$country
[1] USA China GB
Levels: China GB USA
> medals.2012$gold
[1] 46 38 29
</source>

=Standard Graphics: A taster=

An aspect which makes R popular are it's graphing functions. R also has some very handy built-in data sets--we'll use this to demonstrate just a small fraction of R's graphing abilities.

First up is the humble '''plot()''' function. Given a data frame of points, such as one charting the relationship between temperature and the vapour pressure of mercury, it will give us a (handily labelled) scatter plot:

<source>
> plot(pressure)
</source>

See the gallery below for all the plots created in this section.

The plot function will also accept a time-series (another class of object recognised by R) and will sensibly join the points with a line:

<source>
> plot(co2)
> class(co2)
[1] "ts"
</source>

Pie charts are easily constructed. In this case, to show the relative proportions of electricity generated from different sources in the UK in 2011 (source: https://www.gov.uk/government/.../5942-uk-energy-in-brief-2012.pdf‎):

<source>
> uk.electricty.sources.2011 <- c(41,29,18,5,4,2,1)
> names(uk.electricty.sources.2011) <- ("Gas", "Coal", "Nuclear", "Hydro & other", "Wind", "Imports", "Oil")
> pie(uk.electricty.sources.2011, main="UK Electricty Generating Mix, 2011", col=rainbow(7))
</source>

Next, let's create a bar chart of monthly average precipitation falling here in the fair city of Bristol (source: http://www.worldweatheronline.com):

<source>
> bristol.precip <- c(82.9, 56.1, 59.2, 69, 50.8, 50.9, 50.8, 74.8, 74.7, 91.1, 94.5, 93.6)
> names(bristol.precip) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
> barplot(bristol.precip,
+ main="Average Monthly Precipitation in Bristol",
+ ylab="Mean precipitation (mm)",
+ ylim=c(0,100),
+ col=c("darkblue"))
</source>

[http://en.wikipedia.org/wiki/Box_plot 'Box and whisker' plots] are useful ways to graph the quartiles of some data. In this case, the fuel efficiencies of various US cars, circa 1974:

<source>
> boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
+ xlab="Number of Cylinders", ylab="Miles Per Gallon")
</source>

R includes a very useful help facility. In the case of the '''filled.contour()''' plotting function, the help page includes an example of it's use to plot the topology of a volcano in Auckland, NZ:

<source>
> ?filled.countour
</source>

<gallery widths=300px heights=300px perrow=3>
File:Vapour-pressure.png|Vapour pressure of mercury against temperature
File:Mauna-loa.png|CO2 concentrations measured at Mauna-Loa between 1959 and 1997
File:Pie.png|The UK's electricity generating mix, 2011
File:Barplot.png|Average monthly precipitation in Bristol
File:Boxplot.png|Range of fuel efficiencies for different engine sizes
File:Maunga-Whau.png|Topology of Maunga Whau volcano in Auckland
</gallery>

There are many more example plots--complete with the R code required to create the plots (at the bottom of the page, after the comments)--on the following web page:
* http://gallery.r-enthusiasts.com/thumbs.php

=Loops=

A simple '''for''' loop:

<source>
> for (ii in seq(1,10)) print(ii)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
</source>

Some more exotic counting:

<source>
> for (ii in seq(from=10, to=0, by=-2)) print(ii)
[1] 10
[1] 8
[1] 6
[1] 4
[1] 2
[1] 0
</source>

'''while''' loops are for when we don't know the number of iterations in advance:

<source>
> ii <- runif(1,0,1)
> ii
[1] 0.3998513
> while (ii < 0.5) {print(ii); ii <- runif(1,0,1)}
[1] 0.3998513
[1] 0.05469244
> ii
[1] 0.8265036
</source>

=Functions=

You can define your own functions in R, using the '''function''' keyword. For example, Pythagoras' Theorem:

<source>
> hypotenuse <- function(x, y) {sqrt(x^2 + y^2)}
</source>

The braces ({}) are optional, but add clarity.

To call the function:

<source>
> hypotenuse(3,4)
[1] 5
</source>

We can provide default values for the arguments, which can be overridden for any given invocation of the function:

<source>
> hypot2 <- function(x=3 ,y=4) {sqrt(x^2 + y^2)}
> hypot2()
[1] 5
> hypot2(12,16)
[1] 20
> hypot2(y=16, x=12)
[1] 20
</source>

You can see that the order of the arguments is respected, unless the names are given, in which case the order can be changed.

Longer functions can be spread over several lines. We can also use the '''return''' keyword to control which value is returned by the function:

<source>
> hypot3 <- function(x=3 ,y=4) {
+ x_sq <- x^2
+ y_sq <- y^2
+ return( sqrt(x_sq + y_sq) )}
> hypot3(6,8)
[1] 10
</source>

You can check on the contents of a function, by just typing it's name (without parentheses):

<source>
> hypot3
function(x=3 ,y=4) {
x_sq <- x^2
y_sq <- y^2
return( sqrt(x_sq + y_sq) )}
</source>

Or just check the arguments, using the '''args''' function. (The body of the function in general is reported as NULL):

<source>
> args(hypot3)
function (x = 3, y = 4)
NULL
</source>

=Packages=

Listed at http://cran.r-project.org/

Let's install the '''multicore''' package, that will give us access to functions within R which will run on the multiple processors which we often find in our computers these days:

<source>
> install.packages("multicore")
</source>

Et voila! It is done.

We can check which packages are currently loaded into the library available from our workspace:

<source>
> library()
</source>

If we need to add one, we type e.g.:

<source>
> library(multicore)
</source>

Now, an example of using a function from the multicore package. The '''lapply''' function, which is included in the standard R core, will map a given function over a list inputs, giving a list of the function outputs in return. For example, we can map a squaring function over the list of integers from 1 to 3:

<source>
> lapply(1:3, function(x) {x^2})
</source>

which gives us the list:

<pre>
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9
</pre>

Now, we can do the same work in parallel using:

<source>
> mclapply(1:3, function(x) {x^2})
</source>

=Reading Data from File=

R provides some very useful functions for reading and writing data from/to file.

==Text Files==

Let's start with text files. If your data is organised into a file such that it looks like a table with column headings:

Perhaps the simplest one is '''read.table()'''. If I have a text file with the following contents:

<pre>
country gold silver bronze
"USA" 46 29 29
"China" 38 27 23
"Great Britain" 29 17 19
"Russian Federation" 24 26 32
"Republic of Korea" 13 8 7
"Germany" 11 19 14
</pre>

It will be a simple matter to use the '''read.table()''' function to load the data into R:

<source>
> medals.2012 <- read.table("medals.txt", header=TRUE)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 Great Britain 29 17 19
4 Russian Federation 24 26 32
5 Republic of Korea 13 8 7
6 Germany 11 19 14
</source>

There is a corresponding '''write.table()''' function to export the contents of a data frame into a text file.

CSV files can be easily handled by specifying '''sep=","''' as an argument to read.table(). However, for convenience, there are also '''read.csv()''' and '''write.csv()''' functions defined. For example:

<source>
> write.csv(medals.2012,"medals.csv")
</source>

Gives us the file, '''medals.csv''', with the contents:

<pre>
"","country","gold","silver","bronze"
"1","USA",46,29,29
"2","China",38,27,23
"3","Great Britain",29,17,19
"4","Russian Federation",24,26,32
"5","Republic of Korea",13,8,7
"6","Germany",11,19,14
</pre>

==Binary Files==

The '''save()''' function will store an R data structure in binary form:

<source>
> save(medals.2012,file="medals.RData")
</source>

<pre>
gethin@gethin-desktop:~$ file medals.RData
medals.RData: gzip compressed data, from Unix
</pre>

There is, of course, a corresponding function to load such data:

<source>
> load("medals.RData")
</source>

==Databases==

If you would like to read and write data directly from/to a database, there are several packages to help you. See http://cran.r-project.org/doc/manuals/r-release/R-data.html#Relational-databases for more information.

==NetCDF==

The [http://cran.r-project.org/web/packages/ncdf/index.html '''ncdf''' package] provides an interface to NetCDF files. Before installing the package, you will need the Unidata NetCDF libraries installed on your system. On Linux, the standard package managers conveniently provide this. Note that you will need the 'development' packages. Once the prerequisites are satisfied, you can use the standard R command to install the package from CRAN:

<source>
> install.packages("ncdf")
</source>

=Examples of Common Tasks=

==Preparing Data==

===Sorting===

Using '''sort''':

<source>
> railway.engines <- c("thomas", "henry", "gordon", "edward", "james")
> sort(railway.engines)
[1] "edward" "gordon" "henry" "james" "thomas"
</source>

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html

===Random Sampling===

Using '''sample''':

<source>
> railway.engines <- c("thomas", "henry", "gordon", "edward", "james")
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "edward"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "thomas"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
</source>

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html

===Combining===

Using '''rbind''' to add combine the rows to two data frames:

<source>
> country <- c("France", "Italy", "Hungary", "Australia")
> gold <- c(11, 8, 8, 7)
> silver <- c(11, 9, 4, 16)
> bronze <- c(12, 11, 5, 12)
> extras.2012 <- data.frame(country, gold, silver, bronze)
> rbind(medals.2012, extras.2012)
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 Great Britain 29 17 19
4 Russian Federation 24 26 32
5 Republic of Korea 13 8 7
6 Germany 11 19 14
7 France 11 11 12
8 Italy 8 9 11
9 Hungary 8 4 5
10 Australia 7 16 12
</source>

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html

===Binning Data===

Using '''cut''':

<source>
> girls_2=c(83.8, 86.2, 85.1, 88.6, 83, 88.9, 89.7, 81.3, 88.7, 88.4)
> bins=cut(girls_2, breaks=3)
> bins
[1] (81.3,84.1] (84.1,86.9] (84.1,86.9] (86.9,89.7] (81.3,84.1] (86.9,89.7]
[7] (86.9,89.7] (81.3,84.1] (86.9,89.7] (86.9,89.7]
Levels: (81.3,84.1] (84.1,86.9] (86.9,89.7]
> plot(bins)
</source>

Plotting the data couldn't be simpler with '''plot(bins)'''!

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html

==Linear Regression==

<source>
> plot(cars)
> res=lm(dist ~ speed, data=cars)
> abline(res)
</source>

[[Image:R-lm(cars)-abline.png|400px|thumbnail|center|linear regression of stopping distance against speed from the built-in data set, cars]]

----

'''Exercises'''
* You may wish to compare different methods of estimation. From the MASS package, you can fit a line with the '''rlm''' and '''lqs'' funtions. You can plot all the lines against the data using:
<source>
> abline(res.lm, lty=1)
> abline(res.rlm, lty=2)
> abline(res.lqs, lty=3)
> legend(x=5, y=100, legend=c("lm","rlm","lqs"), lty=c(1,2,3))
</source>
See: http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/rlm.html and http://stat.ethz.ch/R-manual/R-devel/RHOME/library/MASS/html/lqs.html.

* Weighted least squares. The '''lm''' function will accept a vector of weights, '''lm(... weights=...)'''. If given, the function will optimise the line of best fit according a the equation of weighted least squares. Experiment with different linear model fits, given different weighting vectors. Some handy hints for creating a vector of weights:
** '''w1<-rep(0.1,50)''' will give you a vector, length 50, where each element has a value of 0.1. W1[1]<-10 will give the first element of the vector a value of 10.
** '''w2<-seq(from=0.02, to=1.0, by=0.02)''' provides a vector containing a sequence of values from 0.02 to 1.0 in steps of 0.02 (handily, again 50 in total).

==Significance Testing==

<source>
> boys_2=c(90.2, 91.4, 86.4, 87.6, 86.7, 88.1, 82.2, 83.8, 91, 87.4)
> girls_2=c(83.8, 86.2, 85.1, 88.6, 83, 88.9, 89.7, 81.3, 88.7, 88.4)
> res=var.test(boys_2,girls_2)
> res

F test to compare two variances

data: boys_2 and girls_2
F = 1.0186, num df = 9, denom df = 9, p-value = 0.9786
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2529956 4.1007126
sample estimates:
ratio of variances
1.018559
> res=t.test(boys_2, girls_2, var.equal=TRUE, paired=FALSE)
> res

Two Sample t-test

data: boys_2 and girls_2
t = 0.8429, df = 18, p-value = 0.4103
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.656675 3.876675
sample estimates:
mean of x mean of y
87.48 86.3
</source>

==Classification==

===k Nearest Neighbours===

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa (s), versicolor (c), and virginica (v).

See: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/iris.html

k-nearest neighbour classification for test set from training set: For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

See: http://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html

<source>
library(class)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
iris3.knn <- knn(train, test, cl, k = 3, prob=TRUE)
table(predicted=iris3.knn, actual=cl)
</source>

How did we do?

<pre>
actual
predicted c s v
c 23 0 3
s 0 25 0
v 2 0 22
</pre>

===Classification Trees===

The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery.

This data frame contains the following columns:
* Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
* Age: in months
* Number: the number of vertebrae involved
* Start: the number of the first (topmost) vertebra operated on.

See: http://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html

<source>
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
parms = list(prior = c(.65,.35), split = "information"))
fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
control = rpart.control(cp = 0.05))
par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped
plot(fit)
text(fit, use.n = TRUE)
plot(fit2)
text(fit2, use.n = TRUE)
</source>

[[Image:R-classification-tree.png|500px|thumbnail|center|Classification tree for the kyphosis data frame.]]

==Solving Systems of Linear Equations==

See, e.g.: https://source.ggy.bris.ac.uk/wiki/NumMethodsPDEs

<source>
> A <- array(c(1,3,2,3,5,4,-2,6,3), dim=c(3,3))
> b <- c(5,7,8)
> solve(A,b)
[1] -15 8 2
</source>

=Suggested Exercises=

If you would like to work through some exercises, with model answers included, you could take a look at:
* http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/reed/rexercises.pdf

If you would prefer to noodle about with some real-world data, you could take a look at:
* http://www.theguardian.com/news/datablog/2010/oct/18/historic-government-spending-area#data

=Writing Faster R Code=

In the above sections we've introduced a number of features of R and have begun the journey to becoming a proficient and productive user of the language. In the remaining sections, we'll switch tack and focus on a question commonly asked by those beginning to use R in anger--'''"My R code is slow. How can I speed it up?"'''. In this section we'll consider the related tasks of finding which bits of your R code is responsible for the majority of the run-time and what you can do about it.

==Profiling & Timing==

In order to remain productive (and sane, and have a social life...), it is essential that we first identify which portions of your R code are responsible for the majority of the run-time. We could spend ages optimising a portion that we ''think'' may be running slowly, but computers have the gift(!) to constantly surprise us, and if that portion of your program accounted for, say, 10% of the run-time, then you will have sweated for absolutely no useful gain.

The simplest method of investigation is to simply time the application of a function:

<source>
system.time(some.function())
</source>

You can get a more detailed analysis of a block of code using the built-in R profiler. The general pattern of invocation is:

<source>
Rprof(filename="~/rprof.out")
# Do some work
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

For example, here's an R script, '''profile.r''':
<source>
Rprof(filename="~/rprof.out")
# Create a 10 x 100,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(100000)})
# Map a function over the matrix. First in serial..
x <- lapply(data, function(x) {loess.smooth(x,x)})
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

Which I ran by typing:

<pre>
R CMD BATCH profile.r
</pre>

In the output file, '''profile.r.Rout''', I found the following break down:

<pre>
self.time self.pct total.time total.pct
"simpleLoess" 4.84 88.00 5.10 92.73
"rnorm" 0.22 4.00 0.22 4.00
"loess.smooth" 0.18 3.27 5.28 96.00
</pre>

The profile tells us that the function '''simpleLoess''' take 88% of the runtime, whereas '''rnorm''' takes only 4%.

==Preallocation of Memory==

As with other scripting languages, such as MATLAB, the simplest method that you can use to speed up your R code is to pre-allocate the storage for variables whenever possible. To see the benefits of this, consider the following two functions:

<source>
> f1 <- function() {
+ v <- c()
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

and:

<source>
> f2 <- function() {
+ v <- c(NA)
+ length(v) <- 30000
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

Timing calls to each of them shows that the pre-allocation of memory gives a whopping ~'''x30 speed-up'''. Your mileage will vary depending upon the details of your code.

<source>
> system.time(f1())
user system elapsed
1.720 0.040 1.762
> system.time(f2())
user system elapsed
0.052 0.000 0.05
</source>

==Vectorised Operations==

The other principle method for speeding up your R code is to eliminate loops whenever you can. Many functions and operators in R will accept arrays as input, rather than just single values and this may allow you to not use a loop. The examples in the previous section used for loops to step through an array, squaring each element. However, you can achieve the same result far more quickly by passing the array ''en masse'' to exponentiation operator:

<source>
> system.time(v <- (1:1000000)^2)
user system elapsed
0.024 0.004 0.026
</source>

Here we've been able to square 1,000,000 items in half the time it took to process 30,000!

==Calling Functions Written in a Compiled Language (e.g. C or Fortran)==

Another way to get more speed is to outsource portions of R code that are found to be slow to a compiled language, such as C or Fortran. A good starting point on this topic is:

* http://mazamascience.com/WorkingWithData/?p=1067

=R and HPC=

If you've profiled your code and tried all that you can to speed it up, as described in the previous section, you might be interested in the various initiatives that exist to run R on high performance computers, such as bluecrsytal:

* http://cran.r-project.org/web/views/HighPerformanceComputing.html

We will see in the following examples, the general approach to running R in parallel is to arrange your task so that a function is applied to a list of inputs, and then to split the list over several CPU cores or cluster worker nodes.

==Multicore==

The '''multicore''' package allows us to make use of several CPU cores within a single machine. Note, however, that the package does not work on a MS Windows computers.

As an example, let's look at the use of the package's '''mclapply''' function, a multicore equivalent of R's built-in list apply mapper, '''lapply'''. I saved the following commands into an R script called '''mutlicore.r''':
<source>
library(multicore)
# how many cores are present?
multicore:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

And used the following submission script to run it on bluecrystal phase2:
<source>
#!/bin/bash

#PBS -l nodes=1:ppn=8,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH multicore.r
</source>

After the job had run, I got the following output in the file '''multicore.r.Rout''':
<pre>
> library(multicore)
> # how many cores are present?
> multicore:::detectCores()
[1] 8
> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.674 0.007 0.749
> # .. and secondly in parallel (using multicore, within a node)
> system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.301 0.074 0.113
</pre>

==Rmpi==

The '''Rmpi''' package allows us to create and use cohorts of message passing processes from within R. It does so by providing an interface to the MPI (Message Passing Interface) library.

In order to use the Rmpi package on BCp2, you will need the '''ofed/openmpi/gcc/64/1.4.2-qlc''' module loaded.

Here's a short example that I saved as '''Rmpi.r''':
<source>
library(Rmpi)
# spawn as many slaves as possible
mpi.spawn.Rslaves()
mpi.remote.exec(mpi.get.processor.name())
mpi.remote.exec(runif(1))
mpi.close.Rslaves()
mpi.quit()
</source>

I submitted the job to BCp2 using the following submission script:
<source>
#!/bin/bash

#PBS -l nodes=4:ppn=1,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Create a machine file (used for multi-node jobs)
cat $PBS_NODEFILE > machine.file.$PBS_JOBID

#! Disable PSM on the QLogic HCAs
export OMPI_MCA_mtl=^psm

#! Run the R script
mpirun -np 1 -machinefile machine.file.$PBS_JOBID R CMD BATCH Rmpi.r
</source>

and got the following output:
<pre>
> library(Rmpi)
> # spawn as many slaves as possible
> mpi.spawn.Rslaves()
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: u03n074
slave1 (rank 1, comm 1) of size 5 is running on: u03n098
slave2 (rank 2, comm 1) of size 5 is running on: u04n029
slave3 (rank 3, comm 1) of size 5 is running on: u04n030
slave4 (rank 4, comm 1) of size 5 is running on: u03n074
> mpi.remote.exec(mpi.get.processor.name())
$slave1
[1] "u03n098"

$slave2
[1] "u04n029"

$slave3
[1] "u04n030"

$slave4
[1] "u03n074"

> mpi.remote.exec(runif(1))
X1 X2 X3 X4
1 0.5154871 0.5154871 0.5154871 0.5154871
> mpi.close.Rslaves()
[1] 1
> mpi.quit()
</pre>

==Snow==

Calling MPI routines from within R may be too low level for many people to use comfortably. Happily, the '''snow''' package provides a higher level abstraction for distributed memory programming from within R.

Here's my example program that a saved as '''snow.r''':
<source>
library(snow)
# request a cluster of 3 worker nodes
cl <- makeCluster(3)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using snow, across a cluster of workers)
system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
stopCluster(cl)
</source>

I ran it on BCp2 using the same submission script given for Rmpi, save for changing Rmpi.r to snow.r. The output was:

<pre>
> library(snow)
> # request a cluster of 3 worker nodes
> cl <- makeCluster(3)
Loading required package: Rmpi
3 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename machine
"u01n105" "x86_64"

[[2]]
nodename machine
"u02n014" "x86_64"

[[3]]
nodename machine
"u03n098" "x86_64"

> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.711 0.001 0.715
> # .. and secondly in parallel (using snow, across a cluster of workers)
> system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.259 0.001 0.260
> stopCluster(cl)
</pre>

==Parallel==

The '''parallel''' package is an amalgamation of functionality from the multicore and snow packages. The shared memory parallelism in this package runs on an MS Windows machine (unlike the multicore package).

I trivial translation of our previous multicore example is:
<source>
library(parallel)
# how many cores are present?
parallel:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

I have not been able to get a distributed memory cluster working on BCp2 using the parallel package.

=Further Reading=

* [http://shop.oreilly.com/product/9780596801717.do R in a Nutshell]
* [http://shop.oreilly.com/product/0636920021421.do Parallel R]

R1

2014-03-07T12:14:29Z

GethinWilliams: Unprotected "R1"

R1

2014-03-07T12:11:02Z

GethinWilliams: Changed protection level for "R1" ([edit=sysop] (indefinite) [move=sysop] (indefinite))

R1

2014-03-07T12:10:06Z

GethinWilliams: Changed protection level for "R1" ([edit=autoconfirmed] (indefinite) [move=autoconfirmed] (indefinite))

MATLAB1

2014-03-07T12:08:30Z

GethinWilliams: /* Finding where your code is slow */

[[category:Pragmatic Programming]]
'''An Introduction MATLAB'''

=Introduction=

Rather than re-invent the wheel, we'll use some tried and tested tutorial material. The following notes from the Maths department at the University of Dundee are concise, comprehensive, but also easy to read:
http://www.maths.dundee.ac.uk/ftp/na-reports/MatlabNotes.pdf

Once you have read through and understood the above notes, you might like to try your hand at some example exercises:
* easier ones: http://www.facstaff.bucknell.edu/maneval/help211/basicexercises.html
* harder ones: http://www.cl.cam.ac.uk/teaching/2006/UnixTools/matlab-answers.pdf

=Hints and Tips on Performance=

A common query is, '''"How can I speed up my MATLAB code?"'''. People often go on to say that it ran fine when they were developing their code, but now that their ambition has grown and they are working on larger problems, they end up waiting for days to get a result. This is sometimes followed up by, "it'll run faster on the HPC system, right?" Well, not necessarily.

Let's try to pick some of this apart.

There are several aspects of some MATLAB code that can really limit it's performance. For loops are a common limiting factor, as is allocation of memory on-the-fly. These limitations can often be addressed by:

* Pre-allocation memory, where appropriate.
* Replacing loops over the elements of a vector or matrix with:
** Scalar and array operations.
** Built-in functions which take vectors or matrices as arguments.

However, before we get into examples of improved code, we need to determine '''where''' your code is spending the majority of it's time. It would not be sensible to invest lots of effort in re-writing a section of your program which took only 1% of the overall runtime. Accordingly, the next section focusses on methods for finding ''hot spots'' in your code:

==Finding where your code is slow==

Possibly the simplest way to assess the performance of a sequence of MATLAB operations is to employ the timing functions '''tic''' and '''toc'''. For example:

<source lang="matlab">
tic;
n=1500;
A=rand(n);
B=pinv(A);
toc
</source>

gives the result:

<pre>
Elapsed time is 2.163306 seconds.
</pre>

A more detailed analysis can be elicited from the MATLAB profiler. Let's suppose we have a function which converts cartesian to polar coordinates:

<source lang="matlab">
function [r,theta] = cart2plr(x,y)
% cart2plr Convert Cartesian coordinates to polar coordinates
%
% [r,theta] = cart2plr(x,y) computes r and theta with
%
% r = sqrt(x^2 + y^2);
% theta = atan2(y,x);

r = sqrt(x^2 + y^2);
theta = atan2(y,x);
</source>

and we call that function a number of times in the following script:

<source lang="matlab">
profile on
for i=1:3000
cart2plr(rand(),rand());
end
profile off
profile viewer
</source>

We will be able to see the following analysis in the profile viewer window:

[[Image:MATLAB-Profiler.png|thumb|800px|none|The MATLAB profiler]]

==Preallocation of Vectors==

Memory allocation is an expensive operation. MATLAB will allow us to assign values to an array inside a loop, where the array keeps growing to accommodate all the iterations of the loop. For example:

<source lang="matlab">
for i=1:1000
vec(i) = i^2;
end
</source>

However, this flexibility will come at the cost of performance, as the frequent resizing of the container ''vec'' will incur many requests for additional memory for storage. Therefore, it is wise to pre-allocate storage, if you can predict ahead of time how large the container needs to be. This is probably the simplest way in which you can speed up your MATLAB code.

To demonstrate the benefit of pre-allocation, consider the following two MATLAB scripts.

'''noprealloc.m''':
<source lang="matlab">
tic;
for i=1:3000,
for j=1:3000,
x(i,j)=i+j;
end
end
toc
</source>

'''prealloc.m''':
<source lang="matlab">
tic;
x=zeros(3000);
for i=1:3000,
for j=1:3000,
x(i,j)=i+j;
end
end
toc
</source>

When we run these two scripts (on BCp2), we see a ''significant'' difference in the runtime:

<pre>
>> noprealloc
Elapsed time is 14.317089 seconds.
>> prealloc
Elapsed time is 0.279115 seconds.
</pre>

==Scalar and Array Operators==

For example, if you would like to perform a scalar operation to a vector, '''vec''', (say, multiply each element by 3) then you do not need to write a loop.

Replace:
<source lang="matlab">
for i = 1:length(vec)
vec(i) = vec(i) * 3;
end
</source>

with:
<source lang="matlab">
vec = vec*3
</source>

Similarly, if you have two vectors or matrices '''of the same size''', you can perform element-by-element operations using, e.g.

<source lang="matlab">
m3 = m1 - m2
</source>

Note that array versions of the multiplication, division and exponentiation operators are '''.*''', '''./''' and '''.^''', respectively.

If you wish to apply the same function to all the elements of an array or vector, then you can pass it as an argument to the function. If you write your own functions, ensure that the operators that you use inside the function can handle vectors or matrices.

==Built-in Functions==

MATLAB contains a number of built-in functions which can save you from writing a loop. Examples include:

* '''sum''' and '''prod''': which compute the sum or product, respectively, of all the elements of vector.
* '''cumsum''' and '''cumprod''': both return a vector and are the cumulative counterparts of ''''sum''' and '''prod'''.
* '''min''' and '''max'''.
* '''any''' and '''all''': will return true if any or all of the elements of a vector or matrix are true (>0), respectively.
* '''find''': returns the indices of a vector that satisfy the given expression. For example, '''find(vec > 7)''' returns the indices of all elements of vec that are greater than 7.

==MEX Files==

Another route to higher performance is to outsource an identified bottleneck in you MATLAB code to a piece of compiled code written in C/C++ or Fortran. This is the MEX file approach. A good introduction to creating MEX files is:

* http://classes.soe.ucsc.edu/ee264/Fall11/cmex.pdf

Fortran1

2014-03-03T15:14:49Z

GethinWilliams: /* Containers and the Types of Things */

[[Category:Pragmatic Programming]]
'''Fortran1: The Basics'''
=Getting the content for the practical=
We'll forge our path through the verdant garden of '''Fortran90''' using a number of examples. To get your copy of these examples, from the version control repository, login to your favourite linux machine (perhaps dylan), and type:

<pre>
svn co https://svn.ggy.bris.ac.uk/subversion-open/fortran1/trunk fortran1
</pre>

=hello, world=

Without further ado, and in-keeping with the most venerable of traditions, let's meet our first example--"hello, world":

<source>
cd fortran1/examples/example1
</source>

You can compile the program by typing:

<source>
gfortran hello_world.f90 -o hello_world.exe
</source>

and run it by typing:

<source>
./hello_world.exe
</source>

'''Bingo!''' You've just compiled and run, perhaps your first, Fortran90 program. Hurrah! we're on our way:) Everybody whoop! Yeehah!

OK, OK...you'd better reign in your excitement. This is serious you know!:)

Enough of the magic, let's take a look inside the source code file. Take a look at the contents of hello_world.f90, using '''cat''', '''less''', '''more''' or your favourite text editor, and you'll see:

<source lang="fortran">
!
! This is a comment line.
! Below is a simple 'hello, world' program written in Fortran90.
! It illustrates creating a main 'program' unit together
! with good habits, such as using 'implicit none' and comments.
!
program hello_world
implicit none
write(*,*) "hello, world"
end program hello_world
</source>

We have:
#some comment lines, giving us a helpful narrative
#the start of the '''main program unit'''
#the '''implicit none''' statement (more of that in the next section, but suffice to say, every well dressed Fortran program should have one)
#a '''write''' statement, printing our greeting to the screen
#and last, but not least, the end of the main program.

This is all pretty straight forward, right? Open-up your text editor and try changing the greeting, just for the heck of it. Retype '''make''' and re-run it. We'll adopt a similar strategy for all the other examples we'll meet. If you ever want to get back to the original version of a program, just type:

<source>
svn revert hello_world.f90
</source>

Although this has all been fairly painless, we have made a very significant step--we are now editing, compiling and running Fortran programs. All the rest is basically just details!:)

=Containers and the Types of Things=

As fun as "hello, world" was, let's spice things up a little. For instance, let's introduce some variables. We'll need to move to the next example:

<source>
cd ../example2
</source>

Fortran90 has several types of built-in, or '''intrinsic''', variables. Take a look in '''basic_types.f90''':

<source lang="fortran">
character :: sex ! a letter e.g. 'm' or 'f'
character(len=12) :: name ! a string
logical :: wed ! married?
integer :: numBooks ! must be a whole number
real :: height ! e.g. 1.83 m (good include units in comment)
complex :: z ! real and imaginary parts
</source>

This set of types suffice for a great many programs. The above are all single entities. We'll meet arrays of things in a couple of examples time. In ''Fortran2'', we'll also meet user-defined types. These allow us to group instances of intrinsic types together forming new kinds of thing--new types. User-defined types are the ''bees knees'' and can make programs much easier to work with. We'll leave the details to that later course, however.

The above snippet shows some variable '''declarations''', along with a helpful comments. It's good practice to comment your declarations, as a programmer new to your code (or even yourself in a couple of months time) can have a hard time figuring out what is supposed to be stored in such-and-such a variable. While we're on the topic, it's also good practice to give your variables meaningful names, even if they are long. Trust me, a bit more typing now, perhaps, but a lot less head-scratching later on over what is stored in the inspiringly named '''xbNew''', or '''z2'''!

It's often a good idea to give variables an initial value when we declare them (working with uninitialised variables in another common source of bugs). This time we're looking at '''more_types.f90''':

<source lang="fortran">
character :: nucleotide = 'A' ! DNA has A,C,G & T
character(len=50) :: infile = 'yourData.nc', outfile = "myData.nc"
logical :: initialised = .true. ! or .false.
real :: solConst = 1.37 ! Solar 'constant' in kW/m^2
complex :: sqrtMinusOne = (0.0,1.0) ! (real,imag), sqrd gives (-1.0,0.0)
</source>

Fortran90 also allows us to gives variables certain attributes. For example, from '''pitfalls.f90''':

<source lang="fortran">
real,parameter :: pi = 3.14159 ! a fixed constant
real(kind=8) :: totPrecip ! this is preferred to 'double precision'
</source>

The '''parameter''' attribute tells you, me and Fortran that '''pi''' is a '''constant'''. It's fixed and it's a compile-time error if we try to change it. This is a good thing, since we can catch nasty bugs that can creep in that way. We never want pi to be anything other than pi, right?! Assigning '''parameter''' attributes to quantities we know are constant is an example of '''defensive programming''', or ''bug avoidance''!

By default reals in Fortran are represented using 4 bytes of memory. The addition of '''(kind=8)''' gives us an 8-byte real, often referred to a '''double precision''' real. Fortran does have a ''double precision'' type, but the '''kind''' attribute is preferred. (Many compilers also support the promotion of all default, 4-byte reals and integers in your program through flags, typically named ''-r8'' and ''-i8'', respectively.) 8-byte reals can be useful as accumulators, since they can help to avoid rounding errors.

The third program illustrates some arithmetic pitfalls--'''beware!''':
* '''integer division''' and it's truncation
* '''casting''' as a solution to mismatched types
* (integer) '''overflow'''
* (real number) '''underflow'''

Let's have a play with the programs. You an compile them by typing:

<source>
make
</source>

and run them by typing, e.g.:

<source>
./basic_types.exe
</source>

or,

<source>
./more_types.exe
</source>

or,

<source>
./pitfalls.exe
</source>

Although we compiled our first example ''by-hand'', we'll be using '''make''' to compile the rest of our example programs, so you won't have to worry about that side of things. (If you'd like to know more about make, you can take a look at [[Make|our course on make]], presented in a very similar style to this here excursion into Fortran90.)

Now modify the program (remembering '''svn revert intrinsic_types.f90''' if you make a mess). Try giving values to various types and also using operators such as:

* '''arithmetic''': +, -, /, ** (exponentiation)
* '''functions''': sin, cos, floor (rounding down)
* '''logic''': .and., .or., .not., .eqv., .neqv.
* and you'll meet many more in the future..

For example:

* Calculate the [http://en.wikipedia.org/wiki/Sphere#Surface_area_of_a_sphere surface area of a sphere].
* Is sine of pi divided by four really the same as one over root two?
* What is the truth table of the NAND operator?

----

About that mysterious '''implicit none''' which we keep seeing at the start of our programs. Let me tell you a story: Once upon a time, the kings and queens of the garden of Fortran, being a generous and well meaning bunch, decided to save the programmers the bother of specifying the type of their variables. "Don't bother!", they said, "just be sure to give them appropriate names, and well sort out the rest." "Thank you. Thank you very much", said the programmers, and it was decreed that the names of integers should start with the letters i, j, k, l, m, or n, and the names of reals would start with the other letters. Anyhow, this all seemed like a great wheeze and everybody was very happy. This lasted for a while, but after time, the programmers got complacent and forgot how to name things and it all got rather messy. Integers became reals, reals became integers and before they knew it, the programmers had '''bugs all over the place!''' Boo. The kings and queens conferred on the matter and they realised that they had made a grave error in their gift of implicit typing. However, they couldn't undo what they had done. Instead, they had to persuade the programmers to give it up voluntarily. "Anything, anything!", they pleaded "to get rid of '''all these bugs!'''", and so it passed that every good programmer agreed to put '''implicit none''' at the top of every program they wrote, and they all lived happily ever after.

=If, Do, Select and Other Ways to Control the Flow=

Programs are like cooking recipes. We've covered the how much of this and how much of that part. However, we also need to cover the doing bit--do this and then do that, and for how long etc. This is generically termed '''control flow'''. Fortran gives us a fairly rich language with which to describe how we would like things done. Next example:

<pre>
cd ../example3
</pre>

Take a look inside '''control.f90'''. We have some variable declarations and then we encounter our first '''conditional''':

<source lang="fortran">
if (initialised .eqv. .true.) then
write (*,*) "The variable 'area' is initialised and has the value:", area
else
write (*,*) "The variable 'area' is NOT initialised and has the value:", area
end if
</source>

This is fairly self eplanatory--'''if'''..something is the case..'''then'''..'''else'''.. You can also have an '''elseif'''. In fact you can have as many of those as you like. You can also have as many statements inside each clause as you like. Talk about spoiled!

'''Select''' is another control structure:
<source lang="fortran">
select case (nucleotide)
case ('A')
write (*,*) "nucleotide is Adenine"
case ('G')
write (*,*) "nucleotide is Guanine"
case ('T')
write (*,*) "nucleotide is Thymine"
case ('C')
write (*,*) "nucleotide is Cytosine"
case default
write (*,*) "default is the catch-all. 'Fall-through' can be a nasty bug."
stop
end select
</source>

This is a neat way of saying, "if..then..elsif..else.." The '''default''' clause at the bottom is important. Dropping this off can lead to '''fall-through''', where none of the cases triggered. This is rarely what you want and can lead to nasty bugs.

Our first '''do loop''' is of the form:

<source lang="fortran">
do ii=1,5
write (*,*) "Do loop counter ii is:", ii
end do
</source>

Again, this is fairly readable. '''ii''' is first given the value of 1, the body of the loop is evaluated and then we go back to the top again. Except this time we '''increment''' the counter (ii) by the '''default amount''', which is 1. When we're at the top and we take '''ii''' past 5, we stop the loop and move on to the next statement passed the '''end do'''. You're allowed as many statements inside the loop as you like. Indeed, you're allowed more loops, conditions, loops in loops, just about anything you can think of! '''Beware''', however, debugging a huge construct of nested this that and the other can be beyond the limits of human patience. Keep our programs simple and you will be happier for it.

The other loop examples show variations in the stopping condition and '''stride''' (i.e. how much we increment by), including counting backwards, and stopping before we've even started!

You'll notice that all the loops we've seen thus far will run for a pre-determined number of iterations. What if we don't know how many iterations we want ahead of time. Some languages, such as C for example, include a '''while loop''' for this purpose. In Fortran we still use '''do''', but omit any start and end conditions. Note that we must include an '''exit condition''', if we want such a loop to terminate:

<source lang="fortran">
threshold = 0.5
ii = 0
do
random_value = rand()
if (random_value .gt. threshold) then
print*, 'counter is:', ii, random_value, '>', threshold, 'stopping.'
exit
end if
print*, 'counter is:', ii, random_value, '<', threshold
ii = ii+1
end do
</source>

As before, compile it, run it and generally muck about. These are only a few of the control structures provided to us by Fortran. You'll find that you can do most things with these three, however.

Before leaving this example, let's consider if tests containing an equals and floating point numbers. Remember that there are an infinite set of real numbers and so a computer can only approximate them. For example, '''how would a computer represent 10/3'''? It has limited precision. It follows therefore that we should be careful when we need to test whether a real number is equal some value, such as '''3.3''' (see the last section of the program). A common way around this problem is to subtract the first real from
second and to compare the '''absolute''' value of the result to some small threshold (to account for rounding errors).

<source lang="fortran">
if ((abs(val-ref)) .lt. 0.0001) then
...
end if
</source>

'''Exercises'''

* Write a select statement which prints out the names of the digits in the set [1,10] in different languages. Write a loop to trigger each of these print statements.
* Write a nested loop: Write a pair of nested do loops which count between 1 and 3. Print the values of the two counters in the inner-most loop. Think about where your 'end do's should go.
* Write a nested if statement: For example, create character variables called 'vehicle', 'colour' and 'size'. We could represent a small red car as; vehicle = 'c', colour = 'r' and size = 's'. Arrange for your nested if to print 'eureka' given a big green train. Think about where your 'end if' statements should go.

=Not one, Many!=

That was fun. Back to thinking about variables for a moment:

<pre>
cd ../example4
</pre>

Last time we declared just one thing of a given type. Sometimes we're greedy! Sometimes we want more! To be fair, some things are naturally represented by a vector or a matrix. Think of values on a grid, solutions to linear systems, points in space, transformations such as scaling or rotation of vectors. For these sorts of things the kings and queens of Fortran gave us programmers '''arrays'''. Take a look inside '''static_array.f90''':

<source lang="fortran">
real, dimension(4) :: tinyGrid = (/1.0, 2.0, 3.0, 4.0/)
real, dimension(2,2) :: square = 0.0 ! 2 rows, 2 columns, init to all zeros
real, dimension(3,2) :: rectangle ! 3 rows, 2 columns, uninit
</source>

The syntax here reads, "we'll have an one-dimensional array (i.e. vector) of 4 reals called tinyGrid, please, and we'll set the initial values of the cells in that array to be 1.0, 2.0, 3.0 and 4.0, respectively.

For the second and third declarations, we're asking for two-dimensional arrays. One with two rows and two colums, called ''square'', and one with three rows and two colums. We're calling that ''rectangle''.

The program then goes on to print out the contents of ''tinyGrid''.

If we want to access a single element of an array, we can do so by specifying it's indices:

<source lang="fortran">
write (*,*) "square(1,2) is the top right corner:", square(1,2)
</source>

Fortran90 provides a couple of handy '''intrinsic routines''' for determining the '''size''' (how many cells in total) and the '''shape''' (number of dimensions and the ''extent'' of each dimension) of an array.

<source lang="fortran">
write (*,*) "size(rectangle) gives total number of elements:", size(rectangle)
write (*,*) "shape(rectangle) gives rank and extent:", shape(rectangle)
</source>

Fortran90 also allows us to '''reshape''' an array on-the-fly. Using this intrinsic, we can copy the values from ''tinyGrid'' into ''square''. Neat.

<source lang="fortran">
square = reshape(tinyGrid,(/2,2/))
</source>

Fortran also provides us with a rather rich set of operators (+, -, *, / etc.) for array-valued variables. Have a go at playing with these. If you know some linear algebra, you're going to have a great time with this example!

One thing to bear in mind when we consider 2D arrays is that Fortran stores them in memory as a 1D array and 'unwraps' them according to '''column-major order''':

[[Image:columnMajor.jpg|300px|thumbnail|centre|a 2D array 'unwrapped' into a 1D array using column-major order.]]

'''Exercises'''

* Create 3x3 grid to store the outcome of a game of noughts-and-crosses (tic-tac-toe), populate it and print the grid to screen.
* Fortran allows you to '''slice''' arrays. For example the second column of the 2d-array 'a' is a(:,2). Print the third row from the grid above.
* Add two vectors together. Is this algebraically correct? What about adding two matrices?
* Create two 2-d arrays. Populate one randomly and create another as a mask. Combine the two matrices and print the result.

----

The static malarky is because Fortran90 also allows us to say we want an array, but we don't know how big we want it to be yet. "We'll decide that at run-time", we programmers say. This can be handy if you're reading in some data, say a pay-roll, and you don't know how many employees you'll have from one year to the next. Fortran90 calls these '''allocatable arrays''' and we'll meet them in ''Fortran2''.

=If Things get Hectic, Outsource!=

<source>
cd ../example5
</source>

Now, as we get more ambitious, the size of our program grows. Before long, it can get unwieldy. Also we may find that we repeat ourselves. We do the same thing twice, three times. Heck, many times! Now is the time to start breaking your program into chunks, re-using some from time-to-time, making it more manageable. Fortran gives us two routes to chunk-ification, '''functions''' and '''subroutines'''.

Let's deal with subroutines first. In '''procedures.f90''', scroll down a bit and you can see:

<source lang="fortran">
subroutine mirror(len,inArray,outArray)

implicit none

! dummy variables
integer, intent(in) :: len
character,dimension(len),intent(in) :: inArray
character,dimension(len),intent(out) :: outArray

! local variables
integer :: ii
integer :: lucky = 3 ! notice scope of this identically named variable

do ii=1,len
outArray(len+1-ii) = inArray(ii)
end do

write (*,*) "'lucky' _inside_ subroutine is:", lucky

end subroutine mirror
</source>

Now, note that this is '''outside''' of the main program unit. (In principle, we could hive this off into another source code file, but we'll leave that discussion until ''Fortran2''.) Notice also that we have a shiney '''implicit none''' resplendent at the top of the subroutine. Overall, it looks pretty similar to how a main program unit might look, but with the addition of '''arguments'''. Those fellows in the parentheses after the subroutine name. The declaration part also lists those arguments and we've commented that this are so-called '''dummy variables'''. We also see an attribute that we've not seen before, called '''intent'''. This is a very handy tool for defensive programming (remember aka ''bug avoidance''). Using ''intent'' we can say that the integer ''len'' is an input and as such we're not going to try to change it. Likewise for the charcter array ''inArray''. It would be a compile-error if we did. We also state that the character array ''outArray'' is an output and we're going to give it a new value ''come what may''! We also have some variables that are '''local'''. Interestingly enough, one of our local variables, the integer called ''lucky'', has exactly the same name as a variable in the main program unit. When we run the program, however, we will see that the two do not interfere with each other. This is down to their '''scope'''. The scope of lucky in the main program is all and only the main program unit and the scope of lucky in the subroutine is all and only the subroutine. We say that the main program unit and the subroutine units have different '''name spaces'''.

Well, we've seen a lot of new syntax and concepts in all that. Useful ones though. This program is small and artificial, so it's hard to see the benefits just yet. You will, however, as your programs grow. The subroutine is '''called''' from the main program , funnily enough by a '''call''' statement. Notice how the arguments passed to the subroutine in the call statement also have different names in the main program and the subroutine. That's scope again.

Functions similar yet different to subroutines. Notice that we don't call them in the same way. We still pass arguments, but the functions '''returns''' and value, and so we could place a function on the right-hand side (RHS) of an assignment. Typically we would write a function for a smaller body of work. We can potentially invoke it in a neater way from our main program, however.

Looking at the body of the function:

<source lang="fortran">
function isPrime(num)

implicit none

! return and dummy variables
logical :: isPrime
integer, intent(in) :: num

select case (num)
case (2,3,5,7,11,13,17,19,23,29)
isPrime = .true.
case default
isPrime = .false.
end select

end function isPrime
</source>

we can see that we have a declaration for a variable with the same name as the function. The type of this variable is the '''return type''' of the function. Indeed the value of this variable when we reach the bottom of the function is the value passed by to the calling routine. '''Note that we can call functions and subroutines from other functions and subroutines etc'''. in a nested fashion.

The last thing of note is the funky '''interface''' structure at the top of the main program:

<source lang="fortran">
interface
function isPrime(num)
logical :: isPrime
integer, intent(in) :: num
end function isPrime
end interface
</source>

Our function '''definition''' is outside of the main program and so is said to be '''external'''. The main program unit needs to know about it, however, and an interface structure is a good way to do this as it prompts Fortran to check that all the arguments match up between the call and the definition. It's the way to do it..for now. (We'll meet Fortran90 modules in ''Fortran2'', which will give us a neater way to perform the same checks.)

Try running the program and writing some functions and subroutines of your own.

'''Exercises'''

* Write a simple error handling subroutine that '''stop'''s the program after printing a message, which is passed in as an argument.
* Does the surface area of a circle increase linearly with an increase in radius? How about circumference? Write some functions and a loop to investigate.
* What happens if you write a function which calls itself?

=Input and output=

OK, say we want some '''permenance'''? Perhaps we want to record the outputs of our program for some time, or we want to read the same values into a program each time that we run it. Storing data in files is the way to go.

==File i/o ==
<pre>
cd ../example6
</pre>

In this example we'll see how to read and write data to and from files. The code for this example in contained in '''file_access.f90'''.

Before we can either read from- or write to- a file, we must first open it. Funnily enough, we can do this using the '''open''' statement, e.g.:

<source lang="fortran">
open(unit=19,file="output.txt",form='formatted',status='old',iostat=ios)
</source>

* '''unit''' is a positive integer given to the file so that we can refer to it later. Several numbers are reserved, however, so we must avoid 5 (keyboard), 6 (screen), 101 (stdout) & 102 (punch card!).
* '''file''' is a character string (can be 'literal') containing the name of the file to open.
* '''form''' is the file format--'formatted' (text file) or 'unformatted' (binary file).
* '''status''' specifies the behaviour if the file exists:
*# '''old''' the file must exists
*# '''new''' the file cannot exists prior to being opened
*# '''replace''' the old file will be overwritten
* '''iostat''' is a non-zero integer in case of an error, e.g. the file cannot be opened for instance.

When you have finished with a file, you must '''close''' it:
<source lang="fortran">
close(19)
</source>

Note that if you need to go back to the beginning of a file, you don't have to close it and open it again, you can '''rewind''' it:
<source lang="fortran">
rewind(19)
</source>

To '''write''' to a file, the syntax is, e.g.:
<source lang="fortran">
write(unit=19,fmt=*) array1, array2
</source>

where,
* '''array1''' & '''array2''' are variables we wish to write to the file.
* '''unit''' is the unit number of the file we want write to.
* '''fmt''' is a '''format string'''. We have chosen '*', which gives us default settings. However, we can gain greater control by specifying in detail what we will be writing and how we would like it formatted. Take a look in the example '''file_access.f90''' for examples.

Similarly, we use a '''read''' statement to extract data from a file, e.g.:
<source lang="fortran">
read(unit=19,fmt=*,iostat=ios) var1, var2
</source>

Note is that it is very important to use the '''iostat''' attribute to make sure that we handle any errors, and don't press on regardless into oblivion! e.g.:

<source lang="fortran">
if (ios /= 0) then
print*,'ERROR: could not open file'
stop
end if
</source>

Among other things, the program '''file_access.exe''' writes the same data to; (i) a text file; and (ii) a binary file:

<source lang="fortran">
open(unit=56,file='output.txt',status='replace',iostat=ios,form='formatted')
if (ios /= 0) then
print*,'ERROR: could not open output.txt'
stop
end if

open(unit=57,file='output.bin',status='replace',iostat=ios,form='unformatted')
if (ios /= 0) then
print*,'ERROR: could not open output.bin'
stop
end if

! write size1 and size2 to output files (and then compare sizes)
write(57) array1
write(57) array2
close(57)

write(56,*) array1
write(56,*) array2
close(56)
</source>

If you do a long listing('ls -l'), you'll see the size difference:

<pre>
-rw-r--r-- 1 fred users 220032 Feb 8 11:01 output.bin
-rw-r--r-- 1 fred users 825002 Feb 8 11:01 output.txt
</pre>

==Using character strings as file names==
<pre>
cd ../example7
</pre>

'''filename.f90''' in <tt>example7</tt> shows a little trick for people who output a lot of data and need to manipulate a lot of files. It is possible to use a write statement to output a number (or anything else) to a character string and to subsequently use that string as a filename to open, close and output to many files on the fly:

<source lang="fortran">
character(len=8) :: filename
integer :: ii, ios

! define a format to write a filename
10 format('output',i2.2)

do ii=1,20
write (unit=filename,fmt=10) ii
open(20,file=filename,status='replace',form='formatted',iostat=ios)
...
</source>

==Namelists==
And all of a sudden, we're at our last example:

<pre>
cd ../example8
</pre>

In '''namelists.f90''', we'll look at another approach to file input and output. Fortran provides a way of grouping variables together into a set, called a '''namelist''', that are input or output from our program ''en masse''. This is a common situation. For example, reading in the parameters into a model. The statement:

<source lang="fortran">
namelist /my_bundle/ numBooks,initialised,name,vec
</source>

sets it up.

Fortran further provides us with built-in mechanisms for reading or writing a namelist to- or from a file.

First, we must open the file of course:

<source lang="fortran">
open(unit=56,file='input.nml',status='old',iostat=ios)
if (ios /= 0) then
print*,'ERROR: could not open namelist file'
stop
end if
</source>
Note that we made sure to tell Fortran that this is an '''old''' file, i.e. that it already exists, and to check the error code. I the case that the '''open''' operation failed, we've asked the program to halt with an error.

Now, assuming that we've opened the file OK, we proceed to read its contents. Fortran makes this rather easy for us, given that the information is contained in a namelist:

<source lang="fortran">
read(UNIT=56,NML=my_bundle,IOSTAT=ios)
if (ios /= 0) then
print*,'ERROR: could not read example namelist'
stop
else
close(56)
end if
</source>

In the '''read''' statement, we told fortran that we wanted to read a namelist and that the group of variables should be flagged in the file as '''my_bundle'''. Again we've checked the error status and decided to halt with an error should the read statement fail for any reason. Take a look at the contents of '''input.nml'''. See that it is ascii text and that the variables (with their values) do not need to be listed in the same order in the file as they are in the program:

<pre>
&my_bundle
numBooks=6
vec=1.0 2.0 3.0 4.0
initialised=.false.
name='romeo'
&end
</pre>

The program procedes to print the values to screen, demonstrating that they have indeed come from the file, to assign new values to the variables and then to write the modified values to a new file, called '''modified.nml'''. Compare the two ascii text files. Try running the program a second time and you will receive an error, telling you that the output could not be written since a file called modified.nml exists and that we expressly stated that is was to be a '''new''' file. Delete modified.nml, try again and it will succeed. Try changing the namelist, values, filenames etc. Go for it, make a mess! You'll learn a lot from it:)

'''Exercises'''

* Write some code to read in a colour (red, blue, green etc.) from a namelist and then print out the names of all the [http://en.wikipedia.org/wiki/Railway_engines_(Thomas_and_Friends) railway engines] (thomas, percy, gordon etc.) that match.
* Read in the dimensions of an ellipse from file and print out it's area.

= To go further =
The [[:category:Pragmatic Programming | Pragmatic Programming]] course continues with [[Linux2]], a look at some of the more advanced but very useful Linux concepts.

Now that you are getting familiar with Fortran, go a bit further by reading [[Fortran2]].

A useful Fortran textbook is called '''Fortran 90 Programming''' by Ellis, Philips & Lahey. Take a look at the [[A_Good_Read|'A Good Read?']] page for more details.

2013-11-28T16:26:48Z

GethinWilliams: /* Examples of Common Tasks */

[[category:Pragmatic Programming]]
'''Open Source Statistics with R'''

=Introduction=

R is a mature, open-source (i.e. free!) statistics package, with an intuitive interface, excellent graphics and a vibrant community constantly adding new methods for the statistical investigation of your data to the library of packages available.

The goal of this tutorial is to introduce you to the R package, and not to be an introductory course in statistics.

If you are working on a Linux system, you will typically start R from the command line. On a Windows machine, or a Mac, you will typically start up R in some form of GUI. However you get R started, you will have access to an R command prompt. The good news is that the examples below will all work at the R command prompt, however you gained access to it.

Further resources:

* The R manual is a great resource for learning R: http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
* Some excellent examples of using R can also be found at: http://msenux.redwoods.edu/math/R/ and http://www.r-tutor.com/

=Getting Started=

The very simplest thing we can do with R is to perform some arithmetic at the command prompt:

<source>
> phi <- (1+sqrt(5))/2
> phi
[1] 1.618034
</source>

Parentheses are used to modify the usual order of precedence of the operators ('''/''' will typically be evaluated before '''+'''). Note the '''[1]''' accompanying the returned value. All numbers entered at the console are interpreted as a vector. The '[1]' indicates that the line in question is displaying the vector of values starting at first index. We can use the handy sequence function to create a vector containing more than a single element:

<source>
> odds <- seq(from=1, to=67, by=2)
> odds
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67
</source>

From the above example, we can see that both the '''<-''' and '''=''' operators can be used for assignment.

Vectors are commonly used data structures in R:

<source>
coords.bris <- c(51.5, 2.6)
</source>

As are matrices:

<source>
> magic <- matrix(data=c(2,7,6,9,5,1,4,3,8),nrow=3,ncol=3)
> magic
[,1] [,2] [,3]
[1,] 2 9 4
[2,] 7 5 3
[3,] 6 1 8
</source>

Where the '''c''' function combines the arguments given in the parentheses. We can access portions of the array using the syntax shown in the square brackets. For example, we can access the first row using the '''[1,]''' notation, and similarly the second column using '''[,2]'''. Since the square is 3x3 magic, the numbers in both slices should sum to 15:

<source>
> sum(magic[1,])
[1] 15
> sum(magic[,2])
[1] 15
</source>

Single elements and ranges can also accessed:

<source>
> magic[2,2]
[1] 5
> magic[2:3,2:3]
[,1] [,2]
[1,] 5 3
[2,] 1 8
</source>

R also provides '''arrays''', which have more than two dimensions, and '''lists''' to hold heterogeneous collections.

An example list:

<source>
> list.r4 <- list(name="Radio4", frequency="93.7")
</source>

The items of which, we can access in several ways:

<source>
> list.r4$frequency
[1] "93.7"
> list.r4[1]
$name
[1] "Radio4"

> list.r4[[1]]
[1] "Radio4"
</source>

A very commonly used data structure is the '''data frame''', which R uses to store tabular data. Given several vectors of equal length, we can collate them into a data frame:

<source>
> country <- c("USA", "China", "GB")
> gold <- c(46, 38, 29)
> silver <- c(29, 27, 17)
> bronze <- c(29, 23, 19)
> medals.2012 <- data.frame(country, gold, silver, bronze)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 GB 29 17 19
</source>

We can access columns of a data frame using the '''$''' operator:

<source>
> medals.2012$country
[1] USA China GB
Levels: China GB USA
> medals.2012$gold
[1] 46 38 29
</source>

=Standard Graphics: A taster=

An aspect which makes R popular are it's graphing functions. R also has some very handy built-in data sets--we'll use this to demonstrate just a small fraction of R's graphing abilities.

First up is the humble '''plot()''' function. Given a data frame of points, such as one charting the relationship between temperature and the vapour pressure of mercury, it will give us a (handily labelled) scatter plot:

<source>
> plot(pressure)
</source>

See the gallery below for all the plots created in this section.

The plot function will also accept a time-series (another class of object recognised by R) and will sensibly join the points with a line:

<source>
> plot(co2)
> class(co2)
[1] "ts"
</source>

Pie charts are easily constructed. In this case, to show the relative proportions of electricity generated from different sources in the UK in 2011 (source: https://www.gov.uk/government/.../5942-uk-energy-in-brief-2012.pdf‎):

<source>
> uk.electricty.sources.2011 <- c(41,29,18,5,4,2,1)
> names(uk.electricty.sources.2011) <- ("Gas", "Coal", "Nuclear", "Hydro & other", "Wind", "Imports", "Oil")
> pie(uk.electricty.sources.2011, main="UK Electricty Generating Mix, 2011", col=rainbow(7))
</source>

Next, let's create a bar chart of monthly average precipitation falling here in the fair city of Bristol (source: http://www.worldweatheronline.com):

<source>
> bristol.precip <- c(82.9, 56.1, 59.2, 69, 50.8, 50.9, 50.8, 74.8, 74.7, 91.1, 94.5, 93.6)
> names(bristol.precip) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
> barplot(bristol.precip,
+ main="Average Monthly Precipitation in Bristol",
+ ylab="Mean precipitation (mm)",
+ ylim=c(0,100),
+ col=c("darkblue"))
</source>

[http://en.wikipedia.org/wiki/Box_plot 'Box and whisker' plots] are useful ways to graph the quartiles of some data. In this case, the fuel efficiencies of various US cars, circa 1974:

<source>
> boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
+ xlab="Number of Cylinders", ylab="Miles Per Gallon")
</source>

R includes a very useful help facility. In the case of the '''filled.contour()''' plotting function, the help page includes an example of it's use to plot the topology of a volcano in Auckland, NZ:

<source>
> ?filled.countour
</source>

<gallery widths=300px heights=300px perrow=3>
File:Vapour-pressure.png|Vapour pressure of mercury against temperature
File:Mauna-loa.png|CO2 concentrations measured at Mauna-Loa between 1959 and 1997
File:Pie.png|The UK's electricity generating mix, 2011
File:Barplot.png|Average monthly precipitation in Bristol
File:Boxplot.png|Range of fuel efficiencies for different engine sizes
File:Maunga-Whau.png|Topology of Maunga Whau volcano in Auckland
</gallery>

There are many more example plots--complete with the R code required to create the plots (at the bottom of the page, after the comments)--on the following web page:
* http://gallery.r-enthusiasts.com/thumbs.php

=Loops=

A simple '''for''' loop:

<source>
> for (ii in seq(1,10)) print(ii)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
</source>

Some more exotic counting:

<source>
> for (ii in seq(from=10, to=0, by=-2)) print(ii)
[1] 10
[1] 8
[1] 6
[1] 4
[1] 2
[1] 0
</source>

'''while''' loops are for when we don't know the number of iterations in advance:

<source>
> ii <- runif(1,0,1)
> ii
[1] 0.3998513
> while (ii < 0.5) {print(ii); ii <- runif(1,0,1)}
[1] 0.3998513
[1] 0.05469244
> ii
[1] 0.8265036
</source>

=Functions=

You can define your own functions in R, using the '''function''' keyword. For example, Pythagoras' Theorem:

<source>
> hypotenuse <- function(x, y) {sqrt(x^2 + y^2)}
</source>

The braces ({}) are optional, but add clarity.

To call the function:

<source>
> hypotenuse(3,4)
[1] 5
</source>

We can provide default values for the arguments, which can be overridden for any given invocation of the function:

<source>
> hypot2 <- function(x=3 ,y=4) {sqrt(x^2 + y^2)}
> hypot2()
[1] 5
> hypot2(12,16)
[1] 20
> hypot2(y=16, x=12)
[1] 20
</source>

You can see that the order of the arguments is respected, unless the names are given, in which case the order can be changed.

Longer functions can be spread over several lines. We can also use the '''return''' keyword to control which value is returned by the function:

<source>
> hypot3 <- function(x=3 ,y=4) {
+ x_sq <- x^2
+ y_sq <- y^2
+ return( sqrt(x_sq + y_sq) )}
> hypot3(6,8)
[1] 10
</source>

You can check on the contents of a function, by just typing it's name (without parentheses):

<source>
> hypot3
function(x=3 ,y=4) {
x_sq <- x^2
y_sq <- y^2
return( sqrt(x_sq + y_sq) )}
</source>

Or just check the arguments, using the '''args''' function. (The body of the function in general is reported as NULL):

<source>
> args(hypot3)
function (x = 3, y = 4)
NULL
</source>

=Packages=

Listed at http://cran.r-project.org/

Let's install the '''multicore''' package, that will give us access to functions within R which will run on the multiple processors which we often find in our computers these days:

<source>
> install.packages("multicore")
</source>

Et voila! It is done.

We can check which packages are currently loaded into the library available from our workspace:

<source>
> library()
</source>

If we need to add one, we type e.g.:

<source>
> library(multicore)
</source>

Now, an example of using a function from the multicore package. The '''lapply''' function, which is included in the standard R core, will map a given function over a list inputs, giving a list of the function outputs in return. For example, we can map a squaring function over the list of integers from 1 to 3:

<source>
> lapply(1:3, function(x) {x^2})
</source>

which gives us the list:

<pre>
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9
</pre>

Now, we can do the same work in parallel using:

<source>
> mclapply(1:3, function(x) {x^2})
</source>

=Reading Data from File=

R provides some very useful functions for reading and writing data from/to file.

==Text Files==

Let's start with text files. If your data is organised into a file such that it looks like a table with column headings:

Perhaps the simplest one is '''read.table()'''. If I have a text file with the following contents:

<pre>
country gold silver bronze
"USA" 46 29 29
"China" 38 27 23
"Great Britain" 29 17 19
"Russian Federation" 24 26 32
"Republic of Korea" 13 8 7
"Germany" 11 19 14
</pre>

It will be a simple matter to use the '''read.table()''' function to load the data into R:

<source>
> medals.2012 <- read.table("medals.txt", header=TRUE)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 Great Britain 29 17 19
4 Russian Federation 24 26 32
5 Republic of Korea 13 8 7
6 Germany 11 19 14
</source>

There is a corresponding '''write.table()''' function to export the contents of a data frame into a text file.

CSV files can be easily handled by specifying '''sep=","''' as an argument to read.table(). However, for convenience, there are also '''read.csv()''' and '''write.csv()''' functions defined. For example:

<source>
> write.csv(medals.2012,"medals.csv")
</source>

Gives us the file, '''medals.csv''', with the contents:

<pre>
"","country","gold","silver","bronze"
"1","USA",46,29,29
"2","China",38,27,23
"3","Great Britain",29,17,19
"4","Russian Federation",24,26,32
"5","Republic of Korea",13,8,7
"6","Germany",11,19,14
</pre>

==Binary Files==

The '''save()''' function will store an R data structure in binary form:

<source>
> save(medals.2012,file="medals.RData")
</source>

<pre>
gethin@gethin-desktop:~$ file medals.RData
medals.RData: gzip compressed data, from Unix
</pre>

There is, of course, a corresponding function to load such data:

<source>
> load("medals.RData")
</source>

==Databases==

If you would like to read and write data directly from/to a database, there are several packages to help you. See http://cran.r-project.org/doc/manuals/r-release/R-data.html#Relational-databases for more information.

==NetCDF==

The [http://cran.r-project.org/web/packages/ncdf/index.html '''ncdf''' package] provides an interface to NetCDF files. Before installing the package, you will need the Unidata NetCDF libraries installed on your system. On Linux, the standard package managers conveniently provide this. Note that you will need the 'development' packages. Once the prerequisites are satisfied, you can use the standard R command to install the package from CRAN:

<source>
> install.packages("ncdf")
</source>

=Examples of Common Tasks=

==Preparing Data==

===Sorting===

For example:

<source>
> railway.engines <- c("thomas", "henry", "gordon", "edward", "james")
> sort(railway.engines)
[1] "edward" "gordon" "henry" "james" "thomas"
</source>

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/sort.html

===Random Sampling===

For example:

<source>
> railway.engines <- c("thomas", "henry", "gordon", "edward", "james")
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "edward"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "thomas"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
</source>

See: http://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html

==Linear Regression==

<source>
> plot(cars)
> res=lm(dist ~ speed, data=cars)
> abline(res)
</source>

[[Image:R-lm(cars)-abline.png|400px|thumbnail|center|linear regression of stopping distance against speed from the built-in data set, cars]]

----

'''Exercises'''
* You may wish to compare different methods of estimation. From the MASS package, you can fit a line with the '''rlm''' and '''lqs'' funtions. You can plot all the lines against the data using:
<source>
> abline(res.lm, lty=1)
> abline(res.rlm, lty=2)
> abline(res.lqs, lty=3)
> legend(x=5, y=100, legend=c("lm","rlm","lqs"), lty=c(1,2,3))
</source>
See: http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/rlm.html and http://stat.ethz.ch/R-manual/R-devel/RHOME/library/MASS/html/lqs.html.

* Weighted least squares. The '''lm''' function will accept a vector of weights, '''lm(... weights=...)'''. If given, the function will optimise the line of best fit according a the equation of weighted least squares. Experiment with different linear model fits, given different weighting vectors. Some handy hints for creating a vector of weights:
** '''w1<-rep(0.1,50)''' will give you a vector, length 50, where each element has a value of 0.1. W1[1]<-10 will give the first element of the vector a value of 10.
** '''w2<-seq(from=0.02, to=1.0, by=0.02)''' provides a vector containing a sequence of values from 0.02 to 1.0 in steps of 0.02 (handily, again 50 in total).

==Significance Testing==

<source>
> boys_2=c(90.2, 91.4, 86.4, 87.6, 86.7, 88.1, 82.2, 83.8, 91, 87.4)
> girls_2=c(83.8, 86.2, 85.1, 88.6, 83, 88.9, 89.7, 81.3, 88.7, 88.4)
> res=var.test(boys_2,girls_2)
> res

F test to compare two variances

data: boys_2 and girls_2
F = 1.0186, num df = 9, denom df = 9, p-value = 0.9786
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2529956 4.1007126
sample estimates:
ratio of variances
1.018559
> res=t.test(boys_2, girls_2, var.equal=TRUE, paired=FALSE)
> res

Two Sample t-test

data: boys_2 and girls_2
t = 0.8429, df = 18, p-value = 0.4103
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.656675 3.876675
sample estimates:
mean of x mean of y
87.48 86.3
</source>

==Classification==

===k Nearest Neighbours===

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa (s), versicolor (c), and virginica (v).

See: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/iris.html

k-nearest neighbour classification for test set from training set: For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

See: http://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html

<source>
library(class)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
iris3.knn <- knn(train, test, cl, k = 3, prob=TRUE)
table(predicted=iris3.knn, actual=cl)
</source>

How did we do?

<pre>
actual
predicted c s v
c 23 0 3
s 0 25 0
v 2 0 22
</pre>

===Classification Trees===

The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery.

This data frame contains the following columns:
* Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
* Age: in months
* Number: the number of vertebrae involved
* Start: the number of the first (topmost) vertebra operated on.

See: http://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html

<source>
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
parms = list(prior = c(.65,.35), split = "information"))
fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
control = rpart.control(cp = 0.05))
par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped
plot(fit)
text(fit, use.n = TRUE)
plot(fit2)
text(fit2, use.n = TRUE)
</source>

[[Image:R-classification-tree.png|500px|thumbnail|center|Classification tree for the kyphosis data frame.]]

==Solving Systems of Linear Equations==

See, e.g.: https://source.ggy.bris.ac.uk/wiki/NumMethodsPDEs

<source>
> A <- array(c(1,3,2,3,5,4,-2,6,3), dim=c(3,3))
> b <- c(5,7,8)
> solve(A,b)
[1] -15 8 2
</source>

=Suggested Exercises=

If you would like to work through some exercises, with model answers included, you could take a look at:
* http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/reed/rexercises.pdf

=Writing Faster R Code=

In the above sections we've introduced a number of features of R and have begun the journey to becoming a proficient and productive user of the language. In the remaining sections, we'll switch tack and focus on a question commonly asked by those beginning to use R in anger--'''"My R code is slow. How can I speed it up?"'''. In this section we'll consider the related tasks of finding which bits of your R code is responsible for the majority of the run-time and what you can do about it.

==Profiling & Timing==

In order to remain productive (and sane, and have a social life...), it is essential that we first identify which portions of your R code are responsible for the majority of the run-time. We could spend ages optimising a portion that we ''think'' may be running slowly, but computers have the gift(!) to constantly surprise us, and if that portion of your program accounted for, say, 10% of the run-time, then you will have sweated for absolutely no useful gain.

The simplest method of investigation is to simply time the application of a function:

<source>
system.time(some.function())
</source>

You can get a more detailed analysis of a block of code using the built-in R profiler. The general pattern of invocation is:

<source>
Rprof(filename="~/rprof.out")
# Do some work
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

For example, here's an R script, '''profile.r''':
<source>
Rprof(filename="~/rprof.out")
# Create a 10 x 100,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(100000)})
# Map a function over the matrix. First in serial..
x <- lapply(data, function(x) {loess.smooth(x,x)})
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

Which I ran by typing:

<pre>
R CMD BATCH profile.r
</pre>

In the output file, '''profile.r.Rout''', I found the following break down:

<pre>
self.time self.pct total.time total.pct
"simpleLoess" 4.84 88.00 5.10 92.73
"rnorm" 0.22 4.00 0.22 4.00
"loess.smooth" 0.18 3.27 5.28 96.00
</pre>

The profile tells us that the function '''simpleLoess''' take 88% of the runtime, whereas '''rnorm''' takes only 4%.

==Preallocation of Memory==

As with other scripting languages, such as MATLAB, the simplest method that you can use to speed up your R code is to pre-allocate the storage for variables whenever possible. To see the benefits of this, consider the following two functions:

<source>
> f1 <- function() {
+ v <- c()
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

and:

<source>
> f2 <- function() {
+ v <- c(NA)
+ length(v) <- 30000
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

Timing calls to each of them shows that the pre-allocation of memory gives a whopping ~'''x30 speed-up'''. Your mileage will vary depending upon the details of your code.

<source>
> system.time(f1())
user system elapsed
1.720 0.040 1.762
> system.time(f2())
user system elapsed
0.052 0.000 0.05
</source>

==Vectorised Operations==

The other principle method for speeding up your R code is to eliminate loops whenever you can. Many functions and operators in R will accept arrays as input, rather than just single values and this may allow you to not use a loop. The examples in the previous section used for loops to step through an array, squaring each element. However, you can achieve the same result far more quickly by passing the array ''en masse'' to exponentiation operator:

<source>
> system.time(v <- (1:1000000)^2)
user system elapsed
0.024 0.004 0.026
</source>

Here we've been able to square 1,000,000 items in half the time it took to process 30,000!

==Calling Functions Written in a Compiled Language (e.g. C or Fortran)==

Another way to get more speed is to outsource portions of R code that are found to be slow to a compiled language, such as C or Fortran. A good starting point on this topic is:

* http://mazamascience.com/WorkingWithData/?p=1067

=R and HPC=

If you've profiled your code and tried all that you can to speed it up, as described in the previous section, you might be interested in the various initiatives that exist to run R on high performance computers, such as bluecrsytal:

* http://cran.r-project.org/web/views/HighPerformanceComputing.html

We will see in the following examples, the general approach to running R in parallel is to arrange your task so that a function is applied to a list of inputs, and then to split the list over several CPU cores or cluster worker nodes.

==Multicore==

The '''multicore''' package allows us to make use of several CPU cores within a single machine. Note, however, that the package does not work on a MS Windows computers.

As an example, let's look at the use of the package's '''mclapply''' function, a multicore equivalent of R's built-in list apply mapper, '''lapply'''. I saved the following commands into an R script called '''mutlicore.r''':
<source>
library(multicore)
# how many cores are present?
multicore:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

And used the following submission script to run it on bluecrystal phase2:
<source>
#!/bin/bash

#PBS -l nodes=1:ppn=8,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH multicore.r
</source>

After the job had run, I got the following output in the file '''multicore.r.Rout''':
<pre>
> library(multicore)
> # how many cores are present?
> multicore:::detectCores()
[1] 8
> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.674 0.007 0.749
> # .. and secondly in parallel (using multicore, within a node)
> system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.301 0.074 0.113
</pre>

==Rmpi==

The '''Rmpi''' package allows us to create and use cohorts of message passing processes from within R. It does so by providing an interface to the MPI (Message Passing Interface) library.

In order to use the Rmpi package on BCp2, you will need the '''ofed/openmpi/gcc/64/1.4.2-qlc''' module loaded.

Here's a short example that I saved as '''Rmpi.r''':
<source>
library(Rmpi)
# spawn as many slaves as possible
mpi.spawn.Rslaves()
mpi.remote.exec(mpi.get.processor.name())
mpi.remote.exec(runif(1))
mpi.close.Rslaves()
mpi.quit()
</source>

I submitted the job to BCp2 using the following submission script:
<source>
#!/bin/bash

#PBS -l nodes=4:ppn=1,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Create a machine file (used for multi-node jobs)
cat $PBS_NODEFILE > machine.file.$PBS_JOBID

#! Run the R script
mpirun -np 1 -machinefile machine.file.$PBS_JOBID R CMD BATCH Rmpi.r
</source>

and got the following output:
<pre>
> library(Rmpi)
> # spawn as many slaves as possible
> mpi.spawn.Rslaves()
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: u03n074
slave1 (rank 1, comm 1) of size 5 is running on: u03n098
slave2 (rank 2, comm 1) of size 5 is running on: u04n029
slave3 (rank 3, comm 1) of size 5 is running on: u04n030
slave4 (rank 4, comm 1) of size 5 is running on: u03n074
> mpi.remote.exec(mpi.get.processor.name())
$slave1
[1] "u03n098"

$slave2
[1] "u04n029"

$slave3
[1] "u04n030"

$slave4
[1] "u03n074"

> mpi.remote.exec(runif(1))
X1 X2 X3 X4
1 0.5154871 0.5154871 0.5154871 0.5154871
> mpi.close.Rslaves()
[1] 1
> mpi.quit()
</pre>

==Snow==

Calling MPI routines from within R may be too low level for many people to use comfortably. Happily, the '''snow''' package provides a higher level abstraction for distributed memory programming from within R.

Here's my example program that a saved as '''snow.r''':
<source>
library(snow)
# request a cluster of 3 worker nodes
cl <- makeCluster(3)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using snow, across a cluster of workers)
system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
stopCluster(cl)
</source>

I ran it on BCp2 using the same submission script given for Rmpi, save for changing Rmpi.r to snow.r. The output was:

<pre>
> library(snow)
> # request a cluster of 3 worker nodes
> cl <- makeCluster(3)
Loading required package: Rmpi
3 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename machine
"u01n105" "x86_64"

[[2]]
nodename machine
"u02n014" "x86_64"

[[3]]
nodename machine
"u03n098" "x86_64"

> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.711 0.001 0.715
> # .. and secondly in parallel (using snow, across a cluster of workers)
> system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.259 0.001 0.260
> stopCluster(cl)
</pre>

==Parallel==

The '''parallel''' package is an amalgamation of functionality from the multicore and snow packages. The shared memory parallelism in this package runs on an MS Windows machine (unlike the multicore package).

I trivial translation of our previous multicore example is:
<source>
library(parallel)
# how many cores are present?
parallel:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

I have not been able to get a distributed memory cluster working on BCp2 using the parallel package.

=Further Reading=

* [http://shop.oreilly.com/product/9780596801717.do R in a Nutshell]
* [http://shop.oreilly.com/product/0636920021421.do Parallel R]

R1

2013-11-28T16:18:11Z

GethinWilliams: /* Examples of Common Tasks */

[[category:Pragmatic Programming]]
'''Open Source Statistics with R'''

=Introduction=

R is a mature, open-source (i.e. free!) statistics package, with an intuitive interface, excellent graphics and a vibrant community constantly adding new methods for the statistical investigation of your data to the library of packages available.

The goal of this tutorial is to introduce you to the R package, and not to be an introductory course in statistics.

If you are working on a Linux system, you will typically start R from the command line. On a Windows machine, or a Mac, you will typically start up R in some form of GUI. However you get R started, you will have access to an R command prompt. The good news is that the examples below will all work at the R command prompt, however you gained access to it.

Further resources:

* The R manual is a great resource for learning R: http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
* Some excellent examples of using R can also be found at: http://msenux.redwoods.edu/math/R/ and http://www.r-tutor.com/

=Getting Started=

The very simplest thing we can do with R is to perform some arithmetic at the command prompt:

<source>
> phi <- (1+sqrt(5))/2
> phi
[1] 1.618034
</source>

Parentheses are used to modify the usual order of precedence of the operators ('''/''' will typically be evaluated before '''+'''). Note the '''[1]''' accompanying the returned value. All numbers entered at the console are interpreted as a vector. The '[1]' indicates that the line in question is displaying the vector of values starting at first index. We can use the handy sequence function to create a vector containing more than a single element:

<source>
> odds <- seq(from=1, to=67, by=2)
> odds
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67
</source>

From the above example, we can see that both the '''<-''' and '''=''' operators can be used for assignment.

Vectors are commonly used data structures in R:

<source>
coords.bris <- c(51.5, 2.6)
</source>

As are matrices:

<source>
> magic <- matrix(data=c(2,7,6,9,5,1,4,3,8),nrow=3,ncol=3)
> magic
[,1] [,2] [,3]
[1,] 2 9 4
[2,] 7 5 3
[3,] 6 1 8
</source>

Where the '''c''' function combines the arguments given in the parentheses. We can access portions of the array using the syntax shown in the square brackets. For example, we can access the first row using the '''[1,]''' notation, and similarly the second column using '''[,2]'''. Since the square is 3x3 magic, the numbers in both slices should sum to 15:

<source>
> sum(magic[1,])
[1] 15
> sum(magic[,2])
[1] 15
</source>

Single elements and ranges can also accessed:

<source>
> magic[2,2]
[1] 5
> magic[2:3,2:3]
[,1] [,2]
[1,] 5 3
[2,] 1 8
</source>

R also provides '''arrays''', which have more than two dimensions, and '''lists''' to hold heterogeneous collections.

An example list:

<source>
> list.r4 <- list(name="Radio4", frequency="93.7")
</source>

The items of which, we can access in several ways:

<source>
> list.r4$frequency
[1] "93.7"
> list.r4[1]
$name
[1] "Radio4"

> list.r4[[1]]
[1] "Radio4"
</source>

A very commonly used data structure is the '''data frame''', which R uses to store tabular data. Given several vectors of equal length, we can collate them into a data frame:

<source>
> country <- c("USA", "China", "GB")
> gold <- c(46, 38, 29)
> silver <- c(29, 27, 17)
> bronze <- c(29, 23, 19)
> medals.2012 <- data.frame(country, gold, silver, bronze)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 GB 29 17 19
</source>

We can access columns of a data frame using the '''$''' operator:

<source>
> medals.2012$country
[1] USA China GB
Levels: China GB USA
> medals.2012$gold
[1] 46 38 29
</source>

=Standard Graphics: A taster=

An aspect which makes R popular are it's graphing functions. R also has some very handy built-in data sets--we'll use this to demonstrate just a small fraction of R's graphing abilities.

First up is the humble '''plot()''' function. Given a data frame of points, such as one charting the relationship between temperature and the vapour pressure of mercury, it will give us a (handily labelled) scatter plot:

<source>
> plot(pressure)
</source>

See the gallery below for all the plots created in this section.

The plot function will also accept a time-series (another class of object recognised by R) and will sensibly join the points with a line:

<source>
> plot(co2)
> class(co2)
[1] "ts"
</source>

Pie charts are easily constructed. In this case, to show the relative proportions of electricity generated from different sources in the UK in 2011 (source: https://www.gov.uk/government/.../5942-uk-energy-in-brief-2012.pdf‎):

<source>
> uk.electricty.sources.2011 <- c(41,29,18,5,4,2,1)
> names(uk.electricty.sources.2011) <- ("Gas", "Coal", "Nuclear", "Hydro & other", "Wind", "Imports", "Oil")
> pie(uk.electricty.sources.2011, main="UK Electricty Generating Mix, 2011", col=rainbow(7))
</source>

Next, let's create a bar chart of monthly average precipitation falling here in the fair city of Bristol (source: http://www.worldweatheronline.com):

<source>
> bristol.precip <- c(82.9, 56.1, 59.2, 69, 50.8, 50.9, 50.8, 74.8, 74.7, 91.1, 94.5, 93.6)
> names(bristol.precip) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
> barplot(bristol.precip,
+ main="Average Monthly Precipitation in Bristol",
+ ylab="Mean precipitation (mm)",
+ ylim=c(0,100),
+ col=c("darkblue"))
</source>

[http://en.wikipedia.org/wiki/Box_plot 'Box and whisker' plots] are useful ways to graph the quartiles of some data. In this case, the fuel efficiencies of various US cars, circa 1974:

<source>
> boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
+ xlab="Number of Cylinders", ylab="Miles Per Gallon")
</source>

R includes a very useful help facility. In the case of the '''filled.contour()''' plotting function, the help page includes an example of it's use to plot the topology of a volcano in Auckland, NZ:

<source>
> ?filled.countour
</source>

<gallery widths=300px heights=300px perrow=3>
File:Vapour-pressure.png|Vapour pressure of mercury against temperature
File:Mauna-loa.png|CO2 concentrations measured at Mauna-Loa between 1959 and 1997
File:Pie.png|The UK's electricity generating mix, 2011
File:Barplot.png|Average monthly precipitation in Bristol
File:Boxplot.png|Range of fuel efficiencies for different engine sizes
File:Maunga-Whau.png|Topology of Maunga Whau volcano in Auckland
</gallery>

There are many more example plots--complete with the R code required to create the plots (at the bottom of the page, after the comments)--on the following web page:
* http://gallery.r-enthusiasts.com/thumbs.php

=Loops=

A simple '''for''' loop:

<source>
> for (ii in seq(1,10)) print(ii)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
</source>

Some more exotic counting:

<source>
> for (ii in seq(from=10, to=0, by=-2)) print(ii)
[1] 10
[1] 8
[1] 6
[1] 4
[1] 2
[1] 0
</source>

'''while''' loops are for when we don't know the number of iterations in advance:

<source>
> ii <- runif(1,0,1)
> ii
[1] 0.3998513
> while (ii < 0.5) {print(ii); ii <- runif(1,0,1)}
[1] 0.3998513
[1] 0.05469244
> ii
[1] 0.8265036
</source>

=Functions=

You can define your own functions in R, using the '''function''' keyword. For example, Pythagoras' Theorem:

<source>
> hypotenuse <- function(x, y) {sqrt(x^2 + y^2)}
</source>

The braces ({}) are optional, but add clarity.

To call the function:

<source>
> hypotenuse(3,4)
[1] 5
</source>

We can provide default values for the arguments, which can be overridden for any given invocation of the function:

<source>
> hypot2 <- function(x=3 ,y=4) {sqrt(x^2 + y^2)}
> hypot2()
[1] 5
> hypot2(12,16)
[1] 20
> hypot2(y=16, x=12)
[1] 20
</source>

You can see that the order of the arguments is respected, unless the names are given, in which case the order can be changed.

Longer functions can be spread over several lines. We can also use the '''return''' keyword to control which value is returned by the function:

<source>
> hypot3 <- function(x=3 ,y=4) {
+ x_sq <- x^2
+ y_sq <- y^2
+ return( sqrt(x_sq + y_sq) )}
> hypot3(6,8)
[1] 10
</source>

You can check on the contents of a function, by just typing it's name (without parentheses):

<source>
> hypot3
function(x=3 ,y=4) {
x_sq <- x^2
y_sq <- y^2
return( sqrt(x_sq + y_sq) )}
</source>

Or just check the arguments, using the '''args''' function. (The body of the function in general is reported as NULL):

<source>
> args(hypot3)
function (x = 3, y = 4)
NULL
</source>

=Packages=

Listed at http://cran.r-project.org/

Let's install the '''multicore''' package, that will give us access to functions within R which will run on the multiple processors which we often find in our computers these days:

<source>
> install.packages("multicore")
</source>

Et voila! It is done.

We can check which packages are currently loaded into the library available from our workspace:

<source>
> library()
</source>

If we need to add one, we type e.g.:

<source>
> library(multicore)
</source>

Now, an example of using a function from the multicore package. The '''lapply''' function, which is included in the standard R core, will map a given function over a list inputs, giving a list of the function outputs in return. For example, we can map a squaring function over the list of integers from 1 to 3:

<source>
> lapply(1:3, function(x) {x^2})
</source>

which gives us the list:

<pre>
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9
</pre>

Now, we can do the same work in parallel using:

<source>
> mclapply(1:3, function(x) {x^2})
</source>

=Reading Data from File=

R provides some very useful functions for reading and writing data from/to file.

==Text Files==

Let's start with text files. If your data is organised into a file such that it looks like a table with column headings:

Perhaps the simplest one is '''read.table()'''. If I have a text file with the following contents:

<pre>
country gold silver bronze
"USA" 46 29 29
"China" 38 27 23
"Great Britain" 29 17 19
"Russian Federation" 24 26 32
"Republic of Korea" 13 8 7
"Germany" 11 19 14
</pre>

It will be a simple matter to use the '''read.table()''' function to load the data into R:

<source>
> medals.2012 <- read.table("medals.txt", header=TRUE)
> medals.2012
country gold silver bronze
1 USA 46 29 29
2 China 38 27 23
3 Great Britain 29 17 19
4 Russian Federation 24 26 32
5 Republic of Korea 13 8 7
6 Germany 11 19 14
</source>

There is a corresponding '''write.table()''' function to export the contents of a data frame into a text file.

CSV files can be easily handled by specifying '''sep=","''' as an argument to read.table(). However, for convenience, there are also '''read.csv()''' and '''write.csv()''' functions defined. For example:

<source>
> write.csv(medals.2012,"medals.csv")
</source>

Gives us the file, '''medals.csv''', with the contents:

<pre>
"","country","gold","silver","bronze"
"1","USA",46,29,29
"2","China",38,27,23
"3","Great Britain",29,17,19
"4","Russian Federation",24,26,32
"5","Republic of Korea",13,8,7
"6","Germany",11,19,14
</pre>

==Binary Files==

The '''save()''' function will store an R data structure in binary form:

<source>
> save(medals.2012,file="medals.RData")
</source>

<pre>
gethin@gethin-desktop:~$ file medals.RData
medals.RData: gzip compressed data, from Unix
</pre>

There is, of course, a corresponding function to load such data:

<source>
> load("medals.RData")
</source>

==Databases==

If you would like to read and write data directly from/to a database, there are several packages to help you. See http://cran.r-project.org/doc/manuals/r-release/R-data.html#Relational-databases for more information.

==NetCDF==

The [http://cran.r-project.org/web/packages/ncdf/index.html '''ncdf''' package] provides an interface to NetCDF files. Before installing the package, you will need the Unidata NetCDF libraries installed on your system. On Linux, the standard package managers conveniently provide this. Note that you will need the 'development' packages. Once the prerequisites are satisfied, you can use the standard R command to install the package from CRAN:

<source>
> install.packages("ncdf")
</source>

=Examples of Common Tasks=

==Random Sampling==

<source>
> railway.engines <- c("thomas", "henry", "gordon", "edward", "james")
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "edward"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "thomas"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "gordon"
> sample(railway.engines, 1, replace = TRUE, prob = NULL)
[1] "james"
</source>

==Linear Regression==

<source>
> plot(cars)
> res=lm(dist ~ speed, data=cars)
> abline(res)
</source>

[[Image:R-lm(cars)-abline.png|400px|thumbnail|center|linear regression of stopping distance against speed from the built-in data set, cars]]

----

'''Exercises'''
* You may wish to compare different methods of estimation. From the MASS package, you can fit a line with the '''rlm''' and '''lqs'' funtions. You can plot all the lines against the data using:
<source>
> abline(res.lm, lty=1)
> abline(res.rlm, lty=2)
> abline(res.lqs, lty=3)
> legend(x=5, y=100, legend=c("lm","rlm","lqs"), lty=c(1,2,3))
</source>
See: http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/rlm.html and http://stat.ethz.ch/R-manual/R-devel/RHOME/library/MASS/html/lqs.html.

* Weighted least squares. The '''lm''' function will accept a vector of weights, '''lm(... weights=...)'''. If given, the function will optimise the line of best fit according a the equation of weighted least squares. Experiment with different linear model fits, given different weighting vectors. Some handy hints for creating a vector of weights:
** '''w1<-rep(0.1,50)''' will give you a vector, length 50, where each element has a value of 0.1. W1[1]<-10 will give the first element of the vector a value of 10.
** '''w2<-seq(from=0.02, to=1.0, by=0.02)''' provides a vector containing a sequence of values from 0.02 to 1.0 in steps of 0.02 (handily, again 50 in total).

==Significance Testing==

<source>
> boys_2=c(90.2, 91.4, 86.4, 87.6, 86.7, 88.1, 82.2, 83.8, 91, 87.4)
> girls_2=c(83.8, 86.2, 85.1, 88.6, 83, 88.9, 89.7, 81.3, 88.7, 88.4)
> res=var.test(boys_2,girls_2)
> res

F test to compare two variances

data: boys_2 and girls_2
F = 1.0186, num df = 9, denom df = 9, p-value = 0.9786
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2529956 4.1007126
sample estimates:
ratio of variances
1.018559
> res=t.test(boys_2, girls_2, var.equal=TRUE, paired=FALSE)
> res

Two Sample t-test

data: boys_2 and girls_2
t = 0.8429, df = 18, p-value = 0.4103
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.656675 3.876675
sample estimates:
mean of x mean of y
87.48 86.3
</source>

==Classification==

===k Nearest Neighbours===

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa (s), versicolor (c), and virginica (v).

See: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/iris.html

k-nearest neighbour classification for test set from training set: For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

See: http://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html

<source>
library(class)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
iris3.knn <- knn(train, test, cl, k = 3, prob=TRUE)
table(predicted=iris3.knn, actual=cl)
</source>

How did we do?

<pre>
actual
predicted c s v
c 23 0 3
s 0 25 0
v 2 0 22
</pre>

===Classification Trees===

The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery.

This data frame contains the following columns:
* Kyphosis: a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
* Age: in months
* Number: the number of vertebrae involved
* Start: the number of the first (topmost) vertebra operated on.

See: http://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html

<source>
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
parms = list(prior = c(.65,.35), split = "information"))
fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
control = rpart.control(cp = 0.05))
par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped
plot(fit)
text(fit, use.n = TRUE)
plot(fit2)
text(fit2, use.n = TRUE)
</source>

[[Image:R-classification-tree.png|500px|thumbnail|center|Classification tree for the kyphosis data frame.]]

==Solving Systems of Linear Equations==

See, e.g.: https://source.ggy.bris.ac.uk/wiki/NumMethodsPDEs

<source>
> A <- array(c(1,3,2,3,5,4,-2,6,3), dim=c(3,3))
> b <- c(5,7,8)
> solve(A,b)
[1] -15 8 2
</source>

=Suggested Exercises=

If you would like to work through some exercises, with model answers included, you could take a look at:
* http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/reed/rexercises.pdf

=Writing Faster R Code=

In the above sections we've introduced a number of features of R and have begun the journey to becoming a proficient and productive user of the language. In the remaining sections, we'll switch tack and focus on a question commonly asked by those beginning to use R in anger--'''"My R code is slow. How can I speed it up?"'''. In this section we'll consider the related tasks of finding which bits of your R code is responsible for the majority of the run-time and what you can do about it.

==Profiling & Timing==

In order to remain productive (and sane, and have a social life...), it is essential that we first identify which portions of your R code are responsible for the majority of the run-time. We could spend ages optimising a portion that we ''think'' may be running slowly, but computers have the gift(!) to constantly surprise us, and if that portion of your program accounted for, say, 10% of the run-time, then you will have sweated for absolutely no useful gain.

The simplest method of investigation is to simply time the application of a function:

<source>
system.time(some.function())
</source>

You can get a more detailed analysis of a block of code using the built-in R profiler. The general pattern of invocation is:

<source>
Rprof(filename="~/rprof.out")
# Do some work
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

For example, here's an R script, '''profile.r''':
<source>
Rprof(filename="~/rprof.out")
# Create a 10 x 100,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(100000)})
# Map a function over the matrix. First in serial..
x <- lapply(data, function(x) {loess.smooth(x,x)})
Rprof()
summaryRprof(filename="~/rprof.out")
</source>

Which I ran by typing:

<pre>
R CMD BATCH profile.r
</pre>

In the output file, '''profile.r.Rout''', I found the following break down:

<pre>
self.time self.pct total.time total.pct
"simpleLoess" 4.84 88.00 5.10 92.73
"rnorm" 0.22 4.00 0.22 4.00
"loess.smooth" 0.18 3.27 5.28 96.00
</pre>

The profile tells us that the function '''simpleLoess''' take 88% of the runtime, whereas '''rnorm''' takes only 4%.

==Preallocation of Memory==

As with other scripting languages, such as MATLAB, the simplest method that you can use to speed up your R code is to pre-allocate the storage for variables whenever possible. To see the benefits of this, consider the following two functions:

<source>
> f1 <- function() {
+ v <- c()
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

and:

<source>
> f2 <- function() {
+ v <- c(NA)
+ length(v) <- 30000
+ for (i in 1:30000)
+ v[i] <- i^2
+ }
</source>

Timing calls to each of them shows that the pre-allocation of memory gives a whopping ~'''x30 speed-up'''. Your mileage will vary depending upon the details of your code.

<source>
> system.time(f1())
user system elapsed
1.720 0.040 1.762
> system.time(f2())
user system elapsed
0.052 0.000 0.05
</source>

==Vectorised Operations==

The other principle method for speeding up your R code is to eliminate loops whenever you can. Many functions and operators in R will accept arrays as input, rather than just single values and this may allow you to not use a loop. The examples in the previous section used for loops to step through an array, squaring each element. However, you can achieve the same result far more quickly by passing the array ''en masse'' to exponentiation operator:

<source>
> system.time(v <- (1:1000000)^2)
user system elapsed
0.024 0.004 0.026
</source>

Here we've been able to square 1,000,000 items in half the time it took to process 30,000!

==Calling Functions Written in a Compiled Language (e.g. C or Fortran)==

Another way to get more speed is to outsource portions of R code that are found to be slow to a compiled language, such as C or Fortran. A good starting point on this topic is:

* http://mazamascience.com/WorkingWithData/?p=1067

=R and HPC=

If you've profiled your code and tried all that you can to speed it up, as described in the previous section, you might be interested in the various initiatives that exist to run R on high performance computers, such as bluecrsytal:

* http://cran.r-project.org/web/views/HighPerformanceComputing.html

We will see in the following examples, the general approach to running R in parallel is to arrange your task so that a function is applied to a list of inputs, and then to split the list over several CPU cores or cluster worker nodes.

==Multicore==

The '''multicore''' package allows us to make use of several CPU cores within a single machine. Note, however, that the package does not work on a MS Windows computers.

As an example, let's look at the use of the package's '''mclapply''' function, a multicore equivalent of R's built-in list apply mapper, '''lapply'''. I saved the following commands into an R script called '''mutlicore.r''':
<source>
library(multicore)
# how many cores are present?
multicore:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

And used the following submission script to run it on bluecrystal phase2:
<source>
#!/bin/bash

#PBS -l nodes=1:ppn=8,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Run the R script
R CMD BATCH multicore.r
</source>

After the job had run, I got the following output in the file '''multicore.r.Rout''':
<pre>
> library(multicore)
> # how many cores are present?
> multicore:::detectCores()
[1] 8
> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.674 0.007 0.749
> # .. and secondly in parallel (using multicore, within a node)
> system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.301 0.074 0.113
</pre>

==Rmpi==

The '''Rmpi''' package allows us to create and use cohorts of message passing processes from within R. It does so by providing an interface to the MPI (Message Passing Interface) library.

In order to use the Rmpi package on BCp2, you will need the '''ofed/openmpi/gcc/64/1.4.2-qlc''' module loaded.

Here's a short example that I saved as '''Rmpi.r''':
<source>
library(Rmpi)
# spawn as many slaves as possible
mpi.spawn.Rslaves()
mpi.remote.exec(mpi.get.processor.name())
mpi.remote.exec(runif(1))
mpi.close.Rslaves()
mpi.quit()
</source>

I submitted the job to BCp2 using the following submission script:
<source>
#!/bin/bash

#PBS -l nodes=4:ppn=1,walltime=00:00:05

#! Ensure that we have the correct version of R loaded
module add languages/R-2.15.1

#! change the working directory (default is home directory)
cd $PBS_O_WORKDIR

#! Create a machine file (used for multi-node jobs)
cat $PBS_NODEFILE > machine.file.$PBS_JOBID

#! Run the R script
mpirun -np 1 -machinefile machine.file.$PBS_JOBID R CMD BATCH Rmpi.r
</source>

and got the following output:
<pre>
> library(Rmpi)
> # spawn as many slaves as possible
> mpi.spawn.Rslaves()
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: u03n074
slave1 (rank 1, comm 1) of size 5 is running on: u03n098
slave2 (rank 2, comm 1) of size 5 is running on: u04n029
slave3 (rank 3, comm 1) of size 5 is running on: u04n030
slave4 (rank 4, comm 1) of size 5 is running on: u03n074
> mpi.remote.exec(mpi.get.processor.name())
$slave1
[1] "u03n098"

$slave2
[1] "u04n029"

$slave3
[1] "u04n030"

$slave4
[1] "u03n074"

> mpi.remote.exec(runif(1))
X1 X2 X3 X4
1 0.5154871 0.5154871 0.5154871 0.5154871
> mpi.close.Rslaves()
[1] 1
> mpi.quit()
</pre>

==Snow==

Calling MPI routines from within R may be too low level for many people to use comfortably. Happily, the '''snow''' package provides a higher level abstraction for distributed memory programming from within R.

Here's my example program that a saved as '''snow.r''':
<source>
library(snow)
# request a cluster of 3 worker nodes
cl <- makeCluster(3)
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using snow, across a cluster of workers)
system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
stopCluster(cl)
</source>

I ran it on BCp2 using the same submission script given for Rmpi, save for changing Rmpi.r to snow.r. The output was:

<pre>
> library(snow)
> # request a cluster of 3 worker nodes
> cl <- makeCluster(3)
Loading required package: Rmpi
3 slaves are spawned successfully. 0 failed.
> clusterCall(cl, function() Sys.info()[c("nodename","machine")])
[[1]]
nodename machine
"u01n105" "x86_64"

[[2]]
nodename machine
"u02n014" "x86_64"

[[3]]
nodename machine
"u03n098" "x86_64"

> # Create a 10 x 10,000 matrix of random numbers
> data <- lapply(1:10, function(x) {rnorm(10000)})
> # Map a function over the matrix. First in serial..
> system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.711 0.001 0.715
> # .. and secondly in parallel (using snow, across a cluster of workers)
> system.time(x <- clusterApply(cl, data, function(x) {loess.smooth(x,x)}))
user system elapsed
0.259 0.001 0.260
> stopCluster(cl)
</pre>

==Parallel==

The '''parallel''' package is an amalgamation of functionality from the multicore and snow packages. The shared memory parallelism in this package runs on an MS Windows machine (unlike the multicore package).

I trivial translation of our previous multicore example is:
<source>
library(parallel)
# how many cores are present?
parallel:::detectCores()
# Create a 10 x 10,000 matrix of random numbers
data <- lapply(1:10, function(x) {rnorm(10000)})
# Map a function over the matrix. First in serial..
system.time(x <- lapply(data, function(x) {loess.smooth(x,x)}))
# .. and secondly in parallel (using multicore, within a node)
system.time(x <- mclapply(data, function(x) {loess.smooth(x,x)}))
</source>

I have not been able to get a distributed memory cluster working on BCp2 using the parallel package.

=Further Reading=

* [http://shop.oreilly.com/product/9780596801717.do R in a Nutshell]
* [http://shop.oreilly.com/product/0636920021421.do Parallel R]