509 lines
18 KiB
Plaintext
509 lines
18 KiB
Plaintext
---
|
|
title: Using the `foreach` package
|
|
author: Steve Weston
|
|
output: rmarkdown::html_vignette
|
|
vignette: >
|
|
%\VignetteIndexEntry{foreach}
|
|
%\VignetteEngine{knitr::rmarkdown}
|
|
%\VignetteEncoding{utf8}
|
|
---
|
|
|
|
_Converted to RMarkdown by Hong Ooi_
|
|
|
|
## Introduction
|
|
|
|
One of R's most useful features is its interactive interpreter. This
|
|
makes it very easy to learn and experiment with R. It allows you to
|
|
use R like a calculator to perform arithmetic operations, display data
|
|
sets, generate plots, and create models.
|
|
|
|
Before too long, new R users will find a need to perform some
|
|
operation repeatedly. Perhaps they want to run a simulation repeatedly
|
|
in order to find the distribution of the results. Perhaps they need to
|
|
execute a function with a variety a different arguments passed to it.
|
|
Or maybe they need to create a model for many different data sets.
|
|
|
|
Repeated executions can be done manually, but it becomes quite
|
|
tedious to execute repeated operations, even with the use of command
|
|
line editing. Fortunately, R is much more than an interactive
|
|
calculator. It has its own built-in language that is intended to
|
|
automate tedious tasks, such as repeatedly executing R calculations.
|
|
|
|
R comes with various looping constructs that solve this problem. The
|
|
`for` loop is one of the more common looping constructs, but
|
|
the `repeat` and `while` statements are also quite useful.
|
|
In addition, there is the family of "apply" functions, which includes
|
|
`apply`, `lapply`, `sapply`, `eapply`,
|
|
`mapply`, `rapply`, and others.
|
|
|
|
The `foreach` package provides a new looping construct for
|
|
executing R code repeatedly. With the bewildering variety of existing
|
|
looping constructs, you may doubt that there is a need for yet another
|
|
construct. The main reason for using the `foreach` package is
|
|
that it supports _parallel execution_, that is, it can execute those
|
|
repeated operations on multiple processors/cores on your computer, or on
|
|
multiple nodes of a cluster. If each operation takes over a minute, and
|
|
you want to execute it hundreds of times, the overall runtime can take
|
|
hours. But using `foreach`, that operation can be executed in
|
|
parallel on hundreds of processors on a cluster, reducing the execution
|
|
time back down to minutes.
|
|
|
|
But parallel execution is not the only reason for using the
|
|
`foreach` package. There are other reasons that you might choose
|
|
to use it to execute quick executing operations, as we will see later in
|
|
the document.
|
|
|
|
## Getting Started
|
|
|
|
Let's take a look at a simple example use of the `foreach` package.
|
|
Assuming that you have the `foreach` package installed, you first
|
|
need to load it:
|
|
|
|
```{r loadLibs}
|
|
library(foreach)
|
|
```
|
|
|
|
Note that all of the packages that `foreach` depends on will be
|
|
loaded as well.
|
|
|
|
Now I can use `foreach` to execute the `sqrt` function
|
|
repeatedly, passing it the values 1 through 3, and returning the results
|
|
in a list, called `x`. (Of course, `sqrt` is a
|
|
vectorized function, so you would never really do this. But later,
|
|
we'll see how to take advantage of vectorized functions with
|
|
`foreach`.)
|
|
|
|
```{r ex1}
|
|
x <- foreach(i=1:3) %do% sqrt(i)
|
|
x
|
|
```
|
|
|
|
This is a bit odd looking, because it looks vaguely like a `for`
|
|
loop, but is implemented using a binary operator, called
|
|
`%do%`. Also, unlike a `for` loop, it returns a
|
|
value. This is quite important. The purpose of this statement is to
|
|
compute the list of results. Generally, `foreach` with
|
|
`%do%` is used to execute an R expression repeatedly, and return
|
|
the results in some data structure or object, which is a list by
|
|
default.
|
|
|
|
You will note in the previous example that we used a variable `i` as
|
|
the argument to the `sqrt` function. We specified the values of the
|
|
`i` variable using a named argument to the `foreach` function. We
|
|
could have called that variable anything we wanted, for example, `a`,
|
|
or `b`. We could also specify other variables to be used in the R
|
|
expression, as in the following example:
|
|
|
|
```{r ex2}
|
|
x <- foreach(a=1:3, b=rep(10, 3)) %do% (a + b)
|
|
x
|
|
```
|
|
|
|
Note that parentheses are needed here. We can also use braces:
|
|
|
|
```{r ex3}
|
|
x <- foreach(a=1:3, b=rep(10, 3)) %do% {
|
|
a + b
|
|
}
|
|
x
|
|
```
|
|
|
|
We call `a` and `b` the _iteration variables_, since those are the
|
|
variables that are changing during the multiple executions. Note that
|
|
we are iterating over them in parallel, that is, they are both changing
|
|
at the same time. In this case, the same number of values are being
|
|
specified for both iteration variables, but that need not be the case.
|
|
If we only supplied two values for `b`, the result would be a list of
|
|
length two, even if we specified a thousand values for `a`:
|
|
|
|
```{r ex4}
|
|
x <- foreach(a=1:1000, b=rep(10, 2)) %do% {
|
|
a + b
|
|
}
|
|
x
|
|
```
|
|
|
|
Note that you can put multiple statements between the braces, and you
|
|
can use assignment statements to save intermediate values of
|
|
computations. However, if you use an assignment as a way of
|
|
communicating between the different executions of your loop, then your
|
|
code won't work correctly in parallel, which we will discuss later.
|
|
|
|
\section{The `.combine` Option}
|
|
|
|
So far, all of our examples have returned a list of results. This is a
|
|
good default, since a list can contain any R object. But sometimes
|
|
we'd like the results to be returned in a numeric vector, for example.
|
|
This can be done by using the `.combine` option to `foreach`:
|
|
|
|
```{r ex5}
|
|
x <- foreach(i=1:3, .combine='c') %do% exp(i)
|
|
x
|
|
```
|
|
|
|
The result is returned as a numeric vector, because the standard R `c`
|
|
function is being used to concatenate all the results. Since the
|
|
`exp` function returns numeric values, concatenating them with
|
|
the `c` function will result in a numeric vector of length three.
|
|
|
|
What if the R expression returns a vector, and we want to combine those
|
|
vectors into a matrix? One way to do that is with the `cbind` function:
|
|
|
|
```{r ex6}
|
|
x <- foreach(i=1:4, .combine='cbind') %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
This generates four vectors of four random numbers, and combines them by
|
|
column to produce a 4 by 4 matrix.
|
|
|
|
We can also use the `"+"` or `"*"` functions to combine our results:
|
|
|
|
```{r ex7}
|
|
x <- foreach(i=1:4, .combine='+') %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
You can also specify a user-written function to combine the results.
|
|
Here's an example that throws away the results:
|
|
|
|
```{r ex7.1}
|
|
cfun <- function(a, b) NULL
|
|
x <- foreach(i=1:4, .combine='cfun') %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
Note that this `cfun` function takes two arguments. The
|
|
`foreach` function knows that the functions `c`,
|
|
`cbind`, and `rbind` take many arguments, and
|
|
will call them with up to 100 arguments (by default) in order to improve
|
|
performance. But if any
|
|
other function is specified (such as `"+"`), it assumes that it only
|
|
takes two arguments. If the function does allow many arguments, you can
|
|
specify that using the `.multicombine` argument:
|
|
|
|
```{r ex7.2}
|
|
cfun <- function(...) NULL
|
|
x <- foreach(i=1:4, .combine='cfun', .multicombine=TRUE) %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
If you want the combine function to be called with no more than 10
|
|
arguments, you can specify that using the `.maxcombine` option:
|
|
|
|
```{r ex7.3}
|
|
cfun <- function(...) NULL
|
|
x <- foreach(i=1:4, .combine='cfun', .multicombine=TRUE, .maxcombine=10) %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
The `.inorder` option is used to specify whether the order in which the
|
|
arguments are combined is important. The default value is
|
|
`TRUE`, but if the combine function is `"+"`, you could specify
|
|
`.inorder` to be `FALSE`. Actually, this option is important
|
|
only when executing the R expression in parallel, since results are always
|
|
computed in order when running sequentially. This is not necessarily true when
|
|
executing in parallel, however. In fact, if the expressions take very
|
|
different lengths of time to execute, the results could be returned in
|
|
any order. Here's a contrived example, that executes the tasks in
|
|
parallel to demonstrate the difference. The example uses the
|
|
`Sys.sleep` function to cause the earlier tasks to take longer to execute:
|
|
|
|
```{r ex7.4}
|
|
foreach(i=4:1, .combine='c') %dopar% {
|
|
Sys.sleep(3 * i)
|
|
i
|
|
}
|
|
foreach(i=4:1, .combine='c', .inorder=FALSE) %dopar% {
|
|
Sys.sleep(3 * i)
|
|
i
|
|
}
|
|
```
|
|
|
|
The results of the first of these two examples is guaranteed to be the
|
|
vector `c(4, 3, 2, 1)`. The second example will return the same values,
|
|
but they will probably be in a different order.
|
|
|
|
## Iterators
|
|
|
|
The values for the iteration variables don't have to be specified with
|
|
only vectors or lists. They can be specified with an _iterator_, many
|
|
of which come with the `iterators` package. An iterator is an
|
|
abstract source of data. A vector isn't itself an iterator, but the
|
|
`foreach` function automatically creates an iterator from a
|
|
vector, list, matrix, or data frame, for example. You can also create
|
|
an iterator from a file or a data base query, which are natural sources
|
|
of data. The `iterators` package supplies a function called
|
|
`irnorm` which can return a specified number of random numbers
|
|
for each time it is called. For example:
|
|
|
|
```{r ex8}
|
|
library(iterators)
|
|
x <- foreach(a=irnorm(4, count=4), .combine='cbind') %do% a
|
|
x
|
|
```
|
|
|
|
This becomes useful when dealing with large amounts of data. Iterators
|
|
allow the data to be generated on-the-fly, as it is needed by your
|
|
operations, rather than requiring all of the data to be generated at the
|
|
beginning.
|
|
|
|
For example, let's say that we want to sum together a thousand random
|
|
vectors:
|
|
|
|
```{r ex9}
|
|
set.seed(123)
|
|
x <- foreach(a=irnorm(4, count=1000), .combine='+') %do% a
|
|
x
|
|
```
|
|
|
|
This uses very little memory, since it is equivalent to the following
|
|
`while` loop:
|
|
|
|
```{r ex10}
|
|
set.seed(123)
|
|
x <- numeric(4)
|
|
i <- 0
|
|
while (i < 1000) {
|
|
x <- x + rnorm(4)
|
|
i <- i + 1
|
|
}
|
|
x
|
|
```
|
|
|
|
This could have been done using the `icount` function, which
|
|
generates the values from one to 1000:
|
|
|
|
```{r ex11}
|
|
set.seed(123)
|
|
x <- foreach(icount(1000), .combine='+') %do% rnorm(4)
|
|
x
|
|
```
|
|
|
|
but sometimes it's preferable to generate the actual data with the
|
|
iterator (as we'll see later when we execute in parallel).
|
|
|
|
In addition to introducing the `icount` function from the
|
|
`iterators` package, the last example also used an unnamed
|
|
argument to the `foreach` function. This can be useful when
|
|
we're not intending to generate variable values, but only controlling
|
|
the number of times that the R expression is executed.
|
|
|
|
There's a lot more that I could say about iterators, but for now,
|
|
let's move on to parallel execution.
|
|
|
|
## Parallel Execution
|
|
|
|
Although `foreach` can be a useful construct in its own right,
|
|
the real point of the `foreach` package is to do parallel computing.
|
|
To make any of the previous examples run in parallel, all you have to do
|
|
is to replace `%do%` with `%dopar%`. But for the
|
|
kinds of quick running operations that we've been doing, there wouldn't
|
|
be much point to executing them in parallel. Running many tiny tasks
|
|
in parallel will usually take more time to execute than running them
|
|
sequentially, and if it already runs fast, there's no motivation to make
|
|
it run faster anyway. But if the operation that we're executing in
|
|
parallel takes a minute or longer, there starts to be some motivation.
|
|
|
|
### Parallel Random Forest
|
|
|
|
Let's take random forest as an example of an operation that can take
|
|
a while to execute. Let's say our inputs are the matrix `x`, and the
|
|
factor `y`:
|
|
|
|
```{r ex12.data}
|
|
x <- matrix(runif(500), 100)
|
|
y <- gl(2, 50)
|
|
```
|
|
|
|
We've already loaded the `foreach` package, but we'll also need
|
|
to load the `randomForest` package:
|
|
|
|
```{r ex12.load}
|
|
library(randomForest)
|
|
```
|
|
|
|
If we want want to create a random forest model with a 1000 trees, and
|
|
our computer has four cores in it, we can split up the problem into four
|
|
pieces by executing the `randomForest` function four times, with
|
|
the `ntree` argument set to 250. Of course, we have to combine
|
|
the resulting `randomForest` objects, but the
|
|
`randomForest` package comes with a function called
|
|
`combine` that does just that.
|
|
|
|
Let's do that, but first, we'll do the work sequentially:
|
|
|
|
```{r ex12.seq}
|
|
rf <- foreach(ntree=rep(250, 4), .combine=combine) %do%
|
|
randomForest(x, y, ntree=ntree)
|
|
rf
|
|
```
|
|
|
|
To run this in parallel, we need to change `\%do\%`, but we also need to
|
|
use another `foreach` option called `.packages` to tell
|
|
the `foreach` package that the R expression needs to have the
|
|
`randomForest` package loaded in order to execute successfully.
|
|
Here's the parallel version:
|
|
|
|
```{r ex12.par}
|
|
rf <- foreach(ntree=rep(250, 4), .combine=combine, .packages='randomForest') %dopar%
|
|
randomForest(x, y, ntree=ntree)
|
|
rf
|
|
```
|
|
|
|
If you've done any parallel computing, particularly on a cluster, you
|
|
may wonder why I didn't have to do anything special to handle `x` and
|
|
`y`. The reason is that the `dopar` function noticed that
|
|
those variables were referenced, and that they were defined in the current
|
|
environment. In that case `%dopar%` will automatically export
|
|
them to the parallel execution workers once, and use them for all of the
|
|
expression evaluations for that `foreach` execution. That is
|
|
true for functions that are defined in the current environment as well,
|
|
but in this case, the function is defined in a package, so we had to
|
|
specify the package to load with the `.packages` option instead.
|
|
|
|
### Parallel Apply
|
|
|
|
Now let's take a look at how to make a parallel version of the standard
|
|
R `apply` function. The `apply` function is written in R,
|
|
and although it's only about 100 lines of code, it's a bit difficult to
|
|
understand on a first reading. However, it all really comes down two
|
|
`for` loops, the slightly more complicated of which looks like:
|
|
|
|
```{r ex13.orig}
|
|
applyKernel <- function(newX, FUN, d2, d.call, dn.call=NULL, ...) {
|
|
ans <- vector("list", d2)
|
|
for(i in 1:d2) {
|
|
tmp <- FUN(array(newX[,i], d.call, dn.call), ...)
|
|
if(!is.null(tmp)) ans[[i]] <- tmp
|
|
}
|
|
ans
|
|
}
|
|
applyKernel(matrix(1:16, 4), mean, 4, 4)
|
|
```
|
|
|
|
I've turned this into a function, because otherwise, R will complain
|
|
that I'm using `...` in an invalid context.
|
|
|
|
This could be executed using `foreach` as follows:
|
|
|
|
```{r ex13.first}
|
|
applyKernel <- function(newX, FUN, d2, d.call, dn.call=NULL, ...) {
|
|
foreach(i=1:d2) %dopar%
|
|
FUN(array(newX[,i], d.call, dn.call), ...)
|
|
}
|
|
applyKernel(matrix(1:16, 4), mean, 4, 4)
|
|
```
|
|
|
|
But this approach will cause the entire `newX` array to be sent
|
|
to each of the parallel execution workers. Since each task needs only
|
|
one column of the array, we'd like to avoid this extra data
|
|
communication.
|
|
|
|
One way to solve this problem is to use an iterator that iterates over
|
|
the matrix by column:
|
|
|
|
```{r ex13.second}
|
|
applyKernel <- function(newX, FUN, d2, d.call, dn.call=NULL, ...) {
|
|
foreach(x=iter(newX, by='col')) %dopar%
|
|
FUN(array(x, d.call, dn.call), ...)
|
|
}
|
|
applyKernel(matrix(1:16, 4), mean, 4, 4)
|
|
```
|
|
|
|
Now we're only sending any given column of the matrix to one parallel
|
|
execution worker. But it would be even more efficient if we sent the
|
|
matrix in bigger chunks. To do that, we use a function called
|
|
`iblkcol` that returns an iterator that will return multiple columns
|
|
of the original matrix. That means that the R expression will need to
|
|
execute the user's function once for every column in its submatrix.
|
|
|
|
```{r ex13.iter, results="hide"}
|
|
iblkcol <- function(a, chunks) {
|
|
n <- ncol(a)
|
|
i <- 1
|
|
|
|
nextElem <- function() {
|
|
if (chunks <= 0 || n <= 0) stop('StopIteration')
|
|
m <- ceiling(n / chunks)
|
|
r <- seq(i, length=m)
|
|
i <<- i + m
|
|
n <<- n - m
|
|
chunks <<- chunks - 1
|
|
a[,r, drop=FALSE]
|
|
}
|
|
|
|
structure(list(nextElem=nextElem), class=c('iblkcol', 'iter'))
|
|
}
|
|
nextElem.iblkcol <- function(obj) obj$nextElem()
|
|
```
|
|
|
|
```{r ex13.third}
|
|
applyKernel <- function(newX, FUN, d2, d.call, dn.call=NULL, ...) {
|
|
foreach(x=iblkcol(newX, 3), .combine='c', .packages='foreach') %dopar% {
|
|
foreach(i=1:ncol(x)) %do% FUN(array(x[,i], d.call, dn.call), ...)
|
|
}
|
|
}
|
|
applyKernel(matrix(1:16, 4), mean, 4, 4)
|
|
```
|
|
|
|
Note the use of the `%do%` inside the `%dopar%` to
|
|
call the function on the columns of the submatrix `x`. Now that
|
|
we're using `%do%` again, it makes sense for the iterator to be
|
|
an index into the matrix `x`, since `%do%` doesn't need to
|
|
copy `x` the way that `%dopar%` does.
|
|
|
|
## List Comprehensions
|
|
|
|
If you're familiar with the Python programming language, it may have
|
|
occurred to you that the `foreach` package provides something
|
|
that is not too different from Python's _list comprehensions_.
|
|
In fact, the `foreach` package also includes a function called
|
|
`when` which can prevent some of the evaluations from happening,
|
|
very much like the "if" clause in Python's list comprehensions.
|
|
For example, you could filter out negative values of an iterator using
|
|
`when` as follows:
|
|
|
|
```{r when}
|
|
x <- foreach(a=irnorm(1, count=10), .combine='c') %:% when(a >= 0) %do% sqrt(a)
|
|
x
|
|
```
|
|
|
|
I won't say much on this topic, but I can't help showing how
|
|
`foreach` with `when` can be used to write a simple quick
|
|
sort function, in the classic Haskell fashion:
|
|
|
|
```{r qsort}
|
|
qsort <- function(x) {
|
|
n <- length(x)
|
|
if (n == 0) {
|
|
x
|
|
} else {
|
|
p <- sample(n, 1)
|
|
smaller <- foreach(y=x[-p], .combine=c) %:% when(y <= x[p]) %do% y
|
|
larger <- foreach(y=x[-p], .combine=c) %:% when(y > x[p]) %do% y
|
|
c(qsort(smaller), x[p], qsort(larger))
|
|
}
|
|
}
|
|
|
|
qsort(runif(12))
|
|
```
|
|
|
|
Not that I recommend this over the standard R `sort` function.
|
|
But it's a pretty interesting example use of `foreach`.
|
|
|
|
## Conclusion
|
|
|
|
Much of parallel computing comes to doing three things: splitting the
|
|
problem into pieces, executing the pieces in parallel, and combining the
|
|
results back together. Using the `foreach` package, the
|
|
iterators help you to split the problem into pieces, the
|
|
`%dopar%` function executes the pieces in parallel, and the
|
|
specified `.combine` function puts the results back together.
|
|
We've demonstrated how simple things can be done in parallel quite
|
|
easily using the `foreach` package, and given some ideas about
|
|
how more complex problems can be solved. But it's a fairly new package,
|
|
and we will continue to work on ways of making it a more powerful system
|
|
for doing parallel computing.
|
|
|