The R package optimParallel provides a parallel version of the L-BFGS-B optimization method of optim()
. The main function of the package is optimParallel()
, which has the same usage and output as optim()
. Using optimParallel()
can significantly reduce the optimization time, especially when the evaluation time of the objective function is large and no analytical gradient is available. We introduce the R package and illustrate its implementation, which takes advantage of the lexical scoping mechanism of R.
Many statistical tools involve optimization algorithms, which aim to
find the minima or maxima of an objective function
optim()
of the
R package stats and the more recent R packages
lbfgsb3 (Nash et al. 2015) and
lbfgsb3c (Fidler et al. 2018).
The L-BFGS-B algorithm has proven to work well in numerous applications.
However, long optimization times of computationally intensive functions
sometimes hinder its application; see (Gerber et al. 2017) for an
example of such a function from our research in spatial statistics. For
this reason we present a parallel version of the optim()
L-BFGS-B
algorithm, denoted with optimParallel()
, and explore its potential to
reduce optimization times.
To illustrate the possible speed gains of a parallel L-BFGS-B
implementation let optim()
, because the latter may involve several steps. Let optim()
call, one step of the L-BFGS-B algorithm evaluates optimParallel()
evaluates both
functions in parallel, which reduces the evaluation time to little more
than optim()
calculates a numeric central difference gradient
approximation (CGA).
For optim()
sequentially evaluates optimParallel()
evaluates all calls of
optimParallel()
by examplesThe main function of the R package
optimParallel
(Gerber 2019) is optimParallel()
, which has the same usage and
output as optim()
, but evaluates optim()
to estimate the
parameters negll()
can be constructed based on mean()
and sd()
.
> x <- rnorm(n = 1e7, mean = 5, sd = 2)
> negll <- function(par, x) -sum(dnorm(x = x, mean = par[1], sd = par[2], log = TRUE))
> o1 <- optim(par = c(1, 1), fn = negll, x = x, method = "L-BFGS-B",
+ lower = c(-Inf, 0.0001))
> o1$par
[1] 5.000597 2.000038
Using optimParallel()
, we can do the same in parallel. The functions
makeCluster()
, detectCores()
, and setDefaultCluster()
from the
R package parallel are used to set up a default cluster for the
parallel execution.
> install.packages("optimParallel")
> library("optimParallel")
Loading required package: parallel
> cl <- makeCluster(detectCores()); setDefaultCluster(cl = cl)
> o2 <- optimParallel(par = c(1, 1), fn = negll, x = x, lower = c(-Inf, 0.0001))
> identical(o1, o2)
[1] TRUE
On our computer with 12 Intel Xeon E5-2640 @ 2.50GHz processors one
evaluation of negll()
took optim()
run with optimParallel()
run with optimParallel()
is reduced by
optim()
. Note that for negll()
is
increased, e. g., by increasing the number of data points in x
.
In addition to the arguments of optim()
, optimParallel()
has the
argument parallel
, which takes a list of arguments. For example, we
can set parallel = list(loginfo = TRUE)
to store all evaluated
parameters and the corresponding gradients.
> o3 <- optimParallel(par = c(1, 1), fn = negll, x = x, lower = c(-Inf, 0.0001),
+ parallel = list(loginfo = TRUE))
> head(o3$loginfo, n = 3)
step par1 par2 fn gr1 gr2
[1,] 1 1.000000 1.000000 109213991 -40005963 -190049608
[2,] 2 1.205988 1.978554 39513324 -9693283 -18700810
[3,] 3 1.265626 2.086455 37160791 -8579638 -14969646
> tail(o3$loginfo, n = 3)
step par1 par2 fn gr1 gr2
[16,] 16 5.000840 2.000140 21121045 609.9480540 507.56421
[17,] 17 5.000586 2.000041 21121045 -26.8237162 15.17266
[18,] 18 5.000597 2.000038 21121045 0.6494038 -1.67717
This can be used to visualize the optimization path as shown in Figure 1 and simplifies the following study, which illustrates the impact of using different (approximate) gradient specifications.
optimParallel(..., parallel = list(loginfo = TRUE))
. The red lines
mark the estimates In the optimParallel()
calls above the argument gr
was not
specified, and hence, the CGA was used. Another way of using
optimParallel()
is with a gradient function given to the
argument gr
. If the computation of the analytical gradient is
tractable and does not take more time than evaluating
> negll_gr <- function(par, x){
+ sm <- mean(x); n <- length(x)
+ c(-n*(sm-par[1])/par[2]^2,
+ n/par[2] - (sum((x-sm)^2) + n*(sm-par[1])^2)/par[2]^3)
+ }
> o4 <- optimParallel(par = c(1, 1), fn = negll, gr = negll_gr, x = x,
+ lower = c(-Inf, 0.0001), parallel = list(loginfo = TRUE))
> tail(o4$loginfo, n = 3)
step par1 par2 fn gr1 gr2
[16,] 16 5.000840 2.000139 21121045 609.9651113 507.625076
[17,] 17 5.000586 2.000041 21121045 -26.8233339 15.172072
[18,] 18 5.000597 2.000037 21121045 0.6494452 -1.677113
We see that the resulting optimization path is very similar to that
based on the CGA (o3
above).
Besides the CGA and the option to explicitly specify a gradient
function, optimParallel()
provides a third option, namely the numeric
forward difference gradient approximation (FGA) defined as
> o5 <- optimParallel(par = c(1, 1), fn = negll, x = x, lower = c(-Inf, 0.0001),
+ parallel = list(loginfo = TRUE, forward = TRUE))
> o5$loginfo[17:19, ]
step par1 par2 fn gr1 gr2
[1,] 17 5.000086 1.999541 21121046 -26.33029 14.35781
[2,] 18 5.000331 1.999645 21121045 586.89953 534.85303
[3,] 19 5.000346 1.999651 21121045 625.10421 567.27441
> tail(o5$loginfo, n = 3)
step par1 par2 fn gr1 gr2
[31,] 31 5.000347 1.999652 21121045 627.8436 569.5991
[32,] 32 5.000347 1.999652 21121045 627.8436 569.5991
[33,] 33 5.000347 1.999652 21121045 627.8436 569.5991
We see that the optimizer only stopped after par1
and par2
. This suggests
that using the FGA prevents the algorithm from stopping. One way to
handle this issue is to set a less restrictive stopping criterion by
increasing the value of factr
.
> o6 <- optimParallel(par = c(1, 1), fn = negll, x = x, lower = c(-Inf, 0.0001),
+ parallel = list(loginfo = TRUE, forward = TRUE),
+ control = list(factr = 1e-6/.Machine$double.eps))
> tail(o6$loginfo, n = 3)
step par1 par2 fn gr1 gr2
[14,] 14 4.996680 2.001974 21121074 -8524.7678 12125.5022
[15,] 15 4.999743 1.998478 21121052 -884.4955 -5305.2516
[16,] 16 5.000347 1.999652 21121045 627.8436 569.5991
Now the resulting optimization path and the evaluation counts are
similar to those resulting from the optimization using the analytic
gradient (o4
above). The take-home message of this study is that the
choice of the (approximate) gradient can affect the optimization path
and it may be necessary to adjust control parameters such as factr
,
ndeps
, and parscale
to obtain satisfactory results. A more detailed
discussion of the use of (approximate) gradients in optimization can be
found in Nash (2014), Chapter 10.
optimParallel()
is a wrapper to optim()
and enables the parallel
evaluation of all function calls involved in one step of the L-BFGS-B
optimization method. It is implemented in R and interfaces compiled
C code only through optim()
. The reuse of the stable and well-tested
C code of optim()
has the advantage that optimParallel()
leads to
the exact same results as optim()
. To ensure that optimParallel()
and optim()
indeed return the same results optimParallel contains
systematic unit tests implemented with the R package
testthat
(Wickham 2011, 2017).
The basic idea of the implementation is that calling fn()
(or gr()
)
triggers the evaluation of both fn()
and gr()
. Their return values
are stored in a local environment. The next time fn()
(or gr()
) is
called with the same parameters the results are read from the local
environment without evaluating fn()
and gr()
again. The following R
code illustrates how optimParallel()
takes advantage of the lexical
scoping mechanism of R to store the return values of fn()
and gr()
.
> demo_generator <- function(fn, gr) {
+ par_last <- value <- grad <- NA
+ eval <- function(par) {
+ if(!identical(par, par_last)) {
+ message("--> evaluate fn() and gr()")
+ par_last <<- par
+ value <<- fn(par)
+ grad <<- gr(par)
+ } else
+ message("--> read stored value")
+ }
+ f <- function(par) {
+ eval(par = par)
+ value
+ }
+ g <- function(par) {
+ eval(par = par)
+ grad
+ }
+ list(fn = f, gr = g)
+ }
> demo <- demo_generator(fn = sum, gr = prod)
Calling demo$fn()
triggers the evaluation of both fn()
and gr()
.
> demo$fn(1:5)
--> evaluate fn() and gr()
[1] 15
The subsequent call of demo$gr()
with the same parameters returns the
stored value grad
without evaluating gr()
again.
> demo$gr(1:5)
--> read stored value
[1] 120
A similar construct allows optimParallel()
to evaluate
To illustrate the speed gains that can be achieved with
optimParallel()
we measure the elapsed times per step when optimizing
the following test function and compare them to those of optim()
.
> fn <- function(par, sleep) {
+ Sys.sleep(sleep)
+ sum(par^2)
+ }
> gr <- function(par, sleep) {
+ Sys.sleep(sleep)
+ 2*par
+ }
In both functions the argument par
can be a numeric vector with one or
more elements and the argument sleep
controls the evaluation time of
the functions. We measure the elapsed time per step using all
combinations of sleep =
gr()
. All
measurements are taken on a computer with
optimParallel()
and optim()
. Plotted are the elapsed times per step
(optimParallel()
(solid line) are
independent of The results of the benchmark experiment are summarized in
Figure 2. They show that for optimParallel()
the elapsed
time per step is only marginally larger than optim()
is
optimParallel()
adds a small overhead, and
hence, it is only faster than optim()
for
The use of Sys.sleep()
in this illustration is useful to characterize
the implementation and its overhead. However, it does not represent a
practical use case of optimParallel()
and the speed gains might be
less pronounced for other examples. One factor that reduces the speed of
optimParallel()
is the specification of large objects in its "..."
argument. All those objects are copied to the running R sessions in the
cluster, which increases the elapsed time. Related to that is the
increased memory usage, which may slowdown the optimization when not
enough memory is available.
The R package optimParallel provides a parallel version of the
L-BFGS-B optimization method of optim()
. After a brief theoretical
illustration of the possible speed improvement based on parallel
processing, we illustrate optimParallel()
by examples. The examples
demonstrate that one can replace optim()
by optimParallel()
to
execute the optimization in parallel and illustrate additional features
like capturing log information and using different (approximate)
gradients. Moreover, we briefly sketch the basic idea of the
implementation, which is based on the lexical scoping mechanism of R.
Finally, a performance test shows that using optimParallel()
reduces
the elapsed time to optimize computationally demanding functions
significantly. For functions with evaluation times of more than
optimParallel()
is most beneficial when (i) the evaluation time
of
lbfgsb3, lbfgsb3c, optimParallel, testthat, microbenchmark
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
negll()
can be constructed based on mean()
and sd()
.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gerber & Furrer, "optimParallel: An R Package Providing a Parallel Version of the L-BFGS-B Optimization Method", The R Journal, 2019
BibTeX citation
@article{RJ-2019-030, author = {Gerber, Dr. Florian and Furrer, Prof. Dr. Reinhard}, title = {optimParallel: An R Package Providing a Parallel Version of the L-BFGS-B Optimization Method}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {352-358} }