Gradient optimization methods (function minimizers) are well-represented in both the base and package universe of R (R Core Team 2019). However, some of the methods and the codes developed from them were published before standards for hardware and software were established, in particular the IEEE arithmetic (IEEE 1985). There have been cases of unexpected behaviour or outright errors, and these are the focus of the histoRicalg project. A summary history of some of the tools in R for gradient optimization methods is presented to give perspective on such methods and the occasions where they could be used effectively.
Ignoring exogenous data (assume that it is supplied when needed), our problem is to find
\[argmin{_x}{f(x)}\]
where \(x\) is our set of \(n\) parameters and \(f()\) is a scalar real function of those parameters. The gradient of \(f(x)\) is the vector valued function \[g(x) = \partial{f(x)}/\partial{x}.\] The Hessian of \(f(x)\) is the matrix of second partial derivatives \[H(x) = \partial^2{f(x)}/\partial{x^2}.\]
We will be considering methods for this problem where a gradient can be supplied, though many of the R tools will use a numerical approximation if needed.
The so-called Newton method tries to find \(x\) that satisfies the first order conditions for an extremum, namely,
\[H(x) \delta = - g(x)\]
and then updates \(x\) to \((x + \delta)\).
The main objections to this approach are
There are four main approaches to simplifying the Newton method:
The R Consortium awarded some modest funding for histoRicalg, a
project to document and transfer knowledge of some older algorithms used
by R and by other computational systems. These older codes are mainly in
Fortran, but some are in C, with the original implementations possibly
in other programming languages. This effort was prompted by finding some
apparent bugs in codes, which could be either from the original programs
or else the implementations. Two examples in particular – in nlm()
and in optim–L-BFGS-B
– gave motivation for the project. We welcome
interest and participation, and have a project repository
https://gitlab.com/nashjc/histoRicalg, where several aspects of this
work are in various stages of completion.
The function optim()
in base-R has three of the optimizers from the
1990 edition of Compact Numerical Methods (Nash 1979), abbreviated CNM
henceforth. Given the dates of their conception, these are an obvious
focus for histoRicalg. The direct-search Nelder-Mead method
(Nelder and Mead 1965) is in fact the default solver i.e., it uses
method="Nelder-Mead"
in the optim()
call and does not require a
gradient routine to be supplied. In fact, Nelder-Mead will be used even
if a gradient routine is included unless some other method is suggested.
The choice if method="BFGS"
is the variable metric method of
Fletcher (1970). The third choice from CNM is method="CG"
for conjugate
gradients, which is a combination of several similar approaches.
In this article on the provenance of the methods, the details will be discussed only in general terms. However, it is critical to get those details right.
If the source code for base R is in a directory that is named R-X.Y.Z
(in my case 3.5.1 when this vignette was started) then the calling
routine for optim()
is src/library/stats/R/optim.R
, but this uses
the .External2
method to call src/library/stats/src/optim.c
, where
the function vmmin
is called. However, the source for vmmin
is in
src/appl/optim.c
. I will venture that having two files of the same
name in different directories is tempting an error.
I will use optim–BFGS
when using the call method="BFGS"
in
optim()
, and similar abbreviations for the other methods. To use
optim–BFGS
, the relevant code is a C routine vmmin
. In
src/appl/optim.c
, above this routine is the comment
{,comment}/* BFGS variable-metric method, based on Pascal code
in J.C. Nash, `Compact Numerical Methods for Computers', 2nd edition,
converted by p2c then re-crafted by B.D. Ripley */
As author of this work, I can say that the code used is Algorithm 21 of (Nash 1979). In the First Edition, this was a step-and-description code which was replaced by Turbo Pascal in the Second Edition. The methods were worked out mainly on a Data General Nova which had partitions of between 3.5 K and 8 K bytes accessible via a 10 character per second teletype. The floating point available had a 24 bit mantissa, in a single level of precision with (as I recall) no guard digit. Programming was in a fairly early and simple form of BASIC.
This machine was at Agriculture Canada. In the late 1970s, it was replaced with a Data General Eclipse, but largely the facilities were the same, except the floating point went to 6 hexadecimal digits with no guard digit. I also implemented some codes on HP 9830 and Tektronix 4051 "programmable calculators". A Fortran translation was made as NASHLIB, which ran mainly on IBM System 360 class computers, but also was tested at least partially on Univac 1108, ICL 1906A and Xerox-Honeywell CP-6 machines. The codes were distributed as Fortran source code. See https://gams.nist.gov/cgi-bin/serve.cgi/Package/NASHLIB. I have been informed that the servers supporting this link will not be replaced when they fail.
NASHLIB dates from the 1979-1981 period, though the methods were developed in the mid-1970s at Agriculture Canada to support economic modelling. In 1975, I received an invitation (from Brian Ford) to collaborate with the Numerical Algorithms Group in Oxford, and Agriculture Canada generously allowed me two 6-week periods to do so. During the second of these, in the tail end of the hurricane of January 1976, I drove to Dundee to meet Roger Fletcher. He took an interest in the concept of a program that used minimal memory. We found a printout of the Fortran for the method in (Fletcher 1970), and with a ruler and pencil, Roger and I scratched out the lines of code that were not strictly needed. As I recall, the main deletion was the cubic interpolation line search. The resulting method simply used backtracking with an acceptable point condition.
On my return to Ottawa, I coded the method in BASIC and made two small changes:
the current search was a steepest descent. If so, then the method is terminated. If not, the inverse Hessian approximation is reset to the unit matrix and the cycle restarted. This "quick fix" has, of course, become permanent. My experience over the years suggests it is a reasonable compromise, but the possibility of better choices is still open.
Considering the simplicity of the method, it is surprising to me how robust and efficient it has shown itself to be over the last four decades.
The code does allow of some variations:
In the mid-1980s, the BASIC code was extended to allow for masks (fixed
parameters) and bounds (or box) constraints on the parameters
(Nash and Walker-Smith 1987). This adds about 50% more code, as well as vector storage for
the constraints and indices to manage them (essentially \(3*n\)), but does
permit a much wider variety of problems to be solved. Even for
unconstrained problems that may have difficult properties, imposing
loose bounds can prevent extreme steps in the parameters. To add this
capability to R, I put the package Rvmmin
on CRAN in 2011 (Nash 2018).
This code is, as of 2018, part of a new optimx
(Nash and Varadhan (2011), Nash (2014b))
package on CRAN. It is entirely written in R.
There are other R packages with related capability. In particular,
ucminf
(Nielsen and Mortensen 2012) is based on the Fortran code of Nielsen (2000). This
appears to use a very similar algorithm to optim–BFGS
, but employs a
more sophisticated line search. This package does not, however, allow
for constraints in the form of bounds or masks.
The approximate inverse Hessian could also be saved and used to provide
some estimates of parameter dispersion. Clearly, the use of a steepest
descents direction for the final line search in Rvmmin
before
termination means that the penultimate approximation must be saved.
In practice, I have found the approximate inverse Hessian bears little
or no resemblance to the actual Hessian. This may be because the
construction of the approximation maintains the positive definite
condition. It is an open question whether the approximation has any
utility.
Package mize (Melville 2017) offers some options to build a gradient minimizer and could possibly allow some algorithmic comparisons to be made, but I have not had the time to delve into this.
Rather more sophisticated codes are part of base R in nlm()
and
nlminb()
. The former is a polyalgorithm of quasi-Newton type for
unconstrained problems (Schnabel et al. 1985). The latter allows bounds
constraints and is drawn from the PORT library (Fox 1997) code by
Gay (1990), but there are a number of common ideas in both methods. For
users, these two functions generally work well with the default settings
of their controls. This is fortunate, as even though I am fairly
experienced with such programs, I hesitate to play with the numerous
control parameters.
To underline this last sentence, there was a thread on the R-help
mailing list in 2010 (see, for example,
https://stat.ethz.ch/pipermail/r-help/2010-July/245182.html) noting
that the default behaviour of nlminb()
at that time was to assume a
positive objective function such as a sum of squares. Thus the function
would terminate if a non-positive evaluation occurred. This has since
been corrected, but was present in R for some years, possibly resulting
in erroneous results being published.
The original Algorithm 22 of CNM which was converted to C and included
in R as the method="CG"
choice for optim()
by Brian Ripley is an
approach that has never felt quite right to me. And I am the author! It
offers three different search direction updates using the type
element
of the control
list. The default is type=1
for the Fletcher-Reeves
update (Fletcher and Reeves 1964), with 2 for Polak–Ribiere (Polak and Ribiere 1969) and 3 for
Beale–Sorenson (Sorenson 1969), (Beale 1972).
In the mid-1980’s I updated the CG code (in BASIC) to handle bounds and
masks. Then in 2009 I incorporated the Yuan/Dai (Dai and Yuan 2001) search
direction update that melds the different formulas used in optim–CG
.
This gives a remarkably effective method that retains a surprisingly
short code, and entirely in R, with a version that handles bounds and
masks. This is package Rcgmin
(Nash 2014c).
There has been a flurry of work on CG-related optimizers in the last
decade or so, in particular associated with Hager and Zhang. See
(Hager and Zhang 2006b) and (Hager and Zhang 2006a). I have experimentally wrapped the
CG-Descent
code, but as yet do not feel the program is ready for
production release. However, collaboration on this and related codes I
have been exploring is welcome.
The base-R code lbfgsb.c
(at the time of writing in R-3.5.2/src/appl/)
is commented:
/* l-bfgs-b.f -- translated by f2c (version 19991025).
:
From ?optimfor method ‘"L-BFGS-B"’ is based on Fortran code by Zhu,
The code -Chen and Nocedal obtained from Netlib (file 'opt/lbfgs_bcm.shar')
Byrd, Lu
The Fortran files contained no copyright information.
C. (1995) A limited
Byrd, R. H., Lu, P., Nocedal, J. and Zhu, for bound constrained optimization.
memory algorithm 16}, 1190--1208.
\emph{SIAM J. Scientific Computing}, \bold{*/
The paper (Byrd et al. 1995) builds on (Lu et al. 1994). There have been a
number of other workers who have followed-up on this work, but R code
and packages seem to have largely stayed with codes derived from these
original papers. Though the date of the paper is 1995, the ideas it
embodies were around for a decade and a half at least, in particular in
(Nocedal 1980) and (Liu and Nocedal 1989). The definitive Fortran code was published as
(Zhu et al. 1997). This is available as toms/778.zip
on
http://www.netlib.org.
Besides the ACM TOMS code, there are two related codes from the Northwestern team on NETLIB:
To these are attached references (Liu and Nocedal 1989) and (Byrd et al. 1995) respectively, most likely reflecting the effort required to implement the constraints.
The unconstrained code has been converted to C under the leadership of
Naoaki Okazaki (see http://www.chokkan.org/software/liblbfgs/, or the
fork at https://github.com/MIRTK/LBFGS). This has been wrapped for R
as the lbfgs
package (Coppola et al. 2014). I have included this as one of
the solvers callable from package optimx
.
A side-by-side comparison of the main subroutines in the two downloads
from Netlib (toms
and opt
) unfortunately shows a lot of differences.
I have not tried to determine if these affect performance or are simply
cosmetic.
More seriously perhaps, there were some deficiencies in the code(s), and
in 2011 Nocedal’s team published a Fortran code with some corrections
(Morales and Nocedal 2011). Since the R code in C predates this by many years, I
prepared package lbfgsb3
to wrap the Fortran code. However, I did not
discover any test cases where the optim–L-BFGS-B
and lbfgsb3
gave
different output different, though I confess that the tests I ran are
not exhaustive.
In 2016, I was at a Fields Institute optimization conference in Toronto
for the 70th birthday of Andy Conn. By sheer serendipity, Nocedal did
not attend the conference, but sat down next to me at the conference
dinner. When I asked him about the key changes, he said that the most
important one was to fix the computation of the machine precision, which
was not always correct in the 1995 code. Since R gets this number as
.Machine$double.eps
, the offending code is irrelevant.
Within (Morales and Nocedal 2011), there is also reported an improvement in the subspace minimization that is applied in cases of bounds constraints. In the few tests I have applied with bounds constraints, I have yet to see any substantive differences, but welcome communcation should such be found.
Using Rcpp (see Eddelbuettel and François (2011) and the Fortran code in package lbfgs3
,
Matthew Fidler developed package lbfgsb3c
. As this provides a more
standard call and return than lbfgsb3
, Fidler and I have unified the
two packages, but are still checking and cleaning the package at the
time of writing.
There are (at least) two implementations of truncated Newton methods available.
nloptr–tnewton()
is a wrapper of the NLopt truncated Newton method
translated to C and somewhat modified by Steven G. Johnson in the nlopt
project (https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/)
from Fortran code due to Ladislav Luksan
(http://www.cs.cas.cz/luksan/subroutines.html). The many layers and
translations make it difficult to unravel the particular details in this
code, and I do not feel confident to provide a competent overview.
optimx–tn()
and optimx–tnbc()
are translations from Matlab source of
the Truncated Newton codes of Stephen Nash (Nash 1983) that I prepared,
initially in package Rtnmin
. The code is entirely in R. In
implementing the codes, the principal awkwardness was in making
available to different functions a quite extensive set of variables
relating to the function and gradient. My choice was to use the
list2env()
function to create an environment to hold this data.
Note the survey of truncated Newton methods by Stephen in (Nash 2000).
This story of some of the function minimizers available to R users is not finished. There are certain to be related methods in packages or collections I have overlooked. Moreover, I have not had the time or energy to fully explore some of the items I have mentioned. Nevertheless, how codes come about is important to understanding their strengths and weaknesses and in maintaining them to a level that users can reliably use them. Such matters were a large consideration for Nash (2014a), but have been more explicitly addressed by histoRicalg.
When evaluating such tools, it is important to consider what is wanted. Personally, I value reliability as more important than speed. By this, I mean
"Speed" is, of course, desirable, but how this is measured is debatable. Execution time is notoriously sensitive to hardware and software environments, as well as to the way functions and gradients are coded. Number of function and gradient counts is an alternative, since these computations often are the bottleneck of optimization. However, there are some methods where the optimization calculations are demanding either in cycles or memory.
Readers may note that I have highlighted that some codes are written entirely in R. This allows for customization or in-line use of faster code, and I have exchanged notes with several workers wanting to speed up time-consuming optimizations by such ideas. Profiling and debugging tools are generally quite challenging to use when there are multiple programming languages involved. All-R code may also be much easier to read and maintain if well-coded, especially as the programming language support for "old" languages diminishes.
These considerations are, of course, why we have a number of methods,
and why Ravi Varadhan and I prepared the optimx
package with tools to
allow for comparison of several methods. But please, do use the
comparison for evaluating and choosing a method, not for production
minimization by a "try everything" strategy. As new methods are
released, we hope to include them in the set optimx
can call via a
unified interface that is very close to the optim()
function. We note
that a somewhat different approach to unifying methods is available in
the ROI package of Hornik et al. (2011).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nash, "Provenance of R's Gradient Optimizers", The R Journal, 2020
BibTeX citation
@article{RJ-2020-029, author = {Nash, John C.}, title = {Provenance of R's Gradient Optimizers}, journal = {The R Journal}, year = {2020}, note = {https://rjournal.github.io/}, volume = {12}, issue = {1}, issn = {2073-4859}, pages = {477-483} }