Artificial neural networks are applied in many situations.
neuralnet is built
to train multi-layer perceptrons in the context of regression
analyses, i.e. to approximate functional relationships between
covariates and response variables. Thus, neural networks are used as
extensions of generalized linear models.
neuralnet is a very
flexible package. The backpropagation algorithm and three versions of
resilient backpropagation are implemented and it provides a
custom-choice of activation and error function. An arbitrary number of
covariates and response variables as well as of hidden layers can
theoretically be included.
The paper gives a brief introduction to multi-layer perceptrons and
resilient backpropagation and demonstrates the application of
neuralnet using the
data set infert
, which is contained in the R distribution.
In many situations, the functional relationship between covariates (also known as input variables) and response variables (also known as output variables) is of great interest. For instance when modeling complex diseases, potential risk factors and their effects on the disease are investigated to identify risk factors that can be used to develop prevention or intervention strategies. Artificial neural networks can be applied to approximate any complex functional relationship. Unlike generalized linear models (GLM, (McCullagh and J. Nelder 1983)), it is not necessary to prespecify the type of relationship between covariates and response variables as for instance as linear combination. This makes artificial neural networks a valuable statistical tool. They are in particular direct extensions of GLMs and can be applied in a similar manner. Observed data are used to train the neural network and the neural network learns an approximation of the relationship by iteratively adapting its parameters.
The package neuralnet
(Fritsch and F. Günther 2008) contains a very flexible function to train feed-forward
neural networks, i.e. to approximate a functional relationship in the
above situation. It can theoretically handle an arbitrary number of
covariates and response variables as well as of hidden layers and hidden
neurons even though the computational costs can increase exponentially
with higher order of complexity. This can cause an early stop of the
iteration process since the maximum of iteration steps, which can be
defined by the user, is reached before the algorithm converges. In
addition, the package provides functions to visualize the results or in
general to facilitate the usage of neural networks. For instance, the
function compute
can be applied to calculate predictions for new
covariate combinations.
There are two other packages that deal with artificial neural networks at the moment: nnet (Venables and B. Ripley 2002) and AMORE (Limas, E. P. V. G. Joaquín B. Ordieres Meré, F. J. M. de Pisón Ascacibar, A. V. P. Espinoza, and F. A. Elías 2007). nnet provides the opportunity to train feed-forward neural networks with traditional backpropagation and in AMORE, the TAO robust neural network algorithm is implemented. neuralnet was built to train neural networks in the context of regression analyses. Thus, resilient backpropagation is used since this algorithm is still one of the fastest algorithms for this purpose (e.g. (Schiffmann, M. Joost, and R. Werner 1994; Rocha, P. Cortez, and J. Neves 2003; Kumar and D. Zhang 2006; Almeida, C. Baugh, C. Lacey, C. Frenk, G. Granato, L. Silva, and A. Bressan 2010)). Three different versions are implemented and the traditional backpropagation is included for comparison purposes. Due to a custom-choice of activation and error function, the package is very flexible. The user is able to use several hidden layers, which can reduce the computational costs by including an extra hidden layer and hence reducing the neurons per layer. We successfully used this package to model complex diseases, i.e. different structures of biological gene-gene interactions (Günther, N. Wawro, and K. Bammann 2009). Summarizing, neuralnet closes a gap concerning the provided algorithms for training neural networks in R.
To facilitate the usage of this package for new users of artificial neural networks, a brief introduction to neural networks and the learning algorithms implemented in neuralnet is given before describing its application.
The package neuralnet focuses on multi-layer perceptrons (MLP, (Bishop 1995)), which are well applicable when modeling functional relationships. The underlying structure of an MLP is a directed graph, i.e. it consists of vertices and directed edges, in this context called neurons and synapses. The neurons are organized in layers, which are usually fully connected by synapses. In neuralnet, a synapse can only connect to subsequent layers. The input layer consists of all covariates in separate neurons and the output layer consists of the response variables. The layers in between are referred to as hidden layers, as they are not directly observable. Input layer and hidden layers include a constant neuron relating to intercept synapses, i.e. synapses that are not directly influenced by any covariate. Figure 1 gives an example of a neural network with one hidden layer that consists of three hidden neurons. This neural network models the relationship between the two covariates A and B and the response variable Y. neuralnet theoretically allows inclusion of arbitrary numbers of covariates and response variables. However, there can occur convergence difficulties using a huge number of both covariates and response variables.
To each of the synapses, a weight is attached indicating the effect of the corresponding neuron, and all data pass the neural network as signals. The signals are processed first by the so-called integration function combining all incoming signals and second by the so-called activation function transforming the output of the neuron.
The simplest multi-layer perceptron (also known as perceptron) consists
of an input layer with
To increase the modeling flexibility, hidden layers can be included.
However, Hornik, M. Stichcombe, and H. White (1989) showed that one hidden layer is sufficient to
model any piecewise continuous function. Such an MLP with a hidden layer
consisting of
Formally stated, all hidden neurons and output neurons calculate an
output
Neural networks are fitted to the data by learning algorithms during a training process. neuralnet focuses on supervised learning algorithms. These learning algorithms are characterized by the usage of a given output that is compared to the predicted output and by the adaptation of all parameters according to this comparison. The parameters of a neural network are its weights. All weights are usually initialized with random values drawn from a standard normal distribution. During an iterative training process, the following steps are repeated:
The neural network calculates an output
An error function
$$ measures the difference between predicted and
observed output, where
All weights are adapted according to the rule of a learning algorithm.
The process stops if a pre-specified criterion is fulfilled, e.g. if all
absolute partial derivatives of the error function with respect to the
weights (
The resilient backpropagation algorithm is based on the traditional
backpropagation algorithm that modifies the weights of a neural network
in order to find a local minimum of the error function. Therefore, the
gradient of the error function (
If the partial derivative is negative, the weight is increased (left part of the figure); if the partial derivative is positive, the weight is decreased (right part of the figure). This ensures that a local minimum is reached. All partial derivatives are calculated using the chain rule since the calculated function of a neural network is basically a composition of integration and activation functions. A detailed explanation is given in Rojas (1996).
neuralnet provides the
opportunity to switch between backpropagation, resilient backpropagation
with (Riedmiller 1994) or without weight backtracking
(Riedmiller and H. Braun 1993) and the modified globally convergent version by
Anastasiadis, G. Magoulas, and M. Vrahatis (2005). All algorithms try to minimize the error function by
adding a learning rate to the weights going into the opposite direction
of the gradient. Unlike the traditional backpropagation algorithm, a
separate learning rate
In order to speed up convergence in shallow areas, the learning rate
for all weights{
if (grad.old*grad>0){
delta := min(delta*eta.plus, delta.max)
weights := weights - sign(grad)*delta
grad.old := grad
}
else if (grad.old*grad<0){
weights := weights + sign(grad.old)*delta
delta := max(delta*eta.minus, delta.min)
grad.old := 0
}
else if (grad.old*grad=0){
weights := weights - sign(grad)*delta
grad.old := grad
}
}
while that of the regular backpropagation is given by
for all weights{
weights := weights - grad*delta
}
The globally convergent version introduced by Anastasiadis, G. Magoulas, and M. Vrahatis (2005)
performs a resilient backpropagation with an additional modification of
one learning rate in relation to all other learning rates. It is either
the learning rate associated with the smallest absolute partial
derivative or the smallest learning rate (indexed with
neuralnet depends on
two other packages: grid
and MASS (Venables and B. Ripley 2002).
Its usage is leaned towards that of functions dealing with regression
analyses like lm
and glm
. As essential arguments, a formula in terms
of response variables infert
that is provided by the
package datasets to
illustrate its application. This data set contains data of a
case-control study that investigated infertility after spontaneous and
induced abortion (Trichopoulos, N. Handanos, J. Danezis, A. Kalandidi, and V. Kalapothaki 1976). The data set consists of 248
observations, 83 women, who were infertile (cases), and 165 women, who
were not infertile (controls). It includes amongst others the variables
age
, parity
, induced
, and spontaneous
. The variables induced
and spontaneous
denote the number of prior induced and spontaneous
abortions, respectively. Both variables take possible values 0, 1, and 2
relating to 0, 1, and 2 or more prior abortions. The age in years is
given by the variable age
and the number of births by parity
.
The function neuralnet
used for training a neural network provides the
opportunity to define the required number of hidden layers and hidden
neurons according to the needed complexity. The complexity of the
calculated function increases with the addition of hidden layers or
hidden neurons. The default value is one hidden layer with one hidden
neuron. The most important arguments of the function are the following:
formula
, a symbolic description of the model to be fitted (see
above). No default.
data
, a data frame containing the variables specified in
formula
. No default.
hidden
, a vector specifying the number of hidden layers and hidden
neurons in each layer. For example the vector (3,2,1) induces a
neural network with three hidden layers, the first one with three,
the second one with two and the third one with one hidden neuron.
Default: 1.
threshold
, an integer specifying the threshold for the partial
derivatives of the error function as stopping criteria. Default:
0.01.
rep
, number of repetitions for the training process. Default: 1.
startweights
, a vector containing prespecified starting values for
the weights. Default: random numbers drawn from the standard normal
distribution
algorithm
, a string containing the algorithm type. Possible values
are "backprop"
, "rprop+"
, "rprop-"
, "sag"
, or "slr"
.
"backprop"
refers to traditional backpropagation, "rprop+"
and
"rprop-"
refer to resilient backpropagation with and without
weight backtracking and "sag"
and "slr"
refer to the modified
globally convergent algorithm (grprop). "sag"
and "slr"
define
the learning rate that is changed according to all others. "sag"
refers to the smallest absolute derivative, "slr"
to the smallest
learning rate. Default: "rprop+"
err.fct
, a differentiable error function. The strings "sse"
and
"ce"
can be used, which refer to ‘sum of squared errors’ and
‘cross entropy’. Default: "sse"
act.fct
, a differentiable activation function. The strings
"logistic"
and "tanh"
are possible for the logistic function and
tangent hyperbolicus. Default: "logistic"
linear.output
, logical. If act.fct
should not be applied to the
output neurons, linear.output
has to be TRUE
. Default: TRUE
likelihood
, logical. If the error function is equal to the
negative log-likelihood function, likelihood
has to be TRUE
.
Akaike’s Information Criterion (AIC, (Akaike 1973)) and Bayes
Information Criterion (BIC, (Schwarz 1978)) will then be
calculated. Default: FALSE
exclude
, a vector or matrix specifying weights that should be
excluded from training. A matrix with NULL
constant.weights
, a vector specifying the values of weights that
are excluded from training and treated as fixed. Default: NULL
The usage of neuralnet
is described by modeling the relationship
between the case-control status (case
) as response variable and the
four covariates age
, parity
, induced
and spontaneous
. Since the
response variable is binary, the activation function could be chosen as
logistic function (default) and the error function as cross-entropy
(err.fct="ce"
). Additionally, the item linear.output
should be
stated as FALSE
to ensure that the output is mapped by the activation
function to the interval
> library(neuralnet)
Loading required package: grid
Loading required package: MASS
>
> nn <- neuralnet(
+ case~age+parity+induced+spontaneous,
+ data=infert, hidden=2, err.fct="ce",
+ linear.output=FALSE)
> nn
Call:
neuralnet(
formula = case~age+parity+induced+spontaneous,
data = infert, hidden = 2, err.fct = "ce",
linear.output = FALSE)
1 repetition was calculated.
Error Reached Threshold Steps
1 125.2126851 0.008779243419 5254
Basic information about the training process and the trained neural
network is saved in nn
. This includes all information that has to be
known to reproduce the results as for instance the starting weights.
Important values are the following:
net.result
, a list containing the overall result, i.e. the output,
of the neural network for each replication.
weights
, a list containing the fitted weights of the neural
network for each replication.
generalized.weights
, a list containing the generalized weights of
the neural network for each replication.
result.matrix
, a matrix containing the error, reached threshold,
needed steps, AIC and BIC (computed if likelihood=TRUE
) and
estimated weights for each replication. Each column represents one
replication.
startweights
, a list containing the starting weights for each
replication.
A summary of the main results is provided by nn$result.matrix
:
> nn$result.matrix
1
error 125.212685099732
reached.threshold 0.008779243419
steps 5254.000000000000
Intercept.to.1layhid1 5.593787533788
age.to.1layhid1 -0.117576380283
parity.to.1layhid1 1.765945780047
induced.to.1layhid1 -2.200113693672
spontaneous.to.1layhid1 -3.369491912508
Intercept.to.1layhid2 1.060701883258
age.to.1layhid2 2.925601414213
parity.to.1layhid2 0.259809664488
induced.to.1layhid2 -0.120043540527
spontaneous.to.1layhid2 -0.033475146593
Intercept.to.case 0.722297491596
1layhid.1.to.case -5.141324077052
1layhid.2.to.case 2.623245311046
The training process needed 5254 steps until all absolute partial
derivatives of the error function were smaller than 0.01 (the default
threshold). The estimated weights range from age
, parity
,
induced
and spontaneous
, respectively. If the error function is
equal to the negative log-likelihood function, the error refers to the
likelihood as is used for example to calculate Akaike’s Information
Criterion (AIC).
The given data is saved in nn$covariate
and nn$response
as well as
in nn$data
for the whole data set inclusive non-used variables. The
output of the neural network, i.e. the fitted values nn$net.result
:
> out <- cbind(nn$covariate,
+ nn$net.result[[1]])
> dimnames(out) <- list(NULL,
+ c("age","parity","induced",
+ "spontaneous","nn-output"))
> head(out)
age parity induced spontaneous nn-output
[1,] 26 6 1 2 0.1519579877
[2,] 42 1 1 0 0.6204480608
[3,] 39 6 2 0 0.1428325816
[4,] 34 4 2 0 0.1513351888
[5,] 35 3 1 1 0.3516163154
[6,] 36 4 2 1 0.4904344475
In this case, the object nn$net.result
is a list consisting of only
one element relating to one calculated replication. If more than one
replication were calculated, the outputs would be saved each in a
separate list element. This approach is the same for all values that
change with replication apart from net.result
that is saved as matrix
with one column for each replication.
To compare the results, neural networks are trained with the same
parameter setting as above using
neuralnet with
algorithm="backprop"
and the package
nnet.
> nn.bp <- neuralnet(
+ case~age+parity+induced+spontaneous,
+ data=infert, hidden=2, err.fct="ce",
+ linear.output=FALSE,
+ algorithm="backprop",
+ learningrate=0.01)
> nn.bp
Call:
neuralnet(
formula = case~age+parity+induced+spontaneous,
data = infert, hidden = 2, learningrate = 0.01,
algorithm = "backprop", err.fct = "ce",
linear.output = FALSE)
1 repetition was calculated.
Error Reached Threshold Steps
1 158.085556 0.008087314995 4
>
>
> nn.nnet <- nnet(
+ case~age+parity+induced+spontaneous,
+ data=infert, size=2, entropy=T,
+ abstol=0.01)
# weights: 13
initial value 158.121035
final value 158.085463
converged
nn.bp
and nn.nnet
show equal results. Both training processes last
only a very few iteration steps and the error is approximately 158. Thus
in this little comparison, the model fit is less satisfying than that
achieved by resilient backpropagation.
neuralnet includes the
calculation of generalized weights as introduced by Intrator and N. Intrator (2001). The
generalized weight nn$generalized.weights
and are
given in the following format (rounded values)
> head(nn$generalized.weights[[1]])
[,1] [,2] [,3] [,4]
1 0.0088556 -0.1330079 0.1657087 0.2537842
2 0.1492874 -2.2422321 2.7934978 4.2782645
3 0.0004489 -0.0067430 0.0084008 0.0128660
4 0.0083028 -0.1247051 0.1553646 0.2379421
5 0.1071413 -1.6092161 2.0048511 3.0704457
6 0.1360035 -2.0427123 2.5449249 3.8975730
The columns refer to the four covariates age
(parity
(induced
(spontaneous
(
The results of the training process can be visualized by two different plots. First, the trained neural network can simply be plotted by
> plot(nn)
The resulting plot is given in Figure 3.
It reflects the structure of the trained neural network, i.e. the
network topology. The plot includes by default the trained synaptic
weights, all intercepts as well as basic information about the training
process like the overall error and the number of steps needed to
converge. Especially for larger neural networks, the size of the plot
and that of each neuron can be determined using the parameters
dimension
and radius
, respectively.
The second possibility to visualize the results is to plot generalized
weights. gwplot
uses the calculated generalized weights provided by
nn$generalized.weights
and can be used by the following statements:
> par(mfrow=c(2,2))
> gwplot(nn,selected.covariate="age",
+ min=-2.5, max=5)
> gwplot(nn,selected.covariate="parity",
+ min=-2.5, max=5)
> gwplot(nn,selected.covariate="induced",
+ min=-2.5, max=5)
> gwplot(nn,selected.covariate="spontaneous",
+ min=-2.5, max=5)
The corresponding plot is shown in Figure 4.
The generalized weights are given for all covariates within the same
range. The distribution of the generalized weights suggests that the
covariate age
has no effect on the case-control status since all
generalized weights are nearly zero and that at least the two covariates
induced
and spontaneous
have a non-linear effect since the variance
of their generalized weights is overall greater than one.
compute
functioncompute
calculates and summarizes the output of each neuron, i.e. all
neurons in the input, hidden and output layer. Thus, it can be used to
trace all signals passing the neural network for given covariate
combinations. This helps to interpret the network topology of a trained
neural network. It can also easily be used to calculate predictions for
new covariate combinations. A neural network is trained with a training
data set consisting of known input-output pairs. It learns an
approximation of the relationship between inputs and outputs and can
then be used to predict outputs compute
simplifies this calculation. It automatically redefines the structure of
the given neural network and calculates the output for arbitrary
covariate combinations.
To stay with the example, predicted outputs can be calculated for
instance for missing combinations with age=22
, parity=1
,
induced
1
, and spontaneous
1
. They are provided by
new.output$net.result
> new.output <- compute(nn,
covariate=matrix(c(22,1,0,0,
22,1,1,0,
22,1,0,1,
22,1,1,1),
byrow=TRUE, ncol=4))
> new.output$net.result
[,1]
[1,] 0.1477097
[2,] 0.1929026
[3,] 0.3139651
[4,] 0.8516760
This means that the predicted probability of being a case given the
mentioned covariate combinations, i.e.
confidence.interval
functionThe weights of a neural network follow a multivariate normal distribution if the network is identified (White 1989). A neural network is identified if it does not include irrelevant neurons neither in the input layer nor in the hidden layers. An irrelevant neuron in the input layer can be for instance a covariate that has no effect or that is a linear combination of other included covariates. If this restriction is fulfilled and if the error function equals the neagtive log-likelihood, a confidence interval can be calculated for each weight. The neuralnet package provides a function to calculate these confidence intervals regardless of whether all restrictions are fulfilled. Therefore, the user has to be careful interpreting the results.
Since the covariate age
has no effect on the outcome and the related
neuron is thus irrelevant, a new neural network (nn.new
), which has
only the three input variables parity
, induced
, and spontaneous
,
has to be trained to demonstrate the usage of confidence.interval
. Let
us assume that all restrictions are now fulfilled, i.e. neither the
three input variables nor the two hidden neurons are irrelevant.
Confidence intervals can then be calculated with the function
confidence.interval
:
> ci <- confidence.interval(nn.new, alpha=0.05)
> ci$lower.ci
[[1]]
[[1]][[1]]
[,1] [,2]
[1,] 1.830803796 -2.680895286
[2,] 1.673863304 -2.839908343
[3,] -8.883004913 -37.232020925
[4,] -48.906348154 -18.748849335
[[1]][[2]]
[,1]
[1,] 1.283391149
[2,] -3.724315385
[3,] -2.650545922
For each weight, ci$lower.ci
provides the related lower confidence
limit and ci$upper.ci
the related upper confidence limit. The first
matrix contains the limits of the weights leading to the hidden neurons.
The columns refer to the two hidden neurons. The other three values are
the limits of the weights leading to the output neuron.
This paper gave a brief introduction to multi-layer perceptrons and supervised learning algorithms. It introduced the package neuralnet that can be applied when modeling functional relationships between covariates and response variables. neuralnet contains a very flexible function that trains multi-layer perceptrons to a given data set in the context of regression analyses. It is a very flexible package since most parameters can be easily adapted. For example, the activation function and the error function can be arbitrarily chosen and can be defined by the usual definition of functions in R.
The authors thank Nina Wawro for reading preliminary versions of the paper and for giving helpful comments. Additionally, we would like to thank two anonymous reviewers for their valuable suggestions and remarks.
We gratefully acknowledge the financial support of this research by the grant PI 345/3-1 from the German Research Foundation (DFG).
neuralnet, nnet, AMORE, grid, MASS, datasets
Distributions, Econometrics, Environmetrics, MachineLearning, MixedModels, NumericalMathematics, Psychometrics, Robust, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Günther & Fritsch, "neuralnet: Training of Neural Networks", The R Journal, 2010
BibTeX citation
@article{RJ-2010-006, author = {Günther, Frauke and Fritsch, Stefan}, title = {neuralnet: Training of Neural Networks}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {1}, issn = {2073-4859}, pages = {30-38} }