The ** brms** package allows R users to easily specify a wide range of Bayesian single-level and multilevel models which are fit with the probabilistic programming language Stan behind the scenes. Several response distributions are supported, of which all parameters (e.g., location, scale, and shape) can be predicted. Non-linear relationships may be specified using non-linear predictor terms or semi-parametric approaches such as splines or Gaussian processes. Multivariate models can be fit as well. To make all of these modeling options possible in a multilevel framework,

Multilevel models (MLMs) offer great flexibility for researchers across
sciences (Gelman and Hill 2006; Pinheiro and Bates 2006; Demidenko 2013; Brown and Prescott 2015).
They allow modeling of data measured on different levels at the same
time – for instance students nested within classes and schools – thus
taking complex dependency structures into account. It is not surprising
that many packages have been developed to fit MLMs in R. Usually, the
functionality of these implementations is limited insofar as they only
predict the mean of the response distribution. Other parameters of the
response distribution, such as the residual standard deviation in linear
models, are assumed constant across observations, an assumption which
may be violated in many applications. Accordingly, it is desirable to
allow prediction of *all* response parameters at the same time. Models
that perform simultaneous estimation are often referred to as
*distributional models* or, more verbosely, as *models for location,
scale, and shape* (Rigby and Stasinopoulos 2005).

Another limitation of basic MLMs is that they only allow for linear
predictor terms. While linear predictor terms already offer good
flexibility, they are of limited use when relationships are inherently
non-linear. Such non-linearity can be handled in at least two ways: (1)
by fully specifying a non-linear predictor term with corresponding
parameters, each of which can be predicted using MLMs (Lindstrom and Bates 1990);
or (2) estimating the form of the non-linear relationship on-the-fly
using splines (Wood 2004) or Gaussian processes (Rasmussen and Williams 2006). The
former are often simply called *non-linear models*, while models
applying splines are referred to as *generalized additive models* (GAMs,
see Hastie and Tibshirani 1990).

Combining all of these modeling options into one framework is a complex task, both conceptually and with regard to model fitting. Maximum likelihood methods, typically applied in classical ‘frequentist’ statistics, can reach their limit at some point such that fully Bayesian methods become the go-to solutions to fit such complex models (Gelman et al. 2014). In addition to being more flexible, the Bayesian framework comes with other advantages, such as the ability to derive probability statements for every quantity of interest or explicitly incorporate prior knowledge about parameters into the model. The former is particularly relevant in non-linear models, for which classical approaches struggle more often than not in propagating all the uncertainty in the parameter estimates to non-linear functions such as out-of-sample predictions.

Possibly the most powerful program for performing full Bayesian inference available to date is Stan (Carpenter et al. 2017; Stan Development Team 2017b), which implements Hamiltonian Monte Carlo (Duane et al. 1987; Neal 2011; Betancourt et al. 2014) and its extension, the No-U-Turn (NUTS) Sampler (Hoffman and Gelman 2014). These algorithms converge much more quickly than other Markov-Chain Monte-Carlo (MCMC) algorithms, especially for high-dimensional models (Betancourt et al. 2014; Hoffman and Gelman 2014; Betancourt 2017). An excellent non-mathematical introduction to Hamiltonian Monte Carlo can be found in Betancourt (2017).

Stan comes with its own programming language, allowing for great
modeling flexibility (Carpenter et al. 2017; Stan Development Team 2017b). Many researchers may
still be hesitant to use Stan directly, as every model has to be
written, debugged, and possibly also optimized, which may be a
time-consuming and error-prone process even for researchers familiar
with Bayesian inference. The
** brms** package
(Bürkner 2017) presented in this paper aims to remove these hurdles for
a wide range of regression models by allowing the user to benefit from
the merits of Stan by using extended

`vignette("brms_customfamilies")`

).The purpose of the present article is to provide an introduction to the
advanced multilevel formula syntax implemented in ** brms**, which fits
a wide and growing range of non-linear distributional multilevel models.
A general overview of the package is given in Bürkner (2017).
Accordingly, the present article focuses on more recent developments. We
begin by explaining the underlying structure of distributional models.
Next, the formula syntax of

The core model implemented in ** brms** is the prediction of the
response \(y\) through predicting all parameters \(\theta_p\) of the
response distribution \(D\), which is also called the model

`family`

in
many R packages. We write
\[y_i \sim D(\theta_{1i}, \theta_{2i}, \dots)\]
to stress the dependency on the \(i^{th}\) observation. If
desired, every parameter \(\theta_p\) may be regressed on its own
predictor term \(\eta_p\), transformed by the inverse link function \(f_p\)
that is \(\theta_{pi} = f_p(\eta_{pi})\). Such models are typically
referred to as `family`

are given in
`vignette("brms_families")`

.Suppressing the index \(p\) for simplicity, a predictor term \(\eta\) can
generally be written as
\[\eta = \mathbf{X} \beta + \mathbf{Z} u + \sum_{k = 1}^K s_k(x_k)\]
where \(\beta\) and \(u\) are the respective coefficients at the
population-level and group-level, and \(\mathbf{X}, \mathbf{Z}\) are the
corresponding design matrices. The terms \(s_k(x_k)\) symbolize optional
smooth functions of unspecified form based on covariates \(x_k\) fitted
via splines (see Wood 2011 for the underlying implementation in the
** mgcv** package) or
Gaussian processes (Williams and Rasmussen 1996). The response \(y\) as well as
\(\mathbf{X}\), \(\mathbf{Z}\), and \(x_k\) make up the data, whereas \(\beta\),
\(u\), and the smooth functions \(s_k\) are the model parameters being
estimated. The coefficients \(\beta\) and \(u\) may be more commonly known
as fixed and random effects, but we avoid theses terms following the
recommendations of Gelman and Hill (2006). Details about prior distributions for
\(\beta\) and \(u\) can be found in Bürkner (2017) and under

`help("set_prior")`

.As an alternative to the strictly additive formulation described above,
predictor terms may also have any form specifiable in Stan. We call it a
*non-linear* predictor and write
\[\eta = f(c_1, c_2, \dots, \phi_1, \phi_2, \dots)\]
The structure of the function \(f\) is given by the user, \(c_r\) are known
or observed covariates, and \(\phi_s\) are non-linear parameters, each
having its own linear predictor term \(\eta_{\phi_s}\) of the form
specified above. In fact, we should think of non-linear parameters as
placeholders for linear predictor terms rather than as parameters
themselves. A frequentist implementation of such models, which inspired
the non-linear syntax in ** brms**, can be found in the

The formula syntax applied in ** brms** builds upon the syntax of the R
package

An ** lme4** formula has the general form

`~ pterms + (gterms | group) response `

The `pterms`

contain population-level effects, assumed to be the same
across observations. The `gterms`

contain so called group-level effects,
assumed to vary across the grouping variables specified in `group`

.
Multiple grouping factors each with multiple group-level effects are
possible. Usually, `group`

contains only a single variable name pointing
to a factor, but users may also use `g1:g2`

or `g1/g2`

if both `g1`

and
`g2`

are suitable grouping factors. The `:`

operator creates a new
grouping factor that consists of the combined levels of `g1`

and `g2`

.
The `/`

operator indicates nested grouping structures and expands one
grouping factor into two or more when using multiple `/`

within one
term. For instance, if the user were to write `(1 | g1/g2)`

, ** brms**
will expand this to

`(1 | g1) + (1 | g1:g2)`

. Instead of `|`

, users may
use `||`

in grouping terms to prevent group-level correlations from
being modeled. This may be used to reduce model complexity if
convergence of the sampler is an issue. One limitation of the `||`

operator in `mixed`

function of the While intuitive and visually appealing, the classic ** lme4** syntax is
not flexible enough to specify the more complex models supported by

`|`

operator into `|<ID>|`

, where `<ID>`

can be any value. Group-level terms
with the same `ID`

will then be modeled as correlated if they share same
grouping factor(s). For instance, if the terms `(x1|ID|g1)`

and
`(x2|ID|g1)`

appear somewhere in the same or different formulas passed
to Further extensions of the classical ** lme4** syntax refer to the

`group`

component. In the `group`

component is
rather limited in its flexibility since only variable names combined by
`:`

or `/`

are supported. We propose two extensions of this syntax.The first extension concerns the `group`

element. We propose that
`group`

can generally be split up in its terms so that, for example,
`(1 | g1 + g2)`

is expanded to `(1 | g1) + (1 | g2)`

. This is fully
consistent with the way `/`

is handled and provides a natural
generalization to the existing syntax.

The second extension concerns special grouping structures that cannot
currently be expressed by simply combining grouping variables. For
instance, multi-membership models cannot be expressed this way. To
overcome this limitation, we propose wrapping terms in `group`

within
special functions that allow specifying alternative grouping structures:
`(gterms | fun(group))`

. In ** brms**, there are currently two such
functions implemented:

`gr`

for the default behavior, and `mm`

for
multi-membership terms. To be compatible with the original syntax and to
keep formulas short, `gr`

is automatically added internally if no
function is specified.While some non-linear relationships, such as quadratic relationships,
can be expressed within the basic R formula syntax, more complicated
ones cannot. Thus, we have made it possible in ** brms** to specify
non-linear predictor terms. Similar to the implementation in

```
~ b1 * (1 - exp(-(x / b2) ^ b3)
y ~ z + (1|ID|g)
b1 ~ (1|ID|g)
b2 ~ (1|ID|g) b3
```

The first formula will not be evaluated using standard R formula
parsing, but is instead taken literally. In contrast, the formulas for
the non-linear parameters will be evaluated in the usual way and are
compatible with all terms supported by ** brms**. Note that we have
used the above described ID-syntax to model group-level effects
correlated across formulas.

There are other syntax extensions implemented in ** brms** that do not
directly target grouping terms. First, there are terms formally included
in the

`pterms`

part that are handled separately. The most prominent
examples are smooth terms specified using the `s`

and `t2`

functions of
the `cs`

, monotonic effects `mo`

, noise-free effects `me`

, or
Gaussian process terms `gp`

. The first is explained in Bürkner (2017),
while the latter three are documented in `help(brmsformula)`

.
Internally, these terms are extracted from `pterms`

and not included in
the construction of the population-level design matrix.Second, since the `|`

operator is unused on the left-hand side of \(\sim\)
in the formula, additional information on the response variable may be
specified via

`| aterms ~ <predictor terms> response `

The `aterms`

part may contain multiple terms of the form
`fun(<variable>)`

, each separated by `+`

and providing special
information on the response variable, including weighting observations,
providing known standard errors for meta-analysis, or modeling censored
or truncated data. As it is not the main topic of the present paper, we
refer users to `help("brmsformula")`

and `help("addition-terms")`

for
more details.

To set up the model formulas and combine them into one object,
** brms** defines the

`brmsformula`

function, or `bf`

for short. Its
output can then be passed to the `parse_bf`

function, which splits up
the formulas into separate parts and prepares them for the generation of
design matrices and related data. Other packages may re-use these
functions in their own routines, making it easier to offer support for
the above described multilevel syntax.The idea of ** brms** is to provide one unified framework for
multilevel regression models in R. The above described formula syntax
and all of its variations can be applied in combination with all
response distributions supported by

`help("brmsfamily")`

and
`vignette("brms_families")`

for an overview.In this section, we will discuss four examples in detail, each focusing on certain aspects of the syntax. These examples are chosen to provide a broad overview of the modeling options. The first is about the number of fish caught by different groups of people. It does not actually contain any multilevel structure, but helps to understand how to set up formulas for different model parts. The second example is about housing rents in Munich. We model the data using splines and a distributional regression approach. The third example is about cumulative insurance loss payments across several years, which is fitted using a rather complex non-linear multilevel model. Finally, the fourth example is about the performance of school children who change school during the year, thus requiring a multi-membership model.

In addition to the four examples, we wish to discuss briefly a few more
modeling options. First, ** brms** allows fitting of so-called
phylogenetic models. These models are relevant in evolutionary biology
when data of many species are analyzed at the same time. Species are not
independent if they come from the same phylogenetic tree, implying that
different levels of the same grouping-factor (i.e., species) are likely
correlated. See

`vignette("brms_phylogenetics")`

for an example. Second,
there is a canonical way to handle ordinal predictors without falsely
assuming they are either categorical or continuous. We call them
monotonic effects and discuss them in `vignette("brms_monotonic")`

. Last
but not least, it is possible to account for measurement error in both
response and predictor variables. Although often ignored in applied
regression modeling (Westfall and Yarkoni 2016), measurement error is common in all
scientific fields that employ observational data. There is no vignette
yet for this topic, but one will be added in the future. In the
meantime, `help("brmsformula")`

is the best place to start learning
about such models as well as about other details of the An important application of the distributional regression framework of
** brms** are zero-inflated and hurdle models. These models are helpful
whenever there are more zeros in the response variable than one would
naturally expect. Here, we consider an example dealing with the number
of fish caught by various groups of people. On the UCLA website
(https://stats.idre.ucla.edu/stata/dae/zero-inflated-poisson-regression),
the data are described as follows: “The state wildlife biologists want
to model how many fish are being caught by fishermen at a state park.
Visitors are asked how long they stayed, how many people were in the
group, were there children in the group and how many fish were caught.
Some visitors do not fish, but there is no data on whether a person
fished or not. Some visitors who did fish did not catch any fish so
there are excess zeros in the data because of the people that did not
fish.”

```
<- read.csv("http://stats.idre.ucla.edu/stat/data/fish.csv")
zinb $camper <- factor(zinb$camper, labels = c("no", "yes"))
zinbhead(zinb)
```

```
nofish livebait camper persons child xb zg count1 1 0 no 1 0 -0.8963146 3.0504048 0
2 0 1 yes 1 0 -0.5583450 1.7461489 0
3 0 1 no 1 0 -0.4017310 0.2799389 0
4 0 1 yes 2 1 -0.9562981 -0.6015257 0
5 0 1 no 1 0 0.4368910 0.5277091 1
6 0 1 yes 4 2 1.3944855 -0.7075348 0
```

For predictors, we choose the number of people per group, the number of children, and whether or not the group consists of campers. Many groups may not catch any fish because they do not try and so we fit a zero-inflated Poisson model. For now, we assume a constant zero-inflation probability across observations.

```
<- brm(count ~ persons + child + camper, data = zinb,
fit_zinb1 family = zero_inflated_poisson("log"))
```

The model is readily summarized via

`summary(fit_zinb1)`

```
: zero_inflated_poisson (log)
Family: count ~ persons + child + camper
Formula: zinb (Number of observations: 250)
Data: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
Samples-warmup samples = 4000
total post: Not computed
WAIC
-Level Effects:
Population-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l-1.01 0.17 -1.34 -0.67 2171 1
Intercept 0.87 0.04 0.79 0.96 2188 1
persons -1.36 0.09 -1.55 -1.18 1790 1
child 0.80 0.09 0.62 0.98 2950 1
camper
:
Family Specific Parameters-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l0.41 0.04 0.32 0.49 2409 1
zi
sampling(NUTS). For each parameter, Eff.Sample
Samples were drawn using
is a crude measure of effective sample size, and Rhat is the potential chains (at convergence, Rhat = 1). scale reduction factor on split
```

Figure 1 shows the graphical summary output by

`marginal_effects(fit_zinb1)`

According to the parameter estimates, larger groups catch more fish,
campers catch more fish than non-campers, and groups with more children
catch less fish. The zero-inflation probability `zi`

is pretty large
with a mean of 41%. Note that the probability of catching no fish is
actually higher than 41%, since parts of this probability are already
modeled by the Poisson distribution itself (hence the name
zero-*inflation*). If the user wants to treat all zeros as originating
from a separate process, hurdle models are also supported by ** brms**
(not shown here).

Since we expect groups with more children to avoid fishing, we next try to predict the zero-inflation probability using the number of children. From a purely statistical perspective, zero-inflated and hurdle distributions are a mixture of two processes. Predicting both parts of the model is natural and often very reasonable to make full use of the data.

```
<- brm(bf(count ~ persons + child + camper, zi ~ child),
fit_zinb2 data = zinb, family = zero_inflated_poisson())
```

To transform the linear predictor `zi`

into a probability, ** brms**
applies the logit-link, which takes values in \([0, 1]\) and returns
values on the real line. Thus, it allows the transition between
probabilities and linear predictors.

`summary(fit_zinb2)`

```
: zero_inflated_poisson (log)
Family: count ~ persons + child + camper
Formula~ child
zi : zinb (Number of observations: 250)
Data: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
Samples-warmup samples = 4000
total post: Not computed
WAIC
-Level Effects:
Population-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l-1.07 0.18 -1.43 -0.73 2322 1
Intercept 0.89 0.05 0.80 0.98 2481 1
persons -1.17 0.10 -1.37 -1.00 2615 1
child 0.78 0.10 0.60 0.96 3270 1
camper -0.95 0.27 -1.52 -0.48 2341 1
zi_Intercept 1.21 0.28 0.69 1.79 2492 1
zi_child
sampling(NUTS). For each parameter, Eff.Sample
Samples were drawn using
is a crude measure of effective sample size, and Rhat is the potential chains (at convergence, Rhat = 1). scale reduction factor on split
```

According to the model, trying to fish with children decreases the overall number fish caught (as implied by the Poisson part of the model) and decreases the chance of catching any fish (as implied by the zero-inflation part).

Next we demonstrate how to comparing model fit via leave-one-out cross
validation as implemented in the
** loo** package
(Vehtari et al. 2016a,b):

`loo(fit_zinb1, fit_zinb2)`

```
LOOIC SE1639.52 363.30
fit_zinb1 1621.35 362.39
fit_zinb2 - fit_zinb2 18.16 15.71 fit_zinb1
```

The example above shows that the second model (using the number of
children as a predictor) has better fit. However, when considering the
standard error of the `LOOIC`

difference, improvement in model fit is
apparently modest and not substantial. More examples of distributional
models can be found in `vignette("brms_distreg")`

.

Fahrmeir et al. (2013) use an example about housing rents in Munich. The data
contains information on roughly 3000 apartments, and includes variables
such as the absolute rent (`rent`

), rent per square meter (`rentsqm`

),
size of the apartment (`area`

), construction year (`yearc`

), and the
district in Munich where the apartment is located (`district`

). The data
can be found in the
** gamlss.data**
package (Stasinopoulos and Rigby 2016):

```
data("rent99", package = "gamlss.data")
head(rent99)
```

```
rent rentsqm area yearc location bath kitchen cheating district1 109.9487 4.228797 26 1918 2 0 0 0 916
2 243.2820 8.688646 28 1918 2 0 0 1 813
3 261.6410 8.721369 30 1918 1 0 0 1 611
4 106.4103 3.547009 30 1918 2 0 0 0 2025
5 133.3846 4.446154 30 1918 2 0 0 1 561
6 339.0256 11.300851 30 1918 2 0 0 1 541
```

Here, we wish to predict the rent per square meter using the following explanatory variables: size of the apartment, the construction year, and the district in which the apartment is located. As the effect of apartment size and construction year predictors is of unknown non-linear form, we model these variables using a bivariate tensor spline (Wood et al. 2013). The district is accounted for using a varying intercept.

```
<- brm(rentsqm ~ t2(area, yearc) + (1|district), data = rent99,
fit_rent1 chains = 2, cores = 2)
```

We fit the model using just two chains instead of the default of four chains on two processor cores to reduce the model fitting time for the purpose of the present paper. In general, using the default option of four chains (or more) is recommended.

`summary(fit_rent1)`

```
: gaussian(identity)
Family: rentsqm ~ t2(area, yearc) + (1 | district)
Formula: rent99 (Number of observations: 3082)
Data: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
Samples-warmup samples = 2000
total post: LOO = NA; WAIC = NA; R2 = NA
ICs
:
Smooth Terms-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error lsds(t2areayearc_1) 4.93 2.32 1.61 10.77 1546 1.00
sds(t2areayearc_2) 5.78 2.87 1.58 13.15 1175 1.00
sds(t2areayearc_3) 8.09 3.19 3.66 16.22 1418 1.00
-Level Effects:
Group~district (Number of levels: 336)
-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error lsd(Intercept) 0.60 0.06 0.48 0.73 494 1.01
-Level Effects:
Population-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l7.80 0.11 7.59 8.02 2000 1.00
Intercept -1.00 0.09 -1.15 -0.83 2000 1.00
t2areayearc_1 0.75 0.17 0.43 1.09 2000 1.00
t2areayearc_2 -0.07 0.16 -0.40 0.24 1579 1.00
t2areayearc_3
:
Family Specific Parameters-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l1.95 0.03 1.90 2.01 2000 1.00
sigma
sampling(NUTS). For each parameter, Eff.Sample
Samples were drawn using
is a crude measure of effective sample size, and Rhat is the potential chains (at convergence, Rhat = 1). scale reduction factor on split
```

For models including splines, the output of `summary`

gives limited
information. First, the credible intervals of the standard deviations of
the coefficients forming the splines (under `’Smooth Terms’`

) are
sufficiently far away from zero to indicate non-linearity in the
combined effect of `area`

and `yearc`

. Second, even after controlling
for these predictors, rent per square meter will still vary considerably
by district, as visible under `’Group-Level Effects’`

in the output. To
further understand the effect of the predictor, we apply graphical
methods:

`marginal_effects(fit_rent1, surface = TRUE)`

In Figure 2, the marginal effect of each predictor is displayed with the other predictor fixed at its mean. In Figure 3, the combined effect demonstrates the interaction between the variables. In particular, housing rents appear to be highest for small and relatively new apartments.

In the above example, we only consider the mean of the response
distribution to vary by `area`

and `yearc`

, but this may not be a
reasonable assumption since the variance might vary with these variables
as well. Accordingly, we fit splines and the effects of district for
both the location and scale parameter, which is called `sigma`

in
Gaussian models.

```
<- bf(rentsqm ~ t2(area, yearc) + (1|ID1|district),
bform ~ t2(area, yearc) + (1|ID1|district))
sigma <- brm(bform, data = rent99, chains = 2, cores = 2) fit_rent2
```

If not otherwise specified, `sigma`

is predicted on the log-scale to
ensure it is positive no matter how the predictor term looks. Instead of
`(1|district)`

as in the previous model, we now use `(1|ID1|district)`

in both formulas, which results in modeling the varying intercepts of
both model parts as correlated (see the description of the ID-syntax
above). The group-level part of the `summary`

output looks like the
following:

```
-Level Effects:
Group~district (Number of levels: 336)
-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error lsd(Intercept) 0.60 0.06 0.49 0.73 744 1.00
sd(sigma_Intercept) 0.11 0.02 0.06 0.15 751 1.00
cor(Intercept,sigma_Intercept) 0.72 0.17 0.35 0.98 648 1.00
```

As visible from the positive correlation of the intercepts, districts with overall higher rent per square meter also have higher variance in the outcome.

Lastly, we want to turn our attention to the splines. While
`marginal_effects`

is used to visualize effects of predictors on the
expected response, `marginal_smooths`

is used to show just the spline
parts of the model:

`marginal_smooths(fit_rent2)`

The plot on the left-hand side of Figure 4 resembles the
one in Figure 3, but the scale is different since only the
spline is plotted. The right-hand side of 4 shows the
spline for `sigma`

. Since we apply the log-link on `sigma`

by default,
the spline is on the log-scale as well. As visible in the plot, the
variation in the rent per square meter is highest for relatively small
and old apartments, while the variation is smallest for medium to large
apartments build around the 1960s.

On his blog, Markus Gesmann predicts the growth of cumulative insurance loss payments over time, originated from different origin years (see http://www.magesblog.com/2015/11/loss-developments-via-growth-curves-and.html). We will use a slightly simplified version of his model for demonstration purposes: \[cum_{AY, dev} \sim N(\mu_{AY, dev}, \sigma)\]

\[\mu_{AY, dev} = ult_{AY} \left(1 - \exp\left(- \left( \frac{dev}{\theta} \right)^\omega \right) \right)\] The cumulative insurance payments \(cum\) will grow over time, and we model this dependency using the variable \(dev\). Furthermore, \(ult_{AY}\) is the (to be estimated) ultimate loss of accident each year. It constitutes a non-linear parameter in our framework along with the parameters \(\theta\) and \(\omega\), which are responsible for the growth of the cumulative loss and are for now assumed to be the same across years. We load the data

```
<- paste0("https://raw.githubusercontent.com/mages/",
url "diesunddas/master/Data/ClarkTriangle.csv")
<- read.csv(url)
loss head(loss)
```

```
AY dev cum1 1991 6 357.848
2 1991 18 1124.788
3 1991 30 1735.330
4 1991 42 2182.708
5 1991 54 2745.596
6 1991 66 3319.994
```

and translate the proposed model into a non-linear ** brms** model.

```
<- bf(cum ~ ult * (1 - exp(-(dev / theta)^omega)),
nlform ~ 1 + (1|AY), omega ~ 1, theta ~ 1, nl = TRUE)
ult
<- c(prior(normal(5000, 1000), nlpar = "ult"),
nlprior prior(normal(1, 2), nlpar = "omega"),
prior(normal(45, 10), nlpar = "theta"))
<- brm(formula = nlform, data = loss, family = gaussian(),
fit_loss1 prior = nlprior, control = list(adapt_delta = 0.9))
```

In the above functions calls, quite a few things happen. The formulas
are wrapped in `bf`

to combine them into one object. The first formula
specifies the non-linear model and we set argument `nl = TRUE`

so that
** brms** takes this formula literally, instead of using standard R
formula parsing. We specify one additional formula per non-linear
parameter to: (a) clarify which variables are covariates and which are
parameters; and (b) specify the predictor term for the parameters. We
estimate a group-level effect for accident year (variable

`AY`

) and
ultimate loss `ult`

. This specification nicely shows how a non-linear
parameter is actually a placeholder for a linear predictor, which in the
case of `ult`

, contains only a varying intercept for year. Both `omega`

and `theta`

are assumed to be constant across observations so we just
fit a population-level intercept.Priors on population-level effects are required and are mandatory to
ensure identifiability for the present model. Otherwise, we may observe
that different Markov chains converge to different parameter regions as
multiple posterior distribution are equally plausible. Setting prior
distributions is a difficult task especially in non-linear models
because it requires some experience and knowledge both about the model
that is being fitted and about the data at hand. Additionally, there is
more to keep in mind to optimize the sampler’s performance: First, using
non- or weakly-informative priors in non-linear models often leads to
problems even if the model is generally identified. For instance, if a
zero-centered and reasonably wide prior such as `normal(0, 10000)`

is
set for `ult`

, there is little information about `theta`

and `omega`

for
samples of `ult`

being close to zero, which may lead to convergence
problems. Second, Stan works best when parameters are roughly on the
same order of magnitude (Stan Development Team 2017c). In the present example, `ult`

is
three orders larger than `omega`

. Although the sampler seems to work
quite well in this case, it may not work well for other models. One
solution is to rescale parameters before model fitting. For the present
example, one could downscale `ult`

by replacing it with `ult * 1000`

and
correspondingly the `normal(5000, 1000)`

prior with `normal(5, 1)`

.

In the `control`

argument we increase `adapt_delta`

to get rid of a few
divergent transitions (Bürkner 2017; cf. Stan Development Team 2017c). Again the model is
summarized via

`summary(fit_loss1)`

```
: gaussian (identity)
Family: cum ~ ult * (1 - exp(-(dev / theta)^omega))
Formula~ 1 + (1 | AY)
ult ~ 1
omega ~ 1
theta : loss (Number of observations: 55)
Data: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
Samples-warmup samples = 4000
total post: Not computed
WAIC
-Level Effects:
Group~AY (Number of levels: 10)
-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error lsd(ult_Intercept) 745.74 231.31 421.05 1306.04 916 1
-Level Effects:
Population-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l5273.70 292.34 4707.11 5852.28 798 1
ult_Intercept 1.34 0.05 1.24 1.43 2167 1
omega_Intercept 46.07 2.09 42.38 50.57 1896 1
theta_Intercept
:
Family Specific Parameters-95% CI u-95% CI Eff.Sample Rhat
Estimate Est.Error l139.93 15.52 113.6 175.33 2358 1
sigma
sampling(NUTS). For each parameter, Eff.Sample
Samples were drawn using
is a crude measure of effective sample size, and Rhat is the potential chains (at convergence, Rhat = 1). scale reduction factor on split
```

as well as

`marginal_effects(fit_loss1)`

which yields Figure 5.