This paper describes flan, a package providing tools for fluctuation analysis of mutant cell counts. It includes functions dedicated to the distribution of final numbers of mutant cells. Parametric estimation and hypothesis testing are also implemented, enabling inference on different sorts of data with several possible methods. An overview of the subject is proposed. The general form of mutation models is described, including the classical models as particular cases. Estimating from a model, when the data have been generated by another, induces different possible biases, which are identified and discussed. The three estimation methods available in the package are described, and their mean squared errors are compared. Finally, implementation is discussed, and a few examples of usage on real data sets are given.
Mutation models are probabilistic descriptions of the growth of a
population of cells, where mutations occur randomly during the process.
Data are samples of integers, interpreted as final numbers of mutant
cells. These numbers may be coupled with final numbers of cells (mutant
and non mutant). The frequent appearance in the data of very large
mutant counts, usually called “jackpots”, evidences heavy-tailed
probability distributions. The parameter of interest is the mutation
probability for a mutant cell to appear upon any given cell division,
denoted by
Any mutation model can be interpreted as the result of the three following ingredients:
a random number of mutations occurring with small probability among
a large number of cell divisions. Due to the law of small numbers,
the number of mutations approximately follows a Poisson
distribution. The expectation of that distribution, denoted by
from each mutation, a clone of mutant cells growing for a random
time. Due to exponential growth, most mutations occur close to the
end of the experiment, and the developing time of a random clone has
exponential distribution. The rate of that distribution, denoted by
the number of mutant cells that any clone developing for a given time will produce. The distribution of this number depends on the distribution of division times of mutants.
Using the theory of continuous time branching processes (Bellman and Harris 1952; Athreya and Ney 1972), and under specific modeling assumptions, it can be proved that the asymptotic distribution of the final number of mutants has an explicit form. A first mutation model with explicit distribution is the well known Luria-Delbrück model (Luria and Delbrück 1943). Other mathematical models were introduced by Lea and Coulson (1949), followed by Armitage (1952) and Bartlett (1978). In these models, division times of mutant cells were supposed to be exponentially distributed. Thus a clone develops according to a Yule process, and its size at a given time follows a geometric distribution. The distribution of final mutant counts is also explicit when division times are supposed to be constant. This latter model is called Haldane model by Sarkar (1991); an explicit form of the asymptotic distribution is given in Ycart (2013). General division times have been studied by Ycart (2013), but no explicit distribution is available apart from the exponential and constant division times.
The first estimation method was given by Luria and Delbrück (1943). It is based
on the simple relation between the probability of null counts in the
sample, and the mutation probability, and it is called P0 method. Of
course, if the sample does not contain null counts, the method cannot be
applied. Apart from the P0 method, all other methods couple the
estimation of
The problem with classical mutation models, is that they are based on quite unrealistic assumptions: constant final number of cells (Angerer 2001b; Komarova et al. 2007; Ycart and Veziris 2014), no cell deaths (Angerer (2001b 3.1); Dewanji et al. (2005; Komarova et al. 2007; Ycart 2014)), fully efficient plating (Stewart et al. 1990; Stewart 1991; Angerer 2001a), or, as mentioned above, exponential distribution of division times. Using a model for estimation, when the data have been generated by another one, necessarily induces a bias on estimates. For instance, if cell deaths are neglected, mutation probability will be underestimated.
Several informatic tools have already been developed for fluctuation
analysis (Zheng 2002; Hall et al. 2009; Gillet-Markowska et al. 2015). These tools are quite
user-friendly. However, they do not take into account all the possible
model assumptions mentioned above. The
flan package described
here, is dedicated to mutation models, and parameter estimation with the
three methods P0, ML, and GF. It includes a set of functions for the
distribution of mutant cell counts (dflan
, pflan
, qflan
, rflan
)
and a graphic function (draw.clone
). They treat general models, with
fluctuating final numbers, cell deaths, and other division time
distributions than exponential and constant. The general estimation
function is mutestim
. It returns estimates for the parameters
flan.test
) is provided. In order to make
the package user-friendly, the functions have been designed to resemble
classical R functions, like t.test
or rnorm
.
The paper is organized as follows. Section 2 is devoted to the probabilistic setting: the hypotheses of the different models are described, and the asymptotic results are explained. In Section 3, the three estimation methods are exposed, and the biases described above are discussed. A comparison of the three methods in terms of mean squared errors is provided. The user interface and the Rcpp (Eddelbuettel 2013) implementation is treated in Section 4; examples of execution are shown in Section 5.
In this section, probabilistic mutation models are described. The basic modeling hypotheses are the following:
at time
the lifetime of any normal cell is a random variable with
distribution function
upon completion of the generation time of a normal cell:
with probability
with probability
the lifetime of any mutant cell is a random variable with
distribution function
upon completion of the lifetime of a mutant cell:
with probability
with probability
all random variables and events (division times, mutations, and deaths) are mutually independent.
Consider that the initial number
Under the above hypotheses, as
the Poisson distribution with intensity
the exponential distribution with rate
the distribution with PGF
\tag{3}$$
The PGF (3) defines a parametrized family of
distributions, denoted hereafter by
At this point, the PGF dflan
, pflan
, and qflan
compute densities,
probabilities, quantiles of
\tag{4}$$
Remark that if
The two particular cases for the distribution
The function rflan
outputs samples of pairs (mutant counts–final
counts) following
Here the three estimation methods P0, ML and GF are described. The main
features and the limitations of each method are discussed. The three
methods compute estimates of
Even if the probabilities and their derivatives with respect to
In the rest of this section, the three estimators are described, their performances compared in terms of MSE, and the possible sources of biases discussed.
The first method was introduced by Luria and Delbrück (1943) when
If
Consider now that
Notice that the P0 method does not directly yield an estimator of
Since algorithms
(Embrechts and Hawkes 1982; Zheng 2005; Hamon and Ycart 2012; Ycart and Veziris 2014) enable
the computation of the probabilities of the
sample of mutant counts: In that case, the likelihood is computed
with the probabilities of the model
sample of pairs of (mutant counts–final numbers): In that case, the
likelihood is computed with the probabilities of the model
In both cases,
However, when the sample maximum is large, sums of products of small
terms must be computed (Hamon and Ycart 2012). The procedure can be very long
and numerically unstable. Thus, the ML estimators can fail for large
In theory, it is also possible to be explicit about the probabilities of
the
The GF method uses the PGF to estimate the parameter of a compound
Poisson distribution (Rémillard and Theodorescu 2000; Hamon and Ycart 2012). Let
The GF estimators depend on the three arbitrary values of
In practice, the GF estimators are quite comparable in precision to ML estimators, with a much broader range of calculability, a better numerical stability, and a negligible computing time, even in the case where the ML method fails. For that reason, we have chosen to initialize the ML optimization by GF estimates, to improve both numerical stability and computing time.
The only practical limitation of this method is the following. A zero of
the monotone function
According to (3), the GF estimators or
The function mutestim
computes estimates and their respective standard
deviations for flan.test
. The null hypothesis will be either fixed
theoretical values of
The Figures 1 and 2 (drawn using
ggplot2) show “maps of
usage” of the estimation methods under the
draw
for each sample, compute ML, GF and P0 estimates of
from the
assign a RGB color according to the MSEs. For each method:
if the MSE is less than 0.05, assign 1 to the corresponding RGB component;
if the MSE is greater than 1, assign 0 to the corresponding RGB component;
else, assign 1 minus the MSE to the corresponding RGB component.
The above experience is also performed considering
The map can be roughly divided into four distinct parts:
For
For
For small values of
For
Figure 2 is quite similar to Figure 1. There are still two remarkable differences:
For
For
The three methods should also be compared in terms of computational
time. An illustration on real data will be given in
section 5. The slowest method is ML, for the reasons
discussed in the previous section. It is even slower when the estimates
are calculated under Haldane models
If the model used for the estimation does not correspond to the theoretical model, the estimates can be biased. Four different sources of bias are considered:
the final counts are random in the data, constant for the estimation model;
cell deaths occur in the data, not in the estimation model;
the lifetime distribution is different in the data and the estimation model;
the plating process is less than 100% efficient.
In each case, simulation experiments have been made along the following lines:
draw
for each sample, compute estimates using another model and the true one if available;
compare the empirical distributions of
For each figure, red lines mark unit, blue lines mark relative biases of 0.9 and 1.1. According to Figures 1 and 2, the GF method is at least equivalent in terms of the relative MSE (8) to the ML and P0 methods. Moreover, it is also the best in terms of computational time. Therefore, the bias evidences will be mainly illustrated with GF method.
When
divide ML estimates of
directly compute ML with the sample of pairs (mutant counts–final counts) (center boxplots);
derive from ML estimates of
According to the visual observations, the bias reduction seems to be
working well when either the product
The PGFs (1), (4) and (2) depend on
computing GF estimates of
computing GF estimates of
The visual results show that the negative bias induced by ignoring cell
deaths increases with the value of
As mentioned earlier, the PGF
From the visual observations, the
both models correctly estimate
the
the
Ignoring the plating efficiency induces a negative bias on
computing GF estimates with
applying the correction of Stewart et al. (1990, eq. (41)) to GF estimates (center boxplots);
computing GF estimates with known value of
The visual results illustrate first the fact that the correction of
Stewart et al. (1990) should not be used when
The available functions are described here; more details are given in the manual. The behavior of inference functions for inputs which are out of practical limitations is described. Some details about the Rcpp implementation are also provided.
The flan package can be split into two distinct parts: the
distribution of the final number of mutants and statistical inference.
The functions dflan
, pflan
, qflan
compute densities, probabilities
and quantiles of rflan
outputs samples of pairs (mutant counts–final counts) following draw.clone
is also provided. Using a binary
tree, it represents the growth of a clone starting from a single normal
cell with mutation occurrences until a finite time. The function
mutestim
computes estimates of flan.test
uses asymptotic normality to
perform one or two-sample hypothesis testing. It has been designed on
the principle of the classical hypothesis testing functions of R, such
as t.test
. As mentioned in Section 3, there are
practical limitations for each estimation method. If the inputs of the
mutestim
function do not respect those limitations, it will output
errors or warning messages:
If mutestim
function will
throw an error with a message.
For now, an inefficient plating (i.e. mutestim
function will return a warning message and set
Issues of the Winzorization parameter
If the minimum of the sample is larger than
If
The GF method does not have limitations of usage, even for extreme
cases where the ML estimators fail, i.e. samples with theoretical
large mutestim
function, the domain of research is
Moreover, the initialization of the ML method is done with GF
method. Then the domain of optimization is
The function flan.test
is a wrapper function of mutestim
. It will
ouput the same errors or warning messages if its inputs do not respect
the practical limitations.
Since most functions involve loops that are more expensive in R than in C, flan has been implemented with Rcpp modules. This paradigm provides an easy way to expose C++ functions and classes to R. There are four main classes in the C++ implementation:
FLAN_Sim
: random generation for
FLAN_SimClone
: random generation for clone size distribution
according to the lifetimes distribution;
FLAN_MutationModel
: computation of the descriptive functions
(probabilities, PGF,...) for
FLAN_Clone
: computation of the descriptive functions for clone
size distribution according to the lifetimes distribution.
The Rcpp interface enables also to import into the C++ code any R
function. In particular, it is interesting to import the R functions
which are already implemented in C. Thus no external C/C++ library is
required. The installation remains basic, and the size of the installed
package is reduced. For example, the computations of integrate
is actually implemented in C. The R function can then be directly called
into the C++ code. However, such imports increase the computational time
and memory consumption. This is thus unsatisfactory. Another solution
uses the package RcppGSL
(Eddelbuettel 2013 11), which creates an interface between Rcpp
and the C library gsl. Several integration methods are thus available.
Since it only requires that gsl is correctly installed, this solution is
a good compromise between easy setup and computational cost.
Computing the probabilities for the fft
could be directly called into
the C++ code. For the same reasons as for the integrate
function, it
is more adequate to use the package
RcppArmadillo
(Eddelbuettel (2013 10) ; Eddelbuettel and Sanderson (2014)). This package
links Rcpp to the C++ library armadillo, which is dedicated to linear
algebra. In particular, it includes a performant Fast Fourier Transform.
Finally, likelihood optimizations in the ML and P0 methods are done with
a bounded BFGS optimizer. The package
lbfgsb3 provides the
eponymous function which is implemented in Fortran. It is much faster
than the basic R function optim
.
Some examples on the real data included in flan are provided.
Practical limitations, influence of bias sources and comparison of the
estimation methods in terms of computational time are illustrated.
Consider first the eleventh sample of mutant counts of the werhoff
data (Werngren and Hoffner 2003):
werhoff$samples$W11$mc
## [1] 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 5 5
This sample does not contain any jackpot, then the theoretical fitness
in a mutation model, should be very large. If the GF method is used, it
will output a warning message and set the fitness at
W <- werhoff$samples$W11$mc
# Compute GF estimates of mutations number and fitness}
mutestim(mc = W, method = "GF")
## Warning in mutestim(mc = W, method = "GF"): Impossible to estimate
## 'fitness' with 'GF'-method: 'fitness' is set to default value 1 and
## only mutations number is estimated.
## $mutations
## [1] 0.8792917
##
## $sd.mutations
## [1] 0.260549
##
## $fitness
## [1] 1
##
## $sd.fitness
## [1] 0
Notice that Table 1 of (Werngren and Hoffner 2003) shows mutants counts and mean final numbers of cells for 1 mL of solution. However, the total volume of the culture is 5 mL. In other words, a plating efficiency of 20% has been applied. The fact is the fitness parameter can not be estimated, even taking into account the plating efficiency:
# Volume of each sample: 1 mL
# Total volume of each culture: 5 mL
# i.e. plating efficiency of 0.2
pef <- 0.2
# Compute GF estimates of mutation probability and fitness
# taking into account the plating efficiency
mutestim(mc = W, plateff = pef, method = "GF")
## Warning in mutestim(mc = W, plateff = pef, method = "GF"): Impossible to estimate 'fitness' with
## 'GF'-method: 'fitness' is set to default value 1 and only mutations number is estimated.
## $mutations
## [1] 2.804244
##
## $sd.mutations
## [1] 0.74011
##
## $fitness
## [1] 1
##
## $sd.fitness
## [1] 0
Using the P0 method is a way to realize that setting
# Compute P0 estimate of mutations number
mutestim(mc = W, fitness = 1, method = "P0")
## $mutations
## [1] 2.525729
##
## $sd.mutations
## [1] 0.678233
The P0 estimate of
david$D11
## $mc
## [1] 4 0 1 0 1 0 0 0 0 0
##
## $fn
## [1] 1.3e+09 9.2e+08 1.3e+09 2.5e+09 1.3e+09 1.6e+09 1.3e+09 2.5e+09
## [9] 2.5e+09 2.0e+09
Since the
D <- david$D11
Dmfn <- mean(D$fn)
# Compute ML estimates and confidence intervals of mutation probability and
# fitness with empirical mean and null coefficient of variation
ft <- flan.test(mc = D$mc, mfn = Dmfn)
ft$estimate
## mutation probability fitness
## 2.067641e-10 2.214676e+00
ft$conf.int
## mutation probability fitness
## bInf 0.000000e+00 0.000000
## bSup 4.423916e-10 8.109577
## attr(,"conf.level")
## [1] 0.95
# Compute ML estimates and confidence intervals of mutation probability and
# fitness with empirical mean and empirical coefficient of variation
ft <- flan.test(mc = D$mc, mfn = Dmfn, cvfn = sd(D$fn)/Dmfn)
ft$estimate
## mutation probability fitness
## 2.092720e-10 2.214676e+00
ft$conf.int
## mutation probability fitness
## bInf 0.000000e+00 0.000000
## bSup 4.506155e-10 8.109577
## attr(,"conf.level")
## [1] 0.95
# Compute ML estimates and confidence intervals of mutation probability and
# fitness with couples (mc,fn)
ft <- flan.test(mc = D$mc, fn = D$fn)
ft$estimate
## mutation probability fitness
## 1.977135e-10 2.048984e+00
ft$conf.int
## mutation probability fitness
## bInf 0.000000e+00 0.000000
## bSup 4.078591e-10 7.127426
## attr(,"conf.level")
## [1] 0.95
The sample of final counts is denoted by
Estimates of |
Confidence intervals (95%) of |
Estimates of |
Confidence intervals (95%) of |
|
---|---|---|---|---|
ML using |
||||
ML using |
||||
ML using |
Consider finally the data from Boe et al. (1994, Tab. 4). The author studied
mutations of Escherichia coli from sensitivity to nalidixic acid
resistance. 23 samples of resistant bacteria counts are provided. As in
the original paper, the 23 samples are concatenated as one. The three
estimation methods are compared in terms of computational time on the
resulting sample of size 1104. The package microbenchmark is used,
evaluating
B <- unlist(boeal) # Concatenation of the 23 samples
require(microbenchmark)
# Comparing the methods in terms of computational performance
# (mutations number only)
microbenchmark(P0 = mutestim(mc = B, fitness = 1, method = "P0"),
GF = mutestim(mc = B, fitness = 1, method = "GF"),
ML = mutestim(mc = B, fitness = 1, method = "ML"),
unit = "ms", times = 1e4)
## Unit: milliseconds
## expr min lq mean median uq max neval
## P0 0.063696 0.0800720 0.09701435 0.0899955 0.0986275 2.149537 10000
## GF 0.327989 0.3694555 0.41309310 0.3850380 0.4035050 26.774004 10000
## ML 8.381844 10.5333670 12.23137063 10.9118690 11.3860045 43.592177 10000
# Comparing the methods in terms of computational performance}
microbenchmark(P0 = mutestim(mc = B, method = "P0"),
GF = mutestim(mc = B, method = "GF"),
ML = mutestim(mc = B, method = "ML"),
unit = "ms", times = 1e4)
## Unit: milliseconds
## expr min lq mean median uq max neval
## P0 8.451605 11.004419 12.832827 11.848562 13.082066 43.00397 10000
## GF 2.274944 2.419663 2.688851 2.481529 2.549597 33.19333 10000
## ML 73.704656 79.581238 83.841150 81.802464 84.323398 120.65768 10000
The results are shown on Figure 7, as boxplots of timing distributions. Times are in milliseconds and plotted on log-scale.
Estimates of α setting ρ = 1 | Estimates of the couple (α,ρ) |
|
|
boeal
concatenated as one. For each method, the estimates
of α setting ρ = 1 (left boxplots) or of α and ρ (right boxplots) have been
computed under model LD. The timings have been
returned with microbenchmark, evaluating 104 times each method. Times are
in milliseconds and plotted on log-scale.
As mentioned earlier, the ML method is the slowest. The GF method seems
to be quite faster than the P0 method. However, the estimates of
flan, Rcpp, ggplot2, RcppGSL, polynom, RcppArmadillo, lbfgsb3
HighPerformanceComputing, NumericalMathematics, Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Mazoyer, et al., "flan: An R Package for Inference on Mutation Models.", The R Journal, 2017
BibTeX citation
@article{RJ-2017-029, author = {Mazoyer, Adrien and Drouilhet, Rémy and Despréaux, Stéphane and Ycart, Bernard}, title = {flan: An R Package for Inference on Mutation Models.}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {334-351} }