The distribution of the sum of independent non-identical binomial random variables is frequently encountered in areas such as genomics, healthcare, and operations research. Analytical solutions for the density and distribution are usually cumbersome to find and difficult to compute. Several methods have been developed to approximate the distribution, among which is the saddlepoint approximation. However, implementation of the saddlepoint approximation is non-trivial. In this paper, we implement the saddlepoint approximation in the sinib package and provide two examples to illustrate its usage. One example uses simulated data while the other uses real-world healthcare data. The sinib package addresses the gap between the theory and the implementation of approximating the sum of independent non-identical binomials.
Convolution of independent non-identical binomial random variables appears in a variety of applications, such as analysis of variant-region overlap in genomics (Schmidt et al. 2015), calculation of bundle compliance statistics in healthcare organizations (Benneyan and Taşeli 2010), and reliability analysis in operations research (Kotz and Johnson 1984).
Computating the exact distribution of the sum of non-identical independent binomial random variables requires enumeration of all possible combinations of binomial outcomes that satisfy the totality constraint. However, analytical solutions are often difficult to find for sums of greater than two binomial random variables. Several studies have proposed approximate solutions (Jolayemi 1992; Johnson et al. 2005). In particular, Eisinga et al. (2013) examined the saddlepoint approximation, and compared them to exact solutions. They note that in practice, these approximations are often as good as the exact solution and can be implemented in most statistical software.
Despite the theoretical development of aforementioned approximate
solutions, a software implementation in R is still lacking. The
stats package includes
functions for frequently used distribution such as dbinom
and dnorm
,
and less frequently used distributions such as pbirthday
, but it does
not contain functions for the distribution of the sum of independent
non-identical binomials. The
EQL package provides a
saddlepoint
function to approximate the mean of i.i.d. random
variables, but does not apply to the case where the random variables are
not identical. In this paper, we implement a saddlepoint approximation
in sinib (sum of
independent non-identical binomial random variables). The package
provides the standard suite of distributional functions for the
distribution (psinib
), density (dsinib
), quantile (qsinib
), and
random deviates (rsinib
). The package is accompanied by a detailed
documentation, and can be easily integrated into existing applications.
The remainder of this paper is organized as follows. We begin by providing an overview of the distribution of the sum of independent non-identical binomial random variables. Next, we give an overview of the saddlepoint approximation. The following section describes the design and implementation of the saddlepoint approximation in the sinib package. We provide two examples and assess the accuracy of saddlepoint approximation in these situations. The final section concludes and discusses future direction.
Suppose
Compute the exact distribution of each
Calculate the distribution of
Calculate
Although the recursion speeds up the calculation, studies have shown
that the result may be numerically unstable due to round-off error in
computing
The saddlepoint approximation, first proposed by Daniels (1954) and
later extended by Lugannani and Rice (1980), provides highly accurate
approximations for the probability and density of many distributions. In
brief, let
Eisinga et al. (2013) applied the saddlepoint approximation to the sum of
independent non-identical binomial random variables. Suppose that
The saddlepoint of
The above shows the first-order approximation of the distribution. The
approximation can be improved by adding a second-order correction term
(Daniels 1987; Akahira and Takahashi 2001):
Although the saddlepoint equation cannot be solved at boundaries
Incorporation of boundary solutions into the approximation gives:
We have implemented Equation (2) as the final approxmation of
the probability density function. For the cumulative density,
Daniels (1987) gave the following approximator:
The accuracy can be improved by adding a second-order continuity
correction:
We have implemented Equation (3) to approximate the cumulative distribution.
The package uses only base R and the stats package to
minimize compatibility issues. The arguments for the functions in the
sinib package are designed to have similar meaning to those in the
stats package, thereby minimizing the learning required. To
illustrate, we compare the arguments of the *binom
and the *sinib
functions.
From the help page of the binomial distribution, the arguments are as follows:
x, q: vector of quantiles.
p: vector of probabilities.
n: number of observations.
size: number of trials.
prob: probability of success on each trial.
log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are
Since the distribution of sum of independent non-identical binomials is
defined by a vector of trial and probability pairs (each pair for one
constituent binomial), it was neccessary to redefine these arguments in
the *sinib
functions. Therefore, the following two arguments were
redefined:
size: integer vector of number of trials.
prob: numeric vector of success probabilities.
All other arguments remain the same. It is worth noting that when size
and prob arguments are given as vectors of length 1, the *sinib
function reduces to *binom
functions:
# Binomial:
dbinom(x = 1, size = 2, prob = 0.5)
[1] 0.5
# Sum of binomials:
library(sinib)
dsinib(x = 1, size = 2, prob = 0.5)
[1] 0.5
The next section shows a few examples to illustrate the usage of sinib.
This section shows a few examples to illustrate the usage of sinib.
We use two examples to illustrate the use of this package, starting from
the simplest case of two binomial random variables with the same mean
but different sizes:
library(foreach)
library(data.table)
library(cowplot)
library(sinib)
# Gaussian approximator:
p_norm_app = function(q,size,prob){
mu = sum(size*prob)
sigma = sqrt(sum(size*prob*(1-prob)))
pnorm(q, mean = mu, sd = sigma)
}
# Comparison of CDF between truth and approximation:
data=foreach(m=c(10,100,1000),.combine='rbind')%do%{
foreach(n=c(10,100,1000),.combine='rbind')%do%{
foreach(p=c(0.1, 0.5, 0.9),.combine='rbind')%do%{
a=pbinom(q=0:(m+n),size=(m+n),prob = p)
b=psinib(q=0:(m+n),size=c(m,n),prob=c(p,p))
c=p_norm_app(q=0:(m+n),size=c(m,n),prob = c(p,p))
data.table(s=seq_along(a),truth=a,saddle=b,norm=c,m=m,n=n,p=p)
}
}
}
data[,m:=paste0('m = ',m)]
data[,p:=paste0('p = ',p)]
data = melt(data,measure.vars = c('saddle','norm'))
ggplot(data[n==10],aes(x=truth,y=value,color=p,linetype=variable))+
geom_line()+
facet_grid(p~m)+
theme_bw()+
scale_color_discrete(name='prob',guide = 'none')+
xlab('Truth')+
ylab('Approximation')+
scale_linetype_discrete(name = '', breaks = c('saddle','norm'),
labels = c('Saddlepoint','Gassian')) +
theme(legend.position = 'top')
Figure 1 shows that the saddlepoint approximations are close
to the ground truths across a range of parameters. For parsimony, this
figure only shows the case of
data=foreach(m=c(100),.combine='rbind')%do%{
foreach(n=c(100),.combine='rbind')%do%{
foreach(p=c(0.1, 0.5, 0.9),.combine='rbind')%do%{
a=pbinom(q=0:(m+n),size=(m+n),prob = p)
b=psinib(q=0:(m+n),size=as.integer(c(m,n)),prob=c(p,p))
c=p_norm_app(q=0:(m+n),size=c(m,n),prob = c(p,p))
data.table(s=seq_along(a),truth=a,saddle=b,norm=c,m=m,n=n,p=p)
}
}
}
data = melt(data,measure.vars = c('saddle','norm'),value.name = 'approx',
variable.name = 'Method')
data[,`Relative error` := (truth-approx)/truth]
data[,Error := (truth-approx)]
data[,p:=paste0('p = ',p)]
data = melt(data,measure.vars = c('Relative error','Error'),value.name = 'error',
variable.name = 'type')
ggplot(data[Method == 'saddle'],aes(x=s,y=error,color=p))+
geom_point(alpha=0.5)+theme_bw()+
facet_wrap(type~p,scales='free')+
xlab('Quantile')+
ylab(expression(
'Truth - Approximation'~~~~~~~~~~~~~~~frac(Truth - Approximation,Truth)))+
scale_color_discrete(guide='none') +
geom_vline(xintercept=200*0.5,color='green',linetype='longdash')+
geom_vline(xintercept=200*0.1,color='red',linetype='longdash')+
geom_vline(xintercept=200*0.9,color='blue',linetype='longdash')
Figure 2 shows the difference between the truth and the
approximation for
# Comparison of PDF between truth and approximation:
d_norm_app = function(x,size,prob){
mu = sum(size*prob)
sigma = sqrt(sum(size*prob*(1-prob)))
dnorm(x, mean = mu, sd = sigma)
}
data=foreach(m=c(10,100,1000),.combine='rbind')%do%{
foreach(n=c(10,100,1000),.combine='rbind')%do%{
foreach(p=c(0.1, 0.5, 0.9),.combine='rbind')%do%{
a=dbinom(x=0:(m+n),size=(m+n),prob = p)
b=dsinib(x=0:(m+n),size=as.integer(c(m,n)),prob=c(p,p))
c=d_norm_app(x=0:(m+n),size=c(m,n),prob=c(p,p))
data.table(s=seq_along(a),truth=a,saddle=b,norm=c,m=m,n=n,p=p)
}
}
}
data[,m:=paste0('m = ',m)]
data[,p:=paste0('p = ',p)]
data = melt(data,measure.vars = c('saddle','norm'))
ggplot(data[n==10],aes(x=truth,y=value,color=p,linetype=variable))+
geom_line()+
facet_wrap(m~p,scales='free')+
theme_bw()+
scale_color_discrete(guide = 'none')+
xlab('Truth')+
ylab('Approximation') +
scale_linetype_discrete(name = '', breaks = c('saddle','norm'),
labels = c('Saddlepoint','Gassian')) +
theme(legend.position = 'top')
We next examine the approximation for the probability density function.
Figure 4 shows that the saddlepoint approximation is very
close to the ground truth, whereas the Gaussian approximation is farther
away. It is worthwhile to mention that the Gaussian method provides a
good approximation for
data=foreach(m=c(100),.combine='rbind')%do%{
foreach(n=c(100),.combine='rbind')%do%{
foreach(p=c(0.1, 0.5, 0.9),.combine='rbind')%do%{
a=dbinom(x=0:(m+n),size=(m+n),prob = p)
b=dsinib(x=0:(m+n),size=as.integer(c(m,n)),prob=c(p,p))
c=d_norm_app(x=0:(m+n),size=c(m,n),prob = c(p,p))
data.table(s=seq_along(a),truth=a,saddle=b,norm=c,m=m,n=n,p=p)
}
}
}
data = melt(data,measure.vars = c('saddle','norm'),value.name = 'approx',
variable.name = 'Method')
data[,`Relative error` := (truth-approx)/truth]
data[,Error := (truth-approx)]
data[,p:=paste0('p = ',p)]
data = melt(data,measure.vars = c('Relative error','Error'),
value.name = 'error', variable.name = 'type')
ggplot(data[Method == 'saddle'],aes(x=s,y=error,color=p))+
geom_point(alpha=0.5)+theme_bw()+
facet_wrap(type~p,scales='free')+
xlab('Quantile')+
ylab(expression(
'Truth - Approximation'~~~~~~~~~~~~~~~frac(Truth - Approximation,Truth)))+
scale_color_discrete(guide='none') +
geom_vline(xintercept=200*0.5,color='green',linetype='longdash')+
geom_vline(xintercept=200*0.1,color='red',linetype='longdash')+
geom_vline(xintercept=200*0.9,color='blue',linetype='longdash')
In the second example, we used a health system monitoring dataset by
Benneyan and Taşeli (2010). To improve compliance with medical devices, healthcare
organizations often monitor bundle reliability statistics, each
representing a percentage of patient compliance. Suppose
size=as.integer(c(12, 14, 4, 2, 20, 17, 11, 1, 8, 11))
prob=c(0.074, 0.039, 0.095, 0.039, 0.053, 0.043, 0.067, 0.018, 0.099, 0.045)
Since it is difficult to find an analytical solution for the density, we estimated the density with simulation (1e8 trials) and treated it as the ground truth. We then compared simulations with 1e3, 1e4, 1e5, and 1e6 trials, and the saddlepoint approximation to the ground truth. (Note that running simulation will take several minutes.)
# Sinib:
approx=dsinib(0:sum(size),size,prob)
approx=data.frame(s=0:sum(size),pdf=approx,type='saddlepoint')
# Gauss:
gauss_approx = d_norm_app(0:sum(size),size,prob)
gauss_approx = data.frame(s=0:sum(size),pdf=gauss_approx,type='gauss')
# Simulation:
data=foreach(n_sim=10^c(3:6,8),.combine='rbind')%do%{
ptm=proc.time()
n_binom=length(prob)
set.seed(42)
mat=matrix(rbinom(n_sim*n_binom,size,prob),nrow=n_binom,ncol=n_sim)
S=colSums(mat)
sim=sapply(X = 0:sum(size), FUN = function(x) {sum(S==x)/length(S)})
data.table(s=0:sum(size),pdf=sim,type=n_sim)
}
data=rbind(data,gauss_approx,approx)
truth=data[type=='1e+08',]
merged=merge(truth[,list(s,pdf)],data,by='s',suffixes=c('_truth','_approx'))
merged=merged[type!='1e+08',]
ggplot(merged,aes(pdf_truth,pdf_approx))+
geom_point()+
facet_grid(~type)+
geom_abline(intercept=0,slope=1)+
theme_bw()+
xlab('Truth')+
ylab('Approximation')
Figure 7 shows that the simulation with 1e6 trials and the saddlepoint approximation are indistinguishable from the ground truth, while the Gaussian method and estimates with fewer simulations show clear deviations from the truth. To further examine the magnitude of deviation, we plot the difference in PDF between the truth and the approximation:
merged[,Error:=pdf_truth-pdf_approx]
merged[,`Relative Error`:=(pdf_truth-pdf_approx)/pdf_truth]
merged = melt(merged,measure.vars = c('Error','Relative Error'),variable.name = 'error_type',
value.name = 'error')
p2=ggplot(merged,aes(s,error))+
geom_point()+
facet_grid(error_type~type,scales = 'free_y')+
theme_bw()+
xlab('Outcome')+
ylab('Truth-Approx')+
xlim(0,20) +
ylab(expression(frac(Truth - Approximation,Truth)~~~~~~~'Truth - Approximation'))
Figure 8 shows that that the saddlepoint method and the simulation with 1e6 draws both provide good approximations, while the Gaussian approximation and simulations of smaller sizes show clear deviations. We also note that the saddlepoint approximation is 5 times faster than the simulation of 1e6 trials.
ptm=proc.time()
n_binom=length(prob)
mat=matrix(rbinom(n_sim*n_binom,size,prob),nrow=n_binom,ncol=n_sim)
S=colSums(mat)
sim=sapply(X = 0:sum(size), FUN = function(x) {sum(S==x)/length(S)})
proc.time()-ptm
# user system elapsed
# 1.008 0.153 1.173
ptm=proc.time()
approx=dsinib(0:sum(size),size,prob)
proc.time()-ptm
# user system elapsed
# 0.025 0.215 0.239
In this paper, we presented an implementation of the saddlepoint method to approximate the distribution of the sum of independent and non-identical binomials. We assessed the accuracy of the method by comparing it with first, the analytical solution in the simple case of two binomials, and second, the simulated ground truth on a real-world dataset in healthcare monitoring. These assessments suggest that the saddlepoint method generally provides an approximation superior to simulation in terms of both speed and accuracy, and outperforms the Gaussian approximation in terms of accuracy. Overall, the sinib package addresses the gap between the theory and implementation on the approximation of sum of independent and non-identical binomial random variables.
In the future, we aim to explore other approximation methods such as the Kolmogorov approximation and the Pearson curve approximation described by Butler and Stephens (2016).
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Liu & Quertermous, "Approximating the Sum of Independent Non-Identical Binomial Random Variables", The R Journal, 2018
BibTeX citation
@article{RJ-2018-011, author = {Liu, Boxiang and Quertermous, Thomas}, title = {Approximating the Sum of Independent Non-Identical Binomial Random Variables}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {472-483} }