The SimCorrMix package generates correlated continuous (normal, non-normal, and mixture), binary, ordinal, and count (regular and zero-inflated, Poisson and Negative Binomial) variables that mimic real-world data sets. Continuous variables are simulated using either Fleishman’s third-order or Headrick’s fifth-order power method transformation. Simulation occurs at the component level for continuous mixture distributions, and the target correlation matrix is specified in terms of correlations with components. However, the package contains functions to approximate expected correlations with continuous mixture variables. There are two simulation pathways which calculate intermediate correlations involving count variables differently, increasing accuracy under a wide range of parameters. The package also provides functions to calculate cumulants of continuous mixture distributions, check parameter inputs, calculate feasible correlation boundaries, and summarize and plot simulated variables. SimCorrMix is an important addition to existing R simulation packages because it is the first to include continuous mixture and zero-inflated count variables in correlated data sets.
Finite mixture distributions have a wide range of applications in clinical and genetic studies. They provide a useful way to describe heterogeneity in a population, e.g., when the population consists of several subpopulations or when an outcome is a composite response from multiple sources. In survival analysis, survival times in competing risk models have been described by mixtures of exponential, Weibull, or Gompertz densities (Larson and Dinse 1985; Lau et al. 2009, 2011). In medical research, finite mixture models may be used to detect clusters of subjects (cluster analysis) that share certain characteristics, e.g., concomitant diseases, intellectual ability, or history of physical or emotional abuse (McLachlan 1992; Newcomer et al. 2011; Pamulaparty et al. 2016). In schizophrenia research, Gaussian mixture distributions have frequently described the earlier age of onset in men than in women and the vast phenotypic heterogeneity in the disorder spectrum (Lewine 1981; Sham et al. 1994; Everitt 1996; Welham et al. 2000).
Count mixture distributions, particularly zero-inflated Poisson and Negative Binomial, are required to model count data with an excess number of zeros and/or overdispersion. These distributions play an important role in a wide array of studies, modeling health insurance claim count data (Ismail and Zamani 2013), the number of manufacturing defects (Lambert 1992), the efficacy of pesticides (Hall 2000), and prognostic factors of Hepatitis C (Baghban et al. 2013). Human microbiome studies, which seek to develop new diagnostic tests and therapeutic agents, use RNA-sequencing (RNA-seq) data to assess differential composition of bacterial communities. The operational taxonomic unit (OTU) count data may exhibit overdispersion and an excess number of zeros, necessitating zero-inflated Negative Binomial models (Zhang et al. 2016). Differential gene expression analysis utilizes RNA-seq data to search for genes that exhibit differences in expression level across conditions (e.g., drug treatments) (Soneson and Delorenzi 2013; Solomon 2014). Zero-inflated count models have also been used to characterize the molecular basis of phenotypic variation in diseases, including next-generation sequencing of breast cancer data (Zhou et al. 2017).
The main challenge in applying mixture distributions is estimating the
parameters for the component densities. This is usually done with the EM
algorithm, and the best model is chosen by the lowest Akaike or Bayesian
information criterion (AIC or BIC). Current packages that provide
Gaussian mixture models include:
AdaptGauss, which
uses Pareto density estimation (Thrun et al. 2017);
DPP, which uses a Dirichlet
process prior (Avila et al. 2017); bgmm,
which employs two partially supervised mixture modeling methods (Biecek and Szczurek 2017);
and ClusterR,
mclust, and
mixture, which conduct
cluster analysis (Browne et al. 2015; Fraley et al. 2017; Mouselimis 2017). Although Gaussian
distributions are the most common, the mixture may contain any
combination of component distributions. Packages that provide
alternatives include:
AdMit, which fits an
adaptive mixture of Student-t distributions (Ardia 2017);
bimixt, which uses
case-control data (Winerip et al. 2015);
bmixture, which
conducts Bayesian estimation for finite mixtures of Gamma, Normal and
Given component parameters, there are existing R packages which simulate mixture distributions. The mixpack package generates univariate random Gaussian mixtures (Comas-Cufí et al. 2017). The distr package produces univariate mixtures with components specified by name from stats distributions (Kohl 2017; R Core Team 2017). The rebmix package simulates univariate or multivariate random datasets for mixtures of conditionally independent Normal, Lognormal, Weibull, Gamma, Binomial, Poisson, Dirac, Uniform, or von Mises component densities. It also simulates multivariate random datasets for Gaussian mixtures with unrestricted variance-covariance matrices (Nagode 2017).
Existing simulation packages are limited by: 1) the variety of available component distributions and 2) the inability to produce correlated data sets with multiple variable types. Clinical and genetic studies which involve variables with mixture distributions frequently incorporate influential covariates, such as gender, race, drug treatment, and age. These covariates are correlated with the mixture variables and maintaining this correlation structure is necessary when simulating data based on real data sets (plasmodes, as in Vaughan et al. 2009). The simulated data sets can then be used to accurately perform hypothesis testing and power calculations with the desired type-I or type-II error.
SimCorrMix is an
important addition to existing R simulation packages because it is the
first to include continuous mixture and zero-inflated count variables in
correlated data sets. Therefore, the package can be used to simulate
data sets that mimic real-world clinical or genetic data. SimCorrMix
generates continuous (normal, non-normal, or mixture distributions),
binary, ordinal, and count (regular or zero-inflated, Poisson or
Negative Binomial) variables with a specified correlation matrix via the
functions corrvar
and corrvar2
. The user may also generate one
continuous mixture variable with the contmixvar1
function. The methods
extend those found in the
SimMultiCorrData
package (version ordsample
function (Barbiero and Ferrari 2015a). Count variables are simulated using the
inverse cumulative density function (CDF) method with distribution
functions imported from
VGAM (Yee 2017).
Two simulation pathways (correlation method 1 and correlation method 2)
within SimCorrMix provide two different techniques for calculating
intermediate correlations involving count variables. Each pathway is
associated with functions to calculate feasible correlation boundaries
and/or validate a target correlation matrix rho
, calculate
intermediate correlations (during simulation), and generate correlated
variables. Correlation method 1 uses validcorr
, intercorr
, and
corrvar
. Correlation method 2 uses validcorr2
, intercorr2
, and
corrvar2
. The order of the variables in rho
must be
The simulation functions do not contain parameter checks or variable
summaries in order to decrease simulation time. All parameters should be
checked first with validpar
in order to prevent errors. The function
summary_var
generates summaries by variable type and calculates the
final correlation matrix and maximum correlation error. The package also
provides the functions calc_mixmoments
to calculate the standardized
cumulants of continuous mixture distributions, plot_simpdf_theory
to
plot simulated PDF’s, and plot_simtheory
to plot simulated data
values. The plotting functions work for continuous or count variables
and overlay target distributions, which are specified by name (39
distributions currently available) or PDF function fx
. The fx
input
is useful when plotting continuous mixture variables since there are no
distribution functions available in R. There are five vignettes in the
package documentation to help the user understand the simulation and
analysis methods. The stable version of the package is available via the
Comprehensive R Archive Network (CRAN) at
https://CRAN.R-project.org/package=SimCorrMix, and the development
version may be found on GitHub at
https://github.com/AFialkowski/SimCorrMix. The results given in this
paper are reproducible (for R version
Mixture distributions describe continuous or discrete random variables
that are drawn from more than one component distribution. For a random
variable
Continuous mixture distributions are used in genetic research to model
the effect of underlying genetic factors (e.g., genotypes, alleles, or
mutations at chromosomal loci) on continuous traits
(Gianola1?; Thompson?). Consider a single locus with two alleles
|
|
|
|
For a continuous phenotype
Continuous variables, including components of mixture variables, are
created using either Fleishman (1978)’s third-order (method = "Fleishman"
) or
Headrick (2002)’s fifth-order (method =
"Polynomial"
) PMT applied to
standard normal variables. The transformation is expressed as follows:
find_constants
, which solves the
system of non-linear equations given in poly
or fleish
. The
simulation functions corrvar
and corrvar2
contain checks to see if
any distributions are repeated for non-mixture or components of mixture
variables. If so, these are noted so the constants are only calculated
once, decreasing simulation time. Mixture variables are generated from
their components based on random multinomial variables described by
their mixing probabilities (using stat’s rmultinom
).
The fifth-order PMT allows additional control over the fifth and sixth
moments of the generated distribution. In addition, the range of
feasible standardized kurtosis
For some sets of cumulants, it is either not possible to find power
method constants (indicated by a stop
error) or the calculated
constants do not generate valid PDF’s (indicated in the simulation
function results). For the fifth-order PMT, adding a value to the sixth
cumulant may provide solutions. This can be done for non-mixture
variables in Six
or components of mixture variables in mix_Six
, and
find_constants
will use the smallest correction that yields a valid
PDF. Another possible reason for function failure is that the
standardized kurtosis for a distribution is below the lower boundary of
values which can be generated using the third or fifth-order PMT. This
boundary can be found with SimMultiCorrData’s calc_lower_skurt
using
skew (for method = "Fleishman"
) and standardized fifth and sixth
cumulants (for method = "Polynomial"
).
The PMT simulates continuous variables by matching standardized
cumulants derived from central moments. Using standardized cumulants
decreases the complexity involved in calculations when a distribution
has large central moments. In view of this, let
The standardized cumulants are found by dividing the first six cumulants
The values
The standardized cumulants for a continuous mixture variable can be
derived in terms of the standardized cumulants of its component
distributions. Suppose the goal is to simulate a continuous mixture
variable poly
function and calculated by find_constants
.
Similar results hold for Fleishman’s third-order PMT, where the
constants fleish
The
If the desired mixture distribution calc_mixmoments
to determine the values for a
mixture variable. The summary_var
function executes calc_mixmoments
to provide target distributions for simulated continuous mixture
variables.
Let
The standardized cumulants for the Normal and Beta mixtures using the fifth-order PMT are found as follows:
library("SimCorrMix")
B1 <- calc_theory("Beta", c(13, 11))
B2 <- calc_theory("Beta", c(13, 4))
mix_pis <- list(c(0.36, 0.48, 0.16), c(0.3, 0.7))
mix_mus <- list(c(-5, 1, 7), c(B1[1], B2[1]))
mix_sigmas <- list(c(sqrt(2), sqrt(3), sqrt(4)), c(B1[2], B2[2]))
mix_skews <- list(c(0, 0, 0), c(B1[3], B2[3]))
mix_skurts <- list(c(0, 0, 0), c(B1[4], B2[4]))
mix_fifths <- list(c(0, 0, 0), c(B1[5], B2[5]))
mix_sixths <- list(c(0, 0, 0), c(B1[6], B2[6]))
Nstcum <- calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
Nstcum
## mean sd skew kurtosis fifth sixth
## -0.2000000 4.4810713 0.3264729 -0.6238472 -1.0244454 1.4939902
Bstcum <- calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
Bstcum
## mean sd skew kurtosis fifth sixth
## 0.6977941 0.1429099 -0.4563146 -0.5409080 1.7219898 0.5584577
SimMultiCorrData’s calc_theory
was used first to obtain the
standardized cumulants for each of the Beta distributions.
The target correlation matrix rho
in the simulation functions
corrvar
and corrvar2
is specified in terms of the correlations with
components of continuous mixture variables. This allows the user to set
the correlation between components of the same mixture variable to any
desired value. If this correlation is small (i.e., Sigma
may need to be converted to the
nearest positive-definite (PD) matrix. This is done within the
simulation functions by specifying use.nearPD = TRUE
, and Higham (2002)’s
algorithm is executed with the
Matrix package’s nearPD
function (Bates and Maechler 2017). Otherwise, negative eigenvalues are replaced with
The function intercorr_cont
calculates the intermediate correlations
for the standard normal variables used in Equation (3).
This is necessary because the transformation decreases the absolute
value of the final correlations. The function uses Equation 7b derived
by Headrick and Sawilowsky (1999 28) for the third-order PMT and Equation 26 derived by
Headrick (2002 694) for the fifth-order PMT.
Even though the correlations for the continuous mixture variables are
set at the component level, we can approximate the resulting
correlations for the mixture variables. Assume
The correlation between the mixture variables
rho <- matrix(0.4, 6, 6)
rho[1:3, 1:3] <- matrix(0.1, 3, 3)
rho[4:5, 4:5] <- matrix(0, 2, 2)
diag(rho) <- 1
rho_M1M2(mix_pis, mix_mus, mix_sigmas, rho[1:3, 4:5])
## [1] 0.103596
Note that rho
has 6 columns because
Here
rho_M1Y(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]], rho[1:3, 6])
## [1] 0.1482236
rho_M1Y(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]], rho[4:5, 6])
## [1] 0.2795669
The accuracy of these approximations can be determined through simulation:
means <- c(Nstcum[1], Bstcum[1])
vars <- c(Nstcum[2]^2, Bstcum[2]^2)
seed <- 184
Sim1 <- corrvar(n = 100000, k_mix = 2, k_pois = 1, method = "Polynomial",
means = means, vars = vars, mix_pis = mix_pis, mix_mus = mix_mus,
mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
mix_fifths = mix_fifths, mix_sixths = mix_sixths, lam = 5, p_zip = 0.1,
rho = rho, seed = seed, use.nearPD = FALSE)
## Total Simulation time: 0.065 minutes
names(Sim1)
## [1] "constants" "Y_cont" "Y_comp" "sixth_correction"
## [5] "valid.pdf" "Y_mix" "Y_pois" "Sigma"
## [9] "Error_Time" "Time" "niter"
Sum1 <- summary_var(Y_comp = Sim1$Y_comp, Y_mix = Sim1$Y_mix,
Y_pois = Sim1$Y_pois, means = means, vars = vars, mix_pis = mix_pis,
mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
lam = 5, p_zip = 0.1, rho = rho)
names(Sum1)
## [1] "cont_sum" "target_sum" "mix_sum" "target_mix" "rho_mix" "pois_sum"
## [7] "rho_calc" "maxerr"
Sum1$rho_mix
## [,1] [,2] [,3]
## [1,] 1.0000000 0.1012219 0.1475749
## [2,] 0.1012219 1.0000000 0.2776299
## [3,] 0.1475749 0.2776299 1.0000000
The results show that Equation (20) and Equation (25) provided good approximations to the simulated correlations. 8 also compares approximated expected correlations for continuous mixture variables with simulated correlations.
Figure 2 displays the PDF of the Normal mixture variable and
the simulated values of the zero-inflated Poisson (ZIP) variable
obtained using SimCorrMix’s graphing functions. These functions are
written with ggplot2
functions and the results are ggplot
objects that can be saved or
further modified (Wickham and Chang 2016). As demonstrated below, the target
distribution, specified by distribution name and parameters (39
distributions currently available by name) or PDF function fx
, can be
overlayed on the plot for continuous or count variables.
plot_simpdf_theory(sim_y = Sim1$Y_mix[, 1], title = "", sim_size = 2,
target_size = 2, fx = function(x) mix_pis[[1]][1] *
dnorm(x, mix_mus[[1]][1], mix_sigmas[[1]][1]) + mix_pis[[1]][2] *
dnorm(x, mix_mus[[1]][2], mix_sigmas[[1]][2]) + mix_pis[[1]][3] *
dnorm(x, mix_mus[[1]][3], mix_sigmas[[1]][3]), lower = -10, upper = 10,
legend.position = "none", axis.text.size = 30, axis.title.size = 30)
plot_simtheory(sim_y = Sim1$Y_pois[, 1], title = "", cont_var = FALSE,
binwidth = 0.5, Dist = "Poisson", params = c(5, 0.1),
legend.position = "none", axis.text.size = 30, axis.title.size = 30)
|
![]() |
|
|
The Continuous Mixture Distributions vignette explains how to
compare simulated to theoretical distributions of continuous mixture
variables, as demonstrated here for the Beta mixture variable
Obtain the standardized cumulants for the target mixture variable
calc_mixmoments
and calc_theory
.
Obtain the PMT constants for the components of Sim1$constants
.
Determine whether these constants produce valid PDF’s for the
components of Sim1$valid.pdf
.
## [1] "TRUE" "TRUE" "TRUE" "TRUE" "TRUE"
Select a critical value from the distribution of
beta_fx <- function(x) mix_pis[[2]][1] * dbeta(x, 13, 11) +
mix_pis[[2]][2] * dbeta(x, 13, 4)
beta_cfx <- function(x, alpha, fx = beta_fx) {
integrate(function(x, FUN = fx) FUN(x), -Inf, x, subdivisions = 1000,
stop.on.error = FALSE)$value - (1 - alpha)
}
y2_star <- uniroot(beta_cfx, c(0, 1), tol = 0.001, alpha = 0.05)$root
y2_star
## [1] 0.8985136
Calculate the cumulative probability for the simulated mixture
variable sim_cdf_prob
from SimMultiCorrData calculates
cumulative probabilities.
sim_cdf_prob(sim_y = Sim1$Y_mix[, 2], delta = y2_star)$cumulative_prob
## [1] 0.9534
This is approximately equal to the
Plot a graph of plot_sim_cdf
is in
SimMultiCorrData):
plot_simpdf_theory(sim_y = Sim1$Y_mix[, 2], title = "", sim_size = 2,
target_size = 2, fx = beta_fx, lower = 0, upper = 1,
legend.position = c(0.4, 0.85), legend.text.size = 30,
axis.text.size = 30, axis.title.size = 30)
plot_sim_cdf(sim_y = Sim1$Y_mix[, 2], title = "", calc_cprob = TRUE,
delta = y2_star, text.size = 30, axis.text.size = 30, axis.title.size = 30)
|
|
|
|
SimCorrMix extends the methods in SimMultiCorrData for regular
Poisson and Negative Binomial (NB) variables to zero-inflated Poisson
and NB variables. All count variables are generated using the inverse
CDF method with distribution functions imported from VGAM. The CDF of
a standard normal variable has a uniform distribution. The appropriate
quantile function
A zero-inflated random variable
The model for
lam
and
p_zip
. Setting p_zip = 0
(the default setting) generates a regular
Poisson variable.
The zero-deflated Poisson distribution is obtained by setting
A major limitation of the Poisson distribution is that the mean and
variance are equal. In practice, population heterogeneity creates extra
variability (overdispersion), e.g., if
The model for
size
, prob
, mu
, and p_zinb
. Either
prob
or mu
should be given for all NB and ZINB variables. Setting
p_zinb = 0
(the default setting) generates a regular NB variable.
The zero-deflated NB distribution may be obtained by setting
The two simulation pathways differ by the technique used for count variables. The intermediate correlations used in correlation method 1 are simulation based and accuracy increases with sample size and number of repetitions. The intermediate correlations used in correlation method 2 involve correction loops which make iterative adjustments until a maximum error has been reached (if possible). Correlation method 1 is described below:
Count variable pairs: Based on Yahav and Shmueli (2012)’s method, the intermediate
correlation between the standard normal variables intercorr_pois
, intercorr_nb
, and
intercorr_pois_nb
calculate these correlations.
Ordinal-count variable pairs: Extending Amatya and Demirtas (2015)’s method, the
intermediate correlations are the ratio of the target correlations
to correction factors. The correction factor is the product of the
upper Fréchet-Hoeffding bound on the correlation between the count
variable and the normal variable used to generate it and a simulated
upper bound on the correlation between an ordinal variable and the
normal variable used to generate it. This upper bound is Demirtas and Hedeker (2011)’s
generate, sort, and correlate (GSC) upper bound (see
6). The functions intercorr_cat_pois
and
intercorr_cat_nb
calculate these correlations.
Continuous-count variable pairs: Extending Amatya and Demirtas (2015)’s and Demirtas and Hedeker (2011)’s
methods, the correlation correction factor is the product of the
upper Fréchet-Hoeffding bound on the correlation between the count
variable and the normal variable used to generate it and the power
method correlation between the continuous variable and the normal
variable used to generate it. This power method correlation is given
by intercorr_cont_pois
and intercorr_cont_nb
calculate these
correlations.
Fialkowski and Tiwari (2017) showed that this method is less accurate for positive
correlations with small count variable means (i.e., less than
In correlation method 2, count variables are treated as "ordinal"
variables, based on the methods of Barbiero and Ferrari
(Ferrari and Barbiero 2012; Barbiero and Ferrari 2015b). The Poisson or NB support is made
finite by removing a small user-specified value (specified by pois_eps
and nb_eps
) from the total cumulative probability. This truncation
factor may differ for each count variable, but the default value is
pois_eps = 0.0001
means that the support values removed have a total probability of
maxcount_support
creates these new supports and associated marginal
distributions.
Count variable or ordinal-count variable pairs: The intermediate
correlations are calculated using the correction loop of ord_norm
(see 5).
Continuous-count variable pairs: Extending Demirtas et al. (2012)’s method, the
intermediate correlations are the ratio of the target correlations
to correction factors. The correction factor is the product of the
power method correlation between the continuous variable and the
normal variable used to generate it and the point-polyserial
correlation between the ordinalized count variable and the normal
variable used to generate it (Olsson et al. 1982). The functions
intercorr_cont_pois2
and intercorr_cont_nb2
calculate these
correlations.
This method performs best under the same circumstances as ordinal variables, i.e., when there are few categories and the probability of any given category is not very small. This occurs when the count variable has a small mean. Therefore, method 2 performs well in situations when method 1 has poor accuracy. In contrast, large means for the count variables would result in longer computational times. 8 compares the accuracy of correlation methods 1 and 2 under different scenarios.
Ordinal variables (marginal
. Each element of this
list is a vector of length
If at least one ordinal variable has more than ord_norm
is called. Based on SimMultiCorrData’s ordnorm
and
GenOrd’s ordcont
and contord
, the algorithm to simulate k_cat
ordinal random variables with target correlation matrix rho0
is as
follows:
Create the default support
if necessary.
Use norm_ord
to calculate the initial correlation of the ordinal
variables (rhoordold
) generated by discretizing k_cat
random
normal variates with correlation matrix set equal to rho0
, using
marginal
and the corresponding normal quantiles. These
correlations are calculated using means and variances found from
multivariate normal probabilities determined by
mvtnorm’s pmvnorm
(Genz and Bretz 2009; Genz et al. 2017).
Let rho
be the intermediate normal correlation updated in each
iteration, rhoord
be the ordinal correlation calculated in each
iteration, rhoold
be the intermediate correlation from the
previous iteration (initialized at rhoordold
), it
be the
iteration number, and maxit
and epsilon
be the user-specified
maximum number of iterations and pairwise correlation error. For
each variable pair, execute the following:
If rho0 = 0
, set rho = 0
.
While the absolute error between rhoord
and rho0
is greater
than epsilon
and it
is less than maxit
:
If rho0 * (rho0/rhoord) <= -1
:
rho = rhoold * (1 + 0.1 * (1 - rhoold) * -sign(rho0 - rhoord))
.
If rho0 * (rho0/rhoord) >= 1
:
rho = rhoold * (1 + 0.1 * (1 - rhoold) * sign(rho0 - rhoord))
.
Else, rho = rhoold * (rho0/rhoord)
.
If rho > 1
, set rho = 1
. If rho < -1
, set rho = -1
.
Calculate rhoord
using norm_ord
and the rho
.
Set rhoold = rho
and increase it
by
Store the number of iterations in the matrix niter
.
Return the final intermediate correlation matrix SigmaC = rho
for
the random normal variables. Discretize these to produce ordinal
variables with the desired correlation matrix.
For binary variable pairs, the correlation bounds are calculated as by
Demirtas et al. (2012). The joint distribution is determined using the moments of a
multivariate normal distribution (Emrich and Piedmonte 1991). For marginal
distributions and support
values are used in
Demirtas and Hedeker (2011)’s generate, sort, and correlate (GSC) algorithm. A large number
(default
The GSC algorithm is also used for continuous, continuous-ordinal,
ordinal-count, and continuous-count variable pairs. Since count
variables are treated as "ordinal" in correlation method 2, the
correlation bounds for count variable pairs is found with the GSC
algorithm after creating finite supports with associated marginal
distributions (with maxcount_support
). The correlation bounds for
count variable pairs in correlation method 1 are the Fréchet-Hoeffding
bounds (Fréchet 1957; Hoeffding 1994). For two random variables
Consider the Normal and Beta mixture variables from 3.
Additional variables are an ordinal variable with three equally-weighted
categories (e.g., drug treatment), two zero-inflated Poisson variables
with means validpar
.
marginal <- list(c(1/3, 2/3))
support <- list(c(0, 1, 2))
lam <- c(0.5, 1)
p_zip <- c(0.1, 0.2)
mu <- c(0.5, 1)
prob <- c(0.8, 0.6)
size <- prob * mu/(1 - prob)
p_zinb <- c(0.1, 0.2)
rho <- matrix(-0.5, 10, 10)
rho[2:4, 2:4] <- matrix(0.1, 3, 3)
rho[5:6, 5:6] <- matrix(0, 2, 2)
diag(rho) <- 1
validpar(k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2, method = "Polynomial",
means = means, vars = vars, mix_pis = mix_pis, mix_mus = mix_mus,
mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
mix_fifths = mix_fifths, mix_sixths = mix_sixths, marginal = marginal,
support = support, lam = lam, p_zip = p_zip, size = size, mu = mu,
p_zinb = p_zinb, rho = rho)
## Default of pois_eps = 0.0001 will be used for Poisson variables
## if using correlation method 2.
## Default of nb_eps = 0.0001 will be used for NB variables
## if using correlation method 2.
Target correlation matrix is not positive definite.
## [1] TRUE
valid1 <- validcorr(10000, k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2,
method = "Polynomial", means = means, vars = vars, mix_pis = mix_pis,
mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
marginal = marginal, lam = lam, p_zip = p_zip, size = size, mu = mu,
p_zinb = p_zinb, rho = rho, use.nearPD = FALSE, quiet = TRUE)
## Range error! Corr[ 7 , 9 ] must be between -0.388605 and 0.944974
## Range error! Corr[ 7 , 10 ] must be between -0.432762 and 0.925402
## Range error! Corr[ 8 , 9 ] must be between -0.481863 and 0.877668
## Range error! Corr[ 9 , 10 ] must be between -0.386399 and 0.937699
names(valid1)
## [1] "rho" "L_rho" "U_rho" "constants"
## [5] "sixth_correction" "valid.pdf" "valid.rho"
valid2 <- validcorr2(10000, k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2,
method = "Polynomial", means = means, vars = vars, mix_pis = mix_pis,
mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
marginal = marginal, lam = lam, p_zip = p_zip, size = size, mu = mu,
p_zinb = p_zinb, rho = rho, use.nearPD = FALSE, quiet = TRUE)
## Range error! Corr[ 7 , 9 ] must be between -0.385727 and 0.947462
## Range error! Corr[ 7 , 10 ] must be between -0.428145 and 0.921001
## Range error! Corr[ 8 , 9 ] must be between -0.477963 and 0.879439
## Range error! Corr[ 9 , 10 ] must be between -0.384557 and 0.939524
The validpar
function indicates that all parameter inputs have the
correct format and the default cumulative probability truncation value
of pois_eps
and
nb_eps
. Since rho
is not PD, the intermediate correlation matrix
Sigma
will probably also be non-PD. The user has three choices: 1)
convert rho
to the nearest PD matrix before simulation, 2) set
use.nearPD = TRUE
(default) in the simulation functions to convert
Sigma
to the nearest PD matrix during simulation, or 3) set
use.nearPD = FALSE
in the simulation functions to replace negative
eigenvalues with use.nearPD = TRUE
in validcorr
or
validcorr2
converts rho
to the nearest PD matrix before checking if
all pairwise correlations fall within the feasible boundaries, whereas
use.nearPD = FALSE
checks the initial matrix rho
. Setting
quiet = TRUE
keeps the non-PD message from being reprinted.
Range violations occur with the count variables. The lower and upper
correlation bounds are given in the list components L_rho
and U_rho
.
Note that these are pairwise correlation bounds. Although valid.rho
will return TRUE
if all elements of rho
are within these bounds,
this does not guarantee that the overall target correlation matrix rho
can be obtained in simulation.
The vignette Overall Workflow for Generation of Correlated Data
provides a detailed step-by-step guideline for correlated data
simulation with examples for corrvar
and corrvar2
. These steps are
briefly reviewed here.
Obtain the distributional parameters for the desired variables.
Continuous variables: For non-mixture and components of mixture
variables, these are skew, skurtosis, plus standardized fifth
and sixth cumulants (for method = "Polynomial"
) and sixth
cumulant corrections (if desired). The inputs are skews
,
skurts
, fifths
, sixths
, and Six
for non-mixture
variables; mix_skews
, mix_skurts
, mix_fifths
,
mix_sixths
, and mix_Six
for components of mixture variables.
If the goal is to simulate a theoretical distribution,
SimMultiCorrData’s calc_theory
will return these values
given a distribution’s name and parameters (fx
. If the goal
is to mimic a real data set, SimMultiCorrData’s calc_moments
uses the method of moments or calc_fisherk
uses Fisher (1929)’s
k-statistics given a vector of data. For mixture variables, the
mixing parameters (mix_pis
), component means (mix_mus
), and
component standard deviations (mix_sigmas
) are also required.
The means and variances of non-mixture and mixture variables are
specified in means
and vars
and these can be found using
calc_mixmoments
for mixture variables.
Ordinal variables: The cumulative marginal probabilities in
marginal
and support values in support
as described in
5.
Poisson variables: The mean values in lam
and probabilities of
structural zeros in p_zip
(default of pois_eps
.
NB variables: The target number of successes in size
,
probabilities of structural zeros in p_zinb
(default of mu
or success
probabilities in prob
. The mean refers to the mean of the NB
component of the distribution. For correlation method 2, also
cumulative probability truncation values in nb_eps
.
Check that all parameter inputs have the correct format using
validpar
. Incorrect parameter specification is the most likely
cause of function failure.
If continuous variables are desired, verify that the skurtoses are
greater than the lower skurtoses bounds using SimMultiCorrData’s
calc_lower_skurt
. The function permits a skurtosis correction
vector to aid in discovering a lower bound associated with PMT
constants that yield a valid PDF. Since this step can take
considerable time, the user may wish to do this at the end if any of
the variables have invalid PDF’s. The sixth cumulant value should be
the actual sixth cumulant used in simulation, i.e., the
distribution’s sixth cumulant plus any necessary correction (if
method = "Polynomial"
).
Check if the target correlation matrix rho
falls within the
feasible correlation boundaries. The variables in rho
must be
ordered correctly (see 1).
Generate the variables using either corrvar
or corrvar2
, with or
without the error loop.
Summarize the results numerically with summary_var
or graphically
with plot_simpdf_theory
, plot_simtheory
, or any of the plotting
functions in SimMultiCorrData.
Correlation methods 1 and 2 were compared to demonstrate situations when
each has greater simulation accuracy. In scenario A, the ordinal (O1),
Normal mixture (Nmix with components N1, N2, and N3), Beta mixture (Bmix
with components B1 and B2), two zero-inflated Poisson (P1 and P2), and
two zero-inflated NB (NB1 and NB2) variables from the 6
example were simulated. All count variables in this case had small means
(less than 1). In scenario B, the two Poisson variables were replaced
with two zero-inflated NB (NB3 and NB4) variables with means n
r
corrvar2
.
In scenarios A and B, the simulated correlations among the count
variables were compared to the target values using boxplots generated
with ggplot2’s geom_boxplot
. In scenario A, the simulated
correlations with the continuous mixture variables were compared to the
expected correlations approximated by rho_M1M2
and rho_M1Y
, with O1
as the non-mixture variable. Simulation times included simulation of the
variables only with corrvar
or corrvar2
. Medians and interquartile
ranges (IQR) were computed for the summary tables. Variable summaries
are given for Nmix, Bmix, and NB1–NB4 in scenario B. This example was
run on R version
Table 1 gives the three different correlations and total
simulation times (1,000 repetitions) for correlation method 1 using
corrvar
(Time Mcorrvar2
(Time
M
Scenario | A: Poisson and NB | B: NB | ||||
---|---|---|---|---|---|---|
Correlation Type | Time M |
Time M |
Time M |
Time M |
||
Strong | 2.55 | 2.03 | 2.00 | 9.30 | ||
Moderate | 1.65 | 0.92 | 1.98 | 8.01 | ||
Weak | 1.39 | 0.90 | 1.95 | 5.78 |
The strong correlations required the most time for each correlation method. Although method 2 was faster when all count variables had small means (scenario A), it was notably slower when two of the count variables had large means (scenario B). The reason is that method 2 treats all count variables as "ordinal," which requires creating finite supports and associated marginal distributions, as described in 4.3. When a count variable has a large mean, there are several support values with very small probabilities, making simulation more difficult.
Figure 4 contains boxplots of the simulated correlations for
the continuous mixture variables. Method 1 is in red; method 2 is in
green. The middle line is the median (rho_M1M2
and rho_M1Y
(also given
in Table 2).
Correlation Type | ||||
---|---|---|---|---|
Strong | ||||
Moderate | ||||
Weak |
Notice in Table 2 that the expected correlations are much
smaller than the pairwise correlations, demonstrating an important
consideration when setting the correlations for mixture components. Even
though the strong correlation between the components of Nmix and the
components of Bmix was set at
Figure 4 shows that, as expected, the results with
correlation methods 1 and 2 were similar, since the methods differ
according to count variable correlations. The simulated correlations
were farthest from the approximate expected values with the strong
correlation and closest for the weak correlation. In the simulations
with strong or moderate correlations, the intermediate correlation
matrix Sigma
was not PD due to the weak correlation (0.1) between N1,
N2, and N3 and independence (zero correlation) of B1 and B2. During
simulation, after Sigma
is calculated with intercorr
or
intercorr2
, eigenvalue decomposition is done on Sigma
. The square
roots of the eigenvalues form a diagonal matrix. The product of the
eigenvectors, diagonal matrix, and transposed standard normal variables
produces normal variables with the desired intermediate correlations. If
Sigma
is not PD and use.nearPD
is set to FALSE
in the simulation
functions, negative eigenvalues are replaced with 0 before forming the
diagonal matrix of eigenvalue square roots. If use.nearPD
is set to
TRUE
(default), Sigma
is replaced with the nearest PD matrix using
(Higham 2002)’s algorithm and Matrix’s nearPD
function. Either method
increases correlation errors because the resulting intermediate
correlations are different from those found in Sigma
. As the maximum
absolute correlation in the target matrix rho
increases, these
differences increase. In this example, the Sigma
matrix had two
negative eigenvalues in the strong correlation simulations and one
negative eigenvalue in the moderate correlation simulations. This is why
the correlation errors were largest for the strong correlation setting.
Figure 5 shows boxplots of the simulated correlations for
the count variables. The horizontal lines show the target values. These
correlations were also affected by the adjusted eigenvalues and the
errors for the strong correlations were again the largest. Correlation
method 2 performed better in each case except when generating
|
|
|
|
|
|
Tables 3 and 4 describe the target and simulated distributions for Nmix, Bmix, and NB1–NB4 in the weak correlation case. In all instances, the simulated distributions are close to the target distributions.
Nmix | Bmix | |||
---|---|---|---|---|
Mean | -0.20 | -0.20 (-0.20, -0.20) | 0.70 | 0.70 (0.70, 0.70) |
SD | 4.48 | 4.48 (4.48, 4.48) | 0.14 | 0.14 (0.14, 0.14) |
Skew | 0.33 | 0.33 (0.32, 0.33) | -0.46 | -0.46 (-0.47, -0.45) |
Skurtosis | -0.62 | -0.62 (-0.64, -0.61) | -0.54 | -0.54 (-0.56, -0.52) |
Fifth | -1.02 | -1.03 (-1.07, -0.98) | 1.72 | 1.73 (1.68, 1.77) |
Sixth | 1.49 | 1.50 (1.36, 1.62) | 0.56 | 0.54 (0.37, 0.72) |
Mean $ | []$ | |||
---|---|---|---|---|
NB1 | 0.68 (0.67, 0.68) | 0.68 | 0.45 (0.45, 0.45) | 0.45 |
NB2 | 0.57 (0.57, 0.57) | 0.57 | 0.80 (0.80, 0.80) | 0.80 |
NB3 | 0.10 (0.10, 0.10) | 0.10 | 45.00 (44.96, 45.03) | 45.00 |
NB4 | 0.20 (0.20, 0.20) | 0.20 | 80.00 (79.90, 80.10) | 80.00 |
Var | Median | Max | ||
NB1 | 0.58 (0.58, 0.59) | 0.58 | 0 (0, 0) | 7 (6, 7) |
NB2 | 1.49 (1.48, 1.51) | 1.49 | 0 (0, 0) | 11 (10, 12) |
NB3 | 337.76 (335.43, 339.67) | 337.50 | 48 (48, 48) | 101 (98, 105) |
NB4 | 2000.09 (1990.21, 2010.18) | 2000.00 | 92 (91, 92) | 204 (199, 212) |
Figure 6 shows boxplots of the simulated correlations for the count variables. The horizontal lines show the target values. Method 1 performed better for all strong correlation cases except between the two NB variables with small means (NB1 and NB2). Although method 2 had smaller errors overall, it did require considerably longer simulation times. Therefore, the user should consider using correlation method 1 when the data set contains count variables with large means. Tables 9–11 in the Appendix provide median (IQR) correlation errors for all variables and each correlation type.
|
|
|
|
|
|
The package SimCorrMix generates correlated continuous (normal,
non-normal, and mixture), ordinal (corrvar
and
corrvar2
, permit the user to accurately reproduce desired correlation
matrices for different parameter ranges. Correlation method 1 should be
used when the target distributions include count variables with large
means, and correlation method 2 is preferable in opposite situations.
The package also provides helper functions to calculate standardized
cumulants of continuous mixture variables, approximate expected
correlations with continuous mixture variables, validate parameter
inputs, determine feasible correlation boundaries, and summarize
simulation results numerically and graphically. Future extensions of the
package include adding more variable types (e.g., zero-inflated
Binomial, Gaussian, and Gamma).
The article’s supplementary file contains replication code for the examples in the paper and 8.
This research serves as part of Allison Fialkowski’s dissertation, which
was made possible by grant
Suppose the goal is to simulate a continuous mixture variable
Mean: Using
Variance: The variance of
Skew: Using
Skurtosis: Using
Standardized fifth cumulant: Using
Standardized sixth cumulant: Using
Scenario | ||
---|---|---|
Correlation Type | A: Poisson and NB | B: NB |
Strong | 6 | 9 |
Moderate | 7 | 10 |
Weak | 8 | 11 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | -0.08 (-0.083, -0.078) | -0.08 (-0.082, -0.078) | -0.08 (-0.082, -0.078) | -0.036 (-0.039, -0.034) |
N1 | -0.078 (-0.08, -0.076) | 0 | 0.153 (0.152, 0.153) | 0.153 (0.152, 0.153) | -0.164 (-0.164, -0.164) |
N2 | -0.078 (-0.08, -0.076) | 0.153 (0.153, 0.153) | 0 | 0.153 (0.152, 0.153) | -0.164 (-0.164, -0.164) |
N3 | -0.078 (-0.081, -0.076) | 0.153 (0.153, 0.153) | 0.153 (0.153, 0.153) | 0 | -0.164 (-0.164, -0.164) |
B1 | -0.034 (-0.036, -0.031) | -0.164 (-0.164, -0.164) | -0.164 (-0.164, -0.164) | -0.164 (-0.164, -0.164) | 0 |
B2 | -0.035 (-0.037, -0.033) | -0.165 (-0.166, -0.165) | -0.166 (-0.166, -0.165) | -0.166 (-0.166, -0.165) | 0.155 (0.154, 0.156) |
P1 | -0.033 (-0.036, -0.03) | -0.157 (-0.16, -0.153) | -0.156 (-0.159, -0.153) | -0.156 (-0.159, -0.153) | -0.123 (-0.125, -0.12) |
P2 | -0.018 (-0.02, -0.015) | -0.133 (-0.135, -0.13) | -0.133 (-0.135, -0.13) | -0.133 (-0.135, -0.13) | -0.097 (-0.1, -0.095) |
NB1 | -0.05 (-0.053, -0.047) | -0.168 (-0.172, -0.165) | -0.168 (-0.171, -0.165) | -0.168 (-0.171, -0.165) | -0.137 (-0.14, -0.134) |
NB2 | -0.028 (-0.031, -0.025) | -0.156 (-0.16, -0.153) | -0.156 (-0.159, -0.153) | -0.156 (-0.159, -0.153) | -0.125 (-0.128, -0.122) |
B2 | P1 | P2 | NB1 | NB2 | |
O1 | -0.038 (-0.04, -0.035) | -0.023 (-0.025, -0.02) | 0.003 (0, 0.005) | -0.053 (-0.056, -0.05) | -0.043 (-0.046, -0.04) |
N1 | -0.166 (-0.167, -0.165) | -0.156 (-0.159, -0.153) | -0.128 (-0.131, -0.126) | -0.166 (-0.169, -0.163) | -0.153 (-0.156, -0.15) |
N2 | -0.166 (-0.167, -0.165) | -0.156 (-0.158, -0.153) | -0.128 (-0.131, -0.126) | -0.166 (-0.17, -0.163) | -0.153 (-0.156, -0.15) |
N3 | -0.166 (-0.167, -0.165) | -0.156 (-0.159, -0.153) | -0.128 (-0.131, -0.126) | -0.166 (-0.17, -0.163) | -0.153 (-0.156, -0.15) |
B1 | 0.154 (0.153, 0.155) | -0.123 (-0.126, -0.12) | -0.093 (-0.096, -0.091) | -0.135 (-0.138, -0.132) | -0.121 (-0.124, -0.118) |
B2 | 0 | -0.156 (-0.159, -0.154) | -0.121 (-0.123, -0.118) | -0.174 (-0.177, -0.171) | -0.157 (-0.16, -0.155) |
P1 | -0.156 (-0.159, -0.153) | 0 | 0.029 (0.025, 0.031) | 0.013 (0.009, 0.016) | 0.035 (0.032, 0.038) |
P2 | -0.124 (-0.126, -0.122) | 0.013 (0.01, 0.016) | 0 | 0.036 (0.033, 0.039) | 0.014 (0.01, 0.017) |
NB1 | -0.175 (-0.178, -0.172) | 0.022 (0.018, 0.026) | 0.011 (0.008, 0.015) | 0 | 0.044 (0.04, 0.048) |
NB2 | -0.161 (-0.164, -0.158) | 0.02 (0.016, 0.024) | 0.012 (0.008, 0.015) | 0.022 (0.019, 0.027) | 0 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | -0.021 (-0.023, -0.018) | -0.021 (-0.023, -0.018) | -0.021 (-0.023, -0.018) | 0 (-0.003, 0.003) |
N1 | -0.019 (-0.022, -0.017) | 0 | 0.049 (0.049, 0.05) | 0.049 (0.049, 0.05) | -0.035 (-0.035, -0.035) |
N2 | -0.019 (-0.022, -0.016) | 0.051 (0.051, 0.051) | 0 | 0.049 (0.049, 0.05) | -0.035 (-0.035, -0.035) |
N3 | -0.019 (-0.022, -0.017) | 0.051 (0.051, 0.051) | 0.051 (0.051, 0.051) | 0 | -0.035 (-0.035, -0.035) |
B1 | -0.001 (-0.003, 0.002) | -0.034 (-0.034, -0.034) | -0.034 (-0.034, -0.033) | -0.034 (-0.034, -0.033) | 0 |
B2 | -0.001 (-0.003, 0.002) | -0.034 (-0.035, -0.033) | -0.034 (-0.035, -0.033) | -0.034 (-0.035, -0.033) | 0.016 (0.015, 0.017) |
P1 | -0.002 (-0.005, 0.001) | -0.041 (-0.045, -0.038) | -0.041 (-0.044, -0.038) | -0.041 (-0.044, -0.038) | -0.009 (-0.012, -0.006) |
P2 | -0.001 (-0.005, 0.002) | -0.034 (-0.037, -0.032) | -0.034 (-0.037, -0.032) | -0.034 (-0.037, -0.032) | -0.006 (-0.009, -0.003) |
NB1 | -0.004 (-0.007, 0) | -0.044 (-0.048, -0.041) | -0.045 (-0.048, -0.041) | -0.044 (-0.048, -0.041) | -0.011 (-0.014, -0.008) |
NB2 | -0.003 (-0.006, 0) | -0.041 (-0.044, -0.037) | -0.041 (-0.044, -0.037) | -0.041 (-0.044, -0.038) | -0.01 (-0.012, -0.007) |
B2 | P1 | P2 | NB1 | NB2 | |
O1 | 0.001 (-0.002, 0.003) | 0 (-0.004, 0.004) | 0.005 (0.001, 0.008) | -0.009 (-0.013, -0.006) | -0.007 (-0.011, -0.004) |
N1 | -0.035 (-0.036, -0.034) | -0.038 (-0.041, -0.035) | -0.031 (-0.034, -0.028) | -0.04 (-0.043, -0.037) | -0.037 (-0.04, -0.034) |
N2 | -0.035 (-0.036, -0.034) | -0.038 (-0.041, -0.035) | -0.031 (-0.034, -0.029) | -0.04 (-0.044, -0.037) | -0.037 (-0.04, -0.033) |
N3 | -0.035 (-0.036, -0.034) | -0.038 (-0.041, -0.035) | -0.031 (-0.034, -0.028) | -0.041 (-0.044, -0.037) | -0.037 (-0.04, -0.034) |
B1 | 0.013 (0.012, 0.014) | -0.005 (-0.008, -0.002) | -0.003 (-0.006, 0) | -0.006 (-0.01, -0.003) | -0.005 (-0.008, -0.002) |
B2 | 0 | -0.027 (-0.029, -0.024) | -0.019 (-0.022, -0.017) | -0.033 (-0.035, -0.03) | -0.029 (-0.031, -0.027) |
P1 | -0.03 (-0.033, -0.028) | 0 | 0.02 (0.016, 0.025) | 0.014 (0.008, 0.019) | 0.028 (0.023, 0.033) |
P2 | -0.022 (-0.024, -0.019) | 0.004 (0, 0.009) | 0 | 0.033 (0.028, 0.037) | 0.013 (0.008, 0.017) |
NB1 | -0.037 (-0.04, -0.034) | 0.008 (0.003, 0.013) | 0.004 (0, 0.01) | 0 | 0.037 (0.032, 0.042) |
NB2 | -0.033 (-0.036, -0.03) | 0.007 (0.002, 0.012) | 0.004 (0, 0.009) | 0.008 (0.002, 0.013) | 0 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) |
N1 | 0 (-0.003, 0.003) | 0 | 0 (0, 0) | 0 (0, 0) | 0 (0, 0) |
N2 | 0 (-0.003, 0.003) | 0 (0, 0) | 0 | 0 (0, 0) | 0 (0, 0) |
N3 | 0 (-0.003, 0.003) | 0 (0, 0) | 0 (0, 0) | 0 | 0 (0, 0) |
B1 | 0 (-0.003, 0.003) | 0 (0, 0) | 0 (0, 0) | 0 (0, 0) | 0 |
B2 | 0 (-0.003, 0.004) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) |
P1 | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | 0 (-0.004, 0.003) | 0 (-0.004, 0.004) | -0.001 (-0.004, 0.002) |
P2 | 0 (-0.004, 0.004) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) |
NB1 | -0.001 (-0.005, 0.003) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | -0.001 (-0.005, 0.002) |
NB2 | -0.001 (-0.005, 0.003) | 0 (-0.004, 0.003) | 0 (-0.003, 0.003) | 0 (-0.004, 0.003) | -0.002 (-0.005, 0.002) |
B2 | P1 | P2 | NB1 | NB2 | |
O1 | 0 (-0.003, 0.003) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | -0.002 (-0.006, 0.002) | -0.002 (-0.005, 0.002) |
N1 | 0 (-0.001, 0.001) | 0 (-0.004, 0.004) | 0 (-0.003, 0.003) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) |
N2 | 0 (-0.001, 0.001) | 0 (-0.003, 0.004) | 0 (-0.003, 0.003) | 0 (-0.004, 0.004) | 0 (-0.003, 0.004) |
N3 | 0 (-0.001, 0.001) | 0 (-0.004, 0.004) | 0 (-0.003, 0.003) | 0 (-0.004, 0.004) | 0 (-0.004, 0.003) |
B1 | 0 (-0.001, 0.001) | -0.001 (-0.004, 0.003) | -0.001 (-0.004, 0.002) | -0.001 (-0.005, 0.002) | -0.001 (-0.005, 0.002) |
B2 | 0 | -0.009 (-0.012, -0.006) | -0.006 (-0.009, -0.003) | -0.011 (-0.014, -0.008) | -0.01 (-0.013, -0.006) |
P1 | -0.009 (-0.012, -0.006) | 0 | 0.014 (0.008, 0.018) | 0.016 (0.01, 0.022) | 0.021 (0.016, 0.027) |
P2 | -0.006 (-0.009, -0.004) | 0 (-0.006, 0.004) | 0 | 0.022 (0.017, 0.027) | 0.01 (0.004, 0.015) |
NB1 | -0.011 (-0.014, -0.008) | 0 (-0.006, 0.006) | -0.001 (-0.006, 0.005) | 0 | 0.028 (0.022, 0.035) |
NB2 | -0.01 (-0.013, -0.006) | 0 (-0.006, 0.006) | 0 (-0.006, 0.005) | 0 (-0.006, 0.006) | 0 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | -0.095 (-0.097, -0.092) | -0.095 (-0.097, -0.092) | -0.095 (-0.097, -0.093) | -0.049 (-0.051, -0.046) |
N1 | -0.094 (-0.096, -0.092) | 0 | 0.141 (0.141, 0.141) | 0.141 (0.141, 0.141) | -0.172 (-0.172, -0.172) |
N2 | -0.094 (-0.096, -0.092) | 0.141 (0.141, 0.141) | 0 | 0.141 (0.141, 0.141) | -0.172 (-0.172, -0.171) |
N3 | -0.094 (-0.096, -0.092) | 0.141 (0.141, 0.141) | 0.141 (0.141, 0.141) | 0 | -0.172 (-0.172, -0.172) |
B1 | -0.048 (-0.05, -0.046) | -0.172 (-0.172, -0.171) | -0.172 (-0.172, -0.171) | -0.172 (-0.172, -0.171) | 0 |
B2 | -0.049 (-0.051, -0.047) | -0.173 (-0.174, -0.172) | -0.173 (-0.174, -0.172) | -0.173 (-0.174, -0.172) | 0.14 (0.139, 0.141) |
NB1 | -0.056 (-0.059, -0.053) | -0.161 (-0.164, -0.158) | -0.16 (-0.164, -0.157) | -0.161 (-0.164, -0.158) | -0.129 (-0.132, -0.126) |
NB2 | -0.033 (-0.036, -0.03) | -0.151 (-0.154, -0.148) | -0.151 (-0.154, -0.148) | -0.151 (-0.154, -0.148) | -0.118 (-0.121, -0.115) |
NB3 | -0.001 (-0.003, 0) | -0.094 (-0.096, -0.092) | -0.094 (-0.096, -0.092) | -0.094 (-0.096, -0.092) | -0.052 (-0.054, -0.051) |
NB4 | 0.001 (-0.002, 0.003) | -0.099 (-0.101, -0.098) | -0.1 (-0.101, -0.097) | -0.1 (-0.102, -0.098) | -0.056 (-0.057, -0.054) |
B2 | NB1 | NB2 | NB3 | NB4 | |
O1 | -0.05 (-0.052, -0.048) | -0.05 (-0.053, -0.047) | -0.04 (-0.043, -0.037) | -0.011 (-0.013, -0.009) | 0.022 (0.02, 0.024) |
N1 | -0.173 (-0.174, -0.172) | -0.158 (-0.161, -0.155) | -0.148 (-0.151, -0.145) | -0.094 (-0.096, -0.092) | -0.095 (-0.097, -0.093) |
N2 | -0.173 (-0.174, -0.172) | -0.158 (-0.161, -0.155) | -0.148 (-0.151, -0.145) | -0.094 (-0.096, -0.092) | -0.095 (-0.097, -0.093) |
N3 | -0.173 (-0.174, -0.172) | -0.158 (-0.162, -0.155) | -0.148 (-0.151, -0.145) | -0.094 (-0.096, -0.092) | -0.095 (-0.097, -0.093) |
B1 | 0.14 (0.138, 0.141) | -0.126 (-0.129, -0.124) | -0.115 (-0.118, -0.112) | -0.052 (-0.054, -0.05) | -0.051 (-0.053, -0.05) |
B2 | 0 | -0.166 (-0.169, -0.163) | -0.151 (-0.154, -0.149) | -0.042 (-0.044, -0.039) | -0.045 (-0.047, -0.042) |
NB1 | -0.168 (-0.171, -0.165) | 0 | 0.047 (0.043, 0.05) | -0.015 (-0.017, -0.013) | -0.008 (-0.009, -0.006) |
NB2 | -0.155 (-0.157, -0.152) | 0.027 (0.023, 0.031) | 0 | -0.009 (-0.011, -0.007) | -0.001 (-0.003, 0.001) |
NB3 | -0.042 (-0.045, -0.04) | -0.018 (-0.02, -0.016) | -0.014 (-0.016, -0.012) | 0 | 0.001 (-0.001, 0.004) |
NB4 | -0.049 (-0.051, -0.046) | -0.014 (-0.016, -0.012) | -0.013 (-0.015, -0.011) | 0.003 (0, 0.005) | 0 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | -0.02 (-0.023, -0.017) | -0.02 (-0.023, -0.017) | -0.02 (-0.023, -0.017) | 0.002 (-0.001, 0.004) |
N1 | -0.019 (-0.022, -0.017) | 0 | 0.038 (0.038, 0.038) | 0.038 (0.038, 0.038) | -0.043 (-0.043, -0.042) |
N2 | -0.02 (-0.022, -0.017) | 0.039 (0.039, 0.039) | 0 | 0.038 (0.038, 0.038) | -0.043 (-0.043, -0.042) |
N3 | -0.019 (-0.022, -0.017) | 0.039 (0.039, 0.039) | 0.039 (0.039, 0.039) | 0 | -0.043 (-0.043, -0.042) |
B1 | 0.001 (-0.001, 0.004) | -0.042 (-0.042, -0.042) | -0.042 (-0.042, -0.042) | -0.042 (-0.042, -0.042) | 0 |
B2 | 0.002 (-0.001, 0.004) | -0.042 (-0.043, -0.041) | -0.042 (-0.043, -0.041) | -0.042 (-0.043, -0.041) | 0.017 (0.016, 0.018) |
NB1 | 0 (-0.004, 0.003) | -0.029 (-0.033, -0.025) | -0.029 (-0.033, -0.026) | -0.029 (-0.032, -0.026) | -0.001 (-0.004, 0.003) |
NB2 | 0 (-0.003, 0.003) | -0.028 (-0.031, -0.024) | -0.027 (-0.03, -0.024) | -0.027 (-0.03, -0.025) | -0.001 (-0.004, 0.002) |
NB3 | 0 (-0.003, 0.003) | -0.015 (-0.017, -0.013) | -0.015 (-0.017, -0.013) | -0.015 (-0.017, -0.013) | -0.001 (-0.003, 0.001) |
NB4 | 0 (-0.003, 0.003) | -0.016 (-0.018, -0.014) | -0.016 (-0.018, -0.014) | -0.016 (-0.018, -0.014) | 0 (-0.002, 0.002) |
B2 | NB1 | NB2 | NB3 | NB4 | |
O1 | 0.002 (-0.001, 0.005) | -0.008 (-0.011, -0.004) | -0.006 (-0.009, -0.003) | -0.002 (-0.005, 0.001) | 0.008 (0.004, 0.011) |
N1 | -0.043 (-0.044, -0.042) | -0.027 (-0.031, -0.024) | -0.026 (-0.029, -0.022) | -0.015 (-0.017, -0.013) | -0.015 (-0.017, -0.012) |
N2 | -0.043 (-0.044, -0.042) | -0.027 (-0.031, -0.024) | -0.025 (-0.029, -0.022) | -0.015 (-0.017, -0.013) | -0.014 (-0.017, -0.012) |
N3 | -0.043 (-0.044, -0.042) | -0.027 (-0.031, -0.024) | -0.026 (-0.029, -0.023) | -0.015 (-0.017, -0.013) | -0.015 (-0.017, -0.012) |
B1 | 0.018 (0.017, 0.019) | 0 (-0.004, 0.003) | -0.001 (-0.004, 0.002) | -0.001 (-0.003, 0.001) | -0.001 (-0.003, 0.002) |
B2 | 0 | -0.028 (-0.031, -0.024) | -0.025 (-0.027, -0.022) | 0.005 (0.003, 0.008) | 0.003 (0.001, 0.006) |
NB1 | -0.027 (-0.031, -0.025) | 0 | 0.034 (0.029, 0.04) | 0.002 (0, 0.005) | 0.014 (0.012, 0.017) |
NB2 | -0.025 (-0.027, -0.022) | 0.004 (-0.002, 0.01) | 0 | 0.005 (0.002, 0.007) | 0.013 (0.01, 0.015) |
NB3 | 0.005 (0.003, 0.008) | -0.001 (-0.003, 0.001) | -0.001 (-0.003, 0.001) | 0 | -0.002 (-0.006, 0.001) |
NB4 | 0.005 (0.002, 0.007) | -0.001 (-0.004, 0.001) | -0.001 (-0.003, 0.002) | 0 (-0.003, 0.004) | 0 |
O1 | N1 | N2 | N3 | B1 | |
---|---|---|---|---|---|
O1 | 0 | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) | 0 (-0.003, 0.003) |
N1 | 0 (-0.003, 0.003) | 0 | 0 (0, 0) | 0 (0, 0) | 0 (0, 0) |
N2 | 0 (-0.003, 0.003) | 0 (0, 0) | 0 | 0 (0, 0) | 0 (0, 0) |
N3 | -0.001 (-0.003, 0.003) | 0 (0, 0) | 0 (0, 0) | 0 | 0 (0, 0) |
B1 | 0 (-0.003, 0.003) | 0 (0, 0) | 0 (0, 0) | 0 (0, 0) | 0 |
B2 | 0 (-0.003, 0.003) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) | 0 (-0.001, 0.001) |
NB1 | -0.001 (-0.005, 0.003) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | 0 (-0.004, 0.004) | -0.002 (-0.005, 0.002) |
NB2 | -0.001 (-0.005, 0.003) | 0 (-0.004, 0.003) | 0 (-0.003, 0.003) | 0 (-0.004, 0.003) | -0.002 (-0.005, 0.002) |
NB3 | 0 (-0.004, 0.003) | 0 (-0.002, 0.002) | 0 (-0.002, 0.002) | 0 (-0.002, 0.002) | 0 (-0.002, 0.003) |
NB4 | 0.001 (-0.003, 0.004) | 0 (-0.002, 0.002) | 0 (-0.002, 0.003) | 0 (-0.002, 0.002) | 0.001 (-0.002, 0.003) |
B2 | NB1 | NB2 | NB3 | NB4 | |
O1 | 0 (-0.003, 0.003) | -0.002 (-0.006, 0.002) | -0.002 (-0.005, 0.003) | 0 (-0.004, 0.003) | 0.001 (-0.002, 0.005) |
N1 | 0 (-0.001, 0.001) | 0 (-0.003, 0.004) | 0 (-0.003, 0.003) | 0 (-0.002, 0.002) | 0 (-0.002, 0.002) |
N2 | 0 (-0.001, 0.001) | 0 (-0.004, 0.004) | 0 (-0.004, 0.003) | 0 (-0.002, 0.002) | 0 (-0.002, 0.002) |
N3 | 0 (-0.001, 0.001) | 0 (-0.004, 0.004) | 0 (-0.004, 0.003) | 0 (-0.002, 0.002) | 0 (-0.002, 0.002) |
B1 | 0 (-0.001, 0.001) | -0.002 (-0.005, 0.002) | -0.002 (-0.005, 0.002) | 0 (-0.002, 0.002) | 0 (-0.002, 0.003) |
B2 | 0 | -0.011 (-0.014, -0.007) | -0.01 (-0.012, -0.007) | 0.002 (0, 0.005) | 0.002 (-0.001, 0.005) |
NB1 | -0.011 (-0.014, -0.007) | 0 | 0.028 (0.021, 0.034) | 0.003 (0, 0.006) | 0.012 (0.009, 0.015) |
NB2 | -0.01 (-0.013, -0.007) | 0 (-0.006, 0.006) | 0 | 0.004 (0.001, 0.007) | 0.008 (0.006, 0.012) |
NB3 | 0.003 (0, 0.005) | -0.001 (-0.004, 0.003) | 0 (-0.003, 0.003) | 0 | -0.002 (-0.005, 0.002) |
NB4 | 0.002 (-0.001, 0.005) | 0 (-0.004, 0.003) | 0 (-0.003, 0.003) | 0.001 (-0.003, 0.004) | 0 |
AdaptGauss, DPP, bgmm, ClusterR, mclust, mixture, AdMit, bimixt, bmixture, CAMAN, flexmix, mixdist, mixtools, nspmix, MixtureInf, Rmixmod, hurdlr, zic, mixpack, distr, stats, rebmix, SimCorrMix, SimMultiCorrData, GenOrd, VGAM, Matrix, ggplot2, mvtnorm
Bayesian, Cluster, Distributions, Econometrics, Environmetrics, ExtremeValue, Finance, MissingData, NumericalMathematics, Optimization, Phylogenetics, Psychometrics, Robust, Spatial, Survival, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Fialkowski & Tiwari, "SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions", The R Journal, 2019
BibTeX citation
@article{RJ-2019-022, author = {Fialkowski, Allison and Tiwari, Hemant}, title = {SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {250-286} }