SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions

Abstract:

The SimCorrMix package generates correlated continuous (normal, non-normal, and mixture), binary, ordinal, and count (regular and zero-inflated, Poisson and Negative Binomial) variables that mimic real-world data sets. Continuous variables are simulated using either Fleishman’s third-order or Headrick’s fifth-order power method transformation. Simulation occurs at the component level for continuous mixture distributions, and the target correlation matrix is specified in terms of correlations with components. However, the package contains functions to approximate expected correlations with continuous mixture variables. There are two simulation pathways which calculate intermediate correlations involving count variables differently, increasing accuracy under a wide range of parameters. The package also provides functions to calculate cumulants of continuous mixture distributions, check parameter inputs, calculate feasible correlation boundaries, and summarize and plot simulated variables. SimCorrMix is an important addition to existing R simulation packages because it is the first to include continuous mixture and zero-inflated count variables in correlated data sets.

Cite PDF Tweet

Published

Aug. 16, 2019

Received

Apr 4, 2018

Citation

Fialkowski & Tiwari, 2019

Volume

Pages

11/1

250 - 286


1 Introduction

Finite mixture distributions have a wide range of applications in clinical and genetic studies. They provide a useful way to describe heterogeneity in a population, e.g., when the population consists of several subpopulations or when an outcome is a composite response from multiple sources. In survival analysis, survival times in competing risk models have been described by mixtures of exponential, Weibull, or Gompertz densities . In medical research, finite mixture models may be used to detect clusters of subjects (cluster analysis) that share certain characteristics, e.g., concomitant diseases, intellectual ability, or history of physical or emotional abuse . In schizophrenia research, Gaussian mixture distributions have frequently described the earlier age of onset in men than in women and the vast phenotypic heterogeneity in the disorder spectrum .

Count mixture distributions, particularly zero-inflated Poisson and Negative Binomial, are required to model count data with an excess number of zeros and/or overdispersion. These distributions play an important role in a wide array of studies, modeling health insurance claim count data , the number of manufacturing defects , the efficacy of pesticides , and prognostic factors of Hepatitis C . Human microbiome studies, which seek to develop new diagnostic tests and therapeutic agents, use RNA-sequencing (RNA-seq) data to assess differential composition of bacterial communities. The operational taxonomic unit (OTU) count data may exhibit overdispersion and an excess number of zeros, necessitating zero-inflated Negative Binomial models . Differential gene expression analysis utilizes RNA-seq data to search for genes that exhibit differences in expression level across conditions (e.g., drug treatments) . Zero-inflated count models have also been used to characterize the molecular basis of phenotypic variation in diseases, including next-generation sequencing of breast cancer data .

The main challenge in applying mixture distributions is estimating the parameters for the component densities. This is usually done with the EM algorithm, and the best model is chosen by the lowest Akaike or Bayesian information criterion (AIC or BIC). Current packages that provide Gaussian mixture models include: AdaptGauss, which uses Pareto density estimation ; DPP, which uses a Dirichlet process prior ; bgmm, which employs two partially supervised mixture modeling methods ; and ClusterR, mclust, and mixture, which conduct cluster analysis . Although Gaussian distributions are the most common, the mixture may contain any combination of component distributions. Packages that provide alternatives include: AdMit, which fits an adaptive mixture of Student-t distributions ; bimixt, which uses case-control data ; bmixture, which conducts Bayesian estimation for finite mixtures of Gamma, Normal and t-distributions ; CAMAN, which provides tools for the analysis of finite semiparametric mixtures in univariate and bivariate data ; flexmix, which implements mixtures of standard linear models, generalized linear models and model-based clustering ; mixdist, which applies to grouped or conditional data ; mixtools and nspmix, which analyze a variety of parametric and semiparametric models ; MixtureInf, which conducts model inference ; and Rmixmod, which provides an interface to the MIXMOD software and permits Gaussian or multinomial mixtures . With regards to count mixtures, the BhGLM, hurdlr, and zic packages model zero-inflated distributions with Bayesian methods .

Given component parameters, there are existing R packages which simulate mixture distributions. The mixpack package generates univariate random Gaussian mixtures . The distr package produces univariate mixtures with components specified by name from stats distributions . The rebmix package simulates univariate or multivariate random datasets for mixtures of conditionally independent Normal, Lognormal, Weibull, Gamma, Binomial, Poisson, Dirac, Uniform, or von Mises component densities. It also simulates multivariate random datasets for Gaussian mixtures with unrestricted variance-covariance matrices .

Existing simulation packages are limited by: 1) the variety of available component distributions and 2) the inability to produce correlated data sets with multiple variable types. Clinical and genetic studies which involve variables with mixture distributions frequently incorporate influential covariates, such as gender, race, drug treatment, and age. These covariates are correlated with the mixture variables and maintaining this correlation structure is necessary when simulating data based on real data sets . The simulated data sets can then be used to accurately perform hypothesis testing and power calculations with the desired type-I or type-II error.

SimCorrMix is an important addition to existing R simulation packages because it is the first to include continuous mixture and zero-inflated count variables in correlated data sets. Therefore, the package can be used to simulate data sets that mimic real-world clinical or genetic data. SimCorrMix generates continuous (normal, non-normal, or mixture distributions), binary, ordinal, and count (regular or zero-inflated, Poisson or Negative Binomial) variables with a specified correlation matrix via the functions corrvar and corrvar2. The user may also generate one continuous mixture variable with the contmixvar1 function. The methods extend those found in the SimMultiCorrData package . Standard normal variables with an imposed intermediate correlation matrix are transformed to generate the desired distributions. Continuous variables are simulated using either ’s third-order or ’s fifth-order polynomial transformation method (the power method transformation, PMT). The fifth-order PMT accurately reproduces non-normal data up to the sixth moment, produces more random variables with valid PDF’s, and generates data with a wider range of standardized kurtoses. Simulation occurs at the component-level for continuous mixture distributions. These components are transformed into the desired mixture variables using random multinomial variables based on the mixing probabilities. The target correlation matrix is specified in terms of correlations with components of continuous mixture variables. However, SimCorrMix provides functions to approximate expected correlations with continuous mixture variables given target correlations with the components. Binary and ordinal variables are simulated using a modification of GenOrd’s ordsample function . Count variables are simulated using the inverse cumulative density function (CDF) method with distribution functions imported from VGAM .

Two simulation pathways (correlation method 1 and correlation method 2) within SimCorrMix provide two different techniques for calculating intermediate correlations involving count variables. Each pathway is associated with functions to calculate feasible correlation boundaries and/or validate a target correlation matrix rho, calculate intermediate correlations (during simulation), and generate correlated variables. Correlation method 1 uses validcorr, intercorr, and corrvar. Correlation method 2 uses validcorr2, intercorr2, and corrvar2. The order of the variables in rho must be 1st ordinal (r2 categories), 2nd continuous non-mixture, 3rd components of continuous mixture, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular Negative Binomial (NB), and 7th zero-inflated NB. This ordering is integral for the simulation process. Each simulation pathway shows greater accuracy under different parameter ranges and 4.3 details the differences in the methods. The optional error loop can improve the accuracy of the final correlation matrix in most situations.

The simulation functions do not contain parameter checks or variable summaries in order to decrease simulation time. All parameters should be checked first with validpar in order to prevent errors. The function summary_var generates summaries by variable type and calculates the final correlation matrix and maximum correlation error. The package also provides the functions calc_mixmoments to calculate the standardized cumulants of continuous mixture distributions, plot_simpdf_theory to plot simulated PDF’s, and plot_simtheory to plot simulated data values. The plotting functions work for continuous or count variables and overlay target distributions, which are specified by name (39 distributions currently available) or PDF function fx. The fx input is useful when plotting continuous mixture variables since there are no distribution functions available in R. There are five vignettes in the package documentation to help the user understand the simulation and analysis methods. The stable version of the package is available via the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=SimCorrMix, and the development version may be found on GitHub at https://github.com/AFialkowski/SimCorrMix. The results given in this paper are reproducible (for R version 3.4.1, SimCorrMix version 0.1.0).

2 Overview of mixture distributions

Mixture distributions describe continuous or discrete random variables that are drawn from more than one component distribution. For a random variable Y from a finite mixture distribution with k components, the probability density function (PDF) or probability mass function (PMF) is: (1)hY(y)=i=1kπifYi(y),i=1kπi=1 The πi are mixing parameters which determine the weight of each component distribution fYi(y) in the overall probability distribution. As long as each component has a valid probability distribution, the overall distribution hY(y) has a valid probability distribution. The main assumption is statistical independence between the process of randomly selecting the component distribution and the distributions themselves. Assume there is a random selection process that first generates the numbers 1, ..., k with probabilities π1, ..., πk. After selecting number i, where 1ik, a random variable yi is drawn from component distribution fYi(y) .

3 Continuous mixture distributions

Continuous mixture distributions are used in genetic research to model the effect of underlying genetic factors (e.g., genotypes, alleles, or mutations at chromosomal loci) on continuous traits . Consider a single locus with two alleles A and a, producing three genotypes AA,Aa, and aa with population frequencies pAA,pAa, and paa. Figure 1a shows a codominant mixture in which each genotype exhibits a different phenotype; Figure 1b shows a dominant mixture in which individuals with at least one A allele possess the same phenotype .

graphic without alt text graphic without alt text
  1. Codominant mixture.
  1. Dominant mixture.
Figure 1: Examples of commingled distributions in genetics.

For a continuous phenotype y, the normal mixture density function describing a commingled distribution is given by: (2)f(y|pAA, μAA, σAA2; pAa, μAa, σAa2; paa, μaa, σaa2)=pAAϕ(y|μAA, σAA2)+pAaϕ(y|μAa, σAa2)+paaϕ(y|μaa, σaa2), where ϕ(y|μ, σ2) is the normal density function with mean μ and variance σ2. Commingling analysis may also study traits that are polygenic (result from the additive effects of several genes) or multifactorial . For example, mixture models explain the heterogeneity observed in gene-mapping studies of complex human diseases, including cancer, chronic fatigue syndrome, bipolar disorder, coronary artery disease, and diabetes . Segregation analysis extends commingling analysis to individuals within a pedigree. Mixed models evaluate whether a genetic locus is affecting a particular quantitative trait and incorporate additional influential factors. Finally, linkage analysis discovers the location of genetic loci using recombination rates, and the regression likelihood equation may be written as a mixture distribution .

Generation of continuous distributions in SimCorrMix

Continuous variables, including components of mixture variables, are created using either ’s third-order (method = "Fleishman") or ’s fifth-order (method = "Polynomial") PMT applied to standard normal variables. The transformation is expressed as follows: (3)Y=p(Z)=c0+c1Z+c2Z2+c3Z3+c4Z4+c5Z5, ZN(0,1), where c4=c5=0 for Fleishman’s method. The real constants are calculated by SimMultiCorrData’s find_constants, which solves the system of non-linear equations given in poly or fleish. The simulation functions corrvar and corrvar2 contain checks to see if any distributions are repeated for non-mixture or components of mixture variables. If so, these are noted so the constants are only calculated once, decreasing simulation time. Mixture variables are generated from their components based on random multinomial variables described by their mixing probabilities (using stat’s rmultinom).

The fifth-order PMT allows additional control over the fifth and sixth moments of the generated distribution. In addition, the range of feasible standardized kurtosis (γ2) values, given skew (γ1) and standardized fifth (γ3) and sixth (γ4) cumulants, is larger than with the third-order PMT. For example, Fleishman’s method can not be used to generate a non-normal distribution with a ratio of γ12/γ2>9/14. This eliminates the χ2 family of distributions, which has a constant ratio of γ12/γ2=2/3 . The fifth-order method also generates more distributions with valid PDFs. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used. This would be the case for t-distributions with degrees of freedom below 7.

For some sets of cumulants, it is either not possible to find power method constants (indicated by a stop error) or the calculated constants do not generate valid PDF’s (indicated in the simulation function results). For the fifth-order PMT, adding a value to the sixth cumulant may provide solutions. This can be done for non-mixture variables in Six or components of mixture variables in mix_Six, and find_constants will use the smallest correction that yields a valid PDF. Another possible reason for function failure is that the standardized kurtosis for a distribution is below the lower boundary of values which can be generated using the third or fifth-order PMT. This boundary can be found with SimMultiCorrData’s calc_lower_skurt using skew (for method = "Fleishman") and standardized fifth and sixth cumulants (for method = "Polynomial").

Expected cumulants of continuous mixture variables

The PMT simulates continuous variables by matching standardized cumulants derived from central moments. Using standardized cumulants decreases the complexity involved in calculations when a distribution has large central moments. In view of this, let Y be a real-valued random variable with cumulative distribution function F. Define the central moments, μr, of Y as:

(4)μr=μr(Y)=E[yμ]r=+[yμ]rdF(y).

The standardized cumulants are found by dividing the first six cumulants κ1 - κ6 by κ2r=(σ2)r/2=σr, where σ2 is the variance of Y and r is the order of the cumulant :

(5)0=κ1κ21=μ1σ1

(6)1=κ2κ22=μ2σ2

(7)γ1=κ3κ23=μ3σ3

(8)γ2=κ4κ24=μ4σ43

(9)γ3=κ5κ25=μ5σ510γ1

(10)γ4=κ6κ26=μ6σ615γ210γ1215.

The values γ1, γ2, γ3, and γ4 correspond to skew, standardized kurtosis (so that the normal distribution has a value of 0, subsequently referred to as skurtosis), and standardized fifth and sixth cumulants. The corresponding sample values for the above can be obtained by replacing μr by mr=j=1n(xjm1)r/n .

The standardized cumulants for a continuous mixture variable can be derived in terms of the standardized cumulants of its component distributions. Suppose the goal is to simulate a continuous mixture variable Y with PDF hY(y) that contains two component distributions Ya and Yb with mixing parameters πa and πb: (11)hY(y)=πafYa(y)+πbgYb(y), y  Y, πa  (0, 1), πb  (0, 1), πa+πb=1. Here, (12)Ya=σaZa+μa, YafYa(y), y  Ya  and  Yb=σbZb+μb, YbgYb(y), y  Yb so that Ya and Yb have expected values μa and μb and variances σa2 and σb2. Assume the variables Za and Zb are generated with zero mean and unit variance using Headrick’s fifth-order PMT given the specified values for skew (γ1a, γ1b), skurtosis (γ2a, γ2b), and standardized fifth (γ3a, γ3b) and sixth (γ4a, γ4b) cumulants: (13)Za=c0a+c1aZa+c2aZa2+c3aZa3+c4aZa4+c5aZa5, ZaN(0, 1)Zb=c0b+c1bZb+c2bZb2+c3bZb3+c4bZb4+c5bZb5, ZbN(0, 1). The constants c0a, ..., c5a and c0b, ..., c5b are the solutions to the system of equations given in SimMultiCorrData’s poly function and calculated by find_constants. Similar results hold for Fleishman’s third-order PMT, where the constants c0a, ..., c3a and c0b, ..., c3b are the solutions to the system of equations given in fleish (c4a=c5a=c4b=c5b=0).

The rth expected value of Y can be expressed as: (14)E[Yr]=yrhY(y)dy=πayrfYa(y)dy+πbyrgYb(y)dy=πaE[Yar]+πbE[Ybr]. Equation (14) can be used to derive expressions for the mean, variance, skew, skurtosis, and standardized fifth and sixth cumulants of Y in terms of the rth expected values of Ya and Yb. See 11.1 in the Appendix for the expressions and proofs.

Extension to more than two component distributions

If the desired mixture distribution Y contains more than two component distributions, the expected values of Y are again expressed as sums of the expected values of the component distributions, with weights equal to the associated mixing parameters. For example, assume Y contains k component distributions Y1, ..., Yk with mixing parameters given by π1, ..., πk, where i=1kπi=1. The component distributions are described by the following parameters: means μ1, ..., μk, variances σ12, ..., σk2, skews γ11, ..., γ1k, skurtoses γ21, ..., γ2k, fifth cumulants γ31, ..., γ3k, and sixth cumulants γ41, ..., γ4k. Then the rth expected value of Y can be expressed as: (15)E[Yr]=yrhY(y)dy=i=1kπiyrfYi(y)dy=i=1kπiEfi[Yir]. Therefore, a method similar to that above can be used to derive the system of equations defining the mean, variance, skew, skurtosis, and standardized fifth and sixth cumulants of Y. These equations are used within the function calc_mixmoments to determine the values for a mixture variable. The summary_var function executes calc_mixmoments to provide target distributions for simulated continuous mixture variables.

Example with Normal and Beta mixture variables

Let Y1 be a mixture of Normal(-5, 2), Normal(1, 3), and Normal(7, 4) distributions with mixing parameters 0.36,0.48, and 0.16. This variable could represent a continuous trait with a codominant mixture distribution, as in Figure 1a, where pA=0.6 and pa=0.4. Let Y2 be a mixture of Beta(13, 11) and Beta(13, 4) distributions with mixing parameters 0.3 and 0.7. Beta-mixture models are widely used in bioinformatics to represent correlation coefficients. These could arise from pathway analysis of a relevant gene to study if gene-expression levels are correlated with those of other genes. The correlations could also describe the expression levels of the same gene measured in different studies, as in meta-analyses of multiple gene-expression experiments. Since expression varies greatly across genes, the correlations may come from different probability distributions within one mixture distribution. Each component distribution represents groups of genes with similar behavior. proposed a Beta-mixture model for correlation coefficients. extended this model to methylation microarray data in order to reduce dimensionality and detect fluctuations in methylation status between various samples and tissues. Other extensions include cluster analysis , single nucleotide polymorphism (SNP) analysis , pattern recognition and image processing , and quantile normalization to correct probe design bias . Since these methods assume independence among components, developed a compound hierarchical correlated Beta-mixture model to permit correlations among components, applying it to cluster mouse transcription factor DNA binding data.

The standardized cumulants for the Normal and Beta mixtures using the fifth-order PMT are found as follows:

  library("SimCorrMix")
  B1 <- calc_theory("Beta", c(13, 11))
  B2 <- calc_theory("Beta", c(13, 4))
  mix_pis <- list(c(0.36, 0.48, 0.16), c(0.3, 0.7))
  mix_mus <- list(c(-5, 1, 7), c(B1[1], B2[1]))
  mix_sigmas <- list(c(sqrt(2), sqrt(3), sqrt(4)), c(B1[2], B2[2]))
  mix_skews <- list(c(0, 0, 0), c(B1[3], B2[3]))
  mix_skurts <- list(c(0, 0, 0), c(B1[4], B2[4]))
  mix_fifths <- list(c(0, 0, 0), c(B1[5], B2[5]))
  mix_sixths <- list(c(0, 0, 0), c(B1[6], B2[6]))
  Nstcum <- calc_mixmoments(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]],
    mix_skews[[1]], mix_skurts[[1]], mix_fifths[[1]], mix_sixths[[1]])
  Nstcum
  ##       mean         sd       skew   kurtosis      fifth      sixth
  ## -0.2000000  4.4810713  0.3264729 -0.6238472 -1.0244454  1.4939902
  Bstcum <- calc_mixmoments(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]],
    mix_skews[[2]], mix_skurts[[2]], mix_fifths[[2]], mix_sixths[[2]])
  Bstcum
  ##      mean         sd       skew   kurtosis      fifth      sixth
  ## 0.6977941  0.1429099 -0.4563146 -0.5409080  1.7219898  0.5584577

SimMultiCorrData’s calc_theory was used first to obtain the standardized cumulants for each of the Beta distributions.

Calculation of intermediate correlations for continuous variables

The target correlation matrix rho in the simulation functions corrvar and corrvar2 is specified in terms of the correlations with components of continuous mixture variables. This allows the user to set the correlation between components of the same mixture variable to any desired value. If this correlation is small (i.e., 00.2), the intermediate correlation matrix Sigma may need to be converted to the nearest positive-definite (PD) matrix. This is done within the simulation functions by specifying use.nearPD = TRUE, and ’s algorithm is executed with the Matrix package’s nearPD function . Otherwise, negative eigenvalues are replaced with 0.

The function intercorr_cont calculates the intermediate correlations for the standard normal variables used in Equation (3). This is necessary because the transformation decreases the absolute value of the final correlations. The function uses Equation 7b derived by for the third-order PMT and Equation 26 derived by for the fifth-order PMT.

Approximate correlations for continuous mixture variables:

Even though the correlations for the continuous mixture variables are set at the component level, we can approximate the resulting correlations for the mixture variables. Assume Y1 and Y2 are two continuous mixture variables. Let Y1 have k1 components with mixing probabilities α1,...,αk1 and standard deviations σ11,...,σ1k1. Let Y2 have k2 components with mixing probabilities β1,...,βk2 and standard deviations σ21,...,σ2k2.

Correlation between continuous mixture variables Y1 and Y2

The correlation between the mixture variables Y1 and Y2 is given by: (16)ρY1Y2=E[Y1Y2]E[Y1]E[Y2]σ1σ2, where σ12 is the variance of Y1 and σ22 is the variance of Y2. Equation (16) requires the expected value of the product of Y1 and Y2. Since Y1 and Y2 may contain any number of components and these components may have any continuous distribution, there is no general way to determine this expected value. Therefore, it is approximated by expressing Y1 and Y2 as sums of their component variables: (17)ρY1Y2=E[(i=1k1αiY1i)(j=1k2βjY2j)]E[i=1k1αiY1i]E[j=1k2βjY2j]σ1σ2, where (18)E[(i=1k1αiY1i)(j=1k2βjY2j)]=E[α1Y11β1Y21+α1Y11β2Y22+...+αk1Y1k1βk2Y2k2]=α1β1E[Y11Y21]+α1β2E[Y11Y22]+...+αk1βk2E[Y1k1Y2k2]. Using the general correlation equation, for 1ik1 and 1jk2: (19)E[Y1iY2j]=σ1iσ2jρY1iY2j+E[Y1i]E[Y2j], so that we can rewrite ρY1Y2 as: (20)ρY1Y2=α1β1(σ11σ21ρY11Y21+E[Y11]E[Y21])σ1σ2    +...+αk1βk2(σ1k1σ2k2ρY1k1Y2k2+E[Y1k1]E[Y2k2])σ1σ2  α1β1E[Y11]E[Y21]+...+αk1βk2E[Y1k1]E[Y2k2]σ1σ2=i=1k1αiσ1ij=1k2βjσ2jρY1i,Y2jσ1σ2. Extending the example from 3.2.1, assume there are now three variables: Y1 (the Normal mixture), Y2 (the Beta mixture), and Y3 (a zero-inflated Poisson variable with mean 5 and probability of a structural zero set at 0.1). Let the target correlations among the components of Y1, the components of Y2, and Y3 be 0.4. The components of Y1 have a weak correlation of 0.1 and the components of Y2 are independent. The resulting correlation between Y1 and Y2 is approximated as:

  rho <- matrix(0.4, 6, 6)
  rho[1:3, 1:3] <- matrix(0.1, 3, 3)
  rho[4:5, 4:5] <- matrix(0, 2, 2)
  diag(rho) <- 1
  rho_M1M2(mix_pis, mix_mus, mix_sigmas, rho[1:3, 4:5])
  ## [1] 0.103596

Note that rho has 6 columns because k1=3, k2=2, and k1+k2+1=6.

Correlation between continuous mixture variable Y1 and other random variable Y3

Here Y3 can be an ordinal, a continuous non-mixture, or a regular or zero-inflated Poisson or Negative Binomial variable. The correlation between the mixture variable Y1 and Y3 is given by: (21)ρY1Y3=E[Y1Y3]E[Y1]E[Y3]σ1σ3, where σ32 is the variance of Y3. Equation (21) requires the expected value of the product of Y1 and Y3, which is again approximated by expressing Y1 as a sum of its component variables: (22)ρY1Y3=E[(i=1k1αiY1i)Y3]E[i=1k1αiY1i]E[Y3]σ1σ3, where (23)E[(i=1k1αiY1i)Y3]=E[α1Y11Y3+α2Y12Y3+...+αk1Y1k1Y3]=α1E[Y11Y3]+α2E[Y12Y3]+...+αk1E[Y1k1Y3]. Using the general correlation equation, for 1ik1: (24)E[Y1iY3]=σ1iσ3ρY1iY3+E[Y1i]E[Y3], so that we can rewrite ρY1Y3 as: (25)ρY1Y3=α1(σ11σ3ρY11Y3+E[Y11]E[Y3])+...+αk1(σ1k1σ3ρY1k1Y3+E[Y1k1]E[Y3])σ1σ3  α1E[Y11]E[Y3]+...+αk1E[Y1k1]E[Y3]σ1σ3=i=1k1αiσY1iρY1iY3σ1. Continuing with the example, the correlations between Y1 and Y3 and between Y2 and Y3 are approximated as:

  rho_M1Y(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]], rho[1:3, 6])
  ## [1] 0.1482236
  rho_M1Y(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]], rho[4:5, 6])
  ## [1] 0.2795669

The accuracy of these approximations can be determined through simulation:

  means <- c(Nstcum[1], Bstcum[1])
  vars <- c(Nstcum[2]^2, Bstcum[2]^2)
  seed <- 184
  Sim1 <- corrvar(n = 100000, k_mix = 2, k_pois = 1, method = "Polynomial",
    means = means, vars = vars, mix_pis = mix_pis, mix_mus = mix_mus,
    mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
    mix_fifths = mix_fifths, mix_sixths = mix_sixths, lam = 5, p_zip = 0.1,
    rho = rho, seed = seed, use.nearPD = FALSE)
  ## Total Simulation time: 0.065 minutes
  names(Sim1)
  ## [1] "constants"        "Y_cont"           "Y_comp"           "sixth_correction"
  ## [5] "valid.pdf"        "Y_mix"            "Y_pois"           "Sigma"
  ## [9] "Error_Time"       "Time"             "niter"
  Sum1 <- summary_var(Y_comp = Sim1$Y_comp, Y_mix = Sim1$Y_mix,
    Y_pois = Sim1$Y_pois, means = means, vars = vars, mix_pis = mix_pis,
    mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
    mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
    lam = 5, p_zip = 0.1, rho = rho)
  names(Sum1)
  ## [1] "cont_sum"   "target_sum" "mix_sum"    "target_mix" "rho_mix"    "pois_sum"
  ## [7] "rho_calc"   "maxerr"
  Sum1$rho_mix
  ##           [,1]      [,2]      [,3]
  ## [1,] 1.0000000 0.1012219 0.1475749
  ## [2,] 0.1012219 1.0000000 0.2776299
  ## [3,] 0.1475749 0.2776299 1.0000000

The results show that Equation (20) and Equation (25) provided good approximations to the simulated correlations. 8 also compares approximated expected correlations for continuous mixture variables with simulated correlations.

Figure 2 displays the PDF of the Normal mixture variable and the simulated values of the zero-inflated Poisson (ZIP) variable obtained using SimCorrMix’s graphing functions. These functions are written with ggplot2 functions and the results are ggplot objects that can be saved or further modified . As demonstrated below, the target distribution, specified by distribution name and parameters (39 distributions currently available by name) or PDF function fx, can be overlayed on the plot for continuous or count variables.

  plot_simpdf_theory(sim_y = Sim1$Y_mix[, 1], title = "", sim_size = 2,
    target_size = 2, fx = function(x) mix_pis[[1]][1] *
      dnorm(x, mix_mus[[1]][1], mix_sigmas[[1]][1]) + mix_pis[[1]][2] *
      dnorm(x, mix_mus[[1]][2], mix_sigmas[[1]][2]) + mix_pis[[1]][3] *
      dnorm(x, mix_mus[[1]][3], mix_sigmas[[1]][3]), lower = -10, upper = 10,
    legend.position = "none", axis.text.size = 30, axis.title.size = 30)
  plot_simtheory(sim_y = Sim1$Y_pois[, 1], title = "", cont_var = FALSE,
    binwidth = 0.5, Dist = "Poisson", params = c(5, 0.1),
    legend.position = "none", axis.text.size = 30, axis.title.size = 30)
graphic without alt text graphic without alt text
  1. PDF of Normal mixture variable.
  1. Simulated values of ZIP variable.
Figure 2: Graphs of variables (simulated = blue, target = green).

The Continuous Mixture Distributions vignette explains how to compare simulated to theoretical distributions of continuous mixture variables, as demonstrated here for the Beta mixture variable Y2 :

  1. Obtain the standardized cumulants for the target mixture variable Y2 and its components: these were found above using calc_mixmoments and calc_theory.

  2. Obtain the PMT constants for the components of Y2: these are returned in the simulation result Sim1$constants.

  3. Determine whether these constants produce valid PDF’s for the components of Y2 (and therefore for Y2): this is indicated for all continuous variables in the simulation result Sim1$valid.pdf.

      ## [1] "TRUE" "TRUE" "TRUE" "TRUE" "TRUE"
  4. Select a critical value from the distribution of Y2, i.e. y2 such that Pr[Y2y2]=α, for the desired significance level α: Let α=0.05. Since there are no quantile functions for mixture distributions, determine where the cumulative probability equals 1α=0.95.

      beta_fx <- function(x) mix_pis[[2]][1] * dbeta(x, 13, 11) +
        mix_pis[[2]][2] * dbeta(x, 13, 4)
      beta_cfx <- function(x, alpha, fx = beta_fx) {
        integrate(function(x, FUN = fx) FUN(x), -Inf, x, subdivisions = 1000,
          stop.on.error = FALSE)$value - (1 - alpha)
      }
      y2_star <- uniroot(beta_cfx, c(0, 1), tol = 0.001, alpha = 0.05)$root
      y2_star
      ## [1] 0.8985136
  5. Calculate the cumulative probability for the simulated mixture variable Y2 up to y2 and compare to 1α: The function sim_cdf_prob from SimMultiCorrData calculates cumulative probabilities.

      sim_cdf_prob(sim_y = Sim1$Y_mix[, 2], delta = y2_star)$cumulative_prob
      ## [1] 0.9534

    This is approximately equal to the 1α value of 0.95, indicating that the simulation provides a good approximation to the theoretical distribution.

  6. Plot a graph of Y2 and Y2: Figure 3 shows the PDF and empirical CDF obtained as follows (plot_sim_cdf is in SimMultiCorrData):

      plot_simpdf_theory(sim_y = Sim1$Y_mix[, 2], title = "", sim_size = 2,
        target_size = 2, fx = beta_fx, lower = 0, upper = 1,
        legend.position = c(0.4, 0.85), legend.text.size = 30,
        axis.text.size = 30, axis.title.size = 30)
      plot_sim_cdf(sim_y = Sim1$Y_mix[, 2], title = "", calc_cprob = TRUE,
        delta = y2_star, text.size = 30, axis.text.size = 30, axis.title.size = 30)
graphic without alt text graphic without alt text
  1. PDF.
  1. CDF.
Figure 3: Graphs of the Beta mixture variable.

4 Count mixture distributions

SimCorrMix extends the methods in SimMultiCorrData for regular Poisson and Negative Binomial (NB) variables to zero-inflated Poisson and NB variables. All count variables are generated using the inverse CDF method with distribution functions imported from VGAM. The CDF of a standard normal variable has a uniform distribution. The appropriate quantile function FY1 is applied to this uniform variable with the designated parameters to generate the count variable: Y=Fy1(Φ(Z)). The order within all parameters for count variables should be 1st regular and 2nd zero-inflated.

A zero-inflated random variable YZI is a mixture of a degenerate distribution having the point mass at 0 and another distribution Y that contributes both zero and non-zero values. If the mixing probability is ϕ, then: (26)Pr[YZI=0]=ϕ+(1ϕ)Pr[Y=0],  0<ϕ<1. Therefore, ϕ is the probability of a structural zero, and setting ϕ=0 reduces YZI to the variable Y. In SimCorrMix, Y can have either a Poisson (YP) or a Negative Binomial (YNB) distribution.

Zero-inflated Poisson (ZIP) distribution

The model for YZIPZIP(λ, ϕ), λ>0, 0<ϕ<1 is: (27)Pr[YZIP=0]=ϕ+(1ϕ)exp(λ)Pr[YZIP=y]=(1ϕ)exp(λ)λyy!,  y=1,2,... The mean of YZIP is (1ϕ)λ, and the variance is λ+λ2ϕ/(1ϕ) . The parameters λ (mean of the regular Poisson component) and ϕ are specified in SimCorrMix through the inputs lam and p_zip. Setting p_zip = 0 (the default setting) generates a regular Poisson variable.

The zero-deflated Poisson distribution is obtained by setting ϕ(1/(exp(λ)1), 0), so that the probability of a zero count is less than the nominal Poisson value. In this case, ϕ no longer represents a probability. When ϕ=1/(exp(λ)1), the random variable has a positive-Poisson distribution. The probability of a zero response is 0, and the other probabilities are scaled to sum to 1.

Zero-inflated Negative Binomial (ZINB) distribution

A major limitation of the Poisson distribution is that the mean and variance are equal. In practice, population heterogeneity creates extra variability (overdispersion), e.g., if Y represents the number of events which occur in a given time interval and the length of the observation period varies across subjects. If the length of these periods are available for each subject, an offset term may be used. Otherwise, the length can be considered a latent variable and the mean of the Poisson distribution for each subject is a random variable. If these means are described by a Gamma distribution, then Y has a NB distribution, which has an extra parameter to account for overdispersion. However, an excessive number of zeros requires using a zero-inflated distribution. These extra (structural) zeros may arise from a subpopulation of subjects who are not at risk for the event during the study period. These subjects are still important to the analysis because they may possess different characteristics from the at-risk subjects .

The model for YZINBZINB(η, p, ϕ), η>0, 0<p1,  0<ϕ<1 is: (28)Pr[YZINB=0]=ϕ+(1ϕ)pηPr[YZINB=y]=(1ϕ)Γ(y+η)Γ(η)y!pη(1p)η,  y=1,2,... In this formulation, the Negative Binomial component YNB represents the number of failures that occur in a sequence of independent Bernoulli trials before a target number of successes (η) is reached. The probability of success in each trial is p. YNB may also be parameterized by the mean μ (of the regular NB component) and dispersion parameter η so that p=η/(η+μ) or μ=η(1p)/p. The mean of YZINB is (1ϕ)μ, and the variance is (1ϕ)μ(1+μ(ϕ+1/η)) . The parameters η, p, μ, and ϕ are specified through the inputs size, prob, mu, and p_zinb. Either prob or mu should be given for all NB and ZINB variables. Setting p_zinb = 0 (the default setting) generates a regular NB variable.

The zero-deflated NB distribution may be obtained by setting ϕ(pη/(1pη), 0), so that the probability of a zero count is less than the nominal NB value. In this case, ϕ no longer represents a probability. The positive-NB distribution results when ϕ=pη/(1pη). The probability of a zero response is 0, and the other probabilities are scaled to sum to 1.

Calculation of intermediate correlations for count variables

The two simulation pathways differ by the technique used for count variables. The intermediate correlations used in correlation method 1 are simulation based and accuracy increases with sample size and number of repetitions. The intermediate correlations used in correlation method 2 involve correction loops which make iterative adjustments until a maximum error has been reached (if possible). Correlation method 1 is described below:

  1. Count variable pairs: Based on ’s method, the intermediate correlation between the standard normal variables Z1 and Z2 is calculated using a logarithmic transformation of the target correlation. First, the upper and lower Fréchet-Hoeffding bounds (mincor, maxcor) on ρY1Y2 are simulated [see 6; ; ]. Then the intermediate correlation ρZ1Z2 is found as follows: \[ \label{System34} {\rho}_{{Z}_{1}{Z}_{2}} = \frac{1}{b} \log \left(\frac{{\rho}_{{Y}_{1}{Y}_{2}} - c\tag{29} (#eq:System34)\] where a=maxcormincormaxcor+mincor,   b=log(maxcor+aa),   c=a. The functions intercorr_pois, intercorr_nb, and intercorr_pois_nb calculate these correlations.

  2. Ordinal-count variable pairs: Extending ’s method, the intermediate correlations are the ratio of the target correlations to correction factors. The correction factor is the product of the upper Fréchet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it. This upper bound is ’s generate, sort, and correlate (GSC) upper bound (see 6). The functions intercorr_cat_pois and intercorr_cat_nb calculate these correlations.

  3. Continuous-count variable pairs: Extending ’s and ’s methods, the correlation correction factor is the product of the upper Fréchet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it. This power method correlation is given by ρp(Z)Z=c1+3c3+15c5, where c3=0 for Fleishman’s method . The functions intercorr_cont_pois and intercorr_cont_nb calculate these correlations.

showed that this method is less accurate for positive correlations with small count variable means (i.e., less than 1) or high negative correlations with large count variable means.

In correlation method 2, count variables are treated as "ordinal" variables, based on the methods of Barbiero and Ferrari . The Poisson or NB support is made finite by removing a small user-specified value (specified by pois_eps and nb_eps) from the total cumulative probability. This truncation factor may differ for each count variable, but the default value is 0.0001 . For example, pois_eps = 0.0001 means that the support values removed have a total probability of 0.0001 of occurring in the distribution of that variable. The effect is to remove improbable values, which may be of concern if the user wishes to replicate a distribution with outliers. The function maxcount_support creates these new supports and associated marginal distributions.

  1. Count variable or ordinal-count variable pairs: The intermediate correlations are calculated using the correction loop of ord_norm (see 5).

  2. Continuous-count variable pairs: Extending ’s method, the intermediate correlations are the ratio of the target correlations to correction factors. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it . The functions intercorr_cont_pois2 and intercorr_cont_nb2 calculate these correlations.

This method performs best under the same circumstances as ordinal variables, i.e., when there are few categories and the probability of any given category is not very small. This occurs when the count variable has a small mean. Therefore, method 2 performs well in situations when method 1 has poor accuracy. In contrast, large means for the count variables would result in longer computational times. 8 compares the accuracy of correlation methods 1 and 2 under different scenarios.

5 Simulation of ordinal variables

Ordinal variables (r2 categories) are generated by discretizing standard normal variables at the quantiles determined from the cumulative probabilities specified in marginal. Each element of this list is a vector of length r1 (the rth value is 1). If the support is not provided, the default is to use {1, 2, ..., r} . The tetrachoric correlation is used for the intermediate correlation of binary pairs . The assumptions are that the binary variables arise from latent normal variables and the actual trait is continuous and not discrete. For Y1 and Y2, with success probabilities p1 and p2, the intermediate correlation ρZ1Z2 is the solution to the following equation: (30)Φ[z(p1), z(p2), ρZ1Z2]=ρY1Y2p1(1p1)p2(1p2)+p1p2, where z(p) indicates the pth quantile of the standard normal distribution.

If at least one ordinal variable has more than 2 categories, ord_norm is called. Based on SimMultiCorrData’s ordnorm and GenOrd’s ordcont and contord, the algorithm to simulate k_cat ordinal random variables with target correlation matrix rho0 is as follows:

  1. Create the default support if necessary.

  2. Use norm_ord to calculate the initial correlation of the ordinal variables (rhoordold) generated by discretizing k_cat random normal variates with correlation matrix set equal to rho0, using marginal and the corresponding normal quantiles. These correlations are calculated using means and variances found from multivariate normal probabilities determined by mvtnorm’s pmvnorm .

  3. Let rho be the intermediate normal correlation updated in each iteration, rhoord be the ordinal correlation calculated in each iteration, rhoold be the intermediate correlation from the previous iteration (initialized at rhoordold), it be the iteration number, and maxit and epsilon be the user-specified maximum number of iterations and pairwise correlation error. For each variable pair, execute the following:

    1. If rho0 = 0, set rho = 0.

    2. While the absolute error between rhoord and rho0 is greater than epsilon and it is less than maxit:

      1. If rho0 * (rho0/rhoord) <= -1:
        rho = rhoold * (1 + 0.1 * (1 - rhoold) * -sign(rho0 - rhoord)).

      2. If rho0 * (rho0/rhoord) >= 1:
        rho = rhoold * (1 + 0.1 * (1 - rhoold) * sign(rho0 - rhoord)).

      3. Else, rho = rhoold * (rho0/rhoord).

      4. If rho > 1, set rho = 1. If rho < -1, set rho = -1.

      5. Calculate rhoord using norm_ord and the 2×2 correlation matrix formed by rho.

      6. Set rhoold = rho and increase it by 1.

    3. Store the number of iterations in the matrix niter.

  4. Return the final intermediate correlation matrix SigmaC = rho for the random normal variables. Discretize these to produce ordinal variables with the desired correlation matrix.

6 Calculation of correlation boundaries

For binary variable pairs, the correlation bounds are calculated as by . The joint distribution is determined using the moments of a multivariate normal distribution . For Y1 and Y2, with success probabilities p1 and p2, the boundaries are approximated by: (31){max(p1p2q1q2, q1q2p1p2), min(p1q2q1p2, q1p2p1q2)}, where q1=1p1 and q2=1p2. If one of an ordinal variable pair has more than 2 categories, randomly generated variables with the given marginal distributions and support values are used in ’s generate, sort, and correlate (GSC) algorithm. A large number (default 100,000) of independent random samples from the desired distributions are generated. The lower bound is the sample correlation of the two variables sorted in opposite directions (i.e., one increasing and one decreasing). The upper bound is the sample correlation of the two variables sorted in the same direction.

The GSC algorithm is also used for continuous, continuous-ordinal, ordinal-count, and continuous-count variable pairs. Since count variables are treated as "ordinal" in correlation method 2, the correlation bounds for count variable pairs is found with the GSC algorithm after creating finite supports with associated marginal distributions (with maxcount_support). The correlation bounds for count variable pairs in correlation method 1 are the Fréchet-Hoeffding bounds . For two random variables Y1 and Y2 with CDF’s F1 and F2, the correlation bounds are approximated by: (32){Cor(F11(U),F21(1U)), Cor(F11(U),F21(U))}, where U is a Uniform(0, 1) random variable of default length 100,000.

Example with multiple variable types

Consider the Normal and Beta mixture variables from 3. Additional variables are an ordinal variable with three equally-weighted categories (e.g., drug treatment), two zero-inflated Poisson variables with means 0.5 and 1 (for the regular Poisson components) and structural zero probabilities 0.1 and 0.2, and two zero-inflated NB variables with means 0.5 and 1 (for the regular NB components), success probabilities 0.8 and 0.6, and structural zero probabilities 0.1 and 0.2. The target pairwise correlation is set at 0.5. The components of the Normal mixture variable again have weak correlation of 0.1 and those for the Beta mixture variable are uncorrelated. The parameter inputs are first checked with validpar.

  marginal <- list(c(1/3, 2/3))
  support <- list(c(0, 1, 2))
  lam <- c(0.5, 1)
  p_zip <- c(0.1, 0.2)
  mu <- c(0.5, 1)
  prob <- c(0.8, 0.6)
  size <- prob * mu/(1 - prob)
  p_zinb <- c(0.1, 0.2)
  rho <- matrix(-0.5, 10, 10)
  rho[2:4, 2:4] <- matrix(0.1, 3, 3)
  rho[5:6, 5:6] <- matrix(0, 2, 2)
  diag(rho) <- 1
  validpar(k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2, method = "Polynomial",
    means = means, vars = vars, mix_pis = mix_pis, mix_mus = mix_mus,
    mix_sigmas = mix_sigmas, mix_skews = mix_skews, mix_skurts = mix_skurts,
    mix_fifths = mix_fifths, mix_sixths = mix_sixths, marginal = marginal,
    support = support, lam = lam, p_zip = p_zip, size = size, mu = mu,
    p_zinb = p_zinb, rho = rho)
  ## Default of pois_eps = 0.0001 will be used for Poisson variables
  ##               if using correlation method 2.
  ## Default of nb_eps = 0.0001 will be used for NB variables
  ##               if using correlation method 2.
  Target correlation matrix is not positive definite.
  ## [1] TRUE
  valid1 <- validcorr(10000, k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2,
    method = "Polynomial", means = means, vars = vars, mix_pis = mix_pis,
    mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
    mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
    marginal = marginal, lam = lam, p_zip = p_zip, size = size, mu = mu,
    p_zinb = p_zinb, rho = rho, use.nearPD = FALSE, quiet = TRUE)
  ## Range error! Corr[ 7 , 9 ] must be between -0.388605 and 0.944974
  ## Range error! Corr[ 7 , 10 ] must be between -0.432762 and 0.925402
  ## Range error! Corr[ 8 , 9 ] must be between -0.481863 and 0.877668
  ## Range error! Corr[ 9 , 10 ] must be between -0.386399 and 0.937699
  names(valid1)
  ## [1] "rho"              "L_rho"            "U_rho"            "constants"
  ## [5] "sixth_correction" "valid.pdf"        "valid.rho"
  valid2 <- validcorr2(10000, k_cat = 1, k_mix = 2, k_pois = 2, k_nb = 2,
    method = "Polynomial", means = means, vars = vars, mix_pis = mix_pis,
    mix_mus = mix_mus, mix_sigmas = mix_sigmas, mix_skews = mix_skews,
    mix_skurts = mix_skurts, mix_fifths = mix_fifths, mix_sixths = mix_sixths,
    marginal = marginal, lam = lam, p_zip = p_zip, size = size, mu = mu,
    p_zinb = p_zinb, rho = rho, use.nearPD = FALSE, quiet = TRUE)
  ## Range error! Corr[ 7 , 9 ] must be between -0.385727 and 0.947462
  ## Range error! Corr[ 7 , 10 ] must be between -0.428145 and 0.921001
  ## Range error! Corr[ 8 , 9 ] must be between -0.477963 and 0.879439
  ## Range error! Corr[ 9 , 10 ] must be between -0.384557 and 0.939524

The validpar function indicates that all parameter inputs have the correct format and the default cumulative probability truncation value of 0.0001 will be used in correlation method 2 for pois_eps and nb_eps. Since rho is not PD, the intermediate correlation matrix Sigma will probably also be non-PD. The user has three choices: 1) convert rho to the nearest PD matrix before simulation, 2) set use.nearPD = TRUE (default) in the simulation functions to convert Sigma to the nearest PD matrix during simulation, or 3) set use.nearPD = FALSE in the simulation functions to replace negative eigenvalues with 0. Using use.nearPD = TRUE in validcorr or validcorr2 converts rho to the nearest PD matrix before checking if all pairwise correlations fall within the feasible boundaries, whereas use.nearPD = FALSE checks the initial matrix rho. Setting quiet = TRUE keeps the non-PD message from being reprinted.

Range violations occur with the count variables. The lower and upper correlation bounds are given in the list components L_rho and U_rho. Note that these are pairwise correlation bounds. Although valid.rho will return TRUE if all elements of rho are within these bounds, this does not guarantee that the overall target correlation matrix rho can be obtained in simulation.

7 Overall workflow for generation of correlated data

The vignette Overall Workflow for Generation of Correlated Data provides a detailed step-by-step guideline for correlated data simulation with examples for corrvar and corrvar2. These steps are briefly reviewed here.

  1. Obtain the distributional parameters for the desired variables.

    1. Continuous variables: For non-mixture and components of mixture variables, these are skew, skurtosis, plus standardized fifth and sixth cumulants (for method = "Polynomial") and sixth cumulant corrections (if desired). The inputs are skews, skurts, fifths, sixths, and Six for non-mixture variables; mix_skews, mix_skurts, mix_fifths, mix_sixths, and mix_Six for components of mixture variables. If the goal is to simulate a theoretical distribution, SimMultiCorrData’s calc_theory will return these values given a distribution’s name and parameters (39 distributions currently available by name) or PDF function fx. If the goal is to mimic a real data set, SimMultiCorrData’s calc_moments uses the method of moments or calc_fisherk uses ’s k-statistics given a vector of data. For mixture variables, the mixing parameters (mix_pis), component means (mix_mus), and component standard deviations (mix_sigmas) are also required. The means and variances of non-mixture and mixture variables are specified in means and vars and these can be found using calc_mixmoments for mixture variables.

    2. Ordinal variables: The cumulative marginal probabilities in marginal and support values in support as described in 5.

    3. Poisson variables: The mean values in lam and probabilities of structural zeros in p_zip (default of 0 to yield regular Poisson distributions). The mean refers to the mean of the Poisson component of the distribution. For correlation method 2, also cumulative probability truncation values in pois_eps.

    4. NB variables: The target number of successes in size, probabilities of structural zeros in p_zinb (default of 0 to yield regular NB distributions), plus means in mu or success probabilities in prob. The mean refers to the mean of the NB component of the distribution. For correlation method 2, also cumulative probability truncation values in nb_eps.

  2. Check that all parameter inputs have the correct format using validpar. Incorrect parameter specification is the most likely cause of function failure.

  3. If continuous variables are desired, verify that the skurtoses are greater than the lower skurtoses bounds using SimMultiCorrData’s calc_lower_skurt. The function permits a skurtosis correction vector to aid in discovering a lower bound associated with PMT constants that yield a valid PDF. Since this step can take considerable time, the user may wish to do this at the end if any of the variables have invalid PDF’s. The sixth cumulant value should be the actual sixth cumulant used in simulation, i.e., the distribution’s sixth cumulant plus any necessary correction (if method = "Polynomial").

  4. Check if the target correlation matrix rho falls within the feasible correlation boundaries. The variables in rho must be ordered correctly (see 1).

  5. Generate the variables using either corrvar or corrvar2, with or without the error loop.

  6. Summarize the results numerically with summary_var or graphically with plot_simpdf_theory, plot_simtheory, or any of the plotting functions in SimMultiCorrData.

8 Examples comparing the two simulation pathways

Correlation methods 1 and 2 were compared to demonstrate situations when each has greater simulation accuracy. In scenario A, the ordinal (O1), Normal mixture (Nmix with components N1, N2, and N3), Beta mixture (Bmix with components B1 and B2), two zero-inflated Poisson (P1 and P2), and two zero-inflated NB (NB1 and NB2) variables from the 6 example were simulated. All count variables in this case had small means (less than 1). In scenario B, the two Poisson variables were replaced with two zero-inflated NB (NB3 and NB4) variables with means 50 and 100 (for the regular NB components), success probabilities 0.4 and 0.2, and structural zero probabilities 0.1 and 0.2. This yielded two count variables with small means and two with large means. The simulations were done with n =10,000 sample size and r =1,000 repetitions using three different positive correlations as given in Table 1 (chosen based on the upper correlation bounds). The correlation among N1, N2, N3 was set at 0.1; the correlation between B1 and B2 was set at 0. The default total cumulative probability truncation value of 0.0001 was used for each count variable in corrvar2.

In scenarios A and B, the simulated correlations among the count variables were compared to the target values using boxplots generated with ggplot2’s geom_boxplot. In scenario A, the simulated correlations with the continuous mixture variables were compared to the expected correlations approximated by rho_M1M2 and rho_M1Y, with O1 as the non-mixture variable. Simulation times included simulation of the variables only with corrvar or corrvar2. Medians and interquartile ranges (IQR) were computed for the summary tables. Variable summaries are given for Nmix, Bmix, and NB1–NB4 in scenario B. This example was run on R version 3.4.1 with SimCorrMix version 0.1.0 using CentOS. The complete code is in the supplementary file for this article.

Results

Table 1 gives the three different correlations and total simulation times (1,000 repetitions) for correlation method 1 using corrvar (Time M1) and correlation method 2 using corrvar2 (Time M2). The strong correlation was different between NB variables with small means (NB1, NB2) and NB variables with large means (NB3, NB4) because the upper bounds were lower for these variable pairs.

Table 1: Six comparisons and total simulation times for method 1 (M1) and method 2 (M2) in hours. Correlation ρ applied to the NB1–NB3, NB1–NB4, NB2–NB3, and NB2–NB4 variable pairs.
Scenario A: Poisson and NB B: NB
Correlation Type ρ ρ Time M1 Time M2 Time M1 Time M2
Strong 0.7 0.6 2.55 2.03 2.00 9.30
Moderate 0.5 0.5 1.65 0.92 1.98 8.01
Weak 0.3 0.3 1.39 0.90 1.95 5.78

The strong correlations required the most time for each correlation method. Although method 2 was faster when all count variables had small means (scenario A), it was notably slower when two of the count variables had large means (scenario B). The reason is that method 2 treats all count variables as "ordinal," which requires creating finite supports and associated marginal distributions, as described in 4.3. When a count variable has a large mean, there are several support values with very small probabilities, making simulation more difficult.

Scenario A: Ordinal, Normal and Beta mixtures, Poisson, and NB variables

Figure 4 contains boxplots of the simulated correlations for the continuous mixture variables. Method 1 is in red; method 2 is in green. The middle line is the median (50th percentile); the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). The upper whisker extends from the hinge to the largest value up to 1.5 * IQR from the hinge. The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR from the hinge. Data beyond the end of the whiskers are considered "outliers." The black horizontal lines show the approximate expected values obtained with the functions rho_M1M2 and rho_M1Y (also given in Table 2).

Table 2: Approximate expected correlations with the continuous mixture variables.
Correlation Type ρ ρNmix,Bmix ρNmix,O1 ρBmix,O1
Strong 0.7 0.1813 0.2594 0.4892
Moderate 0.5 0.1295 0.1853 0.3495
Weak 0.3 0.0777 0.1112 0.2097

Notice in Table 2 that the expected correlations are much smaller than the pairwise correlations, demonstrating an important consideration when setting the correlations for mixture components. Even though the strong correlation between the components of Nmix and the components of Bmix was set at 0.7, the expected correlation between Nmix and Bmix was only 0.1813. Combining continuous components into one continuous mixture variable always decreases the absolute correlation between the mixture variable and other variables.

Figure 4 shows that, as expected, the results with correlation methods 1 and 2 were similar, since the methods differ according to count variable correlations. The simulated correlations were farthest from the approximate expected values with the strong correlation and closest for the weak correlation. In the simulations with strong or moderate correlations, the intermediate correlation matrix Sigma was not PD due to the weak correlation (0.1) between N1, N2, and N3 and independence (zero correlation) of B1 and B2. During simulation, after Sigma is calculated with intercorr or intercorr2, eigenvalue decomposition is done on Sigma. The square roots of the eigenvalues form a diagonal matrix. The product of the eigenvectors, diagonal matrix, and transposed standard normal variables produces normal variables with the desired intermediate correlations. If Sigma is not PD and use.nearPD is set to FALSE in the simulation functions, negative eigenvalues are replaced with 0 before forming the diagonal matrix of eigenvalue square roots. If use.nearPD is set to TRUE (default), Sigma is replaced with the nearest PD matrix using ’s algorithm and Matrix’s nearPD function. Either method increases correlation errors because the resulting intermediate correlations are different from those found in Sigma. As the maximum absolute correlation in the target matrix rho increases, these differences increase. In this example, the Sigma matrix had two negative eigenvalues in the strong correlation simulations and one negative eigenvalue in the moderate correlation simulations. This is why the correlation errors were largest for the strong correlation setting.

Figure 5 shows boxplots of the simulated correlations for the count variables. The horizontal lines show the target values. These correlations were also affected by the adjusted eigenvalues and the errors for the strong correlations were again the largest. Correlation method 2 performed better in each case except when generating ρP1,NB1 in the strong correlation case. ’s method of treating count variables as "ordinal" is expected to exhibit better accuracy than ’s equation when the count variables have small means (less than 1). Tables 68 in the Appendix provide median (IQR) correlation errors for all variables and each correlation type.

graphic without alt textgraphic without alt textgraphic without alt text

Figure 4: Boxplots of simulated correlations for continuous mixture variables (scenario A). Method 1 is in red; method 2 is in green. The horizontal lines show the approximate expected values.
graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
Figure 5: Boxplots of simulated correlations for P1, P2, NB1, and NB2 (scenario A). Method 1 is in red; method 2 is in green. The horizontal lines show the target values.

Scenario B: Ordinal, Normal and Beta mixtures, and NB variables

Tables 3 and 4 describe the target and simulated distributions for Nmix, Bmix, and NB1–NB4 in the weak correlation case. In all instances, the simulated distributions are close to the target distributions.

Table 3: Target and median (IQR) simulated distributions of continuous mixture variables.
Nmix Bmix
Mean -0.20 -0.20 (-0.20, -0.20) 0.70  0.70 (0.70, 0.70)
SD 4.48  4.48 (4.48, 4.48) 0.14  0.14 (0.14, 0.14)
Skew 0.33  0.33 (0.32, 0.33) -0.46 -0.46 (-0.47, -0.45)
Skurtosis -0.62 -0.62 (-0.64, -0.61) -0.54 -0.54 (-0.56, -0.52)
Fifth -1.02 -1.03 (-1.07, -0.98) 1.72  1.73 (1.68, 1.77)
Sixth 1.49  1.50 (1.36, 1.62) 0.56  0.54 (0.37, 0.72)
Table 4: Target and median (IQR) simulated distributions of zero-inflated NB variables.
P[Y=0] E(P[Y=0]) Mean $ []$
NB1 0.68 (0.67, 0.68) 0.68 0.45 (0.45, 0.45)  0.45
NB2 0.57 (0.57, 0.57) 0.57 0.80 (0.80, 0.80)  0.80
NB3 0.10 (0.10, 0.10) 0.10 45.00 (44.96, 45.03) 45.00
NB4 0.20 (0.20, 0.20) 0.20 80.00 (79.90, 80.10) 80.00
Var E[Var] Median Max
NB1 0.58 (0.58, 0.59)   0.58 0 (0, 0) 7 (6, 7)
NB2 1.49 (1.48, 1.51)   1.49 0 (0, 0)  11 (10, 12)
NB3    337.76 (335.43, 339.67) 337.50  48 (48, 48)  101 (98, 105)
NB4    2000.09 (1990.21, 2010.18) 2000.00  92 (91, 92)  204 (199, 212)

Figure 6 shows boxplots of the simulated correlations for the count variables. The horizontal lines show the target values. Method 1 performed better for all strong correlation cases except between the two NB variables with small means (NB1 and NB2). Although method 2 had smaller errors overall, it did require considerably longer simulation times. Therefore, the user should consider using correlation method 1 when the data set contains count variables with large means. Tables 911 in the Appendix provide median (IQR) correlation errors for all variables and each correlation type.

graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
Figure 6: Boxplots of simulated correlations for NB1, NB2, NB3, and NB4 (scenario B). Method 1 is in red; method 2 is in green. The horizontal lines show the target values.

9 Summary

The package SimCorrMix generates correlated continuous (normal, non-normal, and mixture), ordinal (r2 categories), and count (regular or zero-inflated, Poisson or Negative Binomial) variables. It is a significant contribution to existing R simulation packages because it is the first to include continuous and count mixture variables in correlated data sets. Since SimCorrMix simulates variables which mimic real-world data sets and provides great flexibility, the package has a wide range of applications in clinical trial and genetic studies. The simulated data sets could be used to compare statistical methods, conduct hypothesis tests, perform bootstrapping, or calculate power. The two simulation pathways, excecuted by the functions corrvar and corrvar2, permit the user to accurately reproduce desired correlation matrices for different parameter ranges. Correlation method 1 should be used when the target distributions include count variables with large means, and correlation method 2 is preferable in opposite situations. The package also provides helper functions to calculate standardized cumulants of continuous mixture variables, approximate expected correlations with continuous mixture variables, validate parameter inputs, determine feasible correlation boundaries, and summarize simulation results numerically and graphically. Future extensions of the package include adding more variable types (e.g., zero-inflated Binomial, Gaussian, and Gamma).

10 Supplementary Material

The article’s supplementary file contains replication code for the examples in the paper and 8.

Acknowledgments

This research serves as part of Allison Fialkowski’s dissertation, which was made possible by grant T32HL079888 from the National Heart, Lung, and Blood Institute of the National Institute of Health, USA and Dr. Hemant K. Tiwari’s William "Student" Sealy Gosset Professorship Endowment. I would like to thank my dissertation mentor, Hemant K. Tiwari, PhD; and committee members T. Mark Beasley, PhD; Charles R. Katholi, PhD; Nita A. Limdi, PhD; M. Ryan Irvin, PhD; and Nengjun Yi, PhD.

11 Appendix

Derivation of expected cumulants of continuous mixture variables

Suppose the goal is to simulate a continuous mixture variable Y with PDF hY(y) that contains two component distributions Ya and Yb with mixing parameters πa and πb: (33)hY(y)=πafYa(y)+πbgYb(y), y  Y, πa  (0, 1), πb  (0, 1), πa+πb=1. Here, (34)Ya=σaZa+μa, YafYa(y), y  Ya  and  Yb=σbZb+μb, YbgYb(y), y  Yb so that Ya and Yb have expected values μa and μb and variances σa2 and σb2. Assume the variables Za and Zb are generated with zero mean and unit variance using Headrick’s fifth-order PMT given the specified values for skew (γ1a, γ1b), skurtosis (γ2a, γ2b), and standardized fifth (γ3a, γ3b) and sixth (γ4a, γ4b) cumulants. The rth expected value of Y can be expressed as: (35)E[Yr]=yrhY(y)dy=πayrfYa(y)dy+πbyrgYb(y)dy=πaE[Yar]+πbE[Ybr]. Equation (35) can be used to derive expressions for the mean, variance, skew, skurtosis, and standardized fifth and sixth cumulants of Y in terms of the rth expected values of Ya and Yb.

  1. Mean: Using r=1 in Equation (35) yields μ: (36)E[Y]=πaE[Ya]+πbE[Yb]=πaE[σaZa+μa]+πbE[σbZb+μb]=πa(σaE[Za]+μa)+πb(σbE[Zb]+μb). Since E[Za]=E[Zb]=0, this becomes: (37)E[Y]=πaμa+πbμb.

  2. Variance: The variance of Y can be expressed by the relation Var[Y]=E[Y2](E[Y])2. Using r=2 in Equation (35) yields μ2: (38)E[Y2]=πaE[Ya2]+πbE[Yb2]=πaE[(σaZa+μa)2]+πbE[(σbZb+μb)2]=πaE[σa2Za2+2μaσaZa+μa2]+πbE[σb2Zb2+2μbσbZb+μb2]=πa(σa2E[Za2]+2μaσaE[Za]+μa2)+πb(σb2E[Zb2]+2μbσbE[Zb]+μb2). Applying the variance relation to Za and Zb gives: (39)E[Za2]=Var[Za]+(E[Za])2E[Zb2]=Var[Zb]+(E[Zb])2. Since E[Za]=E[Zb]=0 and Var[Za]=Var[Zb]=1, E[Za2] and E[Zb2] both equal 1. Therefore, Equation (38) simplifies to: (40)E[Y2]=πa(σa2+μa2)+πb(σb2+μb2), and the variance of Y is given by: (41)Var[Y]=πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2.

  3. Skew: Using r=3 in Equation (35) yields μ3: (42)E[Y3]=πaE[Ya3]+πbE[Yb3]=πaE[(σaZa+μa)3]+πbE[(σbZb+μb)3]=πaE[σa3Za3+3σa2μaZa2+3σaμa2Za+μa3]+πbE[σb3Zb3+3σb2μbZb2+3σbμb2Zb+μb3]=πa(σa3E[Za3]+3σa2μaE[Za2]+3σaμa2E[Za]+μa3)+πb(σb3E[Zb3]+3σb2μbE[Zb2]+3σbμb2E[Zb]+μb3). Then E[Za3]=μ3a and E[Zb3]=μ3b are given by: (43)E[Za3]=(Var[Za])3/2γ1a=γ1aE[Zb3]=(Var[Zb])3/2γ1b=γ1b. Combining these with E[Za]=E[Zb]=0 and E[Za2]=E[Zb2]=1, Equation (42) simplifies to: (44)E[Y3]=πa(σa3γ1a+3σa2μa+μa3)+πb(σb3γ1b+3σb2μb+μb3). From Equation (7), the skew of Y is given by: (45)γ1=πa(σa3γ1a+3σa2μa+μa3)+πb(σb3γ1b+3σb2μb+μb3)(πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)3/2.

  4. Skurtosis: Using r=4 in Equation (35) yields μ4: (46)E[Y4]=πaE[Ya4]+πbE[Yb4]=πaE[(σaZa+μa)4]+πbE[(σbZb+μb)4]=πaE[σa4Za4+4σa3μaZa3+6σa2μa2Za2+4σaμa3Za+μa4]+πbE[σb4Zb4+4σb3μbZb3+6σb2μb2Zb2+4σbμb3Zb+μb4]=πa(σa4E[Za4]+4σa3μaE[Za3]+6σa2μa2E[Za2]+4σaμa3E[Za]+μa4)  +πb(σb4E[Zb4]+4σb3μbE[Zb3]+6σb2μb2E[Zb2]+4σbμb3E[Zb]+μb4) Then E[Za4]=μ4a and E[Zb4]=μ4b are given by: (47)E[Za4]=(Var[Za])2(γ2a+3)=γ2a+3E[Zb4]=(Var[Zb])2(γ2b+3)=γ2b+3. Since E[Za]=E[Zb]=0 and E[Za2]=E[Zb2]=1, Equation (46) simplifies to: (48)E[Y4]=πa[σa4(γ2a+3)+4σa3μaγ1a+6σa2μa2+μa4]+πb[σb4(γ2b+3)+4σb3μbγ1b+6σb2μb2+μb4]. From Equation (8), the skurtosis of Y is given by: (49)γ2=πa[σa4(γ2a+3)+4σa3μaγ1a+6σa2μa2+μa4](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)2+πb[σb4(γ2b+3)+4σb3μbγ1b+6σb2μb2+μb4](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)2.

  5. Standardized fifth cumulant: Using r=5 in Equation (35) yields μ5: (50)E[Y5]=πaE[Ya5]+πbE[Yb5]=πaE[(σaZa+μa)5]+πbE[(σbZb+μb)5]=πaE[σa5Za5+5σa4μaZa4+10σa3μa2Za3+10σa2μa3Za2+5σaμa4Za+μa5]  +πbE[σb5Zb5+5σb4μbZb4+10σb3μb2Zb3+10σb2μb3Zb2+5σbμb4Zb+μb5]=πa(σa5E[Za5]+5σa4μaE[Za4]+10σa3μa2E[Za3]    +10σa2μa3E[Za2]+5σaμa4E[Za]+μa5)  +πb(σb5E[Zb5]+5σb4μbE[Zb4]+10σb3μb2E[Zb3]    +10σb2μb3E[Zb2]+5σbμb4E[Zb]+μb5). Then E[Za5]=μ5a and E[Zb5]=μ5b are given by: (51)E[Za5]=(Var[Za])5/2(γ3a+10γ1a)=γ3a+10γ1aE[Zb5]=(Var[Zb])5/2(γ3b+10γ1b)=γ3b+10γ1b. Since E[Za]=E[Zb]=0 and E[Za2]= E[Zb2]=1, Equation (50) simplifies to: (52)E[Y5]=πa[σa5(γ3a+10γ1a)+5σa4μa(γ2a+3)+10σa3μa2γ1a+10σa2μa3+μa5]    +πb[σb5(γ3b+10γ1b)+5σb4μb(γ2b+3)+10σb3μb2γ1b+10σb2μb3+μb5]. From Equation (9), the standardized fifth cumulant of Y is given by: (53)γ3=πa[σa5(γ3a+10γ1a)+5σa4μa(γ2a+3)+10σa3μa2γ1a+10σa2μa3+μa5](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)5/2    +πb[σb5(γ3b+10γ1b)+5σb4μb(γ2b+3)+10σb3μb2γ1b+10σb2μb3+μb5](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)5/210γ1.

  6. Standardized sixth cumulant: Using r=6 in Equation (35) yields μ6: E[Y6]=πaE[Ya6]+πbE[Yb6]=πaE[(σaZa+μa)6]+πbE[(σbZb+μb)6]=πaE[σa6Za6+6σa5μaZa5+15σa4μa2Za4+20σa3μa3Za3    +15σa2μa4Za2+6σaμa5Za+μa6]  +πbE[σb6Zb6+6σb5μbZb5+15σb4μb2Zb4+20σb3μb3Zb3    +15σb2μb4Zb2+6σbμb5Zb+μb6]

    (54)=πa(σa6E[Za6]+6σa5μaE[Za5]+15σa4μa2E[Za4]+20σa3μa3E[Za3]    +15σa2μa4E[Za2]+6σaμa5E[Za]+μa6)  +πb(σb6E[Zb6]+6σb5μbE[Zb5]+15σb4μb2E[Zb4]+20σb3μb3E[Zb3]    +15σb2μb4E[Zb2]+6σbμb5E[Zb]+μb6). Then E[Za6]=μ6a and E[Zb6]=μ6b are given by: (55)E[Za6]=(Var[Za])3(γ4a+15γ2a+10γ1a2+15)=γ4a+15γ2a+10γ1a2+15E[Zb6]=(Var[Zb])3(γ4b+15γ2b+10γ1b2+15)=γ4b+15γ2b+10γ1b2+15. Since E[Za]=E[Zb]=0 and E[Za2]=E[Zb2]=1, Equation (54) simplifies to: (56)E[Y6]=πa[σa6(γ4a+15γ2a+10γ1a2+15)+6σa5μa(γ3a+10γ1a)    +15σa4μa2(γ2a+3)+20σa3μa3γ1a+15σa2μa4+μa6]  +πb[σb6(γ4b+15γ2b+10γ1b2+15)+6σb5μb(γ3b+10γ1b)    +15σb4μb2(γ2b+3)+20σb3μb3γ1b+15σb2μb4+μb6]. From Equation (10), the standardized sixth cumulant of Y is given by: (57)γ4=πa[σa6(γ4a+15γ2a+10γ1a2+15)+6σa5μa(γ3a+10γ1a)(πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)3    +15σa4μa2(γ2a+3)+20σa3μa3γ1a+15σa2μa4+μa6](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)3  +πb[σb6(γ4b+15γ2b+10γ1b2+15)+6σb5μb(γ3b+10γ1b)(πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)3    +15σb4μb2(γ2b+3)+20σb3μb3γ1b+15σb2μb4+μb6](πa(σa2+μa2)+πb(σb2+μb2)[πaμa+πbμb]2)3  15γ210γ1215.

Results from examples comparing correlation methods 1 and 2

Table 5: Table numbers for matrices of correlation errors.
Scenario
Correlation Type A: Poisson and NB B: NB
Strong 6 9
Moderate 7 10
Weak 8 11
Table 6: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with strong correlations in scenario A.
O1 N1 N2 N3 B1
O1 0 -0.08 (-0.083, -0.078) -0.08 (-0.082, -0.078) -0.08 (-0.082, -0.078) -0.036 (-0.039, -0.034)
N1 -0.078 (-0.08, -0.076) 0 0.153 (0.152, 0.153) 0.153 (0.152, 0.153) -0.164 (-0.164, -0.164)
N2 -0.078 (-0.08, -0.076) 0.153 (0.153, 0.153) 0 0.153 (0.152, 0.153) -0.164 (-0.164, -0.164)
N3 -0.078 (-0.081, -0.076) 0.153 (0.153, 0.153) 0.153 (0.153, 0.153) 0 -0.164 (-0.164, -0.164)
B1 -0.034 (-0.036, -0.031) -0.164 (-0.164, -0.164) -0.164 (-0.164, -0.164) -0.164 (-0.164, -0.164) 0
B2 -0.035 (-0.037, -0.033) -0.165 (-0.166, -0.165) -0.166 (-0.166, -0.165) -0.166 (-0.166, -0.165) 0.155 (0.154, 0.156)
P1 -0.033 (-0.036, -0.03) -0.157 (-0.16, -0.153) -0.156 (-0.159, -0.153) -0.156 (-0.159, -0.153) -0.123 (-0.125, -0.12)
P2 -0.018 (-0.02, -0.015) -0.133 (-0.135, -0.13) -0.133 (-0.135, -0.13) -0.133 (-0.135, -0.13) -0.097 (-0.1, -0.095)
NB1 -0.05 (-0.053, -0.047) -0.168 (-0.172, -0.165) -0.168 (-0.171, -0.165) -0.168 (-0.171, -0.165) -0.137 (-0.14, -0.134)
NB2 -0.028 (-0.031, -0.025) -0.156 (-0.16, -0.153) -0.156 (-0.159, -0.153) -0.156 (-0.159, -0.153) -0.125 (-0.128, -0.122)
B2 P1 P2 NB1 NB2
O1 -0.038 (-0.04, -0.035) -0.023 (-0.025, -0.02) 0.003 (0, 0.005) -0.053 (-0.056, -0.05) -0.043 (-0.046, -0.04)
N1 -0.166 (-0.167, -0.165) -0.156 (-0.159, -0.153) -0.128 (-0.131, -0.126) -0.166 (-0.169, -0.163) -0.153 (-0.156, -0.15)
N2 -0.166 (-0.167, -0.165) -0.156 (-0.158, -0.153) -0.128 (-0.131, -0.126) -0.166 (-0.17, -0.163) -0.153 (-0.156, -0.15)
N3 -0.166 (-0.167, -0.165) -0.156 (-0.159, -0.153) -0.128 (-0.131, -0.126) -0.166 (-0.17, -0.163) -0.153 (-0.156, -0.15)
B1 0.154 (0.153, 0.155) -0.123 (-0.126, -0.12) -0.093 (-0.096, -0.091) -0.135 (-0.138, -0.132) -0.121 (-0.124, -0.118)
B2 0 -0.156 (-0.159, -0.154) -0.121 (-0.123, -0.118) -0.174 (-0.177, -0.171) -0.157 (-0.16, -0.155)
P1 -0.156 (-0.159, -0.153) 0 0.029 (0.025, 0.031) 0.013 (0.009, 0.016) 0.035 (0.032, 0.038)
P2 -0.124 (-0.126, -0.122) 0.013 (0.01, 0.016) 0 0.036 (0.033, 0.039) 0.014 (0.01, 0.017)
NB1 -0.175 (-0.178, -0.172) 0.022 (0.018, 0.026) 0.011 (0.008, 0.015) 0 0.044 (0.04, 0.048)
NB2 -0.161 (-0.164, -0.158) 0.02 (0.016, 0.024) 0.012 (0.008, 0.015) 0.022 (0.019, 0.027) 0
Table 7: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with moderate correlations in scenario A.
O1 N1 N2 N3 B1
O1 0 -0.021 (-0.023, -0.018) -0.021 (-0.023, -0.018) -0.021 (-0.023, -0.018) 0 (-0.003, 0.003)
N1 -0.019 (-0.022, -0.017) 0 0.049 (0.049, 0.05) 0.049 (0.049, 0.05) -0.035 (-0.035, -0.035)
N2 -0.019 (-0.022, -0.016) 0.051 (0.051, 0.051) 0 0.049 (0.049, 0.05) -0.035 (-0.035, -0.035)
N3 -0.019 (-0.022, -0.017) 0.051 (0.051, 0.051) 0.051 (0.051, 0.051) 0 -0.035 (-0.035, -0.035)
B1 -0.001 (-0.003, 0.002) -0.034 (-0.034, -0.034) -0.034 (-0.034, -0.033) -0.034 (-0.034, -0.033) 0
B2 -0.001 (-0.003, 0.002) -0.034 (-0.035, -0.033) -0.034 (-0.035, -0.033) -0.034 (-0.035, -0.033) 0.016 (0.015, 0.017)
P1 -0.002 (-0.005, 0.001) -0.041 (-0.045, -0.038) -0.041 (-0.044, -0.038) -0.041 (-0.044, -0.038) -0.009 (-0.012, -0.006)
P2 -0.001 (-0.005, 0.002) -0.034 (-0.037, -0.032) -0.034 (-0.037, -0.032) -0.034 (-0.037, -0.032) -0.006 (-0.009, -0.003)
NB1 -0.004 (-0.007, 0) -0.044 (-0.048, -0.041) -0.045 (-0.048, -0.041) -0.044 (-0.048, -0.041) -0.011 (-0.014, -0.008)
NB2 -0.003 (-0.006, 0) -0.041 (-0.044, -0.037) -0.041 (-0.044, -0.037) -0.041 (-0.044, -0.038) -0.01 (-0.012, -0.007)
B2 P1 P2 NB1 NB2
O1 0.001 (-0.002, 0.003) 0 (-0.004, 0.004) 0.005 (0.001, 0.008) -0.009 (-0.013, -0.006) -0.007 (-0.011, -0.004)
N1 -0.035 (-0.036, -0.034) -0.038 (-0.041, -0.035) -0.031 (-0.034, -0.028) -0.04 (-0.043, -0.037) -0.037 (-0.04, -0.034)
N2 -0.035 (-0.036, -0.034) -0.038 (-0.041, -0.035) -0.031 (-0.034, -0.029) -0.04 (-0.044, -0.037) -0.037 (-0.04, -0.033)
N3 -0.035 (-0.036, -0.034) -0.038 (-0.041, -0.035) -0.031 (-0.034, -0.028) -0.041 (-0.044, -0.037) -0.037 (-0.04, -0.034)
B1 0.013 (0.012, 0.014) -0.005 (-0.008, -0.002) -0.003 (-0.006, 0) -0.006 (-0.01, -0.003) -0.005 (-0.008, -0.002)
B2 0 -0.027 (-0.029, -0.024) -0.019 (-0.022, -0.017) -0.033 (-0.035, -0.03) -0.029 (-0.031, -0.027)
P1 -0.03 (-0.033, -0.028) 0 0.02 (0.016, 0.025) 0.014 (0.008, 0.019) 0.028 (0.023, 0.033)
P2 -0.022 (-0.024, -0.019) 0.004 (0, 0.009) 0 0.033 (0.028, 0.037) 0.013 (0.008, 0.017)
NB1 -0.037 (-0.04, -0.034) 0.008 (0.003, 0.013) 0.004 (0, 0.01) 0 0.037 (0.032, 0.042)
NB2 -0.033 (-0.036, -0.03) 0.007 (0.002, 0.012) 0.004 (0, 0.009) 0.008 (0.002, 0.013) 0
Table 8: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with weak correlations in scenario A.
O1 N1 N2 N3 B1
O1 0 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003)
N1 0 (-0.003, 0.003) 0 0 (0, 0) 0 (0, 0) 0 (0, 0)
N2 0 (-0.003, 0.003) 0 (0, 0) 0 0 (0, 0) 0 (0, 0)
N3 0 (-0.003, 0.003) 0 (0, 0) 0 (0, 0) 0 0 (0, 0)
B1 0 (-0.003, 0.003) 0 (0, 0) 0 (0, 0) 0 (0, 0) 0
B2 0 (-0.003, 0.004) 0 (-0.001, 0.001) 0 (-0.001, 0.001) 0 (-0.001, 0.001) 0 (-0.001, 0.001)
P1 0 (-0.004, 0.004) 0 (-0.004, 0.004) 0 (-0.004, 0.003) 0 (-0.004, 0.004) -0.001 (-0.004, 0.002)
P2 0 (-0.004, 0.004) 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003)
NB1 -0.001 (-0.005, 0.003) 0 (-0.004, 0.004) 0 (-0.004, 0.004) 0 (-0.004, 0.004) -0.001 (-0.005, 0.002)
NB2 -0.001 (-0.005, 0.003) 0 (-0.004, 0.003) 0 (-0.003, 0.003) 0 (-0.004, 0.003) -0.002 (-0.005, 0.002)
B2 P1 P2 NB1 NB2
O1 0 (-0.003, 0.003) 0 (-0.004, 0.004) 0 (-0.004, 0.004) -0.002 (-0.006, 0.002) -0.002 (-0.005, 0.002)
N1 0 (-0.001, 0.001) 0 (-0.004, 0.004) 0 (-0.003, 0.003) 0 (-0.004, 0.004) 0 (-0.004, 0.004)
N2 0 (-0.001, 0.001) 0 (-0.003, 0.004) 0 (-0.003, 0.003) 0 (-0.004, 0.004) 0 (-0.003, 0.004)
N3 0 (-0.001, 0.001) 0 (-0.004, 0.004) 0 (-0.003, 0.003) 0 (-0.004, 0.004) 0 (-0.004, 0.003)
B1 0 (-0.001, 0.001) -0.001 (-0.004, 0.003) -0.001 (-0.004, 0.002) -0.001 (-0.005, 0.002) -0.001 (-0.005, 0.002)
B2 0 -0.009 (-0.012, -0.006) -0.006 (-0.009, -0.003) -0.011 (-0.014, -0.008) -0.01 (-0.013, -0.006)
P1 -0.009 (-0.012, -0.006) 0 0.014 (0.008, 0.018) 0.016 (0.01, 0.022) 0.021 (0.016, 0.027)
P2 -0.006 (-0.009, -0.004) 0 (-0.006, 0.004) 0 0.022 (0.017, 0.027) 0.01 (0.004, 0.015)
NB1 -0.011 (-0.014, -0.008) 0 (-0.006, 0.006) -0.001 (-0.006, 0.005) 0 0.028 (0.022, 0.035)
NB2 -0.01 (-0.013, -0.006) 0 (-0.006, 0.006) 0 (-0.006, 0.005) 0 (-0.006, 0.006) 0
Table 9: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with strong correlations in scenario B.
O1 N1 N2 N3 B1
O1 0 -0.095 (-0.097, -0.092) -0.095 (-0.097, -0.092) -0.095 (-0.097, -0.093) -0.049 (-0.051, -0.046)
N1 -0.094 (-0.096, -0.092) 0 0.141 (0.141, 0.141) 0.141 (0.141, 0.141) -0.172 (-0.172, -0.172)
N2 -0.094 (-0.096, -0.092) 0.141 (0.141, 0.141) 0 0.141 (0.141, 0.141) -0.172 (-0.172, -0.171)
N3 -0.094 (-0.096, -0.092) 0.141 (0.141, 0.141) 0.141 (0.141, 0.141) 0 -0.172 (-0.172, -0.172)
B1 -0.048 (-0.05, -0.046) -0.172 (-0.172, -0.171) -0.172 (-0.172, -0.171) -0.172 (-0.172, -0.171) 0
B2 -0.049 (-0.051, -0.047) -0.173 (-0.174, -0.172) -0.173 (-0.174, -0.172) -0.173 (-0.174, -0.172) 0.14 (0.139, 0.141)
NB1 -0.056 (-0.059, -0.053) -0.161 (-0.164, -0.158) -0.16 (-0.164, -0.157) -0.161 (-0.164, -0.158) -0.129 (-0.132, -0.126)
NB2 -0.033 (-0.036, -0.03) -0.151 (-0.154, -0.148) -0.151 (-0.154, -0.148) -0.151 (-0.154, -0.148) -0.118 (-0.121, -0.115)
NB3 -0.001 (-0.003, 0) -0.094 (-0.096, -0.092) -0.094 (-0.096, -0.092) -0.094 (-0.096, -0.092) -0.052 (-0.054, -0.051)
NB4 0.001 (-0.002, 0.003) -0.099 (-0.101, -0.098) -0.1 (-0.101, -0.097) -0.1 (-0.102, -0.098) -0.056 (-0.057, -0.054)
B2 NB1 NB2 NB3 NB4
O1 -0.05 (-0.052, -0.048) -0.05 (-0.053, -0.047) -0.04 (-0.043, -0.037) -0.011 (-0.013, -0.009) 0.022 (0.02, 0.024)
N1 -0.173 (-0.174, -0.172) -0.158 (-0.161, -0.155) -0.148 (-0.151, -0.145) -0.094 (-0.096, -0.092) -0.095 (-0.097, -0.093)
N2 -0.173 (-0.174, -0.172) -0.158 (-0.161, -0.155) -0.148 (-0.151, -0.145) -0.094 (-0.096, -0.092) -0.095 (-0.097, -0.093)
N3 -0.173 (-0.174, -0.172) -0.158 (-0.162, -0.155) -0.148 (-0.151, -0.145) -0.094 (-0.096, -0.092) -0.095 (-0.097, -0.093)
B1 0.14 (0.138, 0.141) -0.126 (-0.129, -0.124) -0.115 (-0.118, -0.112) -0.052 (-0.054, -0.05) -0.051 (-0.053, -0.05)
B2 0 -0.166 (-0.169, -0.163) -0.151 (-0.154, -0.149) -0.042 (-0.044, -0.039) -0.045 (-0.047, -0.042)
NB1 -0.168 (-0.171, -0.165) 0 0.047 (0.043, 0.05) -0.015 (-0.017, -0.013) -0.008 (-0.009, -0.006)
NB2 -0.155 (-0.157, -0.152) 0.027 (0.023, 0.031) 0 -0.009 (-0.011, -0.007) -0.001 (-0.003, 0.001)
NB3 -0.042 (-0.045, -0.04) -0.018 (-0.02, -0.016) -0.014 (-0.016, -0.012) 0 0.001 (-0.001, 0.004)
NB4 -0.049 (-0.051, -0.046) -0.014 (-0.016, -0.012) -0.013 (-0.015, -0.011) 0.003 (0, 0.005) 0
Table 10: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with moderate correlations in scenario B.
O1 N1 N2 N3 B1
O1 0 -0.02 (-0.023, -0.017) -0.02 (-0.023, -0.017) -0.02 (-0.023, -0.017) 0.002 (-0.001, 0.004)
N1 -0.019 (-0.022, -0.017) 0 0.038 (0.038, 0.038) 0.038 (0.038, 0.038) -0.043 (-0.043, -0.042)
N2 -0.02 (-0.022, -0.017) 0.039 (0.039, 0.039) 0 0.038 (0.038, 0.038) -0.043 (-0.043, -0.042)
N3 -0.019 (-0.022, -0.017) 0.039 (0.039, 0.039) 0.039 (0.039, 0.039) 0 -0.043 (-0.043, -0.042)
B1 0.001 (-0.001, 0.004) -0.042 (-0.042, -0.042) -0.042 (-0.042, -0.042) -0.042 (-0.042, -0.042) 0
B2 0.002 (-0.001, 0.004) -0.042 (-0.043, -0.041) -0.042 (-0.043, -0.041) -0.042 (-0.043, -0.041) 0.017 (0.016, 0.018)
NB1 0 (-0.004, 0.003) -0.029 (-0.033, -0.025) -0.029 (-0.033, -0.026) -0.029 (-0.032, -0.026) -0.001 (-0.004, 0.003)
NB2 0 (-0.003, 0.003) -0.028 (-0.031, -0.024) -0.027 (-0.03, -0.024) -0.027 (-0.03, -0.025) -0.001 (-0.004, 0.002)
NB3 0 (-0.003, 0.003) -0.015 (-0.017, -0.013) -0.015 (-0.017, -0.013) -0.015 (-0.017, -0.013) -0.001 (-0.003, 0.001)
NB4 0 (-0.003, 0.003) -0.016 (-0.018, -0.014) -0.016 (-0.018, -0.014) -0.016 (-0.018, -0.014) 0 (-0.002, 0.002)
B2 NB1 NB2 NB3 NB4
O1 0.002 (-0.001, 0.005) -0.008 (-0.011, -0.004) -0.006 (-0.009, -0.003) -0.002 (-0.005, 0.001) 0.008 (0.004, 0.011)
N1 -0.043 (-0.044, -0.042) -0.027 (-0.031, -0.024) -0.026 (-0.029, -0.022) -0.015 (-0.017, -0.013) -0.015 (-0.017, -0.012)
N2 -0.043 (-0.044, -0.042) -0.027 (-0.031, -0.024) -0.025 (-0.029, -0.022) -0.015 (-0.017, -0.013) -0.014 (-0.017, -0.012)
N3 -0.043 (-0.044, -0.042) -0.027 (-0.031, -0.024) -0.026 (-0.029, -0.023) -0.015 (-0.017, -0.013) -0.015 (-0.017, -0.012)
B1 0.018 (0.017, 0.019) 0 (-0.004, 0.003) -0.001 (-0.004, 0.002) -0.001 (-0.003, 0.001) -0.001 (-0.003, 0.002)
B2 0 -0.028 (-0.031, -0.024) -0.025 (-0.027, -0.022) 0.005 (0.003, 0.008) 0.003 (0.001, 0.006)
NB1 -0.027 (-0.031, -0.025) 0 0.034 (0.029, 0.04) 0.002 (0, 0.005) 0.014 (0.012, 0.017)
NB2 -0.025 (-0.027, -0.022) 0.004 (-0.002, 0.01) 0 0.005 (0.002, 0.007) 0.013 (0.01, 0.015)
NB3 0.005 (0.003, 0.008) -0.001 (-0.003, 0.001) -0.001 (-0.003, 0.001) 0 -0.002 (-0.006, 0.001)
NB4 0.005 (0.002, 0.007) -0.001 (-0.004, 0.001) -0.001 (-0.003, 0.002) 0 (-0.003, 0.004) 0
Table 11: Median (IQR) of correlation errors using correlation methods 1 (in black) and 2 (in blue) with weak correlations in scenario B.
O1 N1 N2 N3 B1
O1 0 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003) 0 (-0.003, 0.003)
N1 0 (-0.003, 0.003) 0 0 (0, 0) 0 (0, 0) 0 (0, 0)
N2 0 (-0.003, 0.003) 0 (0, 0) 0 0 (0, 0) 0 (0, 0)
N3 -0.001 (-0.003, 0.003) 0 (0, 0) 0 (0, 0) 0 0 (0, 0)
B1 0 (-0.003, 0.003) 0 (0, 0) 0 (0, 0) 0 (0, 0) 0
B2 0 (-0.003, 0.003) 0 (-0.001, 0.001) 0 (-0.001, 0.001) 0 (-0.001, 0.001) 0 (-0.001, 0.001)
NB1 -0.001 (-0.005, 0.003) 0 (-0.004, 0.004) 0 (-0.004, 0.004) 0 (-0.004, 0.004) -0.002 (-0.005, 0.002)
NB2 -0.001 (-0.005, 0.003) 0 (-0.004, 0.003) 0 (-0.003, 0.003) 0 (-0.004, 0.003) -0.002 (-0.005, 0.002)
NB3 0 (-0.004, 0.003) 0 (-0.002, 0.002) 0 (-0.002, 0.002) 0 (-0.002, 0.002) 0 (-0.002, 0.003)
NB4 0.001 (-0.003, 0.004) 0 (-0.002, 0.002) 0 (-0.002, 0.003) 0 (-0.002, 0.002) 0.001 (-0.002, 0.003)
B2 NB1 NB2 NB3 NB4
O1 0 (-0.003, 0.003) -0.002 (-0.006, 0.002) -0.002 (-0.005, 0.003) 0 (-0.004, 0.003) 0.001 (-0.002, 0.005)
N1 0 (-0.001, 0.001) 0 (-0.003, 0.004) 0 (-0.003, 0.003) 0 (-0.002, 0.002) 0 (-0.002, 0.002)
N2 0 (-0.001, 0.001) 0 (-0.004, 0.004) 0 (-0.004, 0.003) 0 (-0.002, 0.002) 0 (-0.002, 0.002)
N3 0 (-0.001, 0.001) 0 (-0.004, 0.004) 0 (-0.004, 0.003) 0 (-0.002, 0.002) 0 (-0.002, 0.002)
B1 0 (-0.001, 0.001) -0.002 (-0.005, 0.002) -0.002 (-0.005, 0.002) 0 (-0.002, 0.002) 0 (-0.002, 0.003)
B2 0 -0.011 (-0.014, -0.007) -0.01 (-0.012, -0.007) 0.002 (0, 0.005) 0.002 (-0.001, 0.005)
NB1 -0.011 (-0.014, -0.007) 0 0.028 (0.021, 0.034) 0.003 (0, 0.006) 0.012 (0.009, 0.015)
NB2 -0.01 (-0.013, -0.007) 0 (-0.006, 0.006) 0 0.004 (0.001, 0.007) 0.008 (0.006, 0.012)
NB3 0.003 (0, 0.005) -0.001 (-0.004, 0.003) 0 (-0.003, 0.003) 0 -0.002 (-0.005, 0.002)
NB4 0.002 (-0.001, 0.005) 0 (-0.004, 0.003) 0 (-0.003, 0.003) 0.001 (-0.003, 0.004) 0

CRAN packages used

AdaptGauss, DPP, bgmm, ClusterR, mclust, mixture, AdMit, bimixt, bmixture, CAMAN, flexmix, mixdist, mixtools, nspmix, MixtureInf, Rmixmod, hurdlr, zic, mixpack, distr, stats, rebmix, SimCorrMix, SimMultiCorrData, GenOrd, VGAM, Matrix, ggplot2, mvtnorm

CRAN Task Views implied by cited packages

Bayesian, Cluster, Distributions, Econometrics, Environmetrics, ExtremeValue, Finance, MissingData, NumericalMathematics, Optimization, Phylogenetics, Psychometrics, Robust, Spatial, Survival, TeachingStatistics

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Footnotes

    References

    A. Amatya and H. Demirtas. Simultaneous generation of multivariate mixed data with Poisson and normal marginals. Journal of Statistical Computation and Simulation, 85(15): 3129–3139, 2015. URL https://doi.org/10.1080/00949655.2014.953534.
    D. Ardia. AdMit: Adaptive mixture of student-t distributions. 2017. URL https://CRAN.R-project.org/package=AdMit. R package version 2.1.3.
    L. M. Avila, M. R. May and J. Ross-Ibarra. DPP: Inference of parameters of normal distributions from a mixture of normals. 2017. URL https://CRAN.R-project.org/package=DPP. R package version 0.1.1.
    A. A. Baghban, A. Pourhoseingholi, F. Zayeri, A. A. Jafari and S. M. Alavian. Application of zero-inflated Poisson mixed models in prognostic factors of Hepatitis C. BioMed Research International, 2013. URL https://doi.org/10.1155/2013/403151.
    O. G. Bahcall. Complex traits: Genetic discovery, heritability and prediction. Nature Reviews Genetics, 16(257): 2015. URL https://doi.org/10.1038/nrg3947.
    E. Balderama and T. Trippe. Hurdlr: Zero-inflated and hurdle modelling using bayesian inference. 2017. URL https://CRAN.R-project.org/package=hurdlr. R package version 0.1.
    A. Barbiero and P. A. Ferrari. GenOrd: Simulation of discrete random variables with given correlation matrix and marginal distributions. 2015a. URL https://CRAN.R-project.org/package=GenOrd. R package version 1.4.0.
    A. Barbiero and P. A. Ferrari. Simulation of correlated Poisson variables. Applied Stochastic Models in Business and Industry, 31: 669–680, 2015b. URL https://doi.org/10.1002/asmb.2072.
    D. Bates and M. Maechler. Matrix: Sparse and dense matrix classes and methods. 2017. URL https://CRAN.R-project.org/package=Matrix. R package version 1.2-12.
    M. Bhattacharjee, M. S. Rajeevan and M. J. Sillanpää. Prediction of complex human diseases from pathway-focused candidate markers by joint estimation of marker effects: Case of chronic fatigue syndrome. Human Genomics, 9(1): 8, 2015. URL https://doi.org/10.1186/s40246-015-0030-6.
    P. Biecek and E. Szczurek. Bgmm: Gaussian mixture modeling algorithms and the belief-based mixture modeling. 2017. URL https://CRAN.R-project.org/package=bgmm. R package version 1.8.3.
    N. Bouguila, D. Ziou and E. Monga. Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Statistics and Computing, 16: 215–225, 2006. URL https://doi.org/10.1007/s11222-006-8451-7.
    R. P. Browne, A. ElSherbiny and P. D. McNicholas. Mixture: Mixture models for clustering and classification. 2015. URL https://CRAN.R-project.org/package=mixture. R package version 1.4.
    M. Comas-Cufí, J. A. Martín-Fernández and G. Mateu-Figueras. Mixpack: Tools to work with mixture components. 2017. URL https://CRAN.R-project.org/package=mixpack. R package version 0.3.6.
    H. Dai and R. Charnigo. Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data. Biostatistics (Oxford, England), 16(4): 641–654, 2015. URL http://doi.org/10.1093/biostatistics/kxv016.
    X. Dai, T. Erkkilä, O. Yli-Harja and H. Lähdesmäki. A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data. BMC Bioinformatics, 10(1): 165, 2009. URL https://doi.org/10.1186/1471-2105-10-165.
    J. Davenport, J. Bezder and R. Hathaway. Parameter estimation for finite mixture distributions. Computers & Mathematics with Applications, 15(10): 819–828, 1988.
    H. Demirtas and D. Hedeker. A practical way for computing approximate lower and upper correlation bounds. The American Statistician, 65(2): 104–109, 2011. URL https://doi.org/10.1198/tast.2011.10090.
    H. Demirtas, D. Hedeker and R. J. Mermelstein. Simulation of massive public health data by power polynomials. Statistics in Medicine, 31(27): 3337–3346, 2012. URL https://doi.org/10.1002/sim.5362.
    R. C. Elston, J. M. Olson and L. Palmer. Biostatistical genetics and genetic epidemiology. Hoboken, New Jersey: John Wiley & Sons, 2002.
    L. J. Emrich and M. R. Piedmonte. A method for generating high-dimensional multivariate binary variates. The American Statistician, 45: 302–304, 1991. URL https://doi.org/10.1080/00031305.1991.10475828.
    B. S. Everitt. An introduction to finite mixture distributions. Statistical Methods in Medical Research, 5(2): 107–127, 1996. URL https://doi.org/10.1177/096228029600500202.
    P. A. Ferrari and A. Barbiero. Simulating ordinal data. Multivariate Behavioral Research, 47(4): 566–589, 2012. URL https://doi.org/10.1080/00273171.2012.692630.
    A. C. Fialkowski. SimMultiCorrData: Simulation of correlated data with multiple variable types. 2017. URL https://CRAN.R-project.org/package=SimMultiCorrData. R package version 0.2.1.
    A. C. Fialkowski and H. K. Tiwari. SimMultiCorrData: An R package for simulation of correlated non-normal or normal, binary, ordinal, poisson, and negative binomial variables. Manuscript submitted for publication, 2017.
    R. A. Fisher. Moments and product moments of sampling distributions. Proceedings of the London Mathematical Society Series 2, 30: 199–238, 1929.
    A. I. Fleishman. A method for simulating non-normal distributions. Psychometrika, 43: 521–532, 1978. URL https://doi.org/10.1007/BF02293811.
    C. Fraley, A. E. Raftery and L. Scrucca. Mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation. 2017. URL https://CRAN.R-project.org/package=mclust. R package version 5.4.
    M. Fréchet. Les tableaux de corrélation et les programmes linéaires. Revue de L’Institut International de Statistique / Review of the International Statistical Institute, 25(1/3): 23–40, 1957. URL https://doi.org/10.2307/1401672.
    B. L. Fridley, D. Serie, G. Jenkins, K. White, W. Bamlet, J. D. Potter and E. L. Goode. Bayesian mixture models for the incorporation of prior knowledge to inform genetic association studies. Genetic Epidemiology, 34(5): 418–426, 2010. URL https://doi.org/10.1002/gepi.20494.
    R. Fu, D. K. Dey and K. E. Holsinger. A beta-mixture model for assessing genetic population structure. Biometrics, 67(3): 1073–1082, 2011. URL http://www.jstor.org/stable/41242556.
    A. Genz and F. Bretz. Computation of multivariate normal and t probabilities. Heidelberg: Springer-Verlag, 2009. URL https://doi.org/10.1007/978-3-642-01689-9.
    A. Genz, F. Bretz, T. Miwa, X. Mi and T. Hothorn. Mvtnorm: Multivariate normal and t distributions. 2017. URL https://CRAN.R-project.org/package=mvtnorm. R package version 1.0-6.
    B. Gruen and F. Leisch. Flexmix: Flexible mixture modeling. 2017. URL https://CRAN.R-project.org/package=flexmix. R package version 2.3-14.
    D. B. Hall. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics, 56(4): 1030–1039, 2000. URL https://doi.org/10.1111/j.0006-341X.2000.01030.x.
    H. He, W. Tang, W. Wang and P. Crits-Christoph. Structural zeroes and zero-inflated models. Shanghai Archives of Psychiatry, 26(4): 236–242, 2014. URL https://doi.org/10.3969/j.issn.1002-0829.2014.04.008.
    T. C. Headrick. Fast fifth-order polynomial transforms for generating univariate and multivariate non-normal distributions. Computational Statistics & Data Analysis, 40(4): 685–711, 2002. URL https://doi.org/10.1016/S0167-9473(02)00072-5.
    T. C. Headrick and R. K. Kowalchuk. The power method transformation: Its probability density function, distribution function, and its further use for fitting data. Journal of Statistical Computation and Simulation, 77: 229–249, 2007. URL https://doi.org/10.1080/10629360600605065.
    T. C. Headrick and S. S. Sawilowsky. Simulating correlated non-normal distributions: Extending the Fleishman power method. Psychometrika, 64: 25–35, 1999. URL https://doi.org/10.1007/BF02294317.
    N. Higham. Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis, 22(3): 329–343, 2002. URL https://doi.org/10.1093/imanum/22.3.329.
    W. Hoeffding. Scale-invariant correlation theory. In The collected works of wassily hoeffding, Eds N. I. Fisher and P. K. Sen pages. 57–107 1994. New York: Springer-Verlag. URL https://doi.org/10.1007/978-1-4612-0865-5_4.
    N. Ismail and H. Zamani. Estimation of claim count data using negative binomial, generalized Poisson, zero-inflated negative binomial and zero-inflated generalized Poisson regression models. Casualty Actuarial Society E-Forum, 41(20): 1–28, 2013.
    Y. Ji, C. Wu, P. Liu, J. Wang and K. R. Coombes. Applications of beta-mixture models in bioinformatics. Bioinformatics, 21(9): 2118–2122, 2005. URL http://dx.doi.org/10.1093/bioinformatics/bti318.
    M. Jochmann. Zic: Bayesian inference for zero-inflated count models. 2017. URL https://CRAN.R-project.org/package=zic. R package version 0.9.1.
    M. Kendall and A. Stuart. The advanced theory of statistics. 4th ed New York: Macmillan, 1977.
    M. Kohl. Distr: Object oriented implementation of distributions. 2017. URL https://CRAN.R-project.org/package=distr. R package version 2.6.2.
    D. Lambert. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1): 1–14, 1992.
    F. Langrognet, R. Lebret, C. Poli and S. Iovleff. Rmixmod: Supervised, unsupervised, semi-supervised classification with MIXture MODelling (interface of MIXMOD software). 2016. URL https://CRAN.R-project.org/package=Rmixmod. R package version 2.1.1.
    M. G. Larson and G. E. Dinse. A mixture model for the regression analysis of competing risks data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 34(3): 201–211, 1985. URL http://www.jstor.org/stable/2347464.
    B. Lau, S. R. Cole and S. J. Gange. Competing risk regression models for epidemiologic data. American Journal of Epidemiology, 170(2): 244–256, 2009. URL http://dx.doi.org/10.1093/aje/kwp107.
    B. Lau, S. R. Cole and S. J. Gange. Parametric mixture models to evaluate and summarize hazard ratios in the presence of competing risks with time-dependent hazards and delayed entry. Statistics in Medicine, 30(6): 654–665, 2011. URL http://dx.doi.org/10.1002/sim.4123.
    K. Laurila, B. Oster, C. L. Andersen, P. Lamy, T. Orntoft, O. Yli-Harja and C. Wiuf. A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics, 12(1): 215, 2011. URL https://doi.org/10.1186/1471-2105-12-215.
    R. R. J. Lewine. Sex differences in schizophrenia: Timing or subtypes? Psychological Bulletin, 90: 432–444, 1981.
    S. Li, J. Chen and P. Li. MixtureInf: Inference for finite mixture models. 2016. URL https://CRAN.R-project.org/package=MixtureInf. R package version 1.1.
    Z. Ma and A. Leijon. Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell, 33(11): 2160–2173, 2011. URL https://doi.org/10.1109/TPAMI.2011.63.
    P. MacDonald and with contributions from Juan Du. Mixdist: Finite mixture distribution models. 2012. URL https://CRAN.R-project.org/package=mixdist. R package version 0.5-4.
    G. J. McLachlan. Cluster analysis and related techniques in medical research. Statistical Methods in Medical Research, 1(1): 27–48, 1992. URL https://doi.org/10.1177/096228029200100103.
    A. Mohammadi. Bmixture: Bayesian estimation for finite mixture of distributions. 2017. URL https://CRAN.R-project.org/package=bmixture. R package version 0.5.
    L. Mouselimis. ClusterR: Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering. 2017. URL https://CRAN.R-project.org/package=ClusterR. R package version 1.0.9.
    M. Nagode. Rebmix: Finite mixture modeling, clustering & classification. 2017. URL https://CRAN.R-project.org/package=rebmix. R package version 2.9.3.
    S. R. Newcomer, J. F. Steiner and E. A. Bayliss. Identifying subgroups of complex patients with cluster analysis. The American Journal of Managed Care, 17(8): e324–32, 2011.
    U. Olsson, F. Drasgow and N. J. Dorans. The polyserial correlation coefficient. Psychometrika, 47(3): 337–347, 1982. URL https://doi.org/10.1007/BF02294164.
    L. Pamulaparty, C. V. G. Rao and M. S. Rao. Cluster analysis of medical research data using R. Global Journal of Computer Science and Technology, 16(1): 1–6, 2016.
    R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2017. URL https://www.R-project.org/.
    P. Schlattmann, J. Hoehne and M. Verba. CAMAN: Finite mixture models and meta-analysis tools - based on c.a.MAN. 2016. URL https://CRAN.R-project.org/package=CAMAN. R package version 0.74.
    N. J. Schork, D. B. Allison and B. Thiel. Mixture distributions in human genetics research. Statistical Methods in Medical Research, 5: 155–178, 1996. URL https://doi.org/10.1177/096228029600500204.
    P. C. Sham, C. J. MacLean and K. S. Kendler. A typological model of schizophrenia based on age at onset, sex and familial morbidity. Acta Psychiatrica Scandinavica, 89(2): 135–141, 1994. URL http://dx.doi.org/10.1111/j.1600-0447.1994.tb01501.x.
    D. L. Solomon. Using RNA-seq data to detect differentially expressed genes. In Statistical analysis of next generation sequencing data, Eds S. Datta and D. Nettleton pages. 25–49 2014. Springer-Verlag.
    C. Soneson and M. Delorenzi. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics, 14: 91, 2013. URL https://doi.org/10.1186/1471-2105-14-91.
    A. E. Teschendorff, F. Marabita, M. Lechner, T. Bartlett, J. Tegner, D. Gomez-Cabrero and S. Beck. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics, 29(2): 189–196, 2013. URL https://doi.org/10.1093/bioinformatics/bts680.
    M. Thrun, O. Hansen-Goos, R. Griese, C. Lippmann, F. Lerch, J. Lotsch and A. Ultsch. AdaptGauss: Gaussian mixture models (GMM). 2017. URL https://CRAN.R-project.org/package=AdaptGauss. R package version 1.3.3.
    L. K. Vaughan, J. Divers, M. Padilla, D. T. Redden, H. K. Tiwari, D. Pomp and D. B. Allison. The use of plasmodes as a supplement to simulations: A simple example evaluating individual admixture estimation methodologies. Computational Statistics & Data Analysis, 53(5): 1755–1766, 2009. URL https://doi.org/10.1016/j.csda.2008.02.032.
    Y. Wang. Nspmix: Nonparametric and semiparametric mixture estimation. 2017. URL https://CRAN.R-project.org/package=nspmix. R package version 1.4-0.
    J. Welham, G. Mclachlan, G. Davies and J. McGrath. Heterogeneity in schizophrenia; mixture modelling of age-at-first-admission, gender and diagnosis. Acta Psychiatrica Scandinavica, 101(4): 312–317, 2000. URL http://dx.doi.org/10.1034/j.1600-0447.2000.101004312.x.
    H. Wickham and W. Chang. Ggplot2: Create elegant data visualisations using the grammar of graphics. 2016. URL https://CRAN.R-project.org/package=ggplot2. R package version 2.2.1.
    M. Winerip, G. Wallstrom and J. LaBaer. Bimixt: Estimates mixture models for case-control data. 2015. URL https://CRAN.R-project.org/package=bimixt. R package version 1.0.
    I. Yahav and G. Shmueli. On generating multivariate Poisson data in management science applications. Applied Stochastic Models in Business and Industry, 28(1): 91–102, 2012. URL https://doi.org/10.1002/asmb.901.
    T. W. Yee. VGAM: Vector generalized linear and additive models. 2017. URL https://CRAN.R-project.org/package=VGAM. R package version 1.0-4.
    N. Yi. BhGLM: Bayesian hierarchical GLMs and survival models, with application to genetics and epidemiology. 2017. URL http://www.ssg.uab.edu/bhglm/. R package version 1.1.0.
    D. Young, T. Benaglia, D. Chauveau and D. Hunter. Mixtools: Tools for analyzing finite mixture models. 2017. URL https://CRAN.R-project.org/package=mixtools. R package version 1.1.0.
    X. Zhang, H. Mallick and N. Yi. Zero-inflated negative binomial regression for differential abundance testing in microbiome studies. Journal of Bioinformatics and Genomics, 2(2): 1–9, 2016. URL https://doi.org/10.18454/jbg.2016.2.2.1.
    Y. Zhou, X. Wan, B. Zhang and T. Tong. Classifying next-generation sequencing data using a zero-inflated poisson model. Bioinformatics, btx768, 2017. URL https://doi.org/10.1093/bioinformatics/btx768.

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Fialkowski & Tiwari, "SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions", The R Journal, 2019

    BibTeX citation

    @article{RJ-2019-022,
      author = {Fialkowski, Allison and Tiwari, Hemant},
      title = {SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions},
      journal = {The R Journal},
      year = {2019},
      note = {https://rjournal.github.io/},
      volume = {11},
      issue = {1},
      issn = {2073-4859},
      pages = {250-286}
    }