Requiring no analytical forms, nonparametric discrete patterns are flexible in representing complex relationships among random variables. This makes them increasingly useful for data-driven applications. However, there appears to be no software tools for simulating nonparametric discrete patterns, which prevents objective evaluation of statistical methods that discover discrete relationships from data. We present a simulator to generate nonparametric discrete functions as contingency tables. User can request strictly many-to-one functional patterns. The simulator can also produce contingency tables representing dependent non-functional and independent relationships. An option is provided to apply random noise to contingency tables. We demonstrate the utility of the simulator by showing the advantage of the FunChisq test over Pearson’s chi-square test in detecting functional patterns. This simulator, implemented in the function simulate_tables
in the R package FunChisq (version 2.4.0 or greater), offers an important means to evaluate the performance of nonparametric statistical pattern discovery methods.
The demand for pattern discovery methodologies has elevated as massive
automatic data collection takes place in every application domain.
Consequently, an increasingly critical task in data science is to design
effective algorithms to recognize complex meaningful patterns from data.
Nonparametric discrete patterns do not require an analytical form,
allowing them to flexibly represent functional and associative
relationships among random variables. This makes them appealing to
data-driven capture of complex relationships from big data sources.
Indeed, many classical statistical methods can detect associative
patterns between discrete random variables, including the widely used
Pearson’s chi-square test (Pearson 1900), Fisher’s exact
test (Fisher 1922),
However, these methods are not specifically designed for detecting
functional patterns, where a dependent variable is a mathematical
function of other independent variables. Functional relationships are
considered powerful indicators of causality (Simon and Rescher 1966). Recent
methods for detecting discrete functional patterns such as the
functional chi-square test (FunChisq) (Zhang and Song 2013) have
demonstrated their effectiveness in identifying causal molecular
interactions in human breast cancer from both real and simulated protein
abundance data (Hill et al. 2016). To evaluate such methods, software tools
that can randomly generate diverse functional patterns are necessary.
Such computer programs are unavailable as far as we are aware, in spite
of several R functions for generating other types of random contingency
table. The function r2dtable
in the base package stats creates
random two-way tables with given marginals using Patefield’s algorithm
(Patefield 1981) under product-multinomial sampling. The
package rTableICC
(Demirhan 2016) produces random
To address this gap, we present a new simulator to generate discrete
functional patterns. This simulator is publicly available and
implemented as the R function simulate_tables
within the R package
FunChisq
(
The simulator randomly samples contingency tables with specified
functional and statistical patterns and also has an option to apply
additional noise on the tables. Let
We define four pattern types formed by
Functional—
Many-to-one—
Dependent non-functional—
Independent—
For all functional patterns, we do not consider the special case where
Pattern | Is |
Is |
Is |
|
---|---|---|---|---|
Functional | True | True or False | False | |
Many-to-one | True | False | False | |
Dependent non-functional | False | False | False | |
Independent | False | False | True |
The simulator will generate three related tables in order: a noise-free
sampled contingency table, a pattern table, and a noisy contingency
table. All tables are
The sampled contingency table—With given sample size
The binary pattern table—This table is created by setting all non-zero entries in the sampled contingency table to 1. Thus values in the pattern table are either 0 or 1. The table strictly satisfies the mathematical relationship for a given pattern type requested by the user, but it does not meet the statistical requirements. It can be used as the ground truth or gold standard for benchmarking how well pattern discovery algorithms can uncover the mathematical relationships.
The noisy contingency table—At a user-specified noise level, this table is the noisy version of the sampled contingency table. Due to the added noise, this table may no longer strictly satisfy the required functional or statistical relationships. This table is the main output to be used for the evaluation of a discrete pattern discovery algorithm.
Figure 1 illustrates the four pattern types by sampled
contingency tables generated using the simulator. The tables are rotated
so that the horizontal axis represents the row variable
Next, we describe how to generate the sampled contingency tables for
each type of discrete pattern. All pattern types require the common
input of sample size
Functional patterns can be used to model causal relationships that are
either linear or nonlinear. A contingency table reduces the burden of
assuming a parametric form for the function. In a functional table
representing
Input: row marginal probability function
Output: a non-constant functional table.
Randomly generate the row sums
For each row
Convert the function from constant to non-constant: If the function
is constant—all nonzero values are on the same column
Many-to-one functions are special cases of functional relationships. They are increasingly relevant as they can expose complex patterns from data of large sample sizes. Being able to generate such patterns will facilitate the evaluation of complex pattern discovery methods. The following two main steps generate this type of patterns:
Input: row marginal probability function
Output: a strictly many-to-one and non-constant functional table.
Generate a non-constant functional table using Step 1 to 3 above.
Convert the function from one-to-one to many-to-one: If the function
is one-to-one—every column of the table has at most one non-zero
entry, do the following. Randomly pick a row
In non-functional dependent relationships between
Lemma 1. Let
and
Proof. As we are given
Based on Lemma 1, we design Algorithm 1 to break the statistical
independency. Let
Algorithm 1. Convert an independent non-functional table to a dependent non-functional table.
Input: an independent non-functional table
Output: a dependent non-functional table
Randomly pick an entry
Randomly select column
Move all
Theorem 1. An empirically statistically independent non-functional contingency table can be converted to dependent non-functional contingency table by Algorithm 1. The row marginal probability function remains the same after the conversion.
Proof. Table 2 illustrates that any independent
non-functional table has at least two rows and two columns with non-zero
entries. Non-functionality guarantees at least one row (row
In the input table to Algorithm 1, entry
Also by the property shown in Table 2, no matter where
As the samples are moved within the same row
Therefore, we conclude that the new table must be dependent non-functional. Q.E.D.
Such tables are only possible when
Input: row marginal probability function
Output: a dependent non-functional table.
Randomly generate the row sums
For each row
If
If
If empirically
When
It is very common to observe noise in real data; noise may arise due to a multitude of reasons ranging from data preparation through the machinery involved in the entire data acquisition process. In statistical inference, noise needs to be handled to reduce type I and type II errors. Thus, our simulator also factors in noise to make the contingency tables resemble those constructed out of real data sets in order to additionally provide for a test of robustness.
By specifying the noise level parameter in the simulate_tables
function, one can apply noise to a contingency table. We use the
discrete house noise model (Zhang et al. 2015) that is controlled by the
We implemented the model as an R function add.house.noise
to apply
noise to an input contingency table and return a noisy contingency table
as the output. The syntax of the function is as follows:
add.house.noise(tables, u, margin = 1)
where tables
is a list of input contingency tables and u
is the
noise level between 0 and 1. The noise can be applied along rows
(margin=1
), columns (margin=2
), or both rows and columns
(margin=0
). In simulate_tables
, the noise is always applied along
the rows, which can be interpreted as applying noise to the dependent
variable simulate_tables
.
To affirm each table type generated by the simulator indeed has the
required characteristics, we compared two hypothesis testing methods:
the fun.chisq.test
(Zhang and Song 2013) in the R package FunChisq
and chisq.test
in the R package stats.
We first simulated 1000 randomly generated tables for each of the four types at the noise level of 0.1. The numbers of rows and columns of the tables were randomly selected from 3 to 10. The sample size of each table was randomly drawn from 10 to 1000 and must be at least the table size.
Next we applied the two tests on the tables.The log
In Fig. 2c, the FunChisq test was performed first on noisy
many-to-one functional patterns and second on the transposed table.
There is an increase in the
In Fig. 2d the log
These results confirm that our simulator was indeed able to generate functional, non-functional, many-to-one, and independent patterns.
a | b |
|
|
c | d |
|
|
We further benchmarked the runtime of the simulator over four table
types, four sample sizes (100, 500, 1000, and 10000), and two noise
levels (0 and 0.5) at a fixed table size of 5
a | c | e | g |
|
|
|
|
b | d | f | h |
|
|
|
|
Here we demonstrate the usage of the simulator by providing examples of
all four pattern types including the description of important
parameters. The R package FunChisq (simulate_tables
that implements the simulator. The package is publicly
available from the Comprehensive R Archive Network (CRAN). The package
can be installed and loaded by
install.packages("FunChisq")
library("FunChisq")
The signature of the function simulate_tables
is given below:
simulate_tables(n=100, nrow=3, ncol=3, type=c("functional", "many.to.one",
"independent", "dependent.non.functional"), noise=0.0, n.tables=1,
row.marginal=rep(1/nrow, nrow), col.marginal=rep(1/ncol, ncol))
The arguments are
n
—an integer specifying the sample size to be distributed in the
table. For "functional"
and "many.to.one"
tables, n
must be no
less than nrow
. For "independent"
tables, n
must be no less
than nrow
*ncol
. For "dependent.non.functional"
tables, n
must be greater than nrow
.
nrow
—an integer specifying the number of rows of output tables.
The value must be no less than 2. For "many.to.one"
tables, nrow
must be no less than 3.
ncol
—an integer value specifying the number of columns in
desired table. The value for ncol
must be no less than 2.
type
—a character string to specify the type of pattern
underlying the table. The options are "functional"
(default),
"many.to.one"
, "independent"
, and "dependent.non.functional"
.
noise
—a numeric value between 0 and 1 specifying the factor of
noise to be added to the table using the house noise
model (Zhang et al. 2015). The house noise is applied along the rows
of the table. See add.house.noise
for details.
n.tables
—an integer value specifying the number of tables to be
generated.
row.marginal
—a numeric vector of length at least 2 specifying
row marginal probabilities. For "many.to.one"
tables, the length
of row.marginal
vector must be no less than 3.
col.marginal
—a numeric vector specifying column marginal
probabilities. It is only applicable in generating independent
tables.
The return value of the function simulate_tables
is a list containing
the following components:
pattern.list
—a list of tables containing 0-1 binary patterns.
Each table is created by setting all non-zero entries in the
corresponding sampled contingency table from sample.list
to 1.
Each table strictly satisfies the functional relationship for a
given pattern type
requested. This table does not meet the
statistical requirements. As each table represents the truth
regarding the mathematical relationship between the row and column
variables, they can be used as the ground truth or gold standard for
benchmarking.
sample.list
—a list of tables satisfying both the functional and
statistical requirements. These tables are noise free.
noise.list
—a list of tables after applying noise to the
corresponding tables in sample.list
. Each table is the noisy
version of the sampled contingency table. Due to the added noise,
each table may no longer strictly satisfy the required functional or
statistical relationships. These tables are the main output to be
used for the evaluation of a discrete pattern discovery algorithm.
pvalue.list
—a list of
Example 1. A functional table. A scientist working at a weather
forecasting agency came up with a novel method that can be trained for
each geographical area to predict the amount of rainfall in that area
given a month. While the agency is busy collecting real data for the
last 10 years, the scientist wants to test his method and train the
method to learn the following hypothesis: ‘The amount of rainfall in an
area is a function of time, expressed in months’. Using different
probability distributions for each month (row.marginal
) the scientist
can simulate contingency tables with
simulate_tables(nrow=12, ncol=3, type='functional', n=100, n.tables=1,
row.marginal = c(0.116, 0.109, 0.049, 0.142, 0.083,
0.070, 0.140, 0.151, 0.037, 0.032, 0.050, 0.015), noise=0.1)
## Output:
## pattern.table sampled.table noise.table
C1 C2 C3 C1 C2 C3 C1 C2 C3
## R1 0 1 0 R1 0 5 0 R1 0 5 0
## R2 0 0 1 R2 0 0 17 R2 0 1 16
## R3 0 0 1 R3 0 0 3 R3 0 0 3
## R4 0 1 0 R4 0 15 0 R4 0 14 1
## R5 0 1 0 R5 0 3 0 R5 1 2 0
## R6 0 0 1 R6 0 0 5 R6 1 0 4
## R7 1 0 0 R7 22 0 0 R7 20 1 1
## R8 0 1 0 R8 0 15 0 R8 1 13 1
## R9 0 0 1 R9 0 0 4 R9 0 1 3
## R10 1 0 0 R10 5 0 0 R10 5 0 0
## R11 0 0 1 R11 0 0 3 R11 0 0 3
## R12 0 0 1 R12 0 0 3 R12 0 0 3
The scientist can also tune parameters and generate 100 or more such tables representing data for 100 or more years.
Example 2. A many-to-one functional table. This type of table is a
special case of functional table. It can be used to model the
relationship between the effectiveness
simulate_tables(nrow=4, ncol=5, type='many.to.one', n=100, n.tables=1, noise=0.5)
## Output:
## pattern.table sampled.table noise.table
## C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
## R1 0 0 0 1 0 R1 0 0 0 19 0 R1 2 4 0 11 2
## R2 0 1 0 0 0 R2 0 32 0 0 0 R2 5 12 7 3 5
## R3 0 1 0 0 0 R3 0 31 0 0 0 R3 5 10 4 7 5
## R4 0 0 1 0 0 R4 0 0 18 0 0 R4 2 3 7 4 2
Example 3. An independent table. This type of table is useful in
generating null distribution or bootstrapping. To measure the
statistical significance or probability of getting a certain empirical
score reported by a method, one needs to randomly sample from an
independent and identically distributed population. The following code
generates 4
simulate_tables(nrow=4, ncol=5, type='independent', n=100, n.tables=1, noise=0.0,
row.marginal=c(0.1,0.3,0.2,0.4), col.marginal=c(0.3,0.2,0.1,0.3,0.1))
## Output:
## pattern.table sampled.table noise.table
## C1 C2 C3 C4 C5 C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
## R1 1 1 1 1 1 R1 4 3 2 4 1 R1 4 3 2 4 1
## R2 1 1 1 1 1 R2 8 4 1 5 3 R2 8 4 1 5 3
## R2 1 1 1 1 1 R3 6 3 1 6 5 R3 6 3 1 6 5
## R2 1 1 1 1 1 R4 16 8 5 7 8 R4 16 8 5 7 8
Example 4. A dependent non-functional table. Two genes
simulate_tables(nrow=3, ncol=3, type='dependent.non.functional', n=100, n.tables=1,
row.marginal=c(0.3,0.5,0.2), noise=0.1)
## Output:
## pattern.table sampled.table noise.table
## C1 C2 C3 C1 C2 C3 C1 C2 C3
## R1 1 1 0 R1 12 11 0 R1 8 13 2
## R2 0 0 1 R2 0 0 49 R2 3 1 45
## R3 1 1 1 R3 7 10 11 R3 9 9 10
We have just introduced a new simulator which conveniently generates
random noisy
Although other methods for constructing random contingency tables are available (Demirhan 2016), the generation of functional tables is innovative. Having diverse functional tables meets a need to detect causal relationships from functional dependencies without using a parametric form.
For practical reasons, if the row marginal is non-zero, we will generate at least one sample for that row. Currently the simulator utilizes row marginal probabilities for the generation of dependent contingency tables; in the future we may provide an option to use the column marginal probabilities instead, to match some experimental design where the distribution of the effect variable is predefined, such as in case-control studies of cancer.
The generated dependent non-functional patterns may span a wide range of
statistical dependency from being very weak to very strong. The
Beyond generating tables of bivariate patterns between the row and column variables, one may consider the row variable as a combination of multiple discrete variables. Therefore one can immediately extend the procedures to generate contingency tables representing multivariate patterns.
In summary, we described algorithms and implemented a simulator to construct contingency tables of desired mathematical and statistical properties, and illustrated the use of this function with several examples. We validated the generation of all table types by the FunChisq and Pearson’s chi-square tests. We evaluated the runtime of the function in generating various noisy patterns. This function offers a previously overlooked utility to generate diverse functional patterns to evaluate discrete pattern discovery methods increasingly important in data science research.
This work is supported by US NSF Advances in Biological Informatics Grant DBI 1661331.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sharma, et al., "Simulating Noisy, Nonparametric, and Multivariate Discrete Patterns", The R Journal, 2017
BibTeX citation
@article{RJ-2017-053, author = {Sharma, Ruby and Kumar, Sajal and Zhong, Hua and Song, Mingzhou}, title = {Simulating Noisy, Nonparametric, and Multivariate Discrete Patterns}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {2}, issn = {2073-4859}, pages = {366-377} }