Decoupled (e.g. separate averages) and censored (e.g. cnbinom.pars()
function estimates the average and dispersion parameter of a censored univariate frequency table. The rec()
function reverse engineers summarized data into an uncensored bivariate table of probabilities.
The revengc R package
was originally developed to help model building occupancy (Stewart et al. 2016).
Household size and area of residential structures are typically found in
any given national census. If a census revealed the raw data or provided
a full uncensored contingency table (household size
A tool that approximates negative binomial parameters from a censored univariate frequency table as well as estimates interior cells of a contingency table governed by negative binomial and/or Poisson marginals can also be useful for other areas ranging from demographic and epidemiological data to ecological inference problems. For example, population and community ecologists could unpack censored organism counts or average life expectancy values. Moreover, other summarized examples include the average number of births, the number of new disease cases, the number of mutations in a gene or average mutation rate, etc. We attempt to accommodate for various application of count data by offering five scenarios that can be reverse engineered:
cnbinom.pars()
- An univariate frequency table estimates an
average and dispersion parameter
rec()
- Decoupled averages estimates an uncensored contingency
table of probabilities
rec()
- Decoupled frequency tables estimates an uncensored
contingency table of probabilities
rec()
- An average and frequency table estimates an uncensored
contingency table of probabilities
rec()
- A censored contingency table estimates an uncensored
contingency table of probabilities
This paper proceeds with our reverse engineering methodology for the two
main functions, cnbinom.pars()
and rec()
. We provide an in-depth
analysis of how we implemented both the negative binomial and Poisson
distribution as well as the
truncdist
(Nadarajah and Kotz 2006; Novomestky and Nadarajah 2016) and
mipfp R package
(Barthélemy and Suesse 2018a,b). Since the revengc package has specific
input requirements for both cnbinom.pars()
and rec()
, we continue
with an explanation of how to format the input tables properly. We then
provide coded examples that implements revengc on national census data
(household size and area) and end with concluding remarks.
cnbinom.pars()
The methodology for the cnbinom.pars()
function is relatively
straightforward. To estimate an average
rec()
rec()
is a statistical approach that estimates the probabilities of a
‘true’ contingency table given summarized information: two averages, two
univariate frequency tables, a combination of an average and univariate
frequency table, and a censored contingency table. Figure 1
presents a methodology workflow.
When only an average is provided, we assume the average and variance are
equal and rely on a Poisson distribution (i.e. the probability of
observing cnbinom.pars()
function estimates the optimal average
rec()
function.With the negative binomial (rec()
calculates truncated distributions to represent
uncensored row (Xlowerbound:Xupperbound
) and column
(Ylowerbound:Yupperbound
) margins. Calculations use the
truncdist R package,
and to provide a reference, Equation (4) gives the probability
density function of a truncated
The rec()
outputs a final contingency table with
higher probabilities near the edge(s) of the table, then it would make
sense to increase the range of the bound(s). For both variables, this
would just involve making the lower bound less, making the upper bound
more, or doing a combination of the two. The opposite holds true as
well. If the final contingency table in rec()
has very low
probabilities near the edge(s) of the table, the range of the particular
bound(s) should be decreased.
rec()
utilizing the
mipfp R package to
calculate cross tabulation probability estimates, and mipfp requires
fixed marginals, a seed estimation method, and a seed matrix. The row
and column marginals are uncensored truncated distributions (see
3.3 section) while opportunities for sensitivity
analysis are presented with the seed estimation method and seed matrix.
For example, mipfp offers four seed estimation methods
(Table 1); the default method in rec()
is the iterative
proportional fitting procedure. Although the algorithms vary, they all
adjust cell proportions
Method | Calculate |
---|---|
Iterative proportional fitting procedure - ipfp | Minimizing |
Maximum likelihood method - ml | Maximizing |
Minimum chi-squared - chi2 | Minimizing |
Weighted least squares - lsq | Minimizing |
The seed matrix input can be arbitrary, but rec()
provides reasonable
defaults. For the decoupled cases (two averages, two tables, or a
combination of a table and average), the absence of additional
information makes it difficult to say much about the joint distribution.
Therefore, rec()
assumes independence between the variables, which is
equivalent in making the rec()
converts the matrix of ones to probabilities. When a censored
contingency table is provided, independence does not have to be assumed
and the interior cells can be weighted. rec()
creates the default seed
matrix by first repeating probability cells, which correspond to the
censored contingency table, for the newly created and compatible
uncensored cross tabulations. Just as in the decoupled cases, the cell
values in this matrix are also changed to probabilities. To see an
example of the default seed for a censored contingency table, see
6 section.
The cnbinom.pars()
function has the following format:
cnbinom.pars(censoredtable)
where censoredtable
is a frequency table (censored and/or uncensored).
A data.frame and matrix are acceptable classes. See
5 section for formatting. The output is a list
consisting of an estimated average
The rec()
function has the following format:
rec(X, Y, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound,
seed.matrix, seed.estimation.method)
where
X
: Argument can be an average, a univariate frequency table, or a
censored contingency table. The average value should be a numeric
class while a data.frame or matrix are acceptable table classes. Y
defaults to NULL
if X
argument is a censored contingency table.
See 5 section for formatting.
Y
: Same description as X
but this argument is for the Y
variable. X
defaults to NULL
if Y
argument is a censored
contingency table.
Xlowerbound
: A numeric class value to represent the left bound for
X
(row in contingency table). The value must strictly be a
non-negative integer and cannot be greater than the lowest
category/average value provided for X
(e.g. the lower bound cannot
be 6 if a table has ‘X
or row category).
Xupperbound
: A numeric class value to represent the right bound
for X
(row in contingency table). The value must strictly be a
non-negative integer and cannot be less than the highest
category/average value provided for X
(e.g. the upper bound cannot
be 90 if a table has ‘X
or row category).
Ylowerbound
: Same description as Xlowerbound
but this argument
is for Y
(column in contingency table).
Yupperbound
: Same description as Xupperbound
but this argument
is for Y
(column in contingency table).
seed.matrix
: An initial probability matrix to be updated. If
decoupled variables is provided the default is a
Xlowerbound:Xupperbound
by Ylowerbound:Yupperbound
matrix with
interior cells of 1, which are then converted to probabilities. If a
censored contingency table is provided the default is the
seedmatrix()$Probabilities
output.
seed.estimation.method
: A character string indicating which method
is used for updating the seed.matrix
. The choices are: "ipfp"
,
"ml"
, "chi2"
, or "lsq"
. Default is "ipfp"
.
The output is a list containing an uncensored contingency table of
probabilities (rows range from Xlowerbound:Xupperbound
and the columns
range from Ylowerbound:Yupperbound
) as well as the row and column
parameters used in making the margins for the mipfp R package.
The input tables are formatted to accommodate most open source data. The
univariate frequency table used in cnbinom.pars()
and/or rec()
needs
to be a data.frame or matrix class with two columns and
Left censored symbols:
Interval censored symbols:
Right censored symbols:
Uncensored symbol: no symbol (only provide category value)
Note, less than or equal to (cnbinom.pars()
output.
The censored contingency table for rec()
has a similar format. The
censored symbols should follow the requirements listed above. The
table’s class can be a data.frame or a matrix. The column names should
be the seed.estimation.method = "ipfp"
or strictly
positive if the seed.estimation.method
is "ml"
, "lsq"
, or
"chi2
". The row NA
or blank. The bottom right corner can be a total
cross tabulation sum value, NA
, or blank. Table 3 is a
formatted example.
Category | Frequency | Category | Frequency | Category | Frequency |
---|---|---|---|---|---|
11800 | LE 6 | 11800 | 11800 | ||
7-12 | 57100 | 7 I 12 | 57100 | 7 I 12 | 57100 |
13-19 | 14800 | 13 I 19 | 14800 | 13-19 | 14800 |
20+ | 3900 | GE 20 | 3900 | 3900 |
NA | 20-30 | NA | ||
18 | 19 | 8 | 45 | |
5-9 | 13 | 8 | 12 | 33 |
7 | 5 | 10 | 21 | |
NA | 38 | 32 | 31 | NA |
The code below shows how to format these tables properly in R.
# create univariate table
# note univariatetable.csv is a preloaded example that provides the same table
univariatetable <- cbind(as.character(c("1-2", "3-4", "5-6", "7-8", ">=9")),
c(16.2, 41.7, 29.0, 9.0, 4.1))
# create contingency table
# note contingencytable.csv is a preloaded example that provides the same table
# fill a matrix
contingencytable <- matrix(c(
6185, 9797, 16809, 11126, 6156, 3637, 908, 147, 69, 4,
5408, 12748, 26506, 21486, 14018, 9165, 2658, 567, 196, 78,
7403, 20444, 44370, 36285, 23576, 15750, 4715, 994, 364, 136,
4793, 17376, 44065, 40751, 28900, 20404, 6557, 1296, 555, 228,
2354, 11143, 32837, 33910, 26203, 19301, 6835, 1438, 618, 245,
1060, 6038, 19256, 21298, 17774, 13864, 4656, 1039, 430, 178,
273, 2521, 9110, 11188, 9626, 7433, 2608, 578, 196, 112,
119, 1130, 4183, 5566, 5053, 3938, 1367, 318, 119, 66,
33, 388, 1707, 2367, 2328, 1972, 719, 171, 68, 37,
38, 178, 1047, 1672, 1740, 1666, 757, 193, 158, 164),
nrow = 10, ncol = 10, byrow = TRUE)
# calculate row marginal
rowmarginal <- apply(contingencytable, 1, sum)
# add row marginal to matrix
contingencytable <- cbind(contingencytable, rowmarginal)
# calculate column marginal
colmarginal <- apply(contingencytable, 2, sum)
# add column marginal to matrix
contingencytable <- rbind(contingencytable, colmarginal)
# remove row names
row.names(contingencytable)[row.names(contingencytable) == "colmarginal"]<-""
# add row names as first column
contingencytable <- data.frame(c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10+", NA), contingencytable)
# add column names
colnames(contingencytable) <- c(NA, "<20", "20-29", "30-39", "40-49",
"50-69", "70-99", "100-149", "150-199", "200-299", "300+", NA)
A Nepal Living Standards Survey (Government of Nepal, National Planning Commission Secretariat 2011) provides both a censored table
and average for urban household size. We use the censored table to show
that the cnbinom.pars()
function calculates a close approximation to
the provided average household size (4.4 people). Note, there is
overdispersion in the data.
# revengc has the Nepal household table preloaded as univariatetable.csv
cnbinom.pars(censoredtable = univariatetable.csv)
In 2010, the Population Census Data - Statistics Indonesia provided over
60 censored contingency tables containing household member size by floor
area of dwelling unit (square meter) (Statistics Indonesia 2010). The tables are separated
by province, urban, and rural. Here we use the rural Aceh Province table
to show the multiple coding steps and functions implemented inside
rec()
. This allows the user to see a methodology workflow in code
form. The final uncensored household size by area estimated probability
table, which implemented the "ipfp"
seed estimation method and default
seed matrix, has rows ranging from 1 (Xlowerbound
) to 15
(Xupperbound
) people and columns ranging from 10 (Ylowerbound
) to
310 (Yupperbound
) square meters.
# Packages needed if doing workflow of rec() step by step
library(stringr)
library(dplyr)
library(mipfp)
library(truncdist)
library(revengc)
# data = Indonesia's rural Aceh Province censored contingency table
# preloaded in revengc as 'contingencytable.csv'
contingencytable.csv
# provided upper and lower bound values for table
# X = row and Y = column
Xlowerbound = 1
Xupperbound = 15
Ylowerbound = 10
Yupperbound = 310
# table of row marginals provides average and dispersion for x
row.marginal.table <- row.marginal(contingencytable.csv)
x <- cnbinom.pars(row.marginal.table)
# table of column marginals provides average and dispersion for y
column.marginal.table <- column.marginal(contingencytable.csv)
y <- cnbinom.pars(column.marginal.table)
# create uncensored row and column ranges
rowrange <- Xlowerbound:Xupperbound
colrange <- Ylowerbound:Yupperbound
# new uncensored row marginal table = truncated negative binomial distribution
uncensored.row.margin <- dtrunc(rowrange, mu=x$Average, size = x$Dispersion,
a = Xlowerbound-1, b = Xupperbound, spec = "nbinom")
# new uncensored column margin table = truncated negative binomial distribution
uncensored.column.margin <- dtrunc(colrange, mu=y$Average, size = y$Dispersion,
a = Ylowerbound-1, b = Yupperbound, spec = "nbinom")
# sum of truncated distributions equal 1
# margins need to be equal for mipfp
sum(uncensored.row.margin)
sum(uncensored.column.margin)
# create seed of probabilities (rec() default)
seed.output <- seedmatrix(contingencytable.csv, Xlowerbound,
Xupperbound, Ylowerbound, Yupperbound)$Probabilities
# run mipfp
# store the new margins in a list
tgt.data <- list(uncensored.row.margin, uncensored.column.margin)
# list of dimensions of each marginal constrain
tgt.list <- list(1,2)
# calling the estimated function
## seed has to be in array format for mipfp package
## ipfp is the selected seed.estimation.method
final1 <- Estimate(array(seed.output, dim=c(length(Xlowerbound:Xupperbound),
length(Ylowerbound:Yupperbound))), tgt.list, tgt.data, method="ipfp")$x.hat
# filling in names of updated seed
final1 <- data.frame(final1)
row.names(final1) <- Xlowerbound:Xupperbound
names(final1) <- Ylowerbound:Yupperbound
# reweight estimates to known censored interior cells
final1 <- reweight.contingencytable(observed.table = contingencytable.csv,
estimated.table = final1)
# final results are probabilities
sum(final1)
# rec() function outputs the same table
# default of rec() seed.estimation.method is ipfp
# default of rec() seed.matrix is the output of seedmatrix()$Probabilities
final2<-rec(
X = contingencytable.csv,
Xlowerbound = 1,
Xupperbound = 15,
Ylowerbound = 10,
Yupperbound = 310)
# check that final1 and final2 have the same results
all(final1 == final2$Probability.Estimates)
revengc was designed to reverse engineer summarized and decoupled
variables with two main functions: cnbinom.pars()
and rec()
. Relying
on a negative binomial distribution, cnbinom.pars()
approximates the
average and dispersion parameter of a censored univariate frequency
table. rec()
fills in missing interior cell values from observed
aggregated data (e.g. decoupled average(s) and/or censored frequency
table(s) or a censored contingency table). It is worth noting the
required assumptions in rec()
. For instance, rec()
relies on a
Poisson distribution when only an average is provided, which is assuming
the variance and average are equal. More descriptive input variables,
such as univariate frequency tables or contingency tables, can account
for dispersion found in data. However, independence between decoupled
variables still has to be assumed when there is no external information
about the joint distribution. For these reasons, revengc provides two
options for sensitivity analysis: the seed matrix and the method used in
updating the seed matrix are both arbitrary inputs.
This manuscript has been authored by employees of UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy. Accordingly, the United States Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The publisher, by accepting the article for publication, acknowledges the above-mentioned conditions.
Distributions, OfficialStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Duchscherer, et al., "revengc: An R package to Reverse Engineer Summarized Data", The R Journal, 2018
BibTeX citation
@article{RJ-2018-044, author = {Duchscherer, Samantha and Stewart, Robert and Urban, Marie}, title = {revengc: An R package to Reverse Engineer Summarized Data}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {114-123} }