Many self-report measures of attitudes, beliefs, personality, and pathology include items that can be easily manipulated by respondents. For example, an individual may deliberately attempt to manipulate or distort responses to simulate grossly exaggerated physical or psychological symptoms in order to reach specific goals such as, for example, obtaining financial compensation, avoiding being charged with a crime, avoiding military duty, or obtaining drugs. This article introduces the package sgr that can be used to perform fake data analysis according to the sample generation by replacement approach. The package includes functions for making simple inferences about discrete/ordinal fake data. The package allows to quantify uncertainty in inferences based on possible fake data as well as to study the implications of fake data for empirical results.
How can we evaluate the impact of fake information in real life contexts? In nature, some individuals tend to distort their behaviors or actions in order to reach specific goals. In some species, for example, wimpy animals may not signal their real social value by faking a higher status to deceive other competitors. Similarly, in personnel selection some job applicants may misrepresent themselves on a personality test hoping to increase the likelihood of being offered a job. Being able to discriminate between honest and fraudulent signals and evaluating the impact of counterfeit elements crucially depend on the way we can reason about the whole process of faking. A coherent knowledge of the type or structure of faking processes may lead to stronger inferences that lie on or close to what we may call the genuine, but probably hidden, representation of a manifest behavior. In general fake data may alter a large variety of self-report measures. This problem is particularly relevant for discrete/ordinal data collected in sensitive environments such as, for example, risky sexual behaviors, drug addictions, tax evasion, political preferences, financial compensation, and personnel selection. More in general, researchers interested in the study of human behavior in areas like psychology (Hopwood, C. A. Talbert, L. C. Morey, and R. Rogers 2006), organizational and social science (Van der Geest and S. Sarkodie 1998), psychiatry (Beaber, A. Marston, J. Michelli, and M. Mills 1985), forensic medicine (Gray, M. J. MacCulloch, J. Smith, M. Morris, and R. J. Snowden 2003), scientific frauds (Marshall 2000), and economics (Crawford 2003) may face the fake data problem when analyzing and interpreting empirical data.
In this article, we discuss the sgr package that we have developed for running fake data analysis according to the sample generation by replacement (SGR) approach (Lombardi and M. Pastore 2012). SGR is a data simulation procedure to generate artificial samples of fake discrete/ordinal data. The main characteristic of the SGR approach is that it allows detailed explorations of what outcomes are produced by particular sets of faking assumptions. By changing the input in the faking model parameters and showing the effect on the outcome of a model, SGR provides a what-if-analysis of the faking scenarios. Therefore, SGR can be used to quantify uncertainty in inferences based on possible fake data as well as to evaluate the implications of fake data for statistical results. To illustrate, let us consider the following example where a researcher is interested in studying the relationship between therapy-uncompliance indicators (e.g., forgetting the treatment) and unsafe behaviors indicators (e.g., drinking alcohol) in a group of liver transplant patients. Generally, patients diagnosed with alcohol dependence who follow a pharmaceutical regimen after the liver transplant would deliberately answer fraudulently a question about drinking alcohol due to abstinence from ethanol and social desirability factors (e.g. Foster, E. Fabrega, S. Karademir, N. H. Sankary, D. Mital, and J. W. Williams 1997). In this context, an SGR analysis can help in testing for potential influence of faking the drinking alcohol self-report measure on the strength of the relationship between therapy-uncompliance and unsafe behaviors indicators. More specifically, how sensitive are the empirical associations to possible fake observations in the drinking alcohol self-report measure? Are the conclusions still valid under one or more scenarios of faking (e.g., slight, moderate, and extreme faking) for the drinking alcohol variable?
In general, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking and the underlying true model representation. This makes SGR related in spirit to other statistical approaches such as, for example, uncertainty and sensitivity analysis (Helton, J. Johnson, C. Salaberry, and C. Storlie 2006) and prospective power analysis (Cohen 1988) which are all characterized by an attempt to directly quantify uncertainty of general statistics computed on the data by means of specific hypotheses.
The rest of the paper is organized as follows. The next section reviews the SGR framework and its basic implementations using the sgr package. The following section provides three examples illustrating the application of sgr to faking scenarios. The final section discusses limitations and future implementations in sgr components beyond the general scheme presented here.
SGR is characterized by a two-stage sampling procedure based on two distinct generative models: the model defining the process that generates the data prior to any fake perturbation (data generation process) and the model representing the faking process to perturb the data (data replacement process). By repeatedly sampling data from the SGR procedure we can generate the so called fake data sample (FDS) and eventually study the distribution of some relevant statistics computed on these simulated data samples. In SGR the data generation process is modeled by means of standard Monte Carlo procedures for ordinal data whereas the data replacement process is implemented using ad hoc probabilistic faking models. In sum, the overall generative process is split into two conceptually independent and possibly simpler components (divide and conquer strategy).
With regard to the fake-data problem in general, we think of the data in
the generation process as being represented by an
where
It is important to note that in the standard SGR framework the
replacement distribution
In the last section of this article we will discuss some potential limitations of the conditional independence assumption in real application domains of the SGR approach.
In general, several options are available to represent the data
generation process
(Samejima 1969; Muthén 1984; Jöreskog and D. Sörbom 1996; Moustaki and M. Knott 2000).
In the current version of the sgr package we implemented a procedure
based on the multivariate latent variable framework which is called
underlying variable approach (UVA, Muthén 1984; Jöreskog and D. Sörbom 1996). This approach assumes that the
observed ordinal variables are treated as metric through assumed
underlying normal variables. In particular, we assume that there exists
a continuous data matrix
In SGR the data generation process is obtained by first generating the
continuous data rdatagen
to sample
> library(sgr)
> require(MASS)
> require(polycor)
> set.seed(367)
> R <- matrix(c(1,.4,.4,1),2,2)
> th <- list(c(-Inf,qnorm(c(0.04,0.27,0.73,0.96)), Inf),
+ c(-Inf,qnorm(c(0.06,0.31,0.69,0.94)),Inf))
> Dx <- rdatagen(n=100,R=R,Q=c(5,5),th=th)
> Dx$data
In this example, the threshold values are derived from the quantiles of
the standard normal distribution in such a way that the first simulated
variable shows a slightly larger variance than the second simulated
variable. Generally, the threshold values can be derived in two
different ways. In the first case, we can use empirically based
knowledge (e.g., an already existing data set) to estimate the threshold
values on the basis of the observed distribution function of the levels
of the discrete variable (e.g., Jöreskog and D. Sörbom 1996). In the second
case, some simple statistical knowledge can be used to simulate
threshold values according to desired properties. For example, the
normal quantiles used as corresponding threshold values can be computed
using the inverse of the binomial cumulative distribution function
(e.g., Jöreskog and D. Sörbom 1996). In the rdatagen
function call the
parameter Q
specifies the number of levels for each ordinal variable.
To compare the model correlation matrix polychor
function in the
polycor package
(Fox 2010)
> d1 <- factor(Dx$data[,1],ordered=TRUE)
> d2 <- factor(Dx$data[,2],ordered=TRUE)
> polychor(d1,d2,ML=TRUE,std.err=TRUE)
Polychoric Correlation, ML est. = 0.3627 (0.09832)
To generate the fake ordinal data we used a parametrized replacement
distribution based on a discrete beta kernel (Pastore and L. Lombardi 2014).
Some examples of replacement distributions are shown in Figure
1. Let
The conditional replacement distribution can be described according to the following equation
with
In the following example, we applied a pure (slight) positive faking
model (see Table 1) to generate a fake data matrix
> RM <- replacement.matrix(Q=5,p=c(.5,0),fake.model="slight")
> Fx <- rdatarepl(Dx$data,RM)
46% of data replaced.
> Fx$Fx
We used the replacement.matrix
function to construct the conditional
replacement probability distribution and save the result in the variable
RM
which is used as the argument of the the data replacement
generation function rdatarepl
. Note that the argument fake.model
in
the replacement.matrix
function allows to set the options reported in
Table 1. However, all of the model parameters can be set
manually by the user to any array of consistent values if so desired.
For example, an equivalent syntax would have been
> RM <- replacement.matrix(Q=5,p=c(.5,0),gam=c(1.5,0),del=c(4,0))
We can evaluate the impact of positive faking on the new fake data
matrix by comparing the frequencies of the ordinal categories in
> table(Dx$data[,1])
1 2 3 4 5
5 29 40 24 2
> table(Fx$Fx[,1])
1 2 3 4 5
2 17 36 31 14
which shows how the positive faking has shifted the values of the first
ordinal variable towards larger ones. In a similar way, we could also
evaluate the impact of faking on the sample polychoric correlation
matrix of
Model | ||||
---|---|---|---|---|
uniform | 1 | 1 | 1 | 1 |
slight | 1.5 | 4 | 4 | 1.5 |
extreme | 4 | 1.5 | 1.5 | 4 |
By way of illustration we consider three simple SGR examples. The first is for the evaluation of a correlational analysis computed on five-point rating data. This example is hypothetical and serves to introduce the main features and functions implemented in the sgr package. The second application considers real data about illicit drug use among young people aged 14 to 27. This second example shows how to model directional faking hypotheses (e.g., faking good or faking bad). It is also important because illustrates how the replacement functions can be applied to dichotomous data. Finally, the third application extends the second example by analyzing a new set of data about cannabis consumption in young people using log-linear models for ordinal data.
We begin with a simple SGR analysis about a hypothetical observed
difference (
The code below illustrates the SGR analysis
> require(polycor)
> set.seed(367)
> obs.stat <- .3; mc.stat <- NULL
> Rmc <- matrix(c(1,.25,.25,1),2)
> PM <- matrix(c(rep(1,100),rep(2,100)),ncol=2,byrow=TRUE)
> Pparm <- list(p=matrix(c(.2,.3,.1,.2),2),gam=matrix(1,2,2),del=matrix(1,2,2))
> for (b in 1:1000) {
+ mcD <- rdatagen(n=100,R=Rmc,Q=5)$data
+ Fx <- partition.replacement(mcD,PM,Pparm=Pparm)
+ for (j in 1:ncol(Fx)) {
+ Fx[,j] <- ordered(Fx[,j])
+ }
+ mcpc1 <- hetcor(Fx[1:50,])$correlations[1,2]
+ mcpc2 <- hetcor(Fx[51:100,])$correlations[1,2]
+ Delta <- mcpc1-mcpc2
+ mc.stat <- c(mc.stat,Delta)
+ }
> hist(mc.stat)
> sum(mc.stat>=obs.stat)/1000
[1] 0.226
An empirical mcD
according to the generative
model with correlation matrix Rmc
. This matrix contains two
symmetrically distributed ordinal variables (default valuerdatagen
function.) Next, the ordinal matrix is transformed according
to the faking models. In particular, the function
partition.replacement
allows to set different replacement
distributions for the two groups of subjects and returns the perturbed
data matrix. This function has three main arguments: Dx=mcD
, the data
frame or matrix to be replaced; PM
, the partition matrix to cluster
the observations into the distinct groups; Pparm
, the list of
replacements parameters for each of the different faking models. Note
that the partition matrix must have the same dimension as the matrix to
be replaced and a numeric code for each distinct cluster (group) in the
partition. If a cell of the partition matrix contains Pparm
is a
list containing three elements. Each element is a p[1,1]
and p[1,2]
denote the overall faking positive
probabilities for gam[1,1]
(resp. gam[2,1]
) indicates the first shape parameter for the faking
positive (resp. faking negative) model in group del
. Figure 3
shows the distribution of the test procedure under
Table 2 refers to a real prospective study about illicit
drug use among young people aged 14-27 (Pastore, L. Lombardi, and F. Mereu 2007). In
particular, we evaluated the relationship between age (dichotomized into
two categories: adults,
drug | ||
yes (1) | no (2) | |
adults (1) | 10 | 25 |
minors (2) | 32 | 29 |
The result of a log linear model for independence for the two-way table
showed a significant likelihood-ratio chi-squared statistic
(
> require(MASS)
> set.seed(367)
> data(smokers)
> ecstasy.table <- table(smokers$drug,smokers$age,dnn=c("drug","age"))
> obs.lrt <- loglm(~drug+age,data=ecstasy.table)$lrt
>
> PM <- matrix(0,nrow(smokers),2)
> PM[smokers$age==1,2] <- 1
> PM[smokers$age==2,2] <- 2
> Pparm <- list(p=matrix(c(.8,.4,0,0),2),gam=matrix(c(1,1,0,0),2),
+ del=matrix(c(1,1,0,0),2))
> mc.lrt <- NULL
> for (b in 1:1000) {
+ smokers$simdrug <- rdatagen(nrow(smokers),R=matrix(1),Q=2,
+ probs=list(c(.75,.25)))$data
+ Fx <- partition.replacement(smokers[,c("age","simdrug")],PM,Pparm=Pparm)
+ mc.lrt <- c(mc.lrt,loglm(~simdrug+age,data=table(Fx$simdrug,Fx$age,
+ dnn=c("simdrug","age")))$lrt)
+ }
> hist(mc.lrt)
> sum(mc.lrt>=obs.lrt)/1000
[1] 0.812
Note that for dichotomous variables (
Figure 4 shows the distribution of the test
procedure under the hypothesis (approximate
In this application we extend the results reported in the second example by analyzing a new set of ordinal data about illicit drug use among young people (see table 3). This new two-way table relates an independent categorical variable, age, minors (< 18 years old) vs adults, to a dependent ordinal variable, cannabis consumption. In particular, the dependent variable uses a four-point ordinal scale ranging from never (1) to often (4) (with intermediate levels being once (2) and some times (3), respectively). When response categories are ordered, logit models can directly incorporate the ordering (Agresti 1990). In general, this results in model representations having simpler interpretations than ordinary multicategory logit models at least when the proportional odds model holds.
cannabis | ||||
(1) | (2) | (3) | (4) | |
adults (1) | 20 | 5 | 7 | 0 |
minors (2) | 27 | 5 | 18 | 10 |
The following code illustrates the results of applying an ordered
logistic model to the data represented in table 3. For
the analysis we used the function polr
in the
MASS package
(Venables and B. D. Ripley 2002) that allows to fit a logistic or probit
regression model to an ordered factor response.
> Y <- data.frame(list(age=gl(2,4),response=gl(4,1,8,ordered=TRUE),
+ counts=c(20,5,7,0,27,5,18,10)))
> fit0 <- polr(response~1,data=Y,weight=counts)
> fit1 <- polr(response~age,data=Y,weight=counts)
> lrt.obs <- -2*(logLik(fit0)-logLik(fit1))
The likelihood ratio statistic
> set.seed(367)
> Z <- na.omit(smokers[,c("age","druguse")])
> PM <- matrix(0,nrow(Z),ncol(Z))
> PM[Z$age==1,2] <- 1
> lrt.mc <- NULL
> for (b in 1:1000) {
+ Z$simdrug <- rdatagen(nrow(Z),R=matrix(1),Q=4,
+ probs=list(c(27,5,18,10)/60))$data
+ Dx <- Z[,-2]
+ Fx <- partition.replacement(Dx,PM,p=matrix(c(0,.1),1),fake.model="slight")
+ Tmc <- table(Fx$age,Fx$simdrug)
+ Ymc <- data.frame(list(age=gl(2,4),response=gl(4,1,8,ordered=TRUE),
+ counts=c(Tmc[1,],Tmc[2,])))
+ fit0 <- polr(response~1,data=Ymc,weight=counts)
+ fit1 <- polr(response~age,data=Ymc,weight=counts)
+ lrt.mc <- c(lrt.mc,-2*(logLik(fit0)-logLik(fit1)))
+ }
> sum(lrt.mc>=lrt.obs)/1000
[1] 0.039
Figure 5 shows the distribution of the test
procedure under the hypothesis. This time the observed likelihood ratio
statistic seems not consistent with
In this section we provide a full exploratory SGR analysis for the data
presented in table 3. In particular, we show how it is
possible to vary the parameters (
In what follows we present a short code example that the reader may easily manipulate to set the desired values for the parameters in the simulation study (shape parameters, overall probabilities of faking, number of runs in the SGR simulations). Note that in this exploratory setting the overall time required to complete the SGR simulation may widely vary according to the complexity (e.g., number of different values for the parameters) of the simulation design.
> data(smokers)
> Z <- na.omit(smokers[,c("age","druguse")])
>
> fit0 <- polr(ordered(druguse)~1,data=Z)
> fit1 <- polr(ordered(druguse)~age,data=Z)
> lrt.obs <- -2*(logLik(fit0)-logLik(fit1)) # observed LRT
>
> ### SGR algorithm
> PI <- .2; B <- 10 # for real simulations set B at least 500
> lrt.mc <- ga.mc <- de.mc <- p.mc <- NULL
> PM <- matrix(0,nrow(Z),ncol(Z)) # partition matrix
> PM[Z$age==1,2] <- 1
>
> for (GA in seq(.5,5.5,.5)) {
+ for (DE in seq(.5,5.5,.5)) {
+
+ Pparm <- list(p=matrix(c(0,PI),1),gam=matrix(c(0,GA),1),del=matrix(c(0,DE),1))
+
+ for (b in 1:B) {
+ Z$simdrug <- rdatagen(nrow(Z),R=matrix(1),Q=4,
+ probs=list(c(27,5,18,10)/60))$data
+ Dx <- Z[,-2]
+ Fx <- partition.replacement(Dx,PM,Pparm=Pparm)
+
+ Tmc <- table(Fx$age,Fx$simdrug)
+ Ymc <- data.frame(list(age=gl(2,ncol(Tmc)),response=gl(ncol(Tmc),1,
+ ordered=TRUE,labels=colnames(Tmc)),counts=c(Tmc[1,],Tmc[2,])))
+
+ fit0 <- polr(response~1,data=Ymc,weight=counts)
+ fit1 <- polr(response~age,data=Ymc,weight=counts)
+ statval <- -2*(logLik(fit0)-logLik(fit1))
+ lrt.mc <- c(lrt.mc,statval)
+
+ ga.mc <- c(ga.mc,GA); de.mc <- c(de.mc,DE)
+ p.mc <- c(p.mc,ifelse(statval>lrt.obs,1,0))
+ }
+ }
+ }
> LRT <- data.frame(list(gam=ga.mc,del=de.mc,lrt=lrt.mc))
> aggregate(p.mc,list(gam=LRT$gam,del=LRT$del),mean)
This paper illustrated the usage of a new R package, sgr, for simulating and analyzing ordinal fake data. As far as we know, sgr is the first statistical package that is devoted to the analysis of fake data. Overall, the essential characteristic of this approach is its explicit use of mathematical models and appropriate probability distributions for quantifying uncertainty in inferences based on possible fake data. Moreover, it involves the derivation of new statistical results as well as the evaluation of the implications of such new results: Are the substantive conclusions reasonable? How sensitive are the results to the modeling assumptions about the process of faking? In sum, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking. In this contribution we illustrated the use of sgr on three simple scenarios of faking. More complex examples of SGR applications can be found in (Lombardi and M. Pastore 2012) and (Pastore and L. Lombardi 2014).
As with many Monte Carlo-type approaches, also SGR involves simplifying assumptions that may result in lower external validity. For example, one relevant limitation regards the assumption that restricts the conditional replacement distribution to satisfy the CI property. Unfortunately, this restriction clearly limits the range of empirical faking processes that can be mimicked by the current SGR simulation procedure. In particular, because the replacement distribution under the CI assumption acts as a perturbation process for the original data, the resulting new fake data sets will in general show covariance patterns that are (on average) weaker than the ones observed for the original uncorrupted data. In general, this may not be a serious problem as different studies have shown that self-report measures under faking motivating conditions tend to have smaller variances and lower reliability (covariance) estimates than those observed for measures collected under uncorrupted conditions (Eysenck, H. J. Eysenck, and L. Shaw 1974; Topping and J. O’Gorman 1997; Ellingson, D. B. Smith, and P. R. Sackett 2001; Hesketh, B. Griffin, and D. Grayson 2004). However, opposite results have also been observed where simple fake good instructions tend to increase the intercorrelations between the manipulated or faked items (Ellingson, P. Sackett, and L. Hough 1999; Zickar and C. Robie 1999; Pauls and N. Crost 2005; Ziegler and M. Buehner 2009; Galic, Z. Jerneic, and M. P. Kovacic 2012). Therefore, although encouraging, the promise of this approach should be examined across more varied conditions. We acknowledge that more work still needs to be done. We are in the process of extending sgr to include new replacement distributions other than the ones presented in this article which will allow to modulate different levels of correlational patterns in the simulated fake data.
Distributions, Econometrics, Environmetrics, MixedModels, NumericalMathematics, Psychometrics, Robust, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lombardi & Pastore, "sgr: A Package for Simulating Conditional Fake Ordinal Data", The R Journal, 2014
BibTeX citation
@article{RJ-2014-019, author = {Lombardi, Luigi and Pastore, Massimiliano}, title = {sgr: A Package for Simulating Conditional Fake Ordinal Data}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {164-177} }