Many self-report measures of attitudes, beliefs, personality, and pathology include items that can be easily manipulated by respondents. For example, an individual may deliberately attempt to manipulate or distort responses to simulate grossly exaggerated physical or psychological symptoms in order to reach specific goals such as, for example, obtaining financial compensation, avoiding being charged with a crime, avoiding military duty, or obtaining drugs. This article introduces the package sgr that can be used to perform fake data analysis according to the sample generation by replacement approach. The package includes functions for making simple inferences about discrete/ordinal fake data. The package allows to quantify uncertainty in inferences based on possible fake data as well as to study the implications of fake data for empirical results.
How can we evaluate the impact of fake information in real life contexts? In nature, some individuals tend to distort their behaviors or actions in order to reach specific goals. In some species, for example, wimpy animals may not signal their real social value by faking a higher status to deceive other competitors. Similarly, in personnel selection some job applicants may misrepresent themselves on a personality test hoping to increase the likelihood of being offered a job. Being able to discriminate between honest and fraudulent signals and evaluating the impact of counterfeit elements crucially depend on the way we can reason about the whole process of faking. A coherent knowledge of the type or structure of faking processes may lead to stronger inferences that lie on or close to what we may call the genuine, but probably hidden, representation of a manifest behavior. In general fake data may alter a large variety of self-report measures. This problem is particularly relevant for discrete/ordinal data collected in sensitive environments such as, for example, risky sexual behaviors, drug addictions, tax evasion, political preferences, financial compensation, and personnel selection. More in general, researchers interested in the study of human behavior in areas like psychology (Hopwood, C. A. Talbert, L. C. Morey, and R. Rogers 2006), organizational and social science (Van der Geest and S. Sarkodie 1998), psychiatry (Beaber, A. Marston, J. Michelli, and M. Mills 1985), forensic medicine (Gray, M. J. MacCulloch, J. Smith, M. Morris, and R. J. Snowden 2003), scientific frauds (Marshall 2000), and economics (Crawford 2003) may face the fake data problem when analyzing and interpreting empirical data.
In this article, we discuss the sgr package that we have developed for running fake data analysis according to the sample generation by replacement (SGR) approach (Lombardi and M. Pastore 2012). SGR is a data simulation procedure to generate artificial samples of fake discrete/ordinal data. The main characteristic of the SGR approach is that it allows detailed explorations of what outcomes are produced by particular sets of faking assumptions. By changing the input in the faking model parameters and showing the effect on the outcome of a model, SGR provides a what-if-analysis of the faking scenarios. Therefore, SGR can be used to quantify uncertainty in inferences based on possible fake data as well as to evaluate the implications of fake data for statistical results. To illustrate, let us consider the following example where a researcher is interested in studying the relationship between therapy-uncompliance indicators (e.g., forgetting the treatment) and unsafe behaviors indicators (e.g., drinking alcohol) in a group of liver transplant patients. Generally, patients diagnosed with alcohol dependence who follow a pharmaceutical regimen after the liver transplant would deliberately answer fraudulently a question about drinking alcohol due to abstinence from ethanol and social desirability factors (e.g. Foster, E. Fabrega, S. Karademir, N. H. Sankary, D. Mital, and J. W. Williams 1997). In this context, an SGR analysis can help in testing for potential influence of faking the drinking alcohol self-report measure on the strength of the relationship between therapy-uncompliance and unsafe behaviors indicators. More specifically, how sensitive are the empirical associations to possible fake observations in the drinking alcohol self-report measure? Are the conclusions still valid under one or more scenarios of faking (e.g., slight, moderate, and extreme faking) for the drinking alcohol variable?
In general, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking and the underlying true model representation. This makes SGR related in spirit to other statistical approaches such as, for example, uncertainty and sensitivity analysis (Helton, J. Johnson, C. Salaberry, and C. Storlie 2006) and prospective power analysis (Cohen 1988) which are all characterized by an attempt to directly quantify uncertainty of general statistics computed on the data by means of specific hypotheses.
The rest of the paper is organized as follows. The next section reviews the SGR framework and its basic implementations using the sgr package. The following section provides three examples illustrating the application of sgr to faking scenarios. The final section discusses limitations and future implementations in sgr components beyond the general scheme presented here.
SGR is characterized by a two-stage sampling procedure based on two distinct generative models: the model defining the process that generates the data prior to any fake perturbation (data generation process) and the model representing the faking process to perturb the data (data replacement process). By repeatedly sampling data from the SGR procedure we can generate the so called fake data sample (FDS) and eventually study the distribution of some relevant statistics computed on these simulated data samples. In SGR the data generation process is modeled by means of standard Monte Carlo procedures for ordinal data whereas the data replacement process is implemented using ad hoc probabilistic faking models. In sum, the overall generative process is split into two conceptually independent and possibly simpler components (divide and conquer strategy).
With regard to the fake-data problem in general, we think of the data in the generation process as being represented by an \(n \times m\) matrix \(\mathbf{D}\), that is to say, \(n\) i.i.d. observations (hypothetical participants) each containing \(m\) elements (hypothetical participant’s responses). We assume that entry \(d_{ij}\) of \(\mathbf{D}\) (\(i = 1,\ldots, n\); \(j = 1,\ldots, m\)) takes values on a small ordinal range \(V_{q}=\{1,\ldots,q\}\) (for the sake of simplicity, in this presentation we assume identical ordinal scales). In particular, let \(\mathbf{d}_i\) be the \((1 \times m)\) array of \(\mathbf{D}\) denoting the pattern of responses of participant \(i\). The response pattern \(\mathbf{d}_i\) is a multidimensional ordinal random variable with probability distribution \(p(\mathbf{d}_i|\theta_{D})\), where \(\theta_{D}\) indicates the vector of parameters of the probabilistic model for the data generation process. The main idea of the replacement approach is to construct a new \(n \times m\) ordinal data matrix \(\mathbf{F}\), called the fake data matrix of \(\mathbf{D}\), by manipulating each element \(d_{ij}\) in \(\mathbf{D}\) according to a replacement probability distribution. Let \(\mathbf{f}_i\) be the \((1 \times m)\) array of \(\mathbf{F}\) denoting the replaced pattern of fake responses of participant \(i\). The fake response pattern \(\mathbf{f}_i\) is a multidimensional ordinal random variable with conditional replacement probability distribution
\[\begin{eqnarray} p(\mathbf{f}_i|\mathbf{d}_i,\theta_F)=\prod_{j=1}^{m}p(f_{ij}|d_{ij},\theta_F), \qquad i=1,\ldots,n \label{eq:CRPD} \end{eqnarray} \tag{1} \]
where \(\theta_F\) indicates the vector of parameters of the probabilistic faking model.
It is important to note that in the standard SGR framework the replacement distribution \(p(\mathbf{f}_i|\mathbf{d}_i,\theta_{F})\) is restricted to satisfy the conditional independence (CI) assumption (see Lombardi and M. Pastore 2012; Pastore and L. Lombardi 2014). More precisely, in the replacement distribution each fake response \(f_{ij}\) only depends on the corresponding data observation \(d_{ij}\) and the model parameter \(\theta_{F}\). Therefore, because the patterns of fake responses are also i.i.d. observations, the simulated data array \((\mathbf{D},\mathbf{F})\) is drawn from the joint probability distribution
\[\begin{eqnarray} p(\mathbf{D},\mathbf{F}|\theta_D,\theta_F) & = & \prod_{i=1}^{n}p(\mathbf{d}_i|\theta_D) p(\mathbf{f}_i|\mathbf{d}_i,\theta_F) \\ & = & \prod_{i=1}^{n}p(\mathbf{d}_i|\theta_D) \prod_{j=1}^{m}p(f_{ij}|d_{ij},\theta_F) \label{eq:JPDF} \end{eqnarray} \tag{2} \]
In the last section of this article we will discuss some potential limitations of the conditional independence assumption in real application domains of the SGR approach.
In general, several options are available to represent the data generation process (Samejima 1969; Muthén 1984; Jöreskog and D. Sörbom 1996; Moustaki and M. Knott 2000). In the current version of the sgr package we implemented a procedure based on the multivariate latent variable framework which is called underlying variable approach (UVA, Muthén 1984; Jöreskog and D. Sörbom 1996). This approach assumes that the observed ordinal variables are treated as metric through assumed underlying normal variables. In particular, we assume that there exists a continuous data matrix \(\mathbf{D}^{\ast}\) underlying the ordinal data matrix \(\mathbf{D}\). Let \(\mathbf{d}_i^{\ast}\) be the (\(1 \times m\)) array of \(\mathbf{D}^{\ast}\) denoting the pattern of underlying continuous values of the \(i\)th observation. It is convenient to let \(\mathbf{d}_i^{\ast}\) have the multivariate standard normal distribution with density function \(\phi(\mathbf{0},\mathbf{R})\) where \(\mathbf{R}\) denotes the (\(m \times m\)) model correlation matrix. The connection between the ordinal variable \(d_{ij}\) and the underlying variable \(d_{ij}^{\ast}\) in \(\mathbf{D}^{\ast}\) is given by \[d_{ij} = h \qquad \mbox{iff} \qquad \tau_{h-1}^{j} < d_{ij}^{\ast} \leq \tau_{h}^{j}\] with \(h=1,\ldots,q; i=1,\ldots,n; j=1,\ldots,m\) and where \[-\infty=\tau_{0}^{j} < \tau_{1}^{j} < \tau_{2}^{j} < \ldots < \tau_{q-1}^{j}, \tau_{q}^{j}=+\infty,\] are threshold parameters. Therefore, the joint probability of \(\mathbf{d}_i=(h_1,\ldots,h_m)\) is given by \[\begin{eqnarray} p(\mathbf{h}|\theta_M) = \int_{\tau_{h_{1}-1}^{1}}^{\tau_{h_{1}}^{1}} \cdots \int_{\tau_{h_{m}-1}^{m}}^{\tau_{h_{m}}^{m}}\phi(\mathbf{z}|\mathbf{0},\mathbf{R})d\mathbf{z} \end{eqnarray}\] with \(\theta_{M}=(\tau,\mathbf{R})\) and \(\mathbf{z}=(z_{1},\ldots,z_{m})\) being the parameter vector of the original data generation model and the values for the continuous variables, respectively.
In SGR the data generation process is obtained by first generating the
continuous data \(\mathbf{D}^{\ast}\) according to a model correlation
matrix \(\mathbf{R}\) and then by transforming it to its discrete
counterpart \(\mathbf{D}\) using the model thresholds \(\tau\). In the
following example, we used the sgr function rdatagen
to sample
\(n=100\) random observations from a data generation model with two
symmetrically distributed ordinal variables with five levels each and
correlation value \(.4\).
> library(sgr)
> require(MASS)
> require(polycor)
> set.seed(367)
> R <- matrix(c(1,.4,.4,1),2,2)
> th <- list(c(-Inf,qnorm(c(0.04,0.27,0.73,0.96)), Inf),
+ c(-Inf,qnorm(c(0.06,0.31,0.69,0.94)),Inf))
> Dx <- rdatagen(n=100,R=R,Q=c(5,5),th=th)
> Dx$data
In this example, the threshold values are derived from the quantiles of
the standard normal distribution in such a way that the first simulated
variable shows a slightly larger variance than the second simulated
variable. Generally, the threshold values can be derived in two
different ways. In the first case, we can use empirically based
knowledge (e.g., an already existing data set) to estimate the threshold
values on the basis of the observed distribution function of the levels
of the discrete variable (e.g., Jöreskog and D. Sörbom 1996). In the second
case, some simple statistical knowledge can be used to simulate
threshold values according to desired properties. For example, the
normal quantiles used as corresponding threshold values can be computed
using the inverse of the binomial cumulative distribution function
(e.g., Jöreskog and D. Sörbom 1996). In the rdatagen
function call the
parameter Q
specifies the number of levels for each ordinal variable.
To compare the model correlation matrix \(\mathbf{R}\) with the sample
polychoric correlation, we can use the polychor
function in the
polycor package
(Fox 2010)
> d1 <- factor(Dx$data[,1],ordered=TRUE)
> d2 <- factor(Dx$data[,2],ordered=TRUE)
> polychor(d1,d2,ML=TRUE,std.err=TRUE)
= 0.3627 (0.09832) Polychoric Correlation, ML est.
To generate the fake ordinal data we used a parametrized replacement distribution based on a discrete beta kernel (Pastore and L. Lombardi 2014). Some examples of replacement distributions are shown in Figure 1. Let \(p_{k|h} \equiv p(k|h,\theta_F)\) be the conditional probability of replacing an original ordinal value \(h\) with the new ordinal value \(k\). In general, \(\theta_F\) represents hypothetical a priori knowledge about the distribution of faking (e.g., the chance of observing a fake observation in the data) or empirically based knowledge about the process of faking (e.g., the direction of faking -fake good vs fake bad-).
The conditional replacement distribution can be described according to the following equation
\[\begin{eqnarray} p_{k|h} = \left\{ \begin{array}{cc} DG(k;a^+,b^+,\theta_F^+) \pi^+ , & 1 \leq h < k \leq q \\ DG(k;a^-,b^-,\theta_F^-) \pi^- , & 1 \leq k < h \leq q \\ 1-(\pi^+ + \pi^-), & 1 < k=h < q \\ 1-\pi^+, & k=h = 1 \\ 1-\pi^-, & k=h = q \end{array} \right. \label{eq:WCPD} \end{eqnarray} \tag{3} \]
with \(DG\) being the generalized beta distribution for discrete variables (Pastore and L. Lombardi 2014). Note that in Eq. ((3)), the function \(DG\) is used with two different set of parameters. More precisely, in the first line the function \(DG\) models the behavior of the faking distribution for fake positive values (\(k > h\)) by means of the governing shape parameter \(\theta_F^+=(\gamma^+,\delta^+)\) with bounds \((a^{+} = h+1,b^{+}=q)\). By contrast, the second line represents the behavior of the faking distribution for fake negative values (\(k < h\)) modelled by the governing shape parameter \(\theta_F^-=(\gamma^-,\delta^-)\) with bounds \((a^{-} = 1,b^{-}=h-1)\). Some examples of faking models with their parameters assignments are reported in Table 1 (see also Pastore and L. Lombardi 2014). In general, the governing shape parameters \(\theta_F^+\) and \(\theta_F^-\) must be strictly positive. In particular, if \(\gamma^+ = \delta^+ = 1\), the right part of the replacement distribution reduces to a uniform support fake positive distribution (Fig. 1 first column). By contrast, if \(1 \leq \gamma^+ < \delta^+\) (resp. \(1 \leq \delta^+ < \gamma^+\)), the model mimics asymmetric faking configurations corresponding to moderate positive shifts (resp. exaggerated positive shifts) in the value of the original response (Fig. 1, second and third columns). More specifically, in the slight positive faking configuration the chance to replace an original value \(h\) with another greater value \(k\) decreases as a function of the distance between \(k\) and \(h\). By contrast, in the extreme positive faking configuration the chance to replace an original value \(h\) with another greater value \(k\) increases as a function of the distance between \(k\) and \(h\). Unlike the asymmetric configurations (slight faking and extreme faking), the uniform support distribution (\(\gamma^+ = \delta^+ = 1\)) mimics a kind of random world model that can be used whenever we believe to deal with purely random fake data. This principle requires the simplest quantitative representation for the replacement process and reflects the lack of information about the distributional properties of the faking behavior. Similar configurations can be described also for the left part of the replacement distribution which represents the negative faking process [\(\theta_F^-=(\gamma^-,\delta^-)\)]. However, for this latter component the ordinal relation characterizing the shape parameters must be reversed (see Table 1). Finally, in the conditional replacement distribution the parameters \(\pi^+\) and \(\pi^-\) denote the overall probability of faking positive and the overall probability of faking negative, respectively. These probabilities act as weights to rescale the discrete beta distribution \(DG\) such that \((\pi=\pi^+ + \pi^-) \leq 1\). In general, \(\pi^+\) and \(\pi^-\) represent a priori or empirically based knowledge about the distribution of faking for the two components (e.g., the chance of observing a positive or negative fake observation in the data). The third, fourth, and fifth lines of Eq. ((3)) show the probability of non-replacement (\(k=h\)). Note that, if we set \(\pi^+=0\) (resp. \(\pi^-=0\)), then the replacement model boils down to a pure faking negative (resp. positive) model which corresponds to a context in which responses are exclusively subject to negative (resp. positive) faking (see fig. 2).
In the following example, we applied a pure (slight) positive faking model (see Table 1) to generate a fake data matrix \(\mathbf{F}\) from the original data matrix \(\mathbf{D}\).
> RM <- replacement.matrix(Q=5,p=c(.5,0),fake.model="slight")
> Fx <- rdatarepl(Dx$data,RM)
46% of data replaced.
> Fx$Fx
We used the replacement.matrix
function to construct the conditional
replacement probability distribution and save the result in the variable
RM
which is used as the argument of the the data replacement
generation function rdatarepl
. Note that the argument fake.model
in
the replacement.matrix
function allows to set the options reported in
Table 1. However, all of the model parameters can be set
manually by the user to any array of consistent values if so desired.
For example, an equivalent syntax would have been
> RM <- replacement.matrix(Q=5,p=c(.5,0),gam=c(1.5,0),del=c(4,0))
We can evaluate the impact of positive faking on the new fake data matrix by comparing the frequencies of the ordinal categories in \(\mathbf{D}\) and \(\mathbf{F}\). For example, for the first ordinal variable we have
> table(Dx$data[,1])
1 2 3 4 5
5 29 40 24 2
> table(Fx$Fx[,1])
1 2 3 4 5
2 17 36 31 14
which shows how the positive faking has shifted the values of the first ordinal variable towards larger ones. In a similar way, we could also evaluate the impact of faking on the sample polychoric correlation matrix of \(\mathbf{F}\).
Model | \(\gamma^+\) | \(\gamma^-\) | \(\delta^+\) | \(\delta^-\) |
---|---|---|---|---|
uniform | 1 | 1 | 1 | 1 |
slight | 1.5 | 4 | 4 | 1.5 |
extreme | 4 | 1.5 | 1.5 | 4 |
By way of illustration we consider three simple SGR examples. The first is for the evaluation of a correlational analysis computed on five-point rating data. This example is hypothetical and serves to introduce the main features and functions implemented in the sgr package. The second application considers real data about illicit drug use among young people aged 14 to 27. This second example shows how to model directional faking hypotheses (e.g., faking good or faking bad). It is also important because illustrates how the replacement functions can be applied to dichotomous data. Finally, the third application extends the second example by analyzing a new set of data about cannabis consumption in young people using log-linear models for ordinal data.
We begin with a simple SGR analysis about a hypothetical observed difference (\(\hat{\Delta}=\hat{\rho}_1 - \hat{\rho}_2=.3\)) between two ordinal correlations computed on two five-point rating variables \(X\) and \(Y\) for the groups of subjects, \(G_1\) (\(n_1=50\)) and \(G_2\) (\(n_2=50\)). For example, in a risky sexual behaviors scenario the rating variables \(X\) and \(Y\) can represent, in two groups of young adults (females and males), the self-report attitude to contraceptive use during a sexual intercourse and the declared number of sexual partners in the last three months, respectively. Normally, an effect size of \(.3\) denotes a relevant difference between two correlations. However, how sensitive may this result be to possible fake data? Is this effect still observed under one or more scenarios of faking? In this example, we are interested in testing whether the observed correlation difference can still be consistent with a true generative model reflecting an identical moderate correlation \(\rho_1 = \rho_2=.25\) for the two groups. Moreover, we also assume a perturbation process represented by two distinct uniform faking models: \(\pi_{1}^{+}=.2\) and \(\pi_{1}^{-}=.1\) for \(G_1\), and \(\pi_{2}^{+}=.3\) and \(\pi_{2}^{-}=.2\) for \(G_2\). We can easily reformulate this example using a Fisher significance testing (Lehmann 1993). More precisely, we can construct the corresponding hypothesis \[\begin{eqnarray} H \ & : & \rho_1 = \rho_2=.25 \ (\Delta=0), \nonumber \\ & & \pi_{1}^{+}= \pi_{2}^{-} = .2, \pi_{1}^{-}=.1,\pi_{2}^{+}=.3, \nonumber \\ & & \gamma_{s}^{+}= \gamma_{s}^{-} = \delta_{s}^{+}= \delta_{s}^{-} = 1, \quad s = 1,2 \nonumber \end{eqnarray}\] and examine whether or not the observed correlation difference \(\hat{\Delta}\) is consistent with \(H\). In particular, we are interested in the \(p\)-value \[Pr[\Delta>\hat{\Delta}|H].\]
The code below illustrates the SGR analysis
> require(polycor)
> set.seed(367)
> obs.stat <- .3; mc.stat <- NULL
> Rmc <- matrix(c(1,.25,.25,1),2)
> PM <- matrix(c(rep(1,100),rep(2,100)),ncol=2,byrow=TRUE)
> Pparm <- list(p=matrix(c(.2,.3,.1,.2),2),gam=matrix(1,2,2),del=matrix(1,2,2))
> for (b in 1:1000) {
+ mcD <- rdatagen(n=100,R=Rmc,Q=5)$data
+ Fx <- partition.replacement(mcD,PM,Pparm=Pparm)
+ for (j in 1:ncol(Fx)) {
+ Fx[,j] <- ordered(Fx[,j])
+ }
+ mcpc1 <- hetcor(Fx[1:50,])$correlations[1,2]
+ mcpc2 <- hetcor(Fx[51:100,])$correlations[1,2]
+ Delta <- mcpc1-mcpc2
+ mc.stat <- c(mc.stat,Delta)
+ }
> hist(mc.stat)
> sum(mc.stat>=obs.stat)/1000
1] 0.226 [
An empirical \(p\)-value can be computed by a Monte Carlo experiment. In
our example, the test procedure \(\Delta^\ast = \hat{\rho}_1^{\ast} - \hat{\rho}_2^{\ast}\) is replicated 1000 times
under the condition of the hypothesis. Next, the approximate \(p\)-value
is computed as the proportion of the simulated \(\Delta^\ast\) values
which are larger than the observed correlation difference \(.3\). More
precisely, for each replicate \(b=1,\ldots,1000\), we first generate a
\(100 \times 2\) ordinal data matrix mcD
according to the generative
model with correlation matrix Rmc
. This matrix contains two
symmetrically distributed ordinal variables (default value1 in the
rdatagen
function.) Next, the ordinal matrix is transformed according
to the faking models. In particular, the function
partition.replacement
allows to set different replacement
distributions for the two groups of subjects and returns the perturbed
data matrix. This function has three main arguments: Dx=mcD
, the data
frame or matrix to be replaced; PM
, the partition matrix to cluster
the observations into the distinct groups; Pparm
, the list of
replacements parameters for each of the different faking models. Note
that the partition matrix must have the same dimension as the matrix to
be replaced and a numeric code for each distinct cluster (group) in the
partition. If a cell of the partition matrix contains \(0\), then the
corresponding cell value in the original data matrix is not modified
(no replacement condition is applied). In our example, Pparm
is a
list containing three elements. Each element is a \(2\) (number of groups)
\(\times\) \(2\) (faking positive vs faking negative) matrix. So for
example, p[1,1]
and p[1,2]
denote the overall faking positive
probabilities for \(G_1\) and \(G_2\), respectively. Similarly, gam[1,1]
(resp. gam[2,1]
) indicates the first shape parameter for the faking
positive (resp. faking negative) model in group \(G_1\). The same figure
follows for the second shape parameter del
. Figure 3
shows the distribution of the test procedure under \(H\) (approximate
\(p\)-value \(=.226\)). According to the distribution of the test procedure
the observed correlation difference \(\hat{\Delta}\) seems consistent with
the hypothesis of faking.
Table 2 refers to a real prospective study about illicit drug use among young people aged 14-27 (Pastore, L. Lombardi, and F. Mereu 2007). In particular, we evaluated the relationship between age (dichotomized into two categories: adults, \(>17\), and minors) and ecstasy drug consumption. We expected that each individual would deliberately answer the question either honestly or fraudulently depending on her/his beliefs and intentions which, in turn, could be influenced by the context. How can the researcher evaluate the impact of possible fake answers when trying to provide an overall picture of the investigated phenomena? Although the example is specific, a similar problem may occur in a variety of situations about stigmatizing characteristics (e.g., habitual gambling, experience of induced abortion, tax evasion, rash driving, risky sexual behavior).
drug | ||
yes (1) | no (2) | |
adults (1) | 10 | 25 |
minors (2) | 32 | 29 |
The result of a log linear model for independence for the two-way table showed a significant likelihood-ratio chi-squared statistic (\(G^2_{(1)}=5.29, p<.05\)). Hence the independence assumption was rejected. By a quick inspection of the counts shown in table 2 we can easily recognize that only 29% of adults answered affirmatively to the question. By contrast, more than 50% of minors replied affirmatively. Therefore, we suspected that the adults might have shown a larger level of social desirability (Paulhus 1984) as compared to the minors. This might have artificially boosted the observed difference between the two groups. To test this hypothesis we performed a new SGR analysis on the two-way table by assuming a) a generative model implementing the independence assumption with marginal probability \(\Pr(\mbox{yes}) = .75\) b) a fake good model for the variable drug consumption. In general, faking good can be conceptualized as an individual’s deliberate attempt to manipulate or distort responses to create a positive impression (Paulhus 1984). Notice that, the faking good (as well as the faking bad) scenario always entails a conditional replacement model in which the conditioning is a function of response polarity. In this application the scenario corresponds to a context in which all fakers respond negatively to the question. Finally, we also assumed two distinct levels of faking for the two groups: \(\pi_{1}^{+}= .8\) for the adults and \(\pi_{2}^{+}= .4\) for the minors. We reformulated the problem within a pure significance test setting: \[\begin{eqnarray} H \ & : & G^2 = 0 \ (\mbox{independence assumption}), \nonumber \\ & & \Pr(\mbox{yes}) = .75, \nonumber \\ & & \pi_{1}^{+}= .8, \ \pi_{2}^{+}= .4, \ \pi_{1}^{-}= \pi_{2}^{-}= .0, \nonumber \\ & & \gamma_{s}^{+}= \delta_{s}^{+}= 1, \ \gamma_{s}^{-}= \delta_{s}^{-}= 0, \quad s = 1,2 \nonumber \end{eqnarray}\] The following code illustrates the SGR analysis
> require(MASS)
> set.seed(367)
> data(smokers)
> ecstasy.table <- table(smokers$drug,smokers$age,dnn=c("drug","age"))
> obs.lrt <- loglm(~drug+age,data=ecstasy.table)$lrt
>
> PM <- matrix(0,nrow(smokers),2)
> PM[smokers$age==1,2] <- 1
> PM[smokers$age==2,2] <- 2
> Pparm <- list(p=matrix(c(.8,.4,0,0),2),gam=matrix(c(1,1,0,0),2),
+ del=matrix(c(1,1,0,0),2))
> mc.lrt <- NULL
> for (b in 1:1000) {
+ smokers$simdrug <- rdatagen(nrow(smokers),R=matrix(1),Q=2,
+ probs=list(c(.75,.25)))$data
+ Fx <- partition.replacement(smokers[,c("age","simdrug")],PM,Pparm=Pparm)
+ mc.lrt <- c(mc.lrt,loglm(~simdrug+age,data=table(Fx$simdrug,Fx$age,
+ dnn=c("simdrug","age")))$lrt)
+ }
> hist(mc.lrt)
> sum(mc.lrt>=obs.lrt)/1000
1] 0.812 [
Note that for dichotomous variables (\(q=2\)) all faking positive models reduce to the following uniform conditional replacement distribution (Pastore and L. Lombardi 2014).
\[\begin{eqnarray} p_{k|h} = \left\{ \begin{array}{cc} 1 , & h = k =2 \\ \pi^+ , & h = 1, k = 2 \\ 1 - \pi^+, & h=k=1 \\ 0, & h=2, k = 1 \end{array} \right. \label{eq:BOOLPD} \end{eqnarray} \tag{4} \]
Figure 4 shows the distribution of the test procedure under the hypothesis (approximate \(p\)-value \(=.812\)). According to the approximate \(G^2\) distribution the observed likelihood ratio seems consistent with the hypothesis of faking.
In this application we extend the results reported in the second example by analyzing a new set of ordinal data about illicit drug use among young people (see table 3). This new two-way table relates an independent categorical variable, age, minors (< 18 years old) vs adults, to a dependent ordinal variable, cannabis consumption. In particular, the dependent variable uses a four-point ordinal scale ranging from never (1) to often (4) (with intermediate levels being once (2) and some times (3), respectively). When response categories are ordered, logit models can directly incorporate the ordering (Agresti 1990). In general, this results in model representations having simpler interpretations than ordinary multicategory logit models at least when the proportional odds model holds.
cannabis | ||||
(1) | (2) | (3) | (4) | |
adults (1) | 20 | 5 | 7 | 0 |
minors (2) | 27 | 5 | 18 | 10 |
The following code illustrates the results of applying an ordered
logistic model to the data represented in table 3. For
the analysis we used the function polr
in the
MASS package
(Venables and B. D. Ripley 2002) that allows to fit a logistic or probit
regression model to an ordered factor response.
> Y <- data.frame(list(age=gl(2,4),response=gl(4,1,8,ordered=TRUE),
+ counts=c(20,5,7,0,27,5,18,10)))
> fit0 <- polr(response~1,data=Y,weight=counts)
> fit1 <- polr(response~age,data=Y,weight=counts)
> lrt.obs <- -2*(logLik(fit0)-logLik(fit1))
The likelihood ratio statistic \(\Lambda = -2(L_0-L_1)\) for the observed sample showed a significant result (\(\Lambda_{(3)} =5.22, p < .05\)). Hence, the independence assumption was rejected in the logit model. About the model of faking also in this application we expected that individuals’ responses were strictly subject to faking good manipulations. However, unlike the previous example, this time we speculated that only the group of adults showed a social desirability bias whereas the minors’ responses were assumed not to be fake dependent (\(\pi_2 = \pi_{2}^{+} + \pi_{2}^{-}=0\)). In particular, we supposed that the adults were showing a moderate level of faking good (\(10\%\)) and that their responses were characterized by a slight faking behavior (see table 1). Note that because of the meaning of the categories of the ordinal scale for cannabis consumption, in this application the faking good manipulations are modelled by means of the fake negative parameters (\(\pi^-\)). Finally, for the data generation process we constructed a generative model under the assumption of no relation between age and cannabis consumption (\(\Lambda=0\)) and with true response proportions being equal to the empirical response proportions for the group of minors. We can collect all the information in the following hypothesis: \[\begin{eqnarray} H \ & : & \Lambda = 0 \ (\mbox{independence assumption}), \nonumber \\ & & \Pr(\mbox{0}) = .45, \Pr(\mbox{1}) = .08, \nonumber \\ & & \Pr(\mbox{2}) = .30, \Pr(\mbox{3}) = .17, \nonumber \\ & & \pi_{1}^{-}= .1, \ \pi_{2}^{-}= \pi_{1}^{+}= \pi_{2}^{+}= .0, \nonumber \\ & & \gamma_{1}^{-}= 1.5, \ \delta_{1}^{-}= 4, \nonumber \\ & & \mbox{all the other shape parameters are set to 0} \nonumber \end{eqnarray}\] The following code illustrates the SGR analysis
> set.seed(367)
> Z <- na.omit(smokers[,c("age","druguse")])
> PM <- matrix(0,nrow(Z),ncol(Z))
> PM[Z$age==1,2] <- 1
> lrt.mc <- NULL
> for (b in 1:1000) {
+ Z$simdrug <- rdatagen(nrow(Z),R=matrix(1),Q=4,
+ probs=list(c(27,5,18,10)/60))$data
+ Dx <- Z[,-2]
+ Fx <- partition.replacement(Dx,PM,p=matrix(c(0,.1),1),fake.model="slight")
+ Tmc <- table(Fx$age,Fx$simdrug)
+ Ymc <- data.frame(list(age=gl(2,4),response=gl(4,1,8,ordered=TRUE),
+ counts=c(Tmc[1,],Tmc[2,])))
+ fit0 <- polr(response~1,data=Ymc,weight=counts)
+ fit1 <- polr(response~age,data=Ymc,weight=counts)
+ lrt.mc <- c(lrt.mc,-2*(logLik(fit0)-logLik(fit1)))
+ }
> sum(lrt.mc>=lrt.obs)/1000
1] 0.039 [
Figure 5 shows the distribution of the test procedure under the hypothesis. This time the observed likelihood ratio statistic seems not consistent with \(H\) (approximate \(p\)-value \(.039\)). In substantive terms, the observed association between age and cannabis consumption cannot be explained by an independent generative model and slight faking good manipulations for the adult group.
In this section we provide a full exploratory SGR analysis for the data presented in table 3. In particular, we show how it is possible to vary the parameters (\(\gamma_{1}^{-},\delta_{1}^{-}\)) of the fake negative distribution and evaluate how these changes effect the results of the approximate \(p\)-value. Figure 6 shows the contour plot of the approximate \(p\)-value as a function of different levels for the shape parameters \(\gamma_{1}^{-}\) and \(\delta_{1}^{-}\) in the group of adults. More specifically, the value of parameter \(\gamma_{1}^{-}\) was varied at 21 distinct levels from \(0.5\) to \(5.5\) (by a \(0.25\) step). The same set of values was also applied for the second shape parameter \(\delta_{1}^{-}\). In this application we also changed the overall probability of faking good \(\pi^-\) by replacing the original value \(0.1\) (used in the previous example) with the new value \(0.2\). By contrast, all the other parameters’ values were left unchanged in the SGR simulation by keeping the same values reported in the previous analysis. The results show how the value of the observed statistic, \(\Lambda = 5.22\), is consistent with an independent true model (\(\Lambda = 0\)) that has been corrupted by a moderate amount of faking good perturbation (20%), and which is also characterized by an extreme faking pattern in the replacement distribution. This is evident from a quick inspection of figure 6 where the parameters assignments that resulted consistent with the earlier faking hypothesis are restricted to the left portion of the main diagonal (\(\gamma_{1}^{-} < \delta_{1}^{-}\)) in the graphical representation. By contrast, the parameters assignments corresponding to the right portion of the main diagonal (\(\gamma_{1}^{-} > \delta_{1}^{-}\)) are not consistent with the hypothesis. Note that these latter values represent slight faking configurations in the replacement distribution. In sum, the results are in line with a moderate faking good process which is characterized by a general property of extremeness in the way the original true values are replaced with the fake ones in the replacement distribution. That is to say, in general the chance to replace an original true value with another lower value seem to increase as a function of the distance between two values.
In what follows we present a short code example that the reader may easily manipulate to set the desired values for the parameters in the simulation study (shape parameters, overall probabilities of faking, number of runs in the SGR simulations). Note that in this exploratory setting the overall time required to complete the SGR simulation may widely vary according to the complexity (e.g., number of different values for the parameters) of the simulation design.
> data(smokers)
> Z <- na.omit(smokers[,c("age","druguse")])
>
> fit0 <- polr(ordered(druguse)~1,data=Z)
> fit1 <- polr(ordered(druguse)~age,data=Z)
> lrt.obs <- -2*(logLik(fit0)-logLik(fit1)) # observed LRT
>
> ### SGR algorithm
> PI <- .2; B <- 10 # for real simulations set B at least 500
> lrt.mc <- ga.mc <- de.mc <- p.mc <- NULL
> PM <- matrix(0,nrow(Z),ncol(Z)) # partition matrix
> PM[Z$age==1,2] <- 1
>
> for (GA in seq(.5,5.5,.5)) {
+ for (DE in seq(.5,5.5,.5)) {
+
+ Pparm <- list(p=matrix(c(0,PI),1),gam=matrix(c(0,GA),1),del=matrix(c(0,DE),1))
+
+ for (b in 1:B) {
+ Z$simdrug <- rdatagen(nrow(Z),R=matrix(1),Q=4,
+ probs=list(c(27,5,18,10)/60))$data
+ Dx <- Z[,-2]
+ Fx <- partition.replacement(Dx,PM,Pparm=Pparm)
+
+ Tmc <- table(Fx$age,Fx$simdrug)
+ Ymc <- data.frame(list(age=gl(2,ncol(Tmc)),response=gl(ncol(Tmc),1,
+ ordered=TRUE,labels=colnames(Tmc)),counts=c(Tmc[1,],Tmc[2,])))
+
+ fit0 <- polr(response~1,data=Ymc,weight=counts)
+ fit1 <- polr(response~age,data=Ymc,weight=counts)
+ statval <- -2*(logLik(fit0)-logLik(fit1))
+ lrt.mc <- c(lrt.mc,statval)
+
+ ga.mc <- c(ga.mc,GA); de.mc <- c(de.mc,DE)
+ p.mc <- c(p.mc,ifelse(statval>lrt.obs,1,0))
+ }
+ }
+ }
> LRT <- data.frame(list(gam=ga.mc,del=de.mc,lrt=lrt.mc))
> aggregate(p.mc,list(gam=LRT$gam,del=LRT$del),mean)
This paper illustrated the usage of a new R package, sgr, for simulating and analyzing ordinal fake data. As far as we know, sgr is the first statistical package that is devoted to the analysis of fake data. Overall, the essential characteristic of this approach is its explicit use of mathematical models and appropriate probability distributions for quantifying uncertainty in inferences based on possible fake data. Moreover, it involves the derivation of new statistical results as well as the evaluation of the implications of such new results: Are the substantive conclusions reasonable? How sensitive are the results to the modeling assumptions about the process of faking? In sum, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking. In this contribution we illustrated the use of sgr on three simple scenarios of faking. More complex examples of SGR applications can be found in (Lombardi and M. Pastore 2012) and (Pastore and L. Lombardi 2014).
As with many Monte Carlo-type approaches, also SGR involves simplifying assumptions that may result in lower external validity. For example, one relevant limitation regards the assumption that restricts the conditional replacement distribution to satisfy the CI property. Unfortunately, this restriction clearly limits the range of empirical faking processes that can be mimicked by the current SGR simulation procedure. In particular, because the replacement distribution under the CI assumption acts as a perturbation process for the original data, the resulting new fake data sets will in general show covariance patterns that are (on average) weaker than the ones observed for the original uncorrupted data. In general, this may not be a serious problem as different studies have shown that self-report measures under faking motivating conditions tend to have smaller variances and lower reliability (covariance) estimates than those observed for measures collected under uncorrupted conditions (Eysenck, H. J. Eysenck, and L. Shaw 1974; Topping and J. O’Gorman 1997; Ellingson, D. B. Smith, and P. R. Sackett 2001; Hesketh, B. Griffin, and D. Grayson 2004). However, opposite results have also been observed where simple fake good instructions tend to increase the intercorrelations between the manipulated or faked items (Ellingson, P. Sackett, and L. Hough 1999; Zickar and C. Robie 1999; Pauls and N. Crost 2005; Ziegler and M. Buehner 2009; Galic, Z. Jerneic, and M. P. Kovacic 2012). Therefore, although encouraging, the promise of this approach should be examined across more varied conditions. We acknowledge that more work still needs to be done. We are in the process of extending sgr to include new replacement distributions other than the ones presented in this article which will allow to modulate different levels of correlational patterns in the simulated fake data.
Distributions, Econometrics, Environmetrics, MixedModels, NumericalMathematics, Psychometrics, Robust, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lombardi & Pastore, "sgr: A Package for Simulating Conditional Fake Ordinal Data", The R Journal, 2014
BibTeX citation
@article{RJ-2014-019, author = {Lombardi, Luigi and Pastore, Massimiliano}, title = {sgr: A Package for Simulating Conditional Fake Ordinal Data}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {164-177} }