sgr: A Package for Simulating Conditional Fake Ordinal Data

Many self-report measures of attitudes, beliefs, personality, and pathology include items that can be easily manipulated by respondents. For example, an individual may deliberately attempt to manipulate or distort responses to simulate grossly exaggerated physical or psychological symptoms in order to reach specific goals such as, for example, obtaining financial compensation, avoiding being charged with a crime, avoiding military duty, or obtaining drugs. This article introduces the package sgr that can be used to perform fake data analysis according to the sample generation by replacement approach. The package includes functions for making simple inferences about discrete/ordinal fake data. The package allows to quantify uncertainty in inferences based on possible fake data as well as to study the implications of fake data for empirical results.


Introduction
How can we evaluate the impact of fake information in real life contexts? In nature, some individuals tend to distort their behaviors or actions in order to reach specific goals. In some species, for example, wimpy animals may not signal their real social value by faking a higher status to deceive other competitors. Similarly, in personnel selection some job applicants may misrepresent themselves on a personality test hoping to increase the likelihood of being offered a job. Being able to discriminate between honest and fraudulent signals and evaluating the impact of counterfeit elements crucially depend on the way we can reason about the whole process of faking. A coherent knowledge of the type or structure of faking processes may lead to stronger inferences that lie on or close to what we may call the genuine, but probably hidden, representation of a manifest behavior. In general fake data may alter a large variety of self-report measures. This problem is particularly relevant for discrete/ordinal data collected in sensitive environments such as, for example, risky sexual behaviors, drug addictions, tax evasion, political preferences, financial compensation, and personnel selection. More in general, researchers interested in the study of human behavior in areas like psychology (Hopwood et al., 2006), organizational and social science (Van der Geest and Sarkodie, 1998), psychiatry (Beaber et al., 1985), forensic medicine (Gray et al., 2003), scientific frauds (Marshall, 2000), and economics (Crawford, 2003) may face the fake data problem when analyzing and interpreting empirical data.
In this article, we discuss the sgr package that we have developed for running fake data analysis according to the sample generation by replacement (SGR) approach (Lombardi and Pastore, 2012). SGR is a data simulation procedure to generate artificial samples of fake discrete/ordinal data. The main characteristic of the SGR approach is that it allows detailed explorations of what outcomes are produced by particular sets of faking assumptions. By changing the input in the faking model parameters and showing the effect on the outcome of a model, SGR provides a what-if-analysis of the faking scenarios. Therefore, SGR can be used to quantify uncertainty in inferences based on possible fake data as well as to evaluate the implications of fake data for statistical results. To illustrate, let us consider the following example where a researcher is interested in studying the relationship between therapy-uncompliance indicators (e.g., forgetting the treatment) and unsafe behaviors indicators (e.g., drinking alcohol) in a group of liver transplant patients. Generally, patients diagnosed with alcohol dependence who follow a pharmaceutical regimen after the liver transplant would deliberately answer fraudulently a question about drinking alcohol due to abstinence from ethanol and social desirability factors (e.g. Foster et al., 1997). In this context, an SGR analysis can help in testing for potential influence of faking the drinking alcohol self-report measure on the strength of the relationship between therapy-uncompliance and unsafe behaviors indicators. More specifically, how sensitive are the empirical associations to possible fake observations in the drinking alcohol self-report measure? Are the conclusions still valid under one or more scenarios of faking (e.g., slight, moderate, and extreme faking) for the drinking alcohol variable?
In general, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking and the underlying true model representation. This makes SGR related in spirit to other statistical approaches such as, for example, uncertainty and sensitivity analysis (Helton et al., 2006) and prospective power analysis (Cohen, 1988) which are all characterized by an attempt to directly quantify uncertainty of general statistics computed on the data by means of specific hypotheses.
The rest of the paper is organized as follows. The next section reviews the SGR framework and its basic implementations using the sgr package. The following section provides three examples illustrating the application of sgr to faking scenarios. The final section discusses limitations and future implementations in sgr components beyond the general scheme presented here.

The SGR framework
SGR is characterized by a two-stage sampling procedure based on two distinct generative models: the model defining the process that generates the data prior to any fake perturbation (data generation process) and the model representing the faking process to perturb the data (data replacement process). By repeatedly sampling data from the SGR procedure we can generate the so called fake data sample (FDS) and eventually study the distribution of some relevant statistics computed on these simulated data samples. In SGR the data generation process is modeled by means of standard Monte Carlo procedures for ordinal data whereas the data replacement process is implemented using ad hoc probabilistic faking models. In sum, the overall generative process is split into two conceptually independent and possibly simpler components (divide and conquer strategy).
With regard to the fake-data problem in general, we think of the data in the generation process as being represented by an n × m matrix D, that is to say, n i.i.d. observations (hypothetical participants) each containing m elements (hypothetical participant's responses). We assume that entry d ij of D (i = 1, . . . , n; j = 1, . . . , m) takes values on a small ordinal range V q = {1, . . . , q} (for the sake of simplicity, in this presentation we assume identical ordinal scales). In particular, let d i be the (1 × m) array of D denoting the pattern of responses of participant i. The response pattern d i is a multidimensional ordinal random variable with probability distribution p(d i |θ D ), where θ D indicates the vector of parameters of the probabilistic model for the data generation process. The main idea of the replacement approach is to construct a new n × m ordinal data matrix F, called the fake data matrix of D, by manipulating each element d ij in D according to a replacement probability distribution. Let f i be the (1 × m) array of F denoting the replaced pattern of fake responses of participant i. The fake response pattern f i is a multidimensional ordinal random variable with conditional replacement probability distribution where θ F indicates the vector of parameters of the probabilistic faking model.
It is important to note that in the standard SGR framework the replacement distribution p(f i |d i , θ F ) is restricted to satisfy the conditional independence (CI) assumption (see Lombardi and Pastore, 2012;Pastore and Lombardi, 2014). More precisely, in the replacement distribution each fake response f ij only depends on the corresponding data observation d ij and the model parameter θ F . Therefore, because the patterns of fake responses are also i.i.d. observations, the simulated data array (D, F) is drawn from the joint probability distribution In the last section of this article we will discuss some potential limitations of the conditional independence assumption in real application domains of the SGR approach.

Data generation process
In general, several options are available to represent the data generation process (Muthén, 1984;Jöreskog and Sörbom, 1996;Moustaki and Knott, 2000;Samejima, 1969). In the current version of the sgr package we implemented a procedure based on the multivariate latent variable framework which is called underlying variable approach (UVA, Muthén, 1984;Jöreskog and Sörbom, 1996). This approach assumes that the observed ordinal variables are treated as metric through assumed underlying normal variables. In particular, we assume that there exists a continuous data matrix D * underlying the ordinal data matrix D. Let d * i be the (1 × m) array of D * denoting the pattern of underlying continuous values of the ith observation. It is convenient to let d * i have the multivariate standard normal distribution with density function φ(0, R) where R denotes the (m × m) model correlation matrix. The connection between the ordinal variable d ij and the underlying variable d * ij in D * is given by with h = 1, . . . , q; i = 1, . . . , n; j = 1, . . . , m and where are threshold parameters. Therefore, the joint probability of d i = (h 1 , . . . , h m ) is given by with θ M = (τ, R) and z = (z 1 , . . . , z m ) being the parameter vector of the original data generation model and the values for the continuous variables, respectively.
In SGR the data generation process is obtained by first generating the continuous data D * according to a model correlation matrix R and then by transforming it to its discrete counterpart D using the model thresholds τ. In the following example, we used the sgr function rdatagen to sample n = 100 random observations from a data generation model with two symmetrically distributed ordinal variables with five levels each and correlation value .4. > library(sgr) > require(MASS) > require(polycor) > set.seed(367) > R <-matrix(c(1,.4,.4,1),2,2) > th <-list(c(-Inf,qnorm(c(0.04,0.27,0.73,0.96)), Inf), + c(-Inf,qnorm(c(0.06,0.31,0.69,0.94)),Inf)) > Dx <-rdatagen(n=100,R=R,Q=c(5,5),th=th) > Dx$data In this example, the threshold values are derived from the quantiles of the standard normal distribution in such a way that the first simulated variable shows a slightly larger variance than the second simulated variable. Generally, the threshold values can be derived in two different ways. In the first case, we can use empirically based knowledge (e.g., an already existing data set) to estimate the threshold values on the basis of the observed distribution function of the levels of the discrete variable (e.g., Jöreskog and Sörbom, 1996). In the second case, some simple statistical knowledge can be used to simulate threshold values according to desired properties. For example, the normal quantiles used as corresponding threshold values can be computed using the inverse of the binomial cumulative distribution function (e.g., Jöreskog and Sörbom, 1996). In the rdatagen function call the parameter Q specifies the number of levels for each ordinal variable. To compare the model correlation matrix R with the sample polychoric correlation, we can use the polychor function in the polycor package (Fox, 2010) > d1 <-factor(Dx$data[,1],ordered=TRUE) > d2 <-factor(Dx$data[,2],ordered=TRUE) > polychor(d1,d2,ML=TRUE,std.err=TRUE) Polychoric Correlation, ML est. = 0.3627 (0.09832)

Data replacement process
To generate the fake ordinal data we used a parametrized replacement distribution based on a discrete beta kernel (Pastore and Lombardi, 2014). Some examples of replacement distributions are shown in Figure 1. Let p k|h ≡ p(k|h, θ F ) be the conditional probability of replacing an original ordinal value h with the new ordinal value k. In general, θ F represents hypothetical a priori knowledge about the distribution of faking (e.g., the chance of observing a fake observation in the data) or empirically based knowledge about the process of faking (e.g., the direction of faking -fake good vs fake bad-).
The conditional replacement distribution can be described according to the following equation with DG being the generalized beta distribution for discrete variables (Pastore and Lombardi, 2014). Note that in Eq. (5), the function DG is used with two different set of parameters. More precisely, in the first line the function DG models the behavior of the faking distribution for fake positive values  Table 1). For each example the overall probabilities are π + = .6 and π − = .2. Each row in the graphical representation corresponds to a different original 7-point discrete value h.
(k > h) by means of the governing shape parameter θ + F = (γ + , δ + ) with bounds (a + = h + 1, b + = q). By contrast, the second line represents the behavior of the faking distribution for fake negative values (k < h) modelled by the governing shape parameter . Some examples of faking models with their parameters assignments are reported in Table 1 (see also Pastore and Lombardi, 2014). In general, the governing shape parameters θ + F and θ − F must be strictly positive. In particular, if γ + = δ + = 1, the right part of the replacement distribution reduces to a uniform support fake positive distribution ( Fig. 1 first column). By contrast, if 1 ≤ γ + < δ + (resp. 1 ≤ δ + < γ + ), the model mimics asymmetric faking configurations corresponding to moderate positive shifts (resp. exaggerated positive shifts) in the value of the original response (Fig. 1, second and third columns). More specifically, in the slight positive faking configuration the chance to replace an original value h with another greater value k decreases as a function of the distance between k and h. By contrast, in the extreme positive faking configuration the chance to replace an original value h with another greater value k increases as a function of the distance between k and h. Unlike the asymmetric configurations (slight faking and extreme faking), the uniform support distribution (γ + = δ + = 1) mimics a kind of random world model that can be used whenever we believe to deal with purely random fake data. This principle requires the simplest quantitative representation for the replacement process and reflects the lack of information about the distributional properties of the faking behavior. Similar configurations can be described also for the left part of the replacement distribution which represents the negative faking process . However, for this latter component the ordinal relation characterizing the shape parameters must be reversed (see Table 1). Finally, in the conditional replacement distribution the parameters π + and π − denote the overall probability of faking positive and the overall probability of faking negative, respectively. These probabilities act as weights to rescale the discrete beta distribution DG such that (π = π + + π − ) ≤ 1. In general, π + and π − represent a priori or empirically based knowledge about the distribution of faking for the two components (e.g., the chance of observing a positive or negative fake observation in the data). The third, fourth, and fifth lines of Eq. (5) show the probability of non-replacement (k = h). Note that, if we set π + = 0 (resp. π − = 0), then the replacement model boils down to a pure faking negative (resp. positive) model which corresponds to a context in which responses are exclusively subject to negative (resp. positive) faking (see fig. 2).
In the following example, we applied a pure (slight) positive faking model (see Table 1) to generate a fake data matrix F from the original data matrix D.
We used the replacement.matrix function to construct the conditional replacement probability distribution and save the result in the variable RM which is used as the argument of the the data replacement generation function rdatarepl. Note that the argument fake.model in the replacement.matrix function allows to set the options reported in Table 1

Illustrative examples
By way of illustration we consider three simple SGR examples. The first is for the evaluation of a correlational analysis computed on five-point rating data. This example is hypothetical and serves to introduce the main features and functions implemented in the sgr package. The second application considers real data about illicit drug use among young people aged 14 to 27. This second example shows how to model directional faking hypotheses (e.g., faking good or faking bad). It is also important because illustrates how the replacement functions can be applied to dichotomous data. Finally, the third application extends the second example by analyzing a new set of data about cannabis consumption in young people using log-linear models for ordinal data.

Example 1
We begin with a simple SGR analysis about a hypothetical observed difference (∆ =ρ 1 −ρ 2 = .3) between two ordinal correlations computed on two five-point rating variables X and Y for the groups of subjects, G 1 (n 1 = 50) and G 2 (n 2 = 50). For example, in a risky sexual behaviors scenario the rating variables X and Y can represent, in two groups of young adults (females and males), the self-report attitude to contraceptive use during a sexual intercourse and the declared number of sexual partners in the last three months, respectively. Normally, an effect size of .3 denotes a relevant difference between two correlations. However, how sensitive may this result be to possible fake data? Is this effect still observed under one or more scenarios of faking? In this example, we are interested in testing whether the observed correlation difference can still be consistent with a true generative model reflecting an identical moderate correlation ρ 1 = ρ 2 = .25 for the two groups. Moreover, we also assume a perturbation process represented by two distinct uniform faking models: π + 1 = .2 and π − 1 = .1 for G 1 , and π + 2 = .3 and π − 2 = .2 for G 2 . We can easily reformulate this example using a Fisher significance testing (Lehmann, 1993). More precisely, we can construct the corresponding hypothesis An empirical p-value can be computed by a Monte Carlo experiment. In our example, the test procedure ∆ * =ρ * 1 −ρ * 2 is replicated 1000 times under the condition of the hypothesis. Next, the approximate p-value is computed as the proportion of the simulated ∆ * values which are larger than the observed correlation difference .3. More precisely, for each replicate b = 1, . . . , 1000, we first generate a 100 × 2 ordinal data matrix mcD according to the generative model with correlation matrix Rmc. This matrix contains two symmetrically distributed ordinal variables (default value 1 in the rdatagen function.) Next, the ordinal matrix is transformed according to the faking models.
In particular, the function partition.replacement allows to set different replacement distributions for the two groups of subjects and returns the perturbed data matrix. This function has three main arguments: Dx=mcD, the data frame or matrix to be replaced; PM, the partition matrix to cluster the observations into the distinct groups; Pparm, the list of replacements parameters for each of the different faking models. Note that the partition matrix must have the same dimension as the matrix to be replaced and a numeric code for each distinct cluster (group) in the partition. If a cell of the partition matrix contains 0, then the corresponding cell value in the original data matrix is not modified (no replacement condition is applied). In our example, Pparm is a list containing three elements. Each element is a 2 (number of groups) × 2 (faking positive vs faking negative) matrix. So for example, p[1,1] and p[1,2] denote the overall faking positive probabilities for G 1 and G 2 , respectively. Similarly, gam[1,1] (resp. gam[2,1]) indicates the first shape parameter for the faking positive (resp. faking negative) model in group G 1 . The same figure follows for the second shape parameter del. Figure 3 shows the distribution of the test procedure under H (approximate p-value = .226). According to the distribution of the test procedure the observed correlation difference∆ seems consistent with the hypothesis of faking.

Example 2
Table 2 refers to a real prospective study about illicit drug use among young people aged 14-27 (Pastore et al., 2007). In particular, we evaluated the relationship between age (dichotomized into two categories: adults, > 17, and minors) and ecstasy drug consumption. We expected that each individual would deliberately answer the question either honestly or fraudulently depending on her/his beliefs and intentions which, in turn, could be influenced by the context. How can the researcher evaluate the impact of possible fake answers when trying to provide an overall picture of the investigated phenomena? Although the example is specific, a similar problem may occur in a variety of situations about stigmatizing characteristics (e.g., habitual gambling, experience of induced abortion, tax evasion, rash driving, risky sexual behavior).

Example 3
In this application we extend the results reported in the second example by analyzing a new set of ordinal data about illicit drug use among young people (see table 3). This new two-way table relates an independent categorical variable, age, minors (< 18 years old) vs adults, to a dependent ordinal variable, cannabis consumption. In particular, the dependent variable uses a four-point ordinal scale ranging from never (1) to often (4) (with intermediate levels being once (2) and some times (3), respectively). When response categories are ordered, logit models can directly incorporate the ordering (Agresti, 1990). In general, this results in model representations having simpler interpretations than ordinary multicategory logit models at least when the proportional odds model holds. cannabis (1) (2) (3) (4) adults (1) 20 5 7 0 minors (2) 27 5 18 10 Table 3: Observed frequencies for testing independence of age and drug use.

Example 3 (continued)
In this section we provide a full exploratory SGR analysis for the data presented in table 3. In particular, we show how it is possible to vary the parameters (γ − 1 , δ − 1 ) of the fake negative distribution and evaluate how these changes effect the results of the approximate p-value. Figure 6 shows the contour plot of the approximate p-value as a function of different levels for the shape parameters γ − 1 and δ − 1 in the group of adults. More specifically, the value of parameter γ − 1 was varied at 21 distinct levels from 0.5 to 5.5 (by a 0.25 step). The same set of values was also applied for the second shape parameter δ − 1 . In this application we also changed the overall probability of faking good π − by replacing the original value 0.1 (used in the previous example) with the new value 0.2. By contrast, all the other parameters' values were left unchanged in the SGR simulation by keeping the same values reported in the previous analysis. The results show how the value of the observed statistic, Λ = 5.22, is consistent with an independent true model (Λ = 0) that has been corrupted by a moderate amount of faking good perturbation (20%), and which is also characterized by an extreme faking pattern in the replacement distribution. This is evident from a quick inspection of figure 6 where the parameters assignments that resulted consistent with the earlier faking hypothesis are restricted to the left portion of the main diagonal (γ − 1 < δ − 1 ) in the graphical representation. By contrast, the parameters assignments corresponding to the right portion of the main diagonal (γ − 1 > δ − 1 ) are not consistent with the hypothesis. Note that these latter values represent slight faking configurations in the replacement distribution. In sum, the results are in line with a moderate faking good process which is characterized by a general property of extremeness in the way the original true values are replaced with the fake ones in the replacement distribution. That is to say, in general the chance to replace an original true value with another lower value seem to increase as a function of the distance between two values.
In what follows we present a short code example that the reader may easily manipulate to set the desired values for the parameters in the simulation study (shape parameters, overall probabilities of faking, number of runs in the SGR simulations). Note that in this exploratory setting the overall time required to complete the SGR simulation may widely vary according to the complexity (e.g., number of different values for the parameters) of the simulation design.

Summary, limitations, and future works
This paper illustrated the usage of a new R package, sgr, for simulating and analyzing ordinal fake data. As far as we know, sgr is the first statistical package that is devoted to the analysis of fake data. Overall, the essential characteristic of this approach is its explicit use of mathematical models and appropriate probability distributions for quantifying uncertainty in inferences based on possible fake data. Moreover, it involves the derivation of new statistical results as well as the evaluation of the implications of such new results: Are the substantive conclusions reasonable? How sensitive are the results to the modeling assumptions about the process of faking? In sum, SGR takes an interpretation perspective by incorporating in a global model all the available information about the process of faking. In this contribution we illustrated the use of sgr on three simple scenarios of faking. More complex examples of SGR applications can be found in Lombardi and Pastore (2012) and Pastore and Lombardi (2014).
As with many Monte Carlo-type approaches, also SGR involves simplifying assumptions that may result in lower external validity. For example, one relevant limitation regards the assumption that restricts the conditional replacement distribution to satisfy the CI property. Unfortunately, this restriction clearly limits the range of empirical faking processes that can be mimicked by the current SGR simulation procedure. In particular, because the replacement distribution under the CI assumption acts as a perturbation process for the original data, the resulting new fake data sets will in general show covariance patterns that are (on average) weaker than the ones observed for the original uncorrupted data. In general, this may not be a serious problem as different studies have shown that self-report measures under faking motivating conditions tend to have smaller variances and lower reliability (covariance) estimates than those observed for measures collected under uncorrupted conditions (Ellingson et al., 2001;Eysenck et al., 1974;Hesketh et al., 2004;Topping and O'Gorman, 1997). However, opposite results have also been observed where simple fake good instructions tend to increase the intercorrelations between the manipulated or faked items (Ellingson et al., 1999;Galic et al., 2012;Pauls and Crost, 2005;Zickar and Robie, 1999;Ziegler and Buehner, 2009). Therefore, although encouraging, the promise of this approach should be examined across more varied conditions. We acknowledge that more work still needs to be done. We are in the process of extending sgr to