In analyzing data deriving from the administration of a questionnaire to a group of individuals, Item Response Theory (IRT) models provide a flexible framework to account for several aspects involved in the response process, such as the existence of multiple latent traits. In this paper, we focus on a class of semi-parametric multidimensional IRT models, in which these traits are represented through one or more discrete latent variables; these models allow us to cluster individuals into homogeneous latent classes and, at the same time, to properly study item characteristics. In particular, we follow a within-item multidimensional formulation similar to that adopted in the two-tier models, with each item measuring one or two latent traits. The proposed class of models may be estimated through the package MLCIRTwithin, whose functioning is illustrated in this paper with examples based on data about quality-of-life measurement and about the propensity to commit a crime.
Several fields of human knowledge require the measurement of unobservable constructs (or latent traits) through ad hoc methods based on questionnaires consisting of multiple items having dichotomously or ordered politomously scored response categories. This is the case of measurement of customer satisfaction, quality-of-life, level of physical and/or psychological disabilities, ability in certain subjects, and so on.
Item Response Theory (IRT) models (Hambleton and H. Swaminathan 1985; Van der Linden and R. K. Hambleton 1997; Bartolucci, S. Bacci, and M. Gnaldi 2015) are well-known statistical models to deal with these data. In their original formulation, these models are characterized by: (i) unidimensionality (i.e., only one latent trait is assumed to be measured by all items); (ii) a parametric (usually normal) distribution for the latent variables used to represent the trait of interest; and (iii) no effect of individual covariates on this latent trait. These elements often turn out to be restrictive in modern applications and, therefore, several extensions of IRT models have been proposed in the literature. Among the possible extensions, in this paper we consider the class of multidimensional Latent Class (LC) IRT models proposed by (Bartolucci 2007) and (Davier 2008); see also (Bacci, F. Bartolucci, and M. Gnaldi 2014). Models of this type are characterized by: (i) multidimensionality, in the sense that more latent traits may be measured by the set of items (Reckase 2009); (ii) discreteness of the latent variables, so that homogeneous subpopulations [or latent classes; Lazarsfeld and N. W. Henry (1968); Goodman (1974)] of individuals are detected with respect to the constructs measured by the questionnaire; and (iii) possible presence of individual covariates affecting the probabilities to belong to each latent class.
In particular, we focus on a specific extension of IRT models based on within-item multidimensionality (Adams, M. Wilson, and W.-C. Wang 1997), which is characterized by items affected by more than one latent variable. This is opposed to the more common between-item multidimensionality, where each item may measure only one latent variable as in the original approach of (Bartolucci 2007). More in detail, the model here proposed represents a discrete version of the item bifactor model and of the more general two-tier IRT model (Bock, R. Gibbons, and E. Muraki 1988; Gibbons and D. R. Hedeker 1992; Gibbons, R. D. Bock, D. Hedeker, D. J. Weiss, E. Segawa, D. K. Bhaumik, D. J. Kupfer, E. Frank, V. J. Grochocinski, and A. Stover 2007; Cai 2010; Cai, J. S. Yang, and M. Hansen 2011; Reise 2012; Bonifay 2015), based on a particular within-item multidimensional formulation with each item loading on at most two latent variables that are mutually uncorrelated. With respect to traditional item bifactor and two-tier models, which assume the normality of the latent variables, the discreteness assumption increases the flexibility of the approach and allows us to cluster individuals in homogeneous latent classes. (Formann and T. Kohlmann 2002) propose a general approach based on latent classes that includes the discrete two-tier model here proposed as special case. However, different from the proposal of these authors, we let the class membership probability depend on individual covariates and we also allow for more flexibility in terms of specification of model link function. Limited to binary items, a recent example of application of the proposed two-tier LC-IRT model is provided in (Bacci and F. Bartolucci 2015) to jointly study certain students’ abilities and the propensity to skipping item responses.
The procedures to estimate the proposed class of two-tier LC-IRT models
are implemented in the R package
MLCIRTwithin
(Bartolucci and S. Bacci 2016), downloadable from
http://CRAN.R-project.org/package=MLCIRTwithin, whose illustration is
the primary focus of the present paper. In particular, we are interested
in providing a detailed description of the main functions of this
package, named est_multi_poly_within
and search.model_within
, also
through some applications.
The remainder of the paper is organized as follows. In the next section we provide the formulation of the proposed class of two-tier LC-IRT models and, then, some details about likelihood inference for these models. Furthermore, we describe the main functions implemented in the R package MLCIRTwithin for model estimation. In the following, the functioning of the package is illustrated through two applications: the first one concerns the measurement of Heath-related Quality Of Life (HQOL) on cancer patients and the second one is about the measurement of propensity to commit crimes. Some final remarks conclude the work.
The proposed class of models is formulated on the basis of two
independent vectors of latent variables representing the unobservable
individual characteristics measured by the test items. For each unit
An illustrative example of the above assumptions is provided in
Figure 1, where
Adopting a semi-parametric approach for the latent distribution, the
first latent vector
For binary response variables, the measurement model assumes that, for
Different from traditional LC models characterized by constant mass
probabilities, a more general approach is based on assuming that the
probabilities to belong to every latent class defined by the
distribution of
In the case of polytomously scored items with ordered categories indexed
from 0 to
In order to ensure the identification of the proposed class of models,
two necessary conditions must hold. First, as usual in the IRT modeling,
we must constrain one discriminant index to be equal to 1 and one
difficulty parameter to be equal to 0 for each dimension. More in
detail, let
A further identification condition requires that at least one item
belongs to one of the subsets
To specify in a flexible way constraints on the support points and item
parameters, we denote the complete vectors of support points by
For instance, according to the usual IRT parametrization with free
support points and constraints on the item parameters (i.e., one
discriminant index equal to 1 and one difficulty parameter equal to 0
for each dimension),
The proposed two-tier LC-IRT model can be estimated by maximizing the
marginal log-likelihood
We maximize
the expected value of the complete data log-likelihood (i.e., the
log-likelihood for the observed and latent variables) is computed,
given the current parameter vector
the parameter vector
Similar to the other iterative algorithms, the first iteration of the EM algorithm needs to be initialized through suitable values for the model parameters that can be chosen according to certain deterministic or random rules. A common problem with finite mixture models, and then with the proposed model, is due to the presence of several local maximum points of the log-likelihood function. Therefore, in order to avoid a solution that does not correspond to the global maximum, a good practice consists in repeating the estimation process for a specific model a certain number of times using random starting values and, in the presence of different values of the log-likelihood at convergence, the solution corresponding to the highest log-likelihood value is selected.
A crucial point is that of model selection, mainly as concerns the
choice of the number of support points (or latent classes) for both
latent vectors (i.e.,
Other selection criteria may be based on the entropy, whose computation
involves the individual posterior probabilities. Entropy is a measure of
the capability of the model to provide a neat partition of the sample
units, which is computed as
In practice, we propose to fit a series of models with similar
specifications that distinguish one other for values assigned to
More in general, for certain values of
Finally, in order to facilitate the interpretation of the results, we
suggest to standardize the estimated support points
Let
Moreover,
for
The class of two-tier LC-IRT models previously described may be estimated through the R package MLCIRTwithin; for technical details see the official documentation provided in CRAN (Bartolucci and S. Bacci 2016).
Before illustrating the main functions in the package at issue, it is
worth mentioning some alternative R packages, which estimate models with
a formulation resembling the one proposed. A first example is provided
by the R package
MultiLCIRT
(Bartolucci, S. Bacci, and M. Gnaldi 2014, 2016), whose functions are similar, in terms of
input and output, to those of MLCIRTwithin; however, MultiLCIRT is
limited to the estimation of LC-IRT models under between-item
multidimensionality, in the sense that items loading on more than one
latent trait are not allowed and then a single vector of latent
variables
est_multi_poly_within
and est_multi_poly_between
The main function of MLCIRTwithin is est_multi_poly_within
, which
performs the maximum likelihood estimation of the model specified
through equations (1) to (5), allowing
for several options.
Function est_multi_poly_within
requires the following main input
arguments:
S
: matrix of item response configurations listed row-by-row; items
with a different number of categories and missing responses are
allowed.
yv
: vector of the frequencies of every row in S
; by default,
yv
is a vector of ones.
k1
: number of latent classes for latent variable
k2
: number of latent classes for latent variable
X
: matrix of covariates affecting the class weights; by default,
X
is NULL
.
start
: method of initialization of the algorithm:
"deterministic"
(default value) for values chosen according to a
deterministic rule, "random"
for values randomly drawn from
suitable distributions (continuous uniform between 0 and 1 for the
class weights and standard normal for the other parameters), and
"external"
for values provided by the researcher through inputs
Phi
, ga1t
, ga2t
, De1
, and De2
.
link
: type of link function: "global"
for global logits as in
Equation (4) and "local"
for local logits as in
Equation (5). With binary items, any type of link
function may be specified, resulting in a Rasch (Rasch 1960) or a
two-parameter logit [2PL; Birnbaum (1968)] type model depending on the
value assigned to input disc
.
disc
: constraints on the discriminating item parameters: FALSE
(default value) for parameters TRUE
for free values. With binary items, option
disc = FALSE
results in a Rasch model, whereas a 2PL model is
obtained when option disc = TRUE
.
difl
: constraints on the difficulty item parameters, in the case
of ordinal polytomously scored items: FALSE
(default value) for
unconstrained parameters TRUE
for a rating scale
parametrization as in (6). This option is not allowed in
the presence of items with a different number of response
categories.
multi1
: matrix with one row for each component of multi1
is specified as
(multi1 <- rbind(c(1,2,0,0), c(3,4,5,6)))
[,1] [,2] [,3] [,4]
[1,] 1 2 0 0
[2,] 3 4 5 6
multi2
: same as multi1
for latent variable multi2
that cannot be the same as the
indices in the first column of multi1
. For instance, in the
situation described in Figure 1, the matrix
multi2
specified as
(multi2 <- c(7, 2:3, 5:6))
[1] 7 2 3 5 6
implies multi1
and multi2
is
empty: in such a case a between-item multidimensional LC-IRT model
is specified, based on two completely independent latent vectors
fort
: if TRUE
, Fortran routines are used whenever possible to
speed up computation.
tol
: level of tolerance of the algorithm in terms of relative
difference between the log-likelihood corresponding to two
consecutive algorithm iterations (default value is
disp
: if TRUE
, the log-likelihood evolution is displayed
step-by-step.
output
: if TRUE
, additional output arguments are returned.
out_se
: if TRUE
, standard errors and variance-covariance matrix
for the parameter estimates are returned.
glob
: type of parametrization for the sub-model assumed on the
individual-specific latent class weights: FALSE
(default value)
for a multinomial logit model as in (2)-(3)
and TRUE
for a global logit model.
Zth1
, Zth2
: matrices for the specification of linear constraints
on the support points, according to (7) and
(8), respectively; by default these are identity
matrices with a number of rows (and columns) equal to the total
number of support points, that is,
zth1
, zth2
: vectors of length
Under the default specifications of Zth1
, Zth2
, zth1
, and
zth2
, the support points are freely estimated and, for the model
identification, certain constraints are assumed on the item
parameters. On the contrary, to fix the values of the support
points, Zth1
, Zth2
, zth1
, and zth2
must be supplied by the
user. For instance, in the situation described in Figure
1 under the assumption
Zth1 <- matrix(0,2*2,0)
zth1 <- c(rep(-1, times=2), rep(1, times = 2))
Zth2 <- matrix(0,2,0)
zth2 <- c(-0.5,0.5)
Zga1
, Zga2
, Zbe
: matrices for the specification of linear
constraints on the vectors of item parameters Zga1
and Zga2
is equal to the
number of non-null entries in multi1
and multi2
, respectively,
and coincides with the length of vectors Zbe
corresponds
to the total number of item difficulties and coincides with the
length of vector Zga1
,
Zga2
, Zbe
is equal to the total number of free parameters,
corresponding to the length of vectors multi1
and multi2
defined above, and est_multi_poly_within
:
(Zga1 <- diag(6)[, -c(1,3)])
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0
[2,] 1 0 0 0
[3,] 0 0 0 0
[4,] 0 1 0 0
[5,] 0 0 1 0
[6,] 0 0 0 1
(Zga2 <- diag(5)[, -5])
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
[5,] 0 0 0 0
(Zbe <- diag(7)[, -c(1,3,7)])
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 1 0 0 0
[3,] 0 0 0 0
[4,] 0 1 0 0
[5,] 0 0 1 0
[6,] 0 0 0 1
[7,] 0 0 0 0
Whenever we are interested in introducing further constraints, then
the matrices at issue must be supplied by the user. For instance, to
restrict Zga1
must be
defined as
Zga1 <- diag(6)[ , -c(1, 3, 5)]; Zga1[5, 2] <- 1
Zga1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 0 0 0
[4,] 0 1 0
[5,] 0 1 0
[6,] 0 0 1
zga1
, zga2
, and zbe
: vectors whose length is equal to the
number of rows of Zga1
, Zga2
, and Zbe
, respectively. In other
words, length of zga1
, zga2
, and zbe
is given by the number of
elements in Zga1
, Zga2
, and Zbe
, allows for
linear constraints and fixed values of the item parameters, as in
(9), (10), and (11).
By default, zga1
and zga2
are vectors with elements 1 for each
constrained item and 0 otherwise; by default zbe
is a null vector.
For instance, in the situation depicted in Figure 1
and matrices multi1
and multi2
, default values assumed for
vectors at issue are
zga1 <- c(1, 0, 1, 0, 0, 0)
zga2 <- c(0, 0, 0, 0, 1)
zbe <- rep(0, times = 7)
Any other constraint may be defined by modifying in a suitable way
these three vectors and matrices Zga1
, Zga2
, Zbe
. For
instance, if we are interested in fixing the difficulty of (binary)
item 4 to be equal to 2 (i.e., Zbe
and zbe
as follows:
Zbe <- diag(7)[, -c(1,3,4,7)]
Zbe
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 1 0 0
[3,] 0 0 0
[4,] 0 0 0
[5,] 0 1 0
[6,] 0 0 1
[7,] 0 0 0
zbe <- c(0, 0, 0, 2, 0, 0, 0)
Function est_multi_poly_within
supplies the following output:
piv1
and piv2
: vectors of the estimated weights of latent
classes for
Th1
and Th2
: matrices of estimated and constrained support
points for each dimension (by row) and each latent class (by column)
for
Bec
: matrix of estimated and constrained item difficulty
parameters; exact zeros correspond to identifiability constraints.
ga1c
and ga2c
: vectors of estimated and constrained item
discriminating parameters for NA
to items that do not load on the latent variable.
th1t
, th2t
, bet
, ga1t
, and ga2t
: estimated parameters
(i.e., parameters without constraints) related to
Th1, Th2, Bec, ga1c, ga2c
, respectively.
Th1s
, Th2s
, Becs
, ga1cs
, and ga2cs
: standardized values of
Th1
, Th2
, Bec
, ga1c
, and ga2c
, respectively; in Th1s
and
Th2s
classes are re-ordered according to the increasing values of
the support points for the first dimension.
piv1s
and piv2s
: the same as piv1
and piv2
, but re-ordered
according to Th1s
and Th2s
.
fv1
and fv2
: vectors indicating the reference items for each
dimension of
Phi
: conditional response probabilities for every item and each
pair of latent classes of
De1
and De2
: matrices of estimated regression coefficients for
the model on the class weights for
Piv1
, Piv2
, Pp1
, Pp2
, and lkv
: optional output (obtained
if output = TRUE
) referred to the matrices of weights for every
covariate configuration for latent variables
XX1dis
and XX2dis
: design matrices for the covariates affecting
the first and the second vector of latent variables, respectively
(optional output obtained if output = TRUE
).
lk
: value of the log-likelihood at convergence.
np
: number of estimated model parameters.
aic
, bic
, and ent
: AIC, BIC, and entropy indices,
respectively.
seDe1
, seDe2
, seTh1
, seTh2
, seBec
, sega1
, sega2
,
seth1t
, seth2t
, sebet
, sega1t
, sega2t
, and Vn
: standard
errors of the corresponding estimated parameters and estimated
variance-covariance matrix (if out_se = TRUE
).
Some relevant commands to display output from function
est_multi_poly_within
are based on the S3 methods summary
for the
main estimates; coef
and confint
for the point estimates and
confidence intervals (at a specified level of confidence) of support
points, item parameters, and regression coefficients; logLik
for the
value of log-likelihood at convergence; and vcov
for the estimated
variance-and-covariance matrix.
Another relevant function of package MLCIRTwithin is
est_multi_poly_between
, which performs the maximum likelihood
estimation of an LC-IRT model under between-item multidimensionality.
The main differences with respect to function est_multi_poly
of the R
package MultiLCIRT are that the latter does not allow for items with a
different number of response categories and refers to a slightly
different specification of item difficulties (for details see Bacci, F. Bartolucci, and M. Gnaldi 2014).
Input arguments required by est_multi_poly_between
are very similar to
those of function est_multi_poly_within
. The main difference is that
only one vector of latent variables is involved in the model
specification. Consequently, the number of latent classes (input k
) is
common to all the dimensions and the multidimensional structure of the
items is specified through one matrix (input multi
), having one row
for each dimension. Constraints on model parameters are also possible
through a suitable definition of arguments Zth
, zth
, Zbe
, zbe
,
Zga
, and zga
, whose functioning is the same as the corresponding
arguments in function est_multi_poly_within
. The function provides as
its main output argument a vector of estimated average weights of the
latent classes (output piv
) and a matrix of estimated support points
for each dimension and each latent class of the latent trait before
(output Th
) and after the standardization (output Ths
). Besides, a
matrix of difficulty item parameters (outputs Bec
and, in the case of
standardization, Becs
), a vector of discriminating indices (output
gac
and, in the case of standardization, gacs
), and a matrix of
regression coefficients (output De
) are provided, other than other
output arguments similar to those above described for
est_multi_poly_within
, included the S3 methods.
Finally, we clarify that a model specification of type
out1 <- est_multi_poly_between(S, k = k0, link = "global", multi = rbind(1:3, 4:6)),
with k0
latent classes, is substantially different from a model
specification of type
out2 <- est_multi_poly_within(S, k1 = k0, k2 = k0, link = "global", multi1 = c(1:3),
multi2 = c(4:6)).
In fact, the model corresponding to out2
involves two completely
independent latent variables, having incidentally the same number of
latent classes: thus, an individual belonging to a specific class (say,
class 1) according to the first latent variable may belong to any latent
class under the second latent variable (say, class 2). On the contrary,
model out1
involves only one latent variable decomposed in two
dimensions: thus, belonging to a given latent class under one dimension
implies belonging to the same class under the other dimension. Overall,
model out2
has k0
out1
.
search.model_within
and search.model_between
As outlined in Section “Likelihood inference,” the selection of a
two-tier LC-IRT model may be a quite demanding procedure, requiring the
choice of the number of support points for the latent variables and a
check for the possible presence of local maxima. Function
search.model_within
allows us to search for the global maximum of the
log-likelihood of a model with a specific formulation (in terms of
multidimensional structure, link function, and constraints on the item
parameters) given a vector of possible number of latent classes to try
for.
In practice, function search.model_within
applies function
est_multi_poly_within
a given number of times for each pair of values
for est_multi_poly_between
, which is
automatically retrieved by search.model_within
.
The function at issue requires the following main input arguments:
S
, yv
, X
, link
, disc
, difl
, multi1
, multi2
, fort
,
disp
, output
, out_se
, Zth1
, zth1
, Zth2
, zth2
, Zbe
,
zbe
, Zga1
, zga1
, Zga2
, and zga2
: are the same as in
function est_multi_poly_within
.
kv1
and kv2
: vectors of number of latent classes to try for
latent variable
tol1
and tol2
: tolerance levels (default value are tol1
is used for checks based on deterministic and random
starting values, whereas the value of tol2
is used for improving
the precision of estimates for the model with the best
log-likelihood level.
nrep
: constant value that drives the number of estimations of each
model with random starting values, given by nrep
nrep
is 2. In the case of nrep
equal to 0,
only the estimation with deterministic starting values is performed.
Note that if single values for kv1
and kv2
are specified and nrep
equals 0, function search.model_within
performs just one call of
function est_multi_poly_within
(or function est_multi_poly_between
if kv1
or kv2
equal 1) with option start = "deterministic"
.
Function search.model_within
supplies the following output:
aicv
, bicv
, and necv
: vectors of AIC, BIC, and NEC indices,
respectively, for each of the estimated models.
errv
: trace of any error occurred during the estimation process.
lkv
: values of log-likelihood at convergence for each of the
estimated models.
out.single
: output of each single model, similar to the output of
est_multi_poly_within
, with the addition of values of k1
); k2
); and the sequence of
log-likelihoods (output lktrace
) for the deterministic start, for
each random start, and for the final estimation provided by a
tolerance level equal to tol2
(if tol2
tol1
).
Finally, we outline that a function with input and output arguments
similar to those of the function search.model_within
is available to
perform the model selection in the case of between-item multidimensional
LC-IRT models. This function is named search.model_between
and it
relies on est_multi_poly_between
.
In the following we illustrate package MLCIRTwithin through two data analysis examples. In Example 1 we describe the model selection procedure, as well as the interpretation of the output, considering a set of ordered items measuring two latent variables. In Example 2, the specification of constraints on the support points and on the item parameters is illustrated through the analysis of data concerning repeated item responses along two time occasions. The detailed software scripts to implement the two examples, named Example1.R and Example2.R, are available in the Supplementary Online Material at https://sites.google.com/site/bartstatistics/sm_mlcirtwithin.zip.
Data set SF12_nomiss
, already provided in the R package
MLCIRTwithin, refer to a sample of 493 oncological Italian patients
who were asked to fill in the Italian validated Short Form 12 version 2
questionnaire [SF-12; Stewart and J. E. Ware (1992); Ware, M. Kosinski, D. M. Turner-Bowker, and B. Gandek (2002)] concerning the
assessment of HQOL. The questionnaire is comprised by 12 items having
five ordered response modalities, except items 2 and 3 having only three
modalities; a high score means a worse level of HQOL and vice-versa
(note that, in the original scoring system, modalities of items 9 and 10
are reversed). Also the age is available for each patient. In the
following we show the first few records of the data set and the related
summaries.
library(MLCIRTwithin)
data(SF12_nomiss)
head(SF12_nomiss)
Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12 age
1 1 0 1 1 0 2 2 2 1 1 1 0 74.94593
2 0 0 1 1 2 1 2 1 0 2 1 1 84.49829
3 1 1 1 2 1 0 0 2 1 1 1 1 77.44285
4 3 2 2 4 4 4 4 3 4 4 4 4 80.55305
6 1 1 2 2 2 2 2 2 2 2 2 2 81.68104
7 1 0 1 3 2 2 1 2 1 2 1 3 78.55168
str(SF12_nomiss)
'data.frame': 493 obs. of 13 variables:
$ Y1 : num 1 0 1 3 1 1 1 1 2 2 ...
$ Y2 : num 0 0 1 2 1 0 0 2 1 1 ...
$ Y3 : num 1 1 1 2 2 1 1 2 1 2 ...
$ Y4 : num 1 1 2 4 2 3 1 3 2 3 ...
$ Y5 : num 0 2 1 4 2 2 2 2 2 3 ...
$ Y6 : num 2 1 0 4 2 2 2 2 3 3 ...
$ Y7 : num 2 2 0 4 2 1 1 2 3 3 ...
$ Y8 : num 2 1 2 3 2 2 2 1 3 3 ...
$ Y9 : num 1 0 1 4 2 1 2 3 3 3 ...
$ Y10: num 1 2 1 4 2 2 1 2 2 2 ...
$ Y11: num 1 1 1 4 2 1 2 1 4 3 ...
$ Y12: num 0 1 1 4 2 3 1 2 3 3 ...
$ age: num 74.9 84.5 77.4 80.6 81.7 ...
# For the description of each item see the online documentation
?SF_nomiss
According to the main current literature (see, mainly, Ware, M. Kosinski, D. M. Turner-Bowker, and B. Gandek 2002), the SF-12 questionnaire may be used to properly evaluate two main aspects of HQOL: physical and emotional. The standard scoring algorithm for summarizing these two latent dimensions is based on an orthogonal factor analysis, on the basis of which positive and negative weights are assigned to each item. More in detail, items 1 to 5 and item 8 have positive weights for physical HQOL and negative weights for emotional HQOL, whereas items 6, 7, 9, 11, and 12 have negative weights for physical HQOL and positive weights for emotional HQOL; item 10 has positive weights for both components. According to this scoring system, the scores of physical and emotional HQOL result by a suitable weighted average of the item responses.
A low score on physical HQOL has the following meaning (Ware and B. Gandek 1998): substantial limitations in self-care, physical, social and role activities; severe bodily pain; frequent tiredness; health rated as poor. On the contrary, a high level of physical component corresponds to: no physical limitations, disabilities or decrements in well-being; high energy level; health rated as excellent. As regards the emotional component, a low score implies: frequent psychological distress, social and role disability due to emotional problems; health rated poor. On the other hand, a high level of emotional component corresponds to: frequent positive affect; absence of psychological distress and limitations in usual social activities due to emotional problems; health rated excellent.
The main drawback of the above algorithm based on the orthogonal factor analysis is that the summary score may be inconsistent due to weights of opposite sign for the same items, as higher emotional health scores drive physical health scores down and, similarly, higher physical health scores drive emotional health scores down (Farivar, W. E. Cunningham, and R. D. Hays 2007). An alternative approach for clustering patients according to their physical and emotional health status is based on IRT analysis (see, among others, Hays, C. D. Sherbourne, and R. M. Mazel 1993). In such a context, we analyze the multidimensional structure of SF-12 questionnaire through a two-dimensional model allowing items measuring both latent variables. In more detail, we compare several plausible multidimensional structures, defined through the following matrices, with the first one referred to the physical HQOL and the second one referred to the emotional HQOL:
within-item multidimensional model with two independent latent variables and no shared item; items are allocated according to the sign of weights resulting by the factor analysis mentioned above:
(multi1_dim1 <- c(1:5, 8))
[1] 1 2 3 4 5 8
(multi1_dim2 <- c(6:7, 9:12))
[1] 6 7 9 10 11 12
model with two latent variables sharing items that do not explicitly affect a specific dimension
(multi2_dim1 <- c(1:5, 8:12))
[1] 1 2 3 4 5 8 9 10 11 12
(multi2_dim2 <- c(6:12, 1))
[1] 6 7 8 9 10 11 12 1
multidimensional structure similar to the previous one, but with three items (9, 10, and 11) assigned only to the emotional HQOL
(multi3_dim1 <- c(1:5, 8, 12))
[1] 1 2 3 4 5 8 12
(multi3_dim2 <- c(6:12, 1))
[1] 6 7 8 9 10 11 12 1
multidimensional structure similar to that defined through multi21
and multi22
, but one more item (number 8), concerning the presence
of pain, is assigned only to physical HQOL, since pain is usually
intended in terms of physical health (i.e., bodily pain)
(multi4_dim1 <- c(1:5, 8, 12))
[1] 1 2 3 4 5 8 12
(multi4_dim2 <- c(6:7, 9:12, 1))
[1] 6 7 9 10 11 12 1
The allocation of every item according to one of the above proposed structures is suggested by the more or less explicit reference of the item text to physical or emotional component of HQOL (or to both of them).
Considering the possible multidimensional structures above defined, we focus on models with global logit link function and free discriminating item parameters; also the effect of age on the mass probabilities is investigated.
# Item responses and covariates
S <- SF12_nomiss[ , 1:12]
X <- SF12_nomiss[ , 13]
For each type of multidimensional structure, we select the optimal
number of latent classes on the basis of BIC index according to the
procedure described in Section “Likelihood inference”. Also the check of
local maxima solutions follows the same lines described therein. For
these aims we use function search.model_within
, as follows:
### Model selection
maxk1 <- 6
maxk2 <- 6
tol1 <- 10^-3
tol2 <- 10^-6
# Multidimensional structure of Type 1
set.seed(0)
out1 <- search.model_within(S, kv1 = 1:maxk1, kv2 = 1:maxk2, X = X, link = "global",
disc = TRUE, multi1 = multi1_dim1, multi2 = multi1_dim2,
fort = TRUE, tol1 = tol1, tol2 = tol2, nrep = 1)
# Multidimensional structure of Type 2
set.seed(0)
out2 <- search.model_within(S, kv1 = 1:maxk1, kv2 = 1:maxk2, X = X, link = "global",
disc = TRUE, multi1 = multi2_dim1, multi2 = multi2_dim2,
fort = TRUE, tol1 = tol1, tol2 = tol2, nrep = 1)
# Multidimensional structure of Type 3
set.seed(0)
out3 <- search.model_within(S, kv1 = 1:maxk1, kv2 = 1:maxk2, X = X, link = "global",
disc = TRUE, multi1 = multi3_dim1, multi2 = multi3_dim2,
fort = TRUE, tol1 = tol1, tol2 = tol2, nrep = 1)
# Multidimensional structure of Type 4
set.seed(0)
out4 <- search.model_within(S, kv1 = 1:maxk1, kv2 = 1:maxk2, X = X, link = "global",
disc = TRUE, multi1 = multi4_dim1, multi2 = multi4_dim2,
fort = TRUE, tol1 = tol1, tol2 = tol2, nrep = 1)
We advise that the entire estimation process may take a very long
computational time; then, we suggest to reduce the tolerance level of
the algorithm (we adopted tol1
and tol2
) and the number of repetitions with
random initializations (we specified nrep = 1
instead of the default
value nrep = 2
). Outputs out1
, out2
, out3
, and out4
are
contained in the file Example1.RData
, available in the supplementary
online material at
https://sites.google.com/site/bartstatistics/sm_mlcirtwithin.zip.
In order to select the optimal model, values of BIC index are displayed in the following 36-by-4 matrix, having one row for each model and one column for each multidimensional structure:
# BIC indices
BIC <- cbind(out1$bicv, out2$bicv, out3$bicv, out4$bicv)
colnames(BIC) <- c("Type 1", "Type 2", "Type 3", "Type 4")
k1 <- rep(1:6, times = 1, each = 6)
k2 <- rep(1:6, times = 6)
BIC <- cbind(k1, k2, BIC)
BIC
k1 k2 Type 1 Type 2 Type 3 Type 4
[1,] 1 1 16041.95 16041.95 16041.95 16041.95
[2,] 1 2 14952.57 14758.90 14758.90 14857.40
[3,] 1 3 14637.43 14397.05 14397.05 14523.30
[4,] 1 4 14457.90 14186.90 14186.90 14323.75
[5,] 1 5 14457.16 14182.29 14182.29 14322.70
[6,] 1 6 14456.62 14181.76 14181.76 14323.05
[7,] 2 1 15245.74 14755.76 15106.47 15106.47
[8,] 2 2 14180.93 14010.29 14071.65 14097.89
[9,] 2 3 13865.85 13721.04 13750.05 13784.53
[10,] 2 4 13686.27 13543.49 13554.24 13595.26
[11,] 2 5 13685.48 13543.77 13553.30 13599.42
[12,] 2 6 13684.91 13559.40 13563.59 13604.91
[13,] 3 1 14945.39 14292.43 14748.77 14748.77
[14,] 3 2 13880.61 13674.91 13783.16 13791.13
[15,] 3 3 13565.50 13435.72 13482.17 13495.04
[16,] 3 4 13385.96 13291.34 13303.73 13319.80
[17,] 3 5 13385.83 13294.18 13306.15 13324.50
[18,] 3 6 13392.96 13308.42 13318.83 13334.18
[19,] 4 1 14897.97 14183.55 14678.01 14678.01
[20,] 4 2 13837.48 13606.69 13728.08 13732.33
[21,] 4 3 13522.40 13369.83 13441.26 13451.80
[22,] 4 4 13342.82 13245.49 13262.09 13271.06
[23,] 4 5 13342.45 13251.35 13269.85 13277.34
[24,] 4 6 13341.56 13265.77 13271.07 13293.80
[25,] 5 1 14858.93 14122.61 14610.54 14610.54
[26,] 5 2 13851.00 13549.28 13744.75 13674.28
[27,] 5 3 13538.57 13353.08 13383.57 13387.67
[28,] 5 4 13356.25 13196.32 13215.00 13219.66
[29,] 5 5 13298.70 13207.16 13216.58 13226.12
[30,] 5 6 13298.20 13212.85 13237.74 13242.02
[31,] 6 1 14932.45 14110.19 14613.70 14613.70
[32,] 6 2 13805.78 13557.76 13683.30 13683.43
[33,] 6 3 13495.08 13339.91 13393.67 13410.61
[34,] 6 4 13311.17 13205.40 13231.35 13230.88
[35,] 6 5 13312.05 13214.68 13237.86 13240.88
[36,] 6 6 13310.90 13222.19 13241.50 13243.07
On the basis of the above matrix we observe that the minimum value of
the BIC index, that is, 13196.32, is displayed in column 2, denoting the
multidimensional structure of type 2, and row 28, which refers to models
with out2$out.single[[28]]
.
# Minimum BIC
min(BIC[ , 3:6])
[1] 13196.32
# Detect model with the minimum BIC
arrayInd(which.min(BIC[ , -c(1:2)]), .dim = dim(BIC))
[,1] [,2]
[1,] 28 2
# Number of support points for the best model
out2$out.single[[28]]$k1
[1] 5
out2$out.single[[28]]$k2
[1] 4
# Selected model
outsel = out2$out.single[[28]]
We also observe that the same number of support points is selected for the other multidimensional structures with shared items (i.e., structures of types 3 and 4).
# Minimum BIC and selected model for multidimensional structures of types 1, 3, 4
min(BIC[, "Type 1"])
[1] 13298.2
which.min(BIC[ ,"Type 1"])
[1] 30
min(BIC[, "Type 3"])
[1] 13215
which.min(BIC[ ,"Type 3"])
[1] 28
min(BIC[, "Type 4"])
[1] 13219.66
which.min(BIC[ ,"Type 4"])
[1] 28
After selecting the number of latent classes and the type of within-item
multidimensional structure, we estimate again the selected model out
with option out_se = T
for the computation of standard errors. To
reduce the computational time, we use function est_multi_poly_within
with option start = "external"
and specified as input for options
Phi
, ga1c
, ga2c
, De1
, and De2
the corresponding output from
the selected model.
# Re-estimate model to compute standard errors
out <- est_multi_poly_within(S = S, k1 = outsel$k1, k2 = outsel$k2, X = X,
start = "external",
multi1 = multi2_dim1, multi2 = multi2_dim2,
Phi = outsel$Phi, ga1t = outsel$ga1t, ga2t = outsel$ga2t,
De1 = outsel$De1, De2 = outsel$De2,
link = "global", disc = TRUE, fort = TRUE, output = TRUE,
out_se = TRUE, disp = TRUE)
Details of the output of the estimated model may be displayed through
the usual methods summary
, coef
, and confint
.
### Display output of the estimated model
# summary(out)
# coef(out)
# confint(out)
# Estimates of support points and average mass probabilities for physical HQOL
lv1 <- rbind(out$Th1, t(out$piv1))
rownames(lv1) <- c("Physical HQOL", "Average prob.")
round(lv1, 3)
1 2 3 4 5
Physical HQOL 0.823 1.684 2.854 0.093 -2.024
Average prob. 0.199 0.387 0.310 0.083 0.021
# Estimates of support points and average mass probabilities for emotional HQOL
lv2 <- rbind(out$Th2, t(out$piv2))
rownames(lv2) <- c("Emotional HQOL", "Average prob.")
round(lv2, 3)
1 2 3 4
Emotional HQOL 0.471 11.526 7.585 4.151
Average prob. 0.149 0.141 0.390 0.321
According to the estimated model, patients are clustered in 5 latent
classes denoting different levels of physical HQOL (output out$Th1
)
and in 4 latent classes denoting different levels of emotional HQOL
(output out$Th2
). To simplify the interpretation of the latent
classes, it is useful to re-order and standardize the estimated support
points. For this aim, function est_multi_poly_within
provides the
values of support points, which are standardized according to equations
(14)-(15) and re-ordered according to
increasing values of the first dimension of the corresponding latent
variable (outputs out$Th1s
and out$Th2s
with related weights
out$piv1s
and out$piv2s
, respectively).
# Standardized support points of physical HQOL
lv1s <- rbind(out$Th1s, t(out$piv1s))
rownames(lv1s) <- c("Stand. Physical HQOL", "Average prob.")
round(lv1s, 3)
5 4 1 2 3
Stand. Physical HQOL -3.561 -1.517 -0.812 0.019 1.149
Average prob. 0.021 0.083 0.199 0.387 0.310
# Standardized support points of emotional HQOL
lv2s <- rbind(out$Th2s, t(out$piv2s))
rownames(lv2s) <- c("Stand. Emotional HQOL", "Average prob.")
round(lv2s, 3)
1 4 3 2
Stand. Emotional HQOL -1.667 -0.553 0.486 1.679
Average prob. 0.149 0.321 0.390 0.141
We observe that classes 5, 4, and 1 of physical HQOL and classes 1 and 4 of emotional HQOL collect patients with negative levels of the related latent trait, whereas patients with levels of HQOL above the mean belong to the remaining classes (i.e., classes 2 and 3 for physical HQOL and classes 3 and 2 for emotional HQOL). We also observe that the distribution of physical HQOL is strongly skewed to negative values, whereas that of emotional HQOL is symmetric (Figure 2).
As concerns item parameters, the discriminating indices and the corresponding standard errors and confidence intervals are displayed as follows:
# Item discriminating parameters and related standard errors
gamma1 <- cbind(out$ga1c, out$sega1c)
colnames(gamma1) <- c("gamma1", "st.err.")
round(gamma1, 3)
gamma1 st.err.
[1,] 1.000 0.000
[2,] 1.988 0.329
[3,] 1.193 0.199
[4,] 3.519 0.570
[5,] 3.394 0.582
[6,] NA NA
[7,] NA NA
[8,] 1.579 0.255
[9,] -0.006 0.117
[10,] 0.732 0.151
[11,] 0.145 0.125
[12,] 1.002 0.184
gamma2 <-cbind(out$ga2c, out$sega2c)
colnames(gamma2) <- c("gamma2", "st.err.")
round(gamma2, 3)
gamma2 st.err.
[1,] 0.212 0.042
[2,] NA NA
[3,] NA NA
[4,] NA NA
[5,] NA NA
[6,] 1.000 0.000
[7,] 0.809 0.076
[8,] 0.128 0.037
[9,] 0.599 0.079
[10,] 0.429 0.059
[11,] 0.793 0.093
[12,] 0.606 0.074
# Confidence intervals at 95% of item discriminating parameters (columns 1-4)
confint(out)
[...] Output omitted
Confidence interval for the item parameters:
gamma1_1 gamma1_2 gamma2_1 gamma2_2 beta1_1 beta1_2 beta2_1 beta2_2 beta3_1 beta3_2 beta4_1 beta4_2
1 1.0000 1.0000 0.1293 0.2939 0.0000 0.0000 2.2927 3.0763 5.4703 6.6560 7.0976 8.9132
2 1.3425 2.6326 NA NA -0.2428 1.8308 3.6945 6.1378 NA NA NA NA
3 0.8036 1.5825 NA NA -1.4885 -0.0974 1.1539 2.5872 NA NA NA NA
4 2.4010 4.6368 NA NA -1.3097 2.1941 1.2833 4.9066 5.2791 9.4358 8.5939 13.4278
5 2.2542 4.5348 NA NA -0.8427 2.5507 1.9437 5.5126 5.3510 9.4330 8.2434 12.8277
6 NA NA 1.0000 1.0000 0.0000 0.0000 2.4866 3.9477 5.4657 7.6420 8.4771 11.2498
7 NA NA 0.6612 0.9574 -0.4397 0.7189 1.7730 3.0777 4.1602 5.8143 7.1608 9.3353
8 1.0790 2.0796 0.0556 0.2010 -2.8325 -0.8067 0.3711 2.0693 3.0574 4.9355 4.5989 6.6225
9 -0.2348 0.2235 0.4451 0.7529 -1.9089 -0.6523 1.0767 2.2309 3.5534 5.0154 6.9565 9.1180
10 0.4366 1.0274 0.3135 0.5451 -0.7823 0.3940 1.9615 3.1901 4.6207 6.1860 7.3341 9.3617
11 -0.0999 0.3905 0.6100 0.9762 -1.1600 0.1005 1.6032 2.9576 4.8700 6.7528 7.4612 9.8398
12 0.6411 1.3637 0.4611 0.7503 -1.5644 0.0124 1.4481 2.8845 4.6337 6.4619 6.9167 9.0782
[...] Output omitted
In the previous output, the first two tables show the estimates of item
discriminating parameters NA
refer to those items
that do not load on the corresponding latent variable.
We observe that discriminating parameters are generally highly
significant, with two notable exceptions: discriminating indices for
items 9 and 11 referred to physical HQOL (i.e., parameters
The standardized estimates of parameters
# Standardized discriminating parameters
gammas <- rbind(out$ga1cs, out$ga2cs)
rownames(gammas) <- c("Physical HQOL", "Emotional HQOL")
round(gammas, 3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
Physical HQOL 1.036 2.059 1.236 3.645 3.516 NA NA 1.636 -0.006 0.758 0.151 1.038
Emotional HQOL 0.699 NA NA NA NA 3.304 2.674 0.424 1.979 1.419 2.621 2.001
Note that the judgment about general health (item 1) is affected by both
components of HQOL, but the physical dimension has a more relevant role
(
We also highlight relevant differences among items in terms of difficulty, as results by the related standardized item parameters:
# Standardized difficulty parameters
round(out$Becs, 3)
cutoff
item 1 2 3 4
1 -2.929 -0.245 3.134 5.076
2 -2.514 1.608 NA NA
3 -2.778 -0.115 NA NA
4 -5.414 -2.761 1.501 5.155
5 -4.795 -1.921 1.743 4.887
6 -5.979 -2.762 0.575 3.884
7 -4.700 -2.414 0.148 3.409
8 -5.215 -2.175 0.601 2.215
9 -4.853 -1.918 0.712 4.465
10 -3.979 -1.209 1.618 4.563
11 -5.514 -2.704 0.827 3.667
12 -6.066 -3.124 0.258 2.707
In particular, positive responses to items related only (items 4 and 5) or mainly (item 1) to physical HQOL require a level of the latent trait in average higher than items related only (items 6 and 7) or mainly (items 9, 11, and 12) to emotional HQOL. In other words, the emotional component of HQOL interferes less than physical component of HQOL on work, daily activities, and social life.
In the estimated model we assume an effect of patient’s age on the HQOL,
according to the multinomial logit parametrization, as in equation
(3). Estimates of regression coefficients
# Effect of covariate age on physical HQOL
De1 <- cbind(out$De1, out$seDe1)
colnames(De1) <- c("delta12", "delta13", "delta14", "delta15", "se(delta12)",
"se(delta13)", "se(delta14)", "se(delta15)")
round(De1, 3)
delta12 delta13 delta14 delta15 se(delta12) se(delta13) se(delta14) se(delta15)
intercept 2.960 3.021 -0.997 -4.773 1.035 0.937 1.862 2.453
X1 -0.038 -0.043 0.002 0.039 0.017 0.015 0.026 0.035
# Effect of covariate age on emotional HQOL
De2 <- cbind(out$De2, out$seDe2)
colnames(De2) <- c("delta22", "delta23", "delta24", "se(delta22)", "se(delta23)",
"se(delta24)")
round(De2, 3)
delta22 delta23 delta24 se(delta22) se(delta23) se(delta24)
intercept 0.743 2.581 1.905 0.951 0.791 0.850
X1 -0.013 -0.027 -0.019 0.015 0.013 0.014
# Confidence intervals at 95% of regression coeffic. for physical and emotional HQOL
confint(out)
[...] Output omitted
Confidence interval for the regression coefficients for the 1st latent variable:
logit
2_1 2_2 3_1 3_2 4_1 4_2 5_1 5_2
intercept 0.9301 4.9890 1.1833 4.8581 -4.6451 2.6520 -9.5808 0.0355
X1 -0.0709 -0.0055 -0.0717 -0.0146 -0.0489 0.0528 -0.0295 0.1075
Confidence interval for the regression coefficients for the 2nd latent variable:
logit
2_1 2_2 3_1 3_2 4_1 4_2
intercept -1.1204 2.606 1.0306 4.1313 0.2383 3.5707
X1 -0.0434 0.017 -0.0521 -0.0021 -0.0458 0.0081
We observe a negative, although not particularly significant, effect of
age on the probability to belong to a specific latent class with respect
to the worst one (i.e., class 1), for both components of HQOL. A more
parsimonious and more easily interpretable solution is provided by
specifying a global logit parametrization through option glob = TRUE
of the function est_multi_poly_within
, as follows:
# Model with global logit parametrization for covariates
outgl <- est_multi_poly_within(S = S, k1 = 5, k2 = 4, X = X,
multi1 = multi2_dim1, multi2 = multi2_dim2, link = "global",
disc = TRUE, fort = TRUE, tol = 10^-8, disp = TRUE, output = TRUE,
out_se = TRUE, glob = TRUE)
# Effect of covariate age on physical HQOL
De1glob <- cbind(outgl$De1, outgl$seDe1)
colnames(De1glob) <- c("coef", "se")
round(De1glob, 3)
coef se
cutoff1 5.093 0.544
cutoff2 2.827 0.467
cutoff3 1.050 0.443
cutoff4 -0.777 0.477
X1 -0.028 0.007
# Effect of covariate age on emotional HQOL
De2glob <- cbind(outgl$De2, outgl$seDe2)
colnames(De2glob) <- c("coef", "se")
round(De2glob, 3)
coef se
cutoff1 2.306 0.463
cutoff2 0.664 0.453
cutoff3 -1.312 0.461
X1 -0.010 0.007
# Confidence intervals at 95% of regression coeffic. for physical and emotional HQOL
confint(outgl)
[...] Output omitted
Confidence interval for the regression coefficients for the 1st latent variable:
logit
_1 _2
cutoff1 4.0265 6.1586
cutoff2 1.9112 3.7427
cutoff3 0.1816 1.9187
cutoff4 -1.7110 0.1575
X1 -0.0422 -0.0132
Confidence interval for the regression coefficients for the 2nd latent variable:
logit
_1 _2
cutoff1 1.3987 3.2124
cutoff2 -0.2229 1.5509
cutoff3 -2.2153 -0.4088
X1 -0.0240 0.0049
In such a case just one regression coefficient for each latent variable
is estimated instead of
An interesting example of a real problem that can be solved through the
discrete two-tier models is encountered when a questionnaire is used on
multiple waves to measure the same latent variable at each time
occasion. In such a situation, following (Cai 2010) we distinguish a
“specific” dimension
To illustrate specification and estimation of the discrete two-tier model for longitudinal data we refer to a simulated data set about 10 different types of crime committed by a cohort of 10,000 hypothetical subjects on 6 waves; each crime corresponds to a binary item, which assumes value 1 if the crime was committed and 0 otherwise. The latent variable of interest denotes the propensity to commit a crime at each time occasion. The data set is available from the R package LMest (Bartolucci and S. Pandolfi 2016).
library(MLCIRTwithin)
### Load and prepare data
library(LMest)
data(data_criminal_sim)
data_criminal_sim[1:12,]
id sex time y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
[1,] 1 1 1 0 0 0 0 0 0 0 0 0 0
[2,] 1 1 2 0 0 0 0 0 0 0 0 0 0
[3,] 1 1 3 0 0 0 0 0 0 0 0 0 0
[4,] 1 1 4 0 0 0 0 0 0 0 0 0 0
[5,] 1 1 5 0 0 0 0 0 0 0 0 0 0
[6,] 1 1 6 0 0 0 0 0 0 0 0 0 0
[7,] 2 1 1 0 0 0 0 0 0 0 0 0 0
[8,] 2 1 2 0 0 0 0 0 0 0 0 0 0
[9,] 2 1 3 0 0 0 0 0 0 0 0 0 0
[10,] 2 1 4 0 0 0 0 0 0 0 0 0 0
[11,] 2 1 5 0 0 0 0 0 0 0 0 0 0
[12,] 2 1 6 0 0 0 0 0 0 0 0 0 0
In order to simplify the illustration of the application, we keep the first two waves and five types of crime, as follows:
# Keep items: y1,y3,y5,y7,y10; keep occasions: 1, 2
criminal_red <- data_criminal_sim[(data_criminal_sim[,3]==1 | data_criminal_sim[,3]==2),
c(1:3,4,6,8,10,13)]
Moreover, in order to estimate the model at issue through the R package
MLCIRTwithin, we need to reshape the data from “long” to “wide”
format; we also aggregate records corresponding to the same patterns.
Note that, after reshaping the data set, the total amount of items is
10, that is, 5 items observed at time 1 (
# Data reshape in wide format
criminal_red <- data.frame(criminal_red)
crim_wide <- reshape(criminal_red, v.names = c("y1", "y3", "y5", "y7", "y10"),
timevar = "time", idvar = "id", direction = "wide")
head(crim_wide)
id sex y1.1 y3.1 y5.1 y7.1 y10.1 y1.2 y3.2 y5.2 y7.2 y10.2
1 1 1 0 0 0 0 0 0 0 0 0 0
3 2 1 0 0 0 0 0 0 0 0 0 0
5 3 1 0 0 0 0 0 0 0 0 0 0
7 4 1 0 0 0 0 0 0 0 0 0 0
9 5 1 0 0 0 0 0 0 0 0 0 0
11 6 1 0 0 0 0 0 0 0 0 0 0
dim(crim_wide)
[1] 10000 12
# Aggregate records with the same pattern
crim_wide <- as.matrix(crim_wide)
crim_wide2 <- aggr_data(crim_wide[, -1])
# Item responses, covariates, and vector of weights
S <- crim_wide2$data_dis[,-1]
X <- crim_wide2$data_dis[,1]; X <- X - 1
yv <- crim_wide2$freq
The multidimensional structure is defined through the specification of
matrices multi1
and multi2
, with multi1
referring to specific
dimensions capturing the dependence between responses to the same item
at the two different waves, whereas multi2
refers to the propensity to
commit a crime at time 1 and at time 2. Then,
### Define the multidimensional structure
multi1 <- matrix(0, nrow=5, ncol=2)
multi2 <- matrix(0, nrow=2, ncol=5)
multi1[1,] <- c(6, 1)
multi1[2,] <- c(2, 7)
multi1[3,] <- c(3, 8)
multi1[4,] <- c(4, 9)
multi1[5,] <- c(5, 10)
multi2[1,] <- c(1:5)
multi2[2,] <- c(7, 6, 8:10)
multi1
[,1] [,2]
[1,] 6 1
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
multi2
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 7 6 8 9 10
# Number of latent classes
k1 <- 2
k2 <- 2
The next step of the model specification consists in fixing some support
points to identify the model and adding equality constraints on
difficulties and discriminating indices to properly account for the
longitudinal data structure, similarly to (Cai 2010). In more detail, we
fix the support points of the latent variable
### Specification of model constraints
# Fix support points on latent variable U
# Fix support points on the first dimension of latent variable V
# Free support points on the second dimension of latent variable V
(Zth1 <- matrix(0, nrow(multi1)*k1, 0))
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
(zth1 <- c(rep(-1, times = nrow(multi1)), rep(1, times = nrow(multi1))))
[1] -1 -1 -1 -1 -1 1 1 1 1 1
(Zth2 <- diag(nrow(multi2)*k2)[ , -c(1,3)])
[,1] [,2]
[1,] 0 0
[2,] 1 0
[3,] 0 0
[4,] 0 1
(zth2 <- c(-1, 0, 1, 0))
[1] -1 0 1 0
# Equality constraints on difficulties and discriminating indices to account
# for the longitudinal data structure
(Zbe <- matrix(1, nrow(multi2), 1) %x% diag(nrow(multi1)))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
[6,] 1 0 0 0 0
[7,] 0 1 0 0 0
[8,] 0 0 1 0 0
[9,] 0 0 0 1 0
[10,] 0 0 0 0 1
Zga2 <- Zbe
Zga1 <- Zbe
The final step consists in estimating the model according to the
structure specified above. The main difference with respect to the model
presented in the previous section (Example 1) is that now all items are
shared by both the latent variables Zth1
, zth1
, Zth2
, zth2
, Zbe
, Zga1
, and Zga2
,
according to equations (7)-(11).
out <- est_multi_poly_within(S = S, yv = yv, k1 = k1, k2 = k2, X = X, link = "global",
disc = TRUE, multi1 = multi1, multi2 = multi2, disp = TRUE,
output = TRUE, out_se = TRUE, Zth1 = Zth1, zth1 = zth1, Zth2 = Zth2,
zth2 = zth2, Zbe = Zbe, Zga1 = Zga1, Zga2 = Zga2)
Estimates of support points and item parameters may be displayed through
the usual methods summary
, coef
, and confint
:
### Display output
# summary(out)
# coef(out)
# confint(out)
# Support points and average weights of latent variable V
out$Th2
class
dimension 1 2
1 -1.0000000 1.000000
2 -0.6925198 1.679367
out$piv2
[,1]
[1,] 0.91757919
[2,] 0.08242081
# Conditional response probabilities for latent variable V
# Note that the latent classes are numbered following an increasing order,
# starting from latent variable U
round(out$Phi[,, 3:4], 3)
, , class = 3
item
category 1 2 3 4 5 6 7 8 9 10
0 0.994 0.982 0.944 0.986 0.998 0.991 0.974 0.919 0.981 0.997
1 0.006 0.018 0.056 0.014 0.002 0.009 0.026 0.081 0.019 0.003
, , class = 4
item
category 1 2 3 4 5 6 7 8 9 10
0 0.897 0.799 0.551 0.874 0.882 0.758 0.618 0.335 0.758 0.628
1 0.103 0.201 0.449 0.126 0.118 0.242 0.382 0.665 0.242 0.372
We observe that 91.8% of the individuals belong to class 1, which is
characterized by the smallest propensity to commit a crime, whereas the
remaining 8.24% of individuals are allocated in class 2. Both classes
present a tendency to increase the propensity to commit a crime from
time 1 to time 2. Indeed, the estimated support points of the propensity
to commit a crime at time 2 (i.e., estimates of
Finally, we outline that the constraints specified through matrices
Zbe
, Zga1
, and Zga2
result in the following estimates, which are
equal for every pair of items referring to the same type of crime
(i.e., items 1-6, 2-7, 3-8, 4-9, 5-10):
# Estimates of item parameters
beta <- cbind(out$Bec[1:5], out$Bec[6:10])
colnames(beta) <- c("time 1", "time 2")
gamma1 <- cbind(out$ga1c[1:5], out$ga1c[6:10])
colnames(gamma1) <- c("time 1", "time 2")
gamma2 <- cbind(out$ga2c[1:5], out$ga2c[6:10])
colnames(gamma2) <- c("time 1", "time 2")
beta
time 1 time 2
[1,] 4.994018 4.994018
[2,] 4.891019 4.891019
[3,] 2.700745 2.700745
[4,] 5.991825 5.991825
[5,] 5.544927 5.544927
gamma1
time 1 time 2
[1,] 1.318531 1.318531
[2,] 2.195804 2.195804
[3,] 1.184268 1.184268
[4,] 2.883623 2.883623
[5,] 1.353772 1.353772
gamma2
time 1 time 2
[1,] 1.510292 1.510292
[2,] 1.317424 1.317424
[3,] 1.311341 1.311341
[4,] 1.169848 1.169848
[5,] 2.183823 2.183823
In this paper we illustrate the R package MLCIRTwithin, whose main
function est_multi_poly_within
implements an Expectation Maximization
based approach to estimate the parameters of two-tier latent class item
response theory (IRT) models. This class of models, which extends in a
flexible way the class of basic IRT models, is based on two independent
vectors of latent variables. Moreover, two main assumptions hold: (i)
items are allowed to measure one or two latent variables (within-item
multidimensionality) and (ii) latent variables are assumed to have a
discrete distribution with a finite number of support points, which
identify homogeneous latent classes of individuals, and related mass
probabilities that may depend on individual covariates.
We illustrate the R package through two examples based on data about the measurement of health-related quality of life in cancer patients (Example 1) and about the measurement of the propensity to commit a crime in two time occasions (Example 2). The first example investigates the multidimensional structure of the questionnaire and is focused on the interpretation of the estimated model parameters. The second example illustrates how to treat longitudinal item responses through the specification of suitable constraints on support points and item parameters.
Both authors acknowledge the financial support from the grant FIRB (“Futuro in ricerca”) 2012 on “Mixture and latent variable models for causal inference and analysis of socio-economic data”, which is funded by the Italian Government (RBFR12SHVV). Authors are also grateful to Dr. G. Miccinesi of the Istituto per lo Studio e la Prevenzione Oncologica (Florence, IT) for making available the data used in Example 1.
MLCIRTwithin, MultiLCIRT, CDM, mirt, flirt, covLCA, lavaan, OpenMx, LMest
Econometrics, MissingData, MixedModels, OfficialStatistics, Psychometrics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bacci & Bartolucci, "Two-Tier Latent Class IRT Models in R", The R Journal, 2016
BibTeX citation
@article{RJ-2016-038, author = {Bacci, Silvia and Bartolucci, Francesco}, title = {Two-Tier Latent Class IRT Models in R}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {139-166} }