We present the CRAN R package MAINT.Data for the modelling and analysis of multivariate interval data, i.e., where units are described by variables whose values are intervals of
In classical statistics and multivariate data analysis, the basic units
under analysis are single individuals, described by numerical and/or
categorical variables, each individual taking one single value for each
variable. For instance, a specific football player may be described by
his age, height, weight, goals marked, nationality; a specific passenger
by his/her gender, age, destination, weight of luggage, etc. Data are
organised in a data-array, where each cell
It is however often the case that the data under analysis are not single observations, but rather sets of values, either related to groups of units gathered on the basis of some common properties, or observed repeatedly over time or under different specific conditions. The classical framework is then somehow restricted to take into account variability inherent to such data. This is the case when we are interested in describing football teams and not each specific player, or flights and not each particular passenger. The same issue often arises in official statistics analysis. Whether it is for the analysis’ purposes, or for confidentiality reasons, individual data – here usually called “microdata” – is gathered into more general data arrays, related to parishes, counties, socio-economical groups, etc. – the so-called “macro-data”. Internal variability should also be considered when the focus of the analysis lies in concepts (i.e., all elements sharing a given set of defining properties) rather than in a single specimen – whether it is a plant species (and not the specific plant I hold in my hand), a model of car (and not the particular one I am driving), etc. Another pertinent case arises when we are facing huge amounts of data, recorded in very large databases, and elements of interest are not the individual records but some second-level entities. For instance, in a database of a hypermarket purchases, we are surely more interested in describing the behaviour of some client (or some pre-defined class or group of clients) rather than each purchase by itself. The analysis requires then that the purchase data for each person (or group) be somehow aggregated to obtain the information of interest; here again the observed variability for each client or within each group is of utmost importance, and cannot be retained by summary statistics.
Symbolic Data Analysis (see e.g. (Diday and Noirhomme-Fraiture 2008), (Brito 2014)) provides a framework where the variability observed may effectively be considered in the data representation, and methods are developed that take that into account. To describe groups of individuals or concepts, new variable types may now assume other forms of realisations, which allow taking intrinsic variability into account. They may take the form of finite sets, intervals or distributions. In recent years, different approaches have been investigated and many methods proposed for the analysis of such symbolic data, and for the design of a symbolic counterpart of statistical multivariate data analysis methods. Most existing methods for the analysis of such data rely however on non-parametric descriptive approaches. Among these, interval data is by far the most investigated data type and for which more methods have been developed.
In (Brito and Duarte Silva 2012), parametric inference methodologies based on probabilistic models for interval variables are developed where each interval is represented by its midpoint and log-range, for which multivariate Normal and Skew-Normal (Azzalini and Dalla Valle 1996) distributions are assumed. The intrinsic nature of the interval variables leads to special structures of the variance-covariance matrix, which are represented by different possible configurations.
It should be noticed that we are modelling interval-valued variables, i.e. variables whose observed values are intervals, and not single-valued real variables. For this reason, they should not be confused with real-valued variables whose values are restricted to some intervals. Data structures for this latter type are available in some R packages such as survreg (Hubeaux and Rufibach 2015) or crch (Messner et al. 2019), but they obviously do not apply in our context.
In this paper, we present the package MAINT.Data (Duarte Silva and Brito 2021), which implements the proposed methodologies in R (R Core Team 2021). MAINT.Data is built using S4 classes and methods, introducing a specific data class for representing interval data. Functions for aggregating microdata into interval data objects are also provided. MAINT.Data includes functions and methods for modelling and analysing interval data, in particular maximum likelihood estimation and statistical tests for the different considered configurations. Methods for (M)ANOVA (Brito and Duarte Silva 2012) and Discriminant Analysis (Duarte Silva and Brito 2015) of this data class are also provided. For the Gaussian model, Model-based Clustering (Brito et al. 2015), robust estimation and outlier detection (Duarte Silva et al. 2018) are implemented; corresponding methods for Robust Discriminant Analysis are also available.
Multivariate analysis of interval-valued data has been addressed from different perspectives, as Clustering (see, e.g., (De Carvalho et al. 2006; De Carvalho and Lechevallier 2009)), Principal Component Analysis (PCA) (see, e.g. (Douzal-Chouakria et al. 2011; Le-Rademacher and Billard 2012)), Discriminant Analysis (Duarte Silva and Brito 2015; Ramos-Guajardo and Grzegorzewski 2016), Regression Analysis (Lima Neto and De Carvalho 2008, 2010, 2011; Dias and Brito 2017), etc. For a survey the reader may refer to (Brito 2014). Those are mostly non-parametric exploratory methodologies; recent approaches based on parametric models have also been proposed in (Brito and Duarte Silva 2012), (Le-Rademacher and Billard 2011), and (Lima Neto and De Carvalho 2011).
Many of the methods mentioned above for analysing interval-valued data may be found in R packages, namely symbolicDA (Dudek et al. 2019), (general multivariate data analysis/machine learning approaches, e.g. PCA, Discriminant Analysis, Multidimensional Scaling, Clustering), RSDA (Rodriguez 2021) (mainly classification and linear models), iRegression (Lima Neto et al. 2016) (Regression) and GPCSIV (Brahim and Makosso-Kallyth 2013) (PCA). We note that most of these packages implement non-parametric methods, an exception being iRegression which comprehends regression based on the parametric approach proposed in (Lima Neto and De Carvalho 2011). To the best of our knowledge, no other implementations of parametric approaches for the (multivariate) analysis of interval-valued data are publicly available.
The remainder of the paper is organised as follows. In the next section, we introduce interval data array and fix notation. Section Models and estimation presents the proposed models and the estimation of corresponding parameters. Section Multivariate analysis develops multivariate analysis methods based on those models. Section Package discusses the main structure and technical implementation of the MAINT.Data package. In Section Applications, two applications illustrate the use of the package and its functionalities. Finally, Section Summary concludes the paper, pointing out perspectives for future developments.
Let
where
The value of an interval variable
We note that the interval-valued data considered here do not represent uncertainty, but rather intrinsic variability. Such interval data may occur directly, or result from the aggregation of microdata. “Native” interval data are common e.g. in Botany and Zoology, one example being the length of the stem of a given plant species, which of course varies from specimen to specimen. The aggregation of microdata from potentially large databases also provides interval data, when individual numerical records are combined at the required level of granularity leading to a range of values representing the underlying variability. An example of such a case is the aggregation of the values of single purchases say, in the Bakery and Dairy section of a supermarket, for each client, during a year – we then obtain, for each client and for each supermarket section, an interval representing the variability of purchase values. Such aggregations are usually based on observed minima and maxima, but specific quantiles may also be considered for this purpose.
In (Brito and Duarte Silva 2012), parametric models for interval data, relying on multivariate Normal or Skew-Normal distributions for the MidPoints and Log-Ranges of the interval-valued variables have been proposed.
The Gaussian model consists in assuming a joint multivariate Normal
distribution
Given that the MidPoint
The most general formulation allows for non-zero correlations among all
MidPoints and Log-Ranges (configuration 1); in another setup, interval
variables
Configuration | Characterization | |
C1 | Not restricted | Not restricted |
C2 | ||
C3 | ||
C4 | All |
It should be remarked that for configurations C2, C3 and C4,
In (Brito and Duarte Silva 2012) another configuration has been considered, where MidPoints
(respectively, Log-Ranges) of different variables may be correlated, the
MidPoint of each variable may be correlated with its Log-Range, but no
correlation between Midpoints and Log-Ranges of different variables is
allowed. However, this case seems less natural, and leads to
computational difficulties, since
The Gaussian model has many advantages, which explains its generalized
use in multivariate data analysis; in particular, it allows for a direct
modelling of the covariance structure between the variables.
Nevertheless, it does present some limitations, namely the fact that it
imposes a symmetrical distribution on the MidPoints and a specific
relation between mean, variance and skewness for the Ranges. A more
general model that overcomes these limitations may be obtained by
considering the family of Skew-Normal distributions (see, for instance,
(Azzalini 1985; Azzalini and Dalla Valle 1996)). The Skew-Normal generalizes the Gaussian
distribution by introducing an additional shape parameter, while trying
to preserve some of its mathematical properties.
The density of a
Notice that the Skew-Normal model encompasses mixed models with marginal Normal random variables, for which the corresponding shape parameter is null.
The mean vector, variance-covariance matrix, and skewness coefficients
of a
As an alternative to the Gaussian model, it may be considered that
As discussed in the previous subsection, (Brito and Duarte Silva 2012) consider as possible models for interval-valued data, eight possible combinations of two multivariate distributions (Gaussian or Skew-Normal) with four covariance configurations. Given an observed data set, the choice among these models may be based on their maximised likelihood using usual information criteria such as the Bayesian Information Criterion (BIC) (Schwarz 1978), the Akaike Information Criterion (AIC) (Akaike 1974), or pairwise likelihood ratio tests. In this subsection we will present the details of the respective maximum likelihood estimation.
Let
Under the unrestricted configuration C1, the maximum likelihood
estimators of
Azzalini and Capitanio (see, e.g., (Azzalini and Capitanio 1999; Azzalini 2005)) have obtained
the log-likelihood of a
The optimal likelihood solution for the Skew-Normal model with
restricted configurations may not be obtained by simply replacing
corresponding entries in the appropriate matrices, because of the
non-linear relations between the parameters in
Multivariate datasets often include data units that deviate from the main pattern, usually called outliers, which may strongly influence the maximum likelihood estimators, leading to the need of alternative (robust) estimators. In the context of interval-valued data this problem has been addressed in (Duarte Silva et al. 2018).
There is an extensive literature on robust estimation of location and
scatter parameters. Trimmed likelihood estimators (Hadi and Luceño 1997) are based
on a sample subset, keeping only the
Outlier detection usually relies on Mahalanobis distances, flagging
units as outliers if their distances from an appropriate estimate of the
multivariate mean
Moreover, more efficient one-step re-weighted MCD estimators are often
used (Hubert et al. 2008). These are obtained by giving null weight only to the
units for which the raw squared robust Mahalanobis distance exceeds a
high threshold value, e.g., the
In practice, one needs to specify the number
The trimmed Maximum Likelihood approach described above has been adapted
to the problem of robust parameter estimation for the Gaussian models
proposed for interval-valued data. For all considered covariance
configurations, the trimmed log-likelihood can be written as
In the case of a restricted covariance matrix, the block diagonal structure always implies that trimmed likelihood maximisation is equivalent to the minimisation of the determinant of the restricted trimmed sample covariance matrix.
The one-step re-weighted bias-corrected estimators are given by
where
In expression ((6)),
These estimates may then be used for outlier detection in an
interval-valued dataset. For that purpose, the robust squared
Mahalanobis distance for unit
The models presented above for interval-valued variables allow
addressing (M)ANOVA problems with interval data - see (Brito and Duarte Silva 2012). Since
each interval-valued variable
Assume a one-way design, with a single factor with
In the Gaussian model, the usual likelihood ratio statistic
As usual, under the null hypothesis,
A simultaneous analysis of all the
The classical decision theoretic approach to Discriminant Analysis assumes that a given vector of attributes follows some known distribution and derives an optimal classification rule that minimises either the misclassification probability or the expected value of the misclassification cost. Parametric discriminant analysis of interval-value data based on the models above has been investigated in (Duarte Silva and Brito 2015).
In a problem with
When
Consider the Gaussian model for interval data. For each covariance configuration, an estimate of the optimum classification rule can be obtained by direct generalisation of the classical linear and quadratic discriminant classification rules, leading to group assignments defined by, respectively,
In MAINT.Data, all mean and covariance estimates in ((8)) and ((9)) may be obtained by either classical maximum likelihood or the robust trimmed maximum likelihood approach (see Section Robust estimation and outlier detection).
We note that for the restricted configurations C2, C3 and C4,
For the Skew-Normal case, we consider a Location Model in which the
groups differ only in terms of the location parameter
Model-based Clustering considers the data as arising from a distribution that is a mixture of two or more components (Banfield and Raftery 1993; McLachlan and Peel 2000; Fraley and Raftery 2002). Each component, that can be thought as a cluster, is characterised by a conditional density/mass function and has an associated probability or “weight”. When the conditional probability is specified as the multivariate Gaussian density, the probability model for clustering will be a finite mixture of multivariate Normals (known as the Gaussian mixture model).
The problem consists in estimating the model parameters for each component, as well as the membership (posterior) probabilities of each unit. To this purpose, the Expectation-Maximisation (EM) algorithm (Dempster et al. 1977) is commonly used. The method alternates between an expectation (E) step, which computes the expectation of the log-likelihood at the current parameter estimates, and a maximisation (M) step, which estimates parameters maximising the expected log-likelihood found in the E step.
Model-based Clustering of interval data has been addressed in (Brito et al. 2015), by considering the Gaussian parametrisation described above (see Section Models and estimation). For that purpose, the EM algorithm has been adapted to the likelihood maximisation in our models, for the different covariance configurations.
The finite mixture model with
Here the conditional distribution is given by
In Model-based clustering of interval data,
For the selection of the appropriate model and the number of components
MAINT.Data is built around S4 classes and methods, the most important being the IData class and classes derived from the virtual IdtE (IData Estimates) classes. Further specialised classes used to store the results of various multivariate analysis (e.g. Model-based Clustering, MANOVA and Discriminant Analysis) are also available. Figure 1 shows common interactions between different objects of MAINT.Data classes.
We note that in addition to the flow shown in Figure 1,
objects containing the results of Discriminant Analysis of Interval Data
may also be obtained from appropriate objects of class IdtMANOVA
, or
directly from the combination of objects of class IData with a grouping
factor.
Class Idata
, which is used to store datasets of interval-valued
variables, is the central class in the
MAINT.Data package.
Its design aims at replicating the functionalities of classical data
frames as smoothly as possible. As seen in Figure 1, objects
of class IData
may be created in one of two alternative ways: (i)
directly from data frames containing either lower and upper bounds or
MidPoints and Log-Ranges, using the creator function IData
; (ii) by
aggregation of a data frame of the microdata by a given aggregating
factor and criterion (e.g. min-max or a given pair of quantiles),
using the function AgrMcDt
.
The creator function IData
takes five arguments as input. The first
one, named Data refers to a data frame or matrix containing either the
lower and upper bounds or the MidPoints and Log-Ranges of the observed
intervals, where each row corresponds to a different unit. Then, Seq
is a string which describes the sequence of the data for each unit,
namely, lower and upper bounds variable by variable (“LbUb_VarbVar”,
default), MidPoints and Log-Ranges variable by variable
(“MidPLogR_VarbVar”), all lower bounds followed by all upper bounds
(“AllLb_AllUb”), or all MidPoints followed by all Log-Ranges
(“AllMidP_AllLogR”). The third and fourth arguments, named VarNames
and ObsNames, allow the user to specify the variables’ and units’
names, respectively. Finally, the last argument NbMicroUnits provides
the number of micro observations corresponding to each unit, when
available. A typical call of this function would be
ExampleIdt <- IData(dataDF,VarNames=c(“Var1”,“Var2”)
) (no names for
the units, number of micro observations corresponding to each unit not
available).
Function AgrMcDt
has three arguments. The first one, MicDtDF
indicates a data frame with the microdata. The second argument, agrby
refers to a factor with the categories according to which the microdata
should be aggregated. The last argument agrcrt specifies whether
aggregation is done with the minimum and maximum observed values, or
else based on user-defined quantiles. An example is shown in Section
Applications.
A UML diagram of class Idata
is shown in Figure 2. As seen
here, class Idata
implements specialised versions of standard R
methods such as summary
, print
, nrow
and ncol
, rownames
and
colnames
, rbind
, cbind
and plot
. Special care has been taken in
the development of indexing operators and of a specialised cbind
method, so that they work as smoothly as with data frames, but treating
each column of Idata
as one interval-valued variable.
The remaining Idata
methods perform parameter estimation and/or
multivariate analysis leading to objects of class IdtE
(parameter
estimation), IdtMANOVA
(Multivariate Analysis of Variance), Idtda
(Discriminant Analysis), or IdtMclust
(Model-based Clustering). All
these methods include a Covcase
argument used to specify the
covariance configurations assumed, which by default compares the BIC of
the results for all four configurations, and select the one with the
lowest BIC value.
The IdtE
class is an abstract (virtual) class used to store parameter
estimates of the models assumed for interval-valued variables. As shown
in Figure 3 there are currently eight such specialisations,
depending on the model assumed and type of estimation performed. The
names of these classes always start with the letters Idt followed by
Sng or Mx (estimates of parameters of a single or several
distributions), ND, SND or NandSND (Gaussian, SkewNormal or both
Gaussian and SkewNormal distributions), and end with E or RE
(Maximum Likelihood or Robust estimates).
As shown in Figure 4 the same reasoning applies to classes
derived from the virtual class IdtMANOVA
. However, in this case, only
Maximum Likelihood estimation has been considered and the
specialisations distinguish classical MANOVA (class IdtClMANOVA
),
heterocedastic MANOVA based on Gaussian distributions (IdtHetNMANOVA
),
Skew-Normal based MANOVA assuming that groups may differ only in
location (IdtLocNMANOVA
) or on all parameters (IdtGenNMANOVA
), and
analyses that consider both Gaussian and SkewNormal assumptions
(IdtLocNSNMANOVA
and IdtGenNSNMANOVA
).
Maximum likelihood estimation is performed by the mle
method, which
has six arguments. The first one, Idt refers to an IData object
representing interval-valued units. The second argument, Model
indicates the joint distribution assumed for the MidPoint and LogRanges;
alternatives are “Normal” for Gaussian (default), “SKNormal” for
Skew-Normal and “NrmandSKN” for both Gaussian and Skew-Normal
distributions. The next argument, CovCase indicates the configurations
of the variance-covariance matrix to be used (default: 1:4). The fourth
argument, SelCrit indicates the model selection criterion, BIC
(default) or AIC. The argument kmax specifies a tolerance criterion to
identify singular correlation matrices. Finally, OptCntrl provides a
list of optional control parameters to be passed to the optimization
routine.
Robust estimation is usually performed by the fasttle
method. Note
that for small datasets, the fulltle
method may be used, whose
arguments are common to fasttle
. The first three arguments of
fasttle
are the same as for the mle
method. Arguments alpha and
getalpha specify how the trimming proportion is chosen. Other
important arguments are the following: use.correction indicates
whether to use finite sample correction factors, default is TRUE.
rawMD2Dist provides the assumed reference distribution of the raw MCD
squared distances used to find the cutoffs defining the observations
kept in one-step reweighted MCD estimates; alternatives are “ChiSq” for
the usual Chi-square (default), “HardRockeAsF” and “HardRockeAdjF”,
respectively asymptotic and adjusted scaled F distributions proposed by
Hardin and Rocke (2005). MD2Dist - assumed reference distribution used
to find cutoffs defining the observations assumed as outliers;
alternatives are “ChiS” and “CerioliBetaF”, respectivelly for the usual
Chi-square, and the Beta and F distributions proposed by Cerioli (2010).
reweighted indicates whether a (Re)weighted estimate of the covariance
matrix should be used in the computation of the trimmed likelihood or
just a “raw” covariance estimate; default is TRUE. Argument outlin
specifies the type of outliers to be considered, alternatives are
“MidPandLogR” if outliers may be present in both MidPoints and
LogRanges, “MidP” if outliers are only present in MidPoints, or “LogR”
if outliers are only present in LogRanges.
Method MANOVA
applies multivariate analysis of variance. The arguments
Idt, Model, CovCase, SelCrit, k2max and OptCritl are
identical to the corresponding ones of method mle
. Argument grouping
indicates the factor whose levels are the different groups. MxT
indicates the type of mixing distributions to be considered: “Hom”
(homoscedastic) or “Het” (heteroscedastic) for Gaussian models, “Loc”
(location model) or “Gen” (general model) for Skew-Normal models (see
Section Discriminant Analysis above). CVtol provides a tolerance
value to identify almost constant variables within groups.
To perform discriminant analysis, three methods may be applied, namely,
lda
(linear discriminant analysis), qda
(quadratic discriminant
analysis) and snda
(skew-normal based discriminant analysis). In all
these methods, the first argument x denotes an IData object
representing the interval-valued units, or else an object of class
IdtMANOVA
. Arguments grouping, CVtol, CovCase, SelCrit and
k2max are identical to the corresponding ones of method MANOVA
. The
argument prior is used to specify the prior probabilities of group
membership, by default they are fixed at the training set corresponding
proportions. In method snda
, argument MxT indicates the type of
mixing distributions to be considered: “Loc” (location model, default)
or “Gen” (general model).
Method Idtmclust
performs model-based clustering based on finite
mixtures of Gaussian distributions. Arguments Idt, CovCase, and
SelCrit are identical to the corresponding ones in the previous
methods. The argument G provides the number of clusters (segments) of
the mixture, by default it is set as 1:9. MxT indicates the type of
mixing distributions to be considered, “Hom” (homoscedastic, default),
“Het” (heteroscedastic), or “HomandHet” (both). Finally, the argument
control provides a list of control parameters for the EM algorithm.
The implementation of the Idata
class, as well as maximum likelihood
estimation and multivariate methods based on the Gaussian distribution,
is relatively straightforward. As shown in Figure 2, the
internal structure of the Idata
class consists of two data frames,
containing MidPoints and Log-Ranges, respectively, a couple of auxiliary
constants and vector strings, and the integer vector NbMicroUnits
which stores the number of microunits aggregated to form each
interval-valued unit, when known. Therefore, Idata
objects require
roughly twice the memory space used by traditional data frames. The
Idata
slots may be retrieved by the accessor methods MidPoints
,
LogRanges
, rownames
, colnames
, nrow
, and ncol
.
The structure of the classes derived from the virtual IdtE
class (see
Figure 3) depends on the type of model specified and
estimation performed. In addition to the common slots of the IdtE
class, these classes include vector and/or matrix slots with estimates
that are constant across all covariance configurations, and a list slot
named ConvConfCases in which each component contains estimates
obtained under the assumption of a particular configuration. We note
that, although the estimates corresponding to one single configuration
are displayed and used in further analysis, all estimates resulting from
the configurations specified by the argument CovCase are stored, and
available to the user. The same logic applies to analyses that consider
more than one model, with the results for all models being stored, but
only one displayed by summary
and print
methods.
The maximum likelihood estimation and multivariate analysis based on the
Gaussian distribution do not entail any particular difficulties, usually
involving well known formulae and the replacement of some values by zero
according to the covariance configuration assumed. Covariance matrices
of Gaussian estimators are also computed in a straightforward manner and
passed, if so requested, to the appropriate stdEr
and vcov
methods.
Maximum likelihood estimation of Skew-Normal parameters requires the
numerical optimisation of the non-convex function ((5)). As
this function often has many different local optima,
MAINT.Data adopts a
repeated local search strategy, calling a given local optimiser from
different starting points. This is implemented in the auxiliary function
RepLOptim
that works as described below.
First, a local optimiser is called from an initial starting point
leading to a local optimum. Then, this optimum is modified by a random
perturbation, and the modified optimum is used as the starting point of
a new call to the local optimiser. This process is repeated until
several (default: 50) consecutive calls to the optimiser fail to improve
the current best solution, or a limit (default: 250) on the total number
of local optima, is reached. This limit, the maximum number of
non-improving consecutive local optimisations, and several other control
options, are set by default to reasonable values, but can be modified by
the components of a list supplied as the value of the argument
mle
or MANOVA
or
snda
) that internally call RepLOptim
, using in this case a list
supplied to their
The default local optimiser of RepLOptim
is the nlminb
PORT function
(Gay 1990). However, in the case of maximum likelihood estimation of
Skew-Normal parameters with unrestricted covariance configuration (C1),
MAINT.Data overrides
this default with the msn.mle
function of Azzalini’s
sn package (Azzalini 2021). For the
remaining configurations, the local optimisation relies on a
quasi-Newton optimiser (by default nlminb
) using the analytical
gradient of the centred Skew-Normal parametrisation derived by
(Valle and Azzalini 2008). In order to improve computational efficiency, the
computation of the log-likelihood ((5)) and of its gradient
was coded in C++, taking advantage of the numerical functions and
classes provided in the
Rcpp (Eddelbuettel and François 2011) and
RcppArmadillo
(François et al. 2021) packages. We note that global optimality cannot be
ensured and even with this strategy sometimes
MAINT.Data identifies
different local optima in different runs.
Once the optimisation of the log-likelihood ((5)) is
completed, MAINT.Data
approximates the covariance of the estimators using the evaluation of
the expected Fisher information matrix implemented in the
sn package. This
approximation may fail if either the expected information matrix is
ill-conditioned or the parameter estimates fall on the frontier of their
domain. In such cases, posterior calls to the stdEr
or vcov
methods
will result in appropriate warning messages.
The robust estimation of Gaussian model parameters by the trimmed
maximum likelihood principle is implemented in the fulltle
and
fasttle
methods. Method fulltle
makes a full combinatorial search
for the Trimmed Maximum Likelihood estimates, and should only be used
when the number of units is relatively small (say, not much larger than
40). Method fasttle
adapts the fast algorithm of Rousseeuw and
Driessen (Rousseeuw and Van Driessen 1999). Both methods were coded in C++, using
functions and classes from
Rcpp and
RcppArmadillo.
Furthermore, the methods RobMxtDESt
, Roblda
and Robqda
call
fasttle
or fulltle
in order to get robust estimates in different
groups that may be used for robust discriminant analysis.
The interface of the
MAINT.Data robust
methods and classes is partially based on the framework developed in the
popular rrcov package
(Todorov and Filzmoser 2009). In particular, the control options for the estimation algorithm
used in the fasttle
, RobMxtDESt
, Roblda
and Robqda
methods can
be provided by an argument of class RobEstControl
which inherits and
extends the class CovControl
of the package
rrcov. This way,
algorithmic options may be specified in a uniform and familiar manner.
The additional slots of class RobEstControl
specify new options, such
as indicators of the distributions assumed for the robust Mahalanobis
distances, the nature of the outliers (only in MidPoints, only in
Log-Ranges or (default) both in MidPoints and Log-Ranges), whether a
two-step procedure should be used to find trimming parameters, and other
choices that are available in
MAINT.Data but not in
rrcov.
The MANOVA methods available in
MAINT.Data are always
based on the maximum likelihood principle. By default, the Chi-square
distribution is used for the test statistic. However, for small samples,
a permutation test has been implemented in the auxiliary function
MANOVAPermTest
(see (Seber 2009)).
The design and interface of class IdtMclust
is modelled after class
Mclust
of the mclust
package (Scrucca et al. 2016). As a result, the IdtMclust
print
and summary
methods with their default argument values, display only a very general
description of the clustering results. A characterization of the
obtained clusters, and the partition itself, may be inspected by
changing the summary
arguments parameters, and classification from
FALSE to TRUE. A difference between the mclust
and IdtMclust
classes lies in that in the former case detailed clustering results can
only be retrieved directly from the Mclust
slots while IdtMclust
provides accessor methods such as parameters
, pro
, mean
, var
,
cor
and classification
to retrieve these results. The EM algorithm
used in IdtMclust
is implemented in C++, using facilities of the
Rcpp and
RcppArmadillo
packages.
To illustrate the modelling and methods presented above, we use the flights dataset from the R data package nycflights13, available at CRAN, which contains on-time data for all flights that departed New York city in 2013. The original microdata consists of 336776 flights characterized by nineteen variables.
From this data, we created a data frame
named FlightsDF, after
removing all rows with missing data, and with six columns corresponding
to the following descriptive variables at microdata level: Month,
Carrier (16 different carriers), Departure delay (min), Arrival delay
(min), Air time(min), and Distance (miles).
We consider as units of interest classes formed by crossing Month with
Carrier, leading to 185 units (note that not all the 192 possible
combinations are present in the microdata). Therefore, we created a
factor
, named FlightsUnits, defining the class each individual case
belongs to.
The command
R> FlightsIdt <- AgrMcDt(FlightsDF,FlightsUnits)
creates an interval data object FlightsIdt, where the values of the
numerical variables Departure delay (DD), Arrival delay (AD), Air time
(AT) and Distance (DT) are aggregated in the form of intervals for each
unit. Leaving the aggregation argument agrcrt
at its “minmax” default,
the lower and upper bounds of the obtained intervals are the minimum and
maximum values observed in the microdata, respectively.
However, we prefer to use the robust aggregation alternative, by
filtering out the
R> FlightsIdt <- AgrMcDt(FlightsDF,FlightsUnits,agrcrt=c(0.05,0.95))
We note that the 43 units for which, for any variable, the lower and the upper bound are equal (degenerate interval) are eliminated, so that the final interval dataset has 142 units.
Table 3 shows a few rows of the resulting interval data table. The full interval dataset is available on MAINT.Data.
Departure | Arrival | Air time | Distance | |
delay | delay | |||
Jan-9E | ||||
Jan-AA | ||||
Dec-YV |
Figure 5 illustrates the two alternative outputs -
(a)-crosses, (b)-rectangles - of the method plot
, resulting
respectively from the commands
R> plot(FlightsIdt[,"distance"],FlightsIdt[,"arr_delay"],
cex.main=3, cex.lab=1.9, cex.axis=2)
R> plot(FlightsIdt[,"distance"],FlightsIdt[,"arr_delay"],type="rectangles",
cex.main=3, cex.lab=1.9, cex.axis=2)
showing the intervals corresponding to the 142 units in two different forms for variables Distance and Arrival delay.
We note that the graphical arguments of traditional R plots are also
available in
MAINT.Data plot
methods. In this example, the default graphical settings were adequate
for online display, but resulted in too small axis and legends, when the
resulting graphs were exported to an external text file. Therefore, we
used the cex.main
, cex.lab
, and cex.axis
traditional R plot
arguments, to improve their readability. This particular example worked
well on a PC under Linux, but since graphical characteristics are
machine and operation system dependent, other argument values may be
required in different computer environments.
Figure 6 plots the MidPoints versus the Log-Ranges for variable Arrival delay, resulting from the command
R> plot(MidPoints(FlightsIdt)[,"arr_delay.MidP"],
LogRanges(FlightsIdt)[,"arr_delay.LogR"],
xlab="Mid Points",ylab="Log Ranges",
main="Mid Points vs. Log Ranges for Arrival delays",
cex.main=2, cex.lab=1.5, cex.axis=1.5)
We observe a strong positive correlation between the MidPoints and the Log-Ranges of the Arrival delay, which is not uncommon for interval-valued variables.
The following statistical analyses of the data will be directed towards the methods proposed earlier. Accordingly, a first analysis relates to statistically characterize the input variables and possible relationships between them. Then the interest is in possible outliers in the interval data. Are there specific carrier/month combinations which are atypical for the observed features? The identified outliers will be excluded from the data for the subsequent analyses, which could be affected by data outliers. The data are split into two groups, the mainline carriers and the regional carriers. Do these groups differ for the considered variables (MANOVA)? Is it possible to distinguish the observations of the two groups from each other (discriminant analysis)? Are there even more subgroups in the multivariate data, and how can those be characterized (cluster analysis)?
We start by adjusting the Gaussian and Skew-Normal models for all four considered covariance configurations using the commands
R> Flightsmle <-mle(FlightsIdt,Model="NrmandSKN")
R> summary(Flightsmle)
which produce the output
Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2
-2278.626 -3249.162 -2547.369 -3564.026 -2207.444 -3183.179
SNModCovC3 SNModCovC4
-2491.175 -3498.258
Bayesian (Schwartz) Information Criteria:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2
4775.309 6597.440 5233.501 7207.346 4672.591 6505.121
SNModCovC3 SNModCovC4
5160.760 7115.455
Selected model:
[1] "SNModCovC1"
We recall that for the Skew-Normal model only local optima are identified, so that in different runs slightly different solutions may be obtained.
Among the eight models testMod(Flightsmle)
. In this case, for any reasonable significance
level, these tests suggest also the Skew-Normal model with covariance
configuration C1.
The estimates of mean, standard deviation and skewness coefficient
vectors and the variance-covariance matrix may be extracted by the usual
coef()
method. Alternatively, the standard methods mean()
, sd()
,
var()
and cor()
may also be used. Furthermore, standard errors and
variances and covariances of the estimates may be obtained, as usual, by
the methods stdEr()
and vcov()
.
The estimates for mean values, standard deviations and skewness coefficients are:
The estimate of the correlation matrix is :
We observe that MidPoints are positively correlated with the
corresponding Log-Ranges, with strong correlations for the delay
variables and moderate correlations for Distance and Air Time. The
MidPoints of Departure delay and Arrival delay on the one hand, and Air
time and Distance, on the other hand, have, as expected, strong
correlations; the corresponding Log-Ranges also present high
correlations. The observed correlation values explain the choice of the
unrestricted covariance configuration C1.
Robust estimation results are obtained by the commands:
R> Flightstle <-fasttle(FlightsIdt)
R> summary(Flightstle)
which produce the output
Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4
-1416.930 -2069.002 -1720.921 -2490.111
Bayesian (Schwartz) Information Criteria:
NModCovC1 NModCovC2 NModCovC3 NModCovC4
3040.279 4231.831 3573.200 5055.283
Selected model:
[1] "NModCovC1"
The estimates for mean values and standard deviations are now:
To identify outliers, we employ the default options for the robust
methods and functions implemented in
MAINT.Data, namely a
cut-off based on the Chi-square distribution, and a trimming parameter
based on a two step procedure using
The following command allows obtaining the list of units identified as outliers:
R> Flights_Otl <- getIdtOutl(FlightsIdt,Flightstle)
R> print(Flights_Otl)
which returns
Jan-FL Jan-VX Feb-FL Feb-VX Mar-FL Mar-VX Apr-FL Apr-VX Apr-YV
6 10 17 21 28 32 39 43 45
May-FL May-VX Jun-9E Jun-FL Jun-VX Jun-WN Jun-YV Jul-9E Jul-FL
51 55 58 63 67 68 69 70 75
Jul-VX Aug-FL Aug-VX Aug-YV Sep-VX Sep-YV Oct-FL Oct-VX Nov-FL
79 87 91 93 103 105 111 115 123
Nov-OO Nov-VX Nov-YV Dec-FL Dec-VX
125 128 130 136 140
From this list it is visible that FL (AirTran Airways) is an outlier for almost all months. When inspecting the aggregated data, it can be seen that the upper bound of the Distance is clearly lower than for the other airlines, and consequently also the upper bound of Air time. The contrary happens for airline VX (Virgin America), which is an outlier for all months. Note, however, that outlyingness can also be caused by a different multivariate behaviour of an observation.
Figure 7 shows the values of robust Mahalanobis
distances (to the mean) for all 142 units, the horizontal line indicates
the
R> plot(Flights_Otl, cex.main=2, cex.lab=1.5, cex.axis=0.3)
We now consider a partition of the
MANOVA analysis was performed for all flights units, considering this two group decomposition. We compared both a Gaussian and a Skew-Normal model, with all four covariance configurations (default), with a homoscedastic setup (default), and using the BIC (default) as comparison criterion. For that purpose we used the commands
R> out1<-Flights_Otl@outliers
R> carr <- substring(rownames(FlightsIdt[-out1,]),5,6)
R> carr_class <- factor(ifelse(carr=="9E"|carr=="EV"|carr=="MQ"|
carr=="OO"|carr=="YV","REG","MAIN"))
R> MANOVAres <- MANOVA(FlightsIdt[-out1,],carr_class,Model="NrmandSKN")
R> summary(MANOVAres)
leading to the output
Null Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-1437.000 -2094.997 -1743.016 -2521.166 -1397.125 -2060.841 -1735.816 -2444.528
Full Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-1318.227 -1874.335 -1566.672 -2184.943 -1259.869 -1821.942 -1536.026 -2085.836
Full Model Bayesian (Schwartz) Information Criteria:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
2880.879 3880.282 3302.561 4482.697 2801.767 3813.102 3278.874 4322.088
Selected Model:
[1] "SNModCovC1"
Chi-squared statistic: 274.5117
degrees of freedom: 8
p-value: 1.082712e-54
The Skew-Normal model with variance-covariance configuration C1 is selected (on the basis of BIC values) as the best model in this case, results indicate that the two carrier groups are indeed different for the considered variables.
These results were to be expected given that regional carriers tend to fly short distances and therefore with shorter air times, than mainlines.
We then proceeded to investigate whether the two carrier groups are different when it comes to each of the delay variables, using the commands
R> MANOVA_Dep_delay_res <- MANOVA(FlightsIdt[-out1,"dep_delay"],carr_class,Model="NrmandSKN")
R> summary(MANOVA_Dep_delay_res)
R> MANOVA_Arr_delay_res <- MANOVA(FlightsIdt[-out1,"arr_delay"],carr_class,Model="NrmandSKN")
R> summary(MANOVA_Arr_delay_res)
that produced the outputs
Null Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-313.5688 NA NA -492.4602 -287.8342 NA NA -473.1451
Full Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-296.6978 NA NA -469.0213 -272.3174 NA NA -449.6624
Full Model Bayesian (Schwartz) Information Criteria:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
626.2990 NA NA 966.2455 586.9391 NA NA 936.9286
Selected Model:
[1] "SNModCovC1"
Chi-squared statistic: 31.03359
degrees of freedom: 2
p-value: 1.824489e-07
Null Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-366.8737 NA NA -473.4313 -360.3910 NA NA -457.5188
Full Model Log likelihoods:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
-357.0406 NA NA -456.5808 -351.1153 NA NA -441.4999
Full Model Bayesian (Schwartz) Information Criteria:
NModCovC1 NModCovC2 NModCovC3 NModCovC4 SNModCovC1 SNModCovC2 SNModCovC3 SNModCovC4
746.9845 NA NA 941.3644 744.5350 NA NA 920.6036
Selected Model:
[1] "SNModCovC1"
Chi-squared statistic: 18.55139
degrees of freedom: 2
p-value: 9.367357e-05
The results show that the two carrier groups present different patterns for both departure and arrival delays.
We consider again the
The function DACrossVal
estimates error rates by c-fold
cross-validation or by leave-one-out. Its main arguments are: (i) the
data object, (ii) the grouping factor, (iii) a function with the
training algorithm (e.g. lda, qda, snda, see Subsection Design of
Section Package), (iv) the number of cross-validation folds
(default: 10), (v) the number of replications (default: 20), (vi) a
boolean flag indicating whether the folds should be stratified according
to the original class proportions (default), or randomly generated from
the whole training sample, ignoring class membership, and (vii) a
boolean flag (false by default) stating if the leave-one-out method
should be used instead of c-fold cross-validation.
Different discriminant methods were compared by leave-one-out cross-validation: Linear and Quadratic Discriminant Analysis for the Gaussian model, and both the Location and the General models of Skew-Normal discriminant analysis. In each case the variance-covariance configuration (see Table 2) was chosen by minimising the value of BIC. The global errors, which are also provided, are computed as a weighted average of the estimated class specific errors.
The code below computes and displays these estimates.
R> DACrossVal(FlightsIdt[-out1,],carr_class,TrainAlg=lda,loo=TRUE)
R> DACrossVal(FlightsIdt[-out1,],carr_class,TrainAlg=qda,loo=TRUE)
R> DACrossVal(FlightsIdt[-out1,],carr_class,TrainAlg=snda,loo=TRUE)
R> DACrossVal(FlightsIdt[-out1,],carr_class,TrainAlg=snda,Mxt="Gen",loo=TRUE)
We note that while the first two commands are executed quite fast (a few seconds), the last two (for the Skew-Normal model) typically need several hours, given that a computationally heavy Skew-Normal estimation is repeated many times.
These commands lead to the output below. Note that when snda
is used
without any additional arguments, the default location model is assumed.
Error rate estimates of algorithm lda
MAIN REG Global
0.013888889 0.000000000 0.009090909
Error rate estimates of algorithm qda
MAIN REG Global
0.00000000 0.05263158 0.01818182
Error rate estimates of algorithm snda
MAIN REG Global
0.08333333 0.07894737 0.08181818
Error rate estimates of algorithm snda with argument Mxt=Gen
MAIN REG Global
0.05555556 0.18421053 0.10000000
These results suggest that lda
performs better than the other
alternatives in the problem at hand. The predicted classes using the
lda
method may be obtained, as usual, by the commands
R> ldares <- lda(FlightsIdt[-out1,],carr_class)
R> ldapred <- predict(ldares,FlightsIdt[-out1,])
R> print(ldapred$class)
We observed that all but one units are correctly classified, namely carrier FL in September, the only FL unit which was not flagged as an outlier.
Model-based Clustering described in Section Model-based Clustering was applied to the Flights dataset, without the identified outliers, to identify up to 16 components. This is accomplished by the command:
R> mclust_res <- Idtmclust(FlightsIdt[-out1,],1:16,Mxt="HomandHet")
where again, by default, the recommended solution is selected by the BIC. The corresponding values may be graphically compared using the command
R> plotInfCrt(mclust_res, cex.lab=1.5, outlegsize=10, outlegdisp=0.25)
which provided the graphic in Figure 8 and the output below. A homocedastic nine component model has been selected with an unrestricted covariance configuration C1.
Best BIC values:
HomG9C1 HomG10C1 HomG11C1 HomG15C1 HomG16C1
BIC 2555.407 2602.228 2653.101 2674.13 2692.822
BIC diff 0 46.82063 97.69343 118.7225 137.4153
The value returned by Idtmclust
is an object of class IdtMclust
with
the same structure, and similar methods, of the corresponding Mclust
class of package mclust
(Scrucca et al. 2016). In particular, the classification results may be inspected by
the command:
R> summary(mclust_res,classification=TRUE)
which returns
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Homoscedastic C1 model with 9 components
log.likelihood NObs BIC
-1005.076 110 2555.407
Clustering table:
CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9
12 12 14 12 10 12 24 9 5
Classification:
Jan-9E Jan-AA Jan-B6 Jan-DL Jan-EV Jan-MQ Jan-UA Jan-US Jan-WN Feb-9E
"CP5" "CP2" "CP7" "CP4" "CP5" "CP6" "CP7" "CP3" "CP3" "CP1"
Feb-AA Feb-B6 Feb-DL Feb-EV Feb-MQ Feb-UA Feb-US Feb-WN Mar-9E Mar-AA
"CP2" "CP7" "CP4" "CP5" "CP6" "CP7" "CP3" "CP3" "CP1" "CP2"
Mar-B6 Mar-DL Mar-EV Mar-MQ Mar-UA Mar-US Mar-WN Apr-9E Apr-AA Apr-B6
"CP7" "CP4" "CP5" "CP6" "CP7" "CP3" "CP8" "CP1" "CP2" "CP7"
Apr-DL Apr-EV Apr-MQ Apr-UA Apr-US Apr-WN May-9E May-AA May-B6 May-DL
"CP4" "CP5" "CP6" "CP7" "CP3" "CP8" "CP5" "CP2" "CP7" "CP4"
May-EV May-MQ May-UA May-US May-WN May-YV Jun-AA Jun-B6 Jun-DL Jun-EV
"CP1" "CP6" "CP7" "CP3" "CP8" "CP9" "CP2" "CP7" "CP4" "CP5"
Jun-MQ Jun-UA Jun-US Jul-AA Jul-B6 Jul-DL Jul-EV Jul-MQ Jul-UA Jul-US
"CP6" "CP7" "CP3" "CP2" "CP7" "CP4" "CP5" "CP6" "CP7" "CP3"
Jul-WN Jul-YV Aug-9E Aug-AA Aug-B6 Aug-DL Aug-EV Aug-MQ Aug-UA Aug-US
"CP8" "CP9" "CP1" "CP2" "CP7" "CP4" "CP1" "CP6" "CP7" "CP3"
Aug-WN Sep-9E Sep-AA Sep-B6 Sep-DL Sep-EV Sep-FL Sep-MQ Sep-UA Sep-US
"CP8" "CP1" "CP2" "CP7" "CP4" "CP1" "CP9" "CP6" "CP7" "CP3"
Sep-WN Oct-9E Oct-AA Oct-B6 Oct-DL Oct-EV Oct-MQ Oct-UA Oct-US Oct-WN
"CP8" "CP1" "CP2" "CP7" "CP4" "CP1" "CP6" "CP7" "CP3" "CP8"
Oct-YV Nov-9E Nov-AA Nov-B6 Nov-DL Nov-EV Nov-MQ Nov-UA Nov-US Nov-WN
"CP9" "CP1" "CP2" "CP7" "CP4" "CP1" "CP6" "CP7" "CP3" "CP8"
Dec-9E Dec-AA Dec-B6 Dec-DL Dec-EV Dec-MQ Dec-UA Dec-US Dec-WN Dec-YV
"CP5" "CP2" "CP7" "CP4" "CP5" "CP6" "CP7" "CP3" "CP8" "CP9"
We observe that units corresponding to the same carrier tend to cluster together, for instance, component 2 gather all AA units, component 4 gathers DL units and component 6 the MQ units.
In order to have a better description of the obtained partition, we then print the corresponding mixing probabilities and the component-wise mean vectors.
R> print(pro(mclust_res), digits=3)
R> print(mean(mclust_res),digits=3)
obtaining
CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9
0.10684 0.10909 0.12727 0.10752 0.09316 0.10909 0.21976 0.08182 0.04545
CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9
dep_delay.MidP 40.26 31.02 21.06 27.39 59.83 35.15 34.85 43.51 44.85
arr_delay.MidP 25.84 18.28 15.44 16.45 51.08 31.89 25.00 32.63 34.63
air_time.MidP 99.13 223.40 163.22 208.73 101.86 98.97 193.37 171.20 72.08
distance.MidP 640.43 1604.00 1163.86 1481.65 632.53 616.50 1360.07 1165.50 411.80
dep_delay.LogR 4.59 4.35 4.07 4.19 4.92 4.48 4.41 4.55 4.68
arr_delay.LogR 4.76 4.68 4.42 4.59 5.05 4.71 4.69 4.80 4.80
air_time.LogR 4.83 5.47 5.55 5.56 4.86 4.63 5.71 4.91 3.97
distance.LogR 6.86 7.46 7.58 7.59 6.81 6.69 7.75 6.81 5.85
These mean vectors can be compared by a parallel coordinate plot, using
the pcoordplot
method as follows
R> pcoordplot( mclust_res, cex.main=2, cex.lab=2,
legendpar=list(cex.main=2.5, cex.lab=2.5) )
which produces the graph shown in Figure 9.
We observe that component 5 is mainly characterised by the largest delays both at departures and arrivals, also displaying their highest variability. Component 3, on the other hand, presents the lowest delays, with lowest variability, and concerns long flights. Component 9 corresponds to the shortest flights, with also low variability of distance and airtime. Components 4 and 7 present similar patterns in the distance and airtime variables, although component 4 displays slightly larger midpoints while component 7 has a higher variability; in terms of delays, we observe in component 7 higher values together with a more important variability.
From Figure 8 we observe that the best heterocedastic model corresponds to configuration C2 and identifies six components.
The corresponding mean vectors may again be displayed by a parallel
coordinate plot, using the pcoordplot
method, now indicating the
solution of interest:
R> pcoordplot(mclust_res, model="HetG6C2", cex.main=2, cex.lab=2,
legendpar=list(cex.main=2.5, cex.lab=2.5) )
leading to the graph shown in Figure 10.
This second example explores the diamonds dataset (from the R package tidyverse available at CRAN). The original microdata consists of 53940 diamonds characterised by ten variables. Descriptive variables are: carat (weight of the diamond), x (length in mm), (width in mm), and z (depth in mm). All rows with missing data or null values in at least one of these variables were removed. Because the distribution of these variables is positive skewed, they were log-transformed (natural logarithm).
The units of analysis were defined by the variables: cut (quality of the cut: Fair, Good, Very Good, Premium, Ideal), color (diamond color, with seven levels: J (worst) to D (best)), and clarity (measurement of how clear the diamond is, with eight levels: I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)), totalising 271 units (out of the 280, four were not present in the data, and five were degenerated). The variable DiamondsUnits defines these combinations.
The commands
R> library(tidyverse)
R> valid_diamonds <- diamonds %>%
R> filter(carat !=0, x != 0, y != 0, z != 0) %>%
R> drop_na() %>% mutate(logcarat = log(carat),
logx = log(x),
logy = log(y),
logz = log(z))
R> DiamondsUnits <- factor( paste(
valid_diamonds$cut,valid_diamonds$color,valid_diamonds$clarity, sep="-"
) )
R> DiamondsIdt <- AgrMcDt(valid_diamonds[,c("logcarat","logx","logy","logz")],
agrby=DiamondsUnits)
do the initial data processing and create the interval data object DiamondsIdt using the default option (min-max). In the application we do not filter out potential outliers as we want to use the finite mixture model to detect them. Indeed, outliers can be seen as an unstructured component of the mixture model (Aitkin and Wilson 1980).
From the estimation of mixtures from one to eight components, we have
R> Diamd_mclust_res <- Idtmclust(DiamondsIdt,1:8,Mxt="HomandHet")
R> plotInfCrt(Diamd_mclust_res, cex.lab=1.5, outlegsize=10, outlegdisp=0.25)
Best BIC values:
HetG3C3 HetG4C3 HetG3C1 HetG5C3 HetG6C3
BIC -7818.702 -7789.946 -7752.533 -7702.916 -7637.323
BIC diff 0 28.75589 66.16948 115.7865 181.3797
Selection based on BIC recommends configuration 3 with three components (see Figure 11). Thus, the best solution is a heteroscedastic solution in which centers are not correlated with ranges.
R> summary(Diamd_mclust_res)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Heteroscedastic C3 model with 3 components
log.likelihood NObs BIC
4150.242 271 -7818.702
Clustering table:
CP1 CP2 CP3
11 176 84
R> print(pro(Diamd_mclust_res), digits=3)
CP1 CP2 CP3
0.0407 0.6328 0.3265
We conclude that the size of CP1 is 0.0407 and contains eleven observations (hard classification), i.e., it is an outlier or niche group. Whenever a population/sample is well represented by the Normal distribution, a single component (or point of support) is enough to model its density. Often, however, distributions tend to have skewness and kurtosis that are far from the multivariate Normal distribution. In that case, Gaussian mixture models (GMM) have been used to estimate densities as an alternative to nonparametric or semi-parametric Kernels (Scott 2015). Indeed, this application of finite mixtures is more general than model-based clustering as the latter tends to be specific to the correspondence between modes and clusters or groups. In this example, the departure from normality (skewness and kurtosis) is modeled using two additional components.
R> plot(MidPoints(DiamondsIdt)[,"logz.MidP"],
LogRanges(DiamondsIdt) [,"logz.LogR"],
xlab="Mid Points",ylab="Log Ranges",
main="Mid Points vs. Log Ranges for logdepth in mm",
col = ifelse(Diamd_mclust_res@classification == 'CP1',1,
ifelse(mclust_res@classification == 'CP2',2,7)),
pch = 19, cex.lab=1.5)
Figure 12 obtained from the code above illustrates the approximation of the density of the data by the finite mixture. The core group is well defined by the multivariate normal distribution. As the result of skewness, a second group is added. And then, finally, the heavy (multidimensional) “tail” is given by the outlier component (black dots) with its large estimated variances and co-variances.
The MAINT.Data R package implements models and methods for the analysis of interval-valued data, relying on multivariate Normal or Skew-Normal distributions for the MidPoints and Log-Ranges of the interval-valued variables. Implemented in the S4 framework, it introduces a data class for representing interval data and functions and methods for parametric modelling and analysis.
The available tools for interval variable management include interval-data versions of most of the standard R methods such as print and summary, index and subseting, and plot. Moreover, functions for aggregating microdata into interval data objects are also provided. The multivariate methodologies available include maximum likelihood estimation and statistical tests for the different configurations, (M)ANOVA, parametric Discriminant Analysis, and Model-based Clustering. Moreover, outlier detection and estimation based on robust techniques are provided; discriminant parametric methods based on robust estimates are implemented accordingly.
MAINT.Data, currently in its 2.6.1 version, offers an integrated solution for the management and parametric analysis of interval-valued data, from aggregation to modelling, analysis and visualisation, extending the R “programming with data” paradigm to new and complex data types.
This work was financed by the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within projects UIDB/00731/2020, UIDB/50014/2020, UIDB/00315/2020 and PTDC/EEI-TEL/32454/2017.
survreg, crch, MAINT.Data, symbolicDA, RSDA, iRegression, GPCSIV, sn, Rcpp, RcppArmadillo, rrcov, mclust, nycflights13, tidyverse
Cluster, Distributions, Econometrics, Environmetrics, HighPerformanceComputing, NumericalMathematics, Robust
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
sn
: The skew-normal and related distributions such as the skew-Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Silva, et al., "MAINT.Data: Modelling and Analysing Interval Data in R", The R Journal, 2021
BibTeX citation
@article{RJ-2021-074, author = {Silva, A. Pedro Duarte and Brito, Paula and Filzmoser, Peter and Dias, José G.}, title = {MAINT.Data: Modelling and Analysing Interval Data in R}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {336-364} }