Variants of Simple Correspondence Analysis

Abstract:

This paper presents the R package CAvariants . The package performs six variants of correspondence analysis on a two-way contingency table. The main function that shares the same name as the package – CAvariants – allows the user to choose (via a series of input parameters) from six different correspondence analysis procedures. These include the classical approach to (symmetrical) correspondence analysis, singly ordered correspondence analysis, doubly ordered correspondence analysis, non symmetrical correspondence analysis, singly ordered non symmetrical correspondence analysis and doubly ordered non symmetrical correspondence analysis. The code provides the flexibility for constructing either a classical correspondence plot or a biplot graphical display. It also allows the user to consider other important features that allow to assess the reliability of the graphical representations, such as the inclusion of algebraically derived elliptical confidence regions. This paper provides R functions that elaborates more fully on the code presented in .

Cite PDF Tweet

Published

Oct. 21, 2016

Received

Feb 28, 2016

Citation

Lombardo & Beh, 2016

Volume

Pages

8/2

167 - 184


1 Introduction

Computational procedures for detecting the association between two or more categorical variables are important aspects of statistical theory and practice. In particular, correspondence analysis provides a quick and simple graphical summary of how categories and variables are associated with one another. The theoretical aspects of the analysis are well documented in the statistical and allied disciplines; see, for example, , , , , , and . Despite the necessity for programs and functions that allow their user to perform correspondence analysis, the availability for many of the varied approaches is generally limited. Commercially available statistical software, such as MATLAB, Minitab, SAS and SPSS provide a means of carrying out correspondence analysis, although their procedures often provide only the most basic of features as part of their output. Generally nothing beyond the calculation of principal inertia values, profile coordinates, contribution to inertia and a two-dimensional correspondence plot are provided. Other popular statistical languages, such as R, provide some packages for performing simple and multiple correspondence analysis of the classical (symmetrical) type, . Nevertheless, at present, no popular statistical packages provide functions to perform ordered variants of symmetrical and non symmetrical correspondence analysis.

2 Overview of correspondence analysis in R

Since the mid 2000’s the programming environment of R has proven to be extremely popular in all areas of theoretical and applied statistics. This is due in part to the free availability of the program from the Comprehensive R Archive Network (CRAN; http://CRAN.R-project.org/), the versatility of the coding environment and the ever increasing number of packages that are now available on the CRAN.

Various R packages have received a great deal of attention for their contribution to the computing of correspondence analysis (CA). One of the first is the MASS package . It provides the user with a means of performing simple and multiple correspondence analysis with the option of including supplementary points onto a display. More recently the ca package of includes functions for performing simple, multiple and joint correspondence analysis using two and three dimensions for the graphical displays. Supplementary points were incorporated into the R code of while the anacor package of allows the user to perform classical and canonical correspondence analysis with missing values . Further, one may refer to the CA or MCA functions in the FactoMineR package by . For lexical tables, the CaGalt function incorporated into the FactoMineR package by may be used. Another recent package – cabootcrs – by checks the reliability of association by superimposing onto a plot bootstrap confidence regions. The CAinterprTools package by makes use of graphical features to enrich a visual interpretation of CA results. Alternatively, prepared the homals package for performing Gifi’s approach to correspondence analysis. As well also, presented dualScale package for doing dual scaling (i.e., multiple correspondence analysis) of multiple choice data. and provide a good overview of correspondence analysis using R with an archaeological focus. Another R based package that can be downloaded freely from CRAN is ExPosition . It is written by Herve Abdi and his team and performs a variety of different multivariate data analysis techniques, including correspondence analysis and multiple correspondence analysis. Abdi’s group has also been responsible for other variations of correspondence analysis including multi-block discriminant correspondence analysis and discriminant correspondence analysis . Furthermore, another suite of R functions that enables the user to perform a variety of correspondence analysis techniques is vegan , which was developed primarily for vegetation ecologists. It includes functions that provide the user with a large array of techniques to choose from including classic correspondence analysis, canonical correspondence analysis and detrended correspondence analysis. One may also consider the ade4 package , which also includes non symmetric correspondence analysis, to analyze ecological and environmental data in the framework of numerous euclidean exploratory methods. Further, the cncaGUI package allows canonical correspondence analysis and canonical non symmetrical correspondence analysis providing inferential results by using bootstrap methods. The PTAk package includes functions for doing multiway data decomposition, and in particular, it also allows simple correspondence analysis and a generalization of correspondence analysis for k-way tables. Lastly, but certainly not least, the R code of for performing simple and multiple correspondence analysis may also be considered.

An overview of the broad areas of correspondence analysis that these packages cover is summarized in Table 1. While non symmetrical correspondence analysis for nominal variables is included in some of the R packages on the CRAN that perform correspondence analysis, the remaining ordered variants have not yet been made available in any R package. However, fragments of R code for some of these CA variants are available in . Therefore, this paper provides a comprehensive description of R code that enhances, beyond the classics, the type of correspondence analysis that one may use. The advantages of these variants is that they enable the user to incorporate categorical predictor/response associations and the ordinal structure of a variable. For ordered variables we can easily identify any linear and non-linear sources of association that may exist in the data. The ordered variants also provide a visualization of non-linear trends of association; the classical approaches to correspondence analysis do not encompass these features.

The theoretical aspects underlying all the six variants of correspondence analysis considered in this paper can be found in and . However, here we will provide the reader with a brief overview of the theoretical aspects of these analyses. We also describe how the algebraic confidence ellipses for polynomial biplots can be derived; this aspect of the analysis has not been described elsewhere.

Table 1: R packages and some CA variants. CA: simple CA; NSCA: non symmetrical CA; MCA: multiple CA; JCA: joint CA; SOCA: singly ordered CA; DOCA: doubly ordered CA; SONSCA: singly ordered NSCA; DONSCA: doubly ordered NSCA; CCA: canonical CA; CNSCA: canonical NSCA; DCA: discriminant CA
package CA NSCA MCA JCA SOCA DOCA SONSCA DONSCA CCA CNSCA DCA
ade4 x x x x x
anacor x x x
ca x x x
cabootcrs x
CAinterprTools x x
CAvariants x x x x x x
cncaGUI x x
dualScale x x
ExPosition x x x
FactoMineR x x
homals x x
MASS x x
PTAk x
vegan x x x

3 Some theory

Symmetrical and non symmetrical correspondence analysis

Consider a two-way contingency table N of dimension I×J such that it is a cross-classification of two variables consisting of I row categories and J column categories, respectively. Denote the matrix of the joint relative frequencies by P=(pij) so that i=1Ij=1Jpij=1. Let pi=j=1Jpij and pj=i=1Ipij be the ith marginal row proportion and the jth marginal column proportion, general elements of the diagonal matrices, DI and DJ, respectively.

There are many ways that correspondence analysis can be performed and provides an excellent overview of some of them. Here, we present the chi-squared statistic expressed in terms of the weighted sum-of-squares of the centered column profiles since this alternative expression of X2 is useful when comparing symmetrical correspondence analysis with its non symmetrical variant. Therefore, consider the chi-squared statistic of N which is defined as

(1)X2=ni=1Ij=1Jpjpi(pijpjpi)2=ni=1Ij=1Jpjpiπij2,

where Π=(πij) is the I×J matrix of centered column profiles. In this case, the weight matrices in I and J are defined by the elements of the matrices DI1 and DJ, respectively.

Suppose we now treat the column variable as a predictor variable and the row variable as its response variable. When such an asymmetric association structure exists between the two categorical variables one may consider non symmetrical correspondence analysis . To quantify this asymmetric association, consider the Goodman-Kruskal (1954) tau index

τ=i=1Ij=1Jpj(pijpjpi)21i=1Ipi2=i=1Ij=1Jpjπij21i=1Ipi2=τnumτden.

For this asymmetric case, the weight matrices are I (an I×I identity matrix) and DJ respectively. Notice that the denominator can be treated as a constant term since it does not depend on the predictor variable. For this reason it can be neglected without losing any information about the structure of the association. Therefore τnum is the measure of association considered in non symmetrical correspondence analysis.

In order to graphically depict the association or the prediction of the rows given the columns in a low dimensional space, we may consider the generalized singular value decomposition of the centered column matrix Π using the suitable weight matrices .

Suppose we consider a general framework for the symmetrical and non symmetrical variants of CA , that considers generic weight matrices, VI and WJ, in I and J. This general framework may be defined by considering the weighted centered column profile matrix

Π~=V1/2ΠW1/2.

Therefore, symmetric (or classical) correspondence analysis may be performed by considering V=DI1 and W=DJ, while non symmetrical correspondence analysis is defined when V=I and W=DJ. Doing so leads to the generalized singular value decomposition (GSVD) of GSVD(Π~)=AΛBT.

where the right and left singular vectors are A(=aim) and B(=bjm), respectively. These quantities have the orthonormality properties with metrics DI1 or I (identity) (in I, depending on the symmetric or asymmetric relationship between the rows and columns) and DJ (in J), respectively. As usual, the elements of the diagonal matrix of singular values, Λ=diag(λm), are arranged in descending order.

Ordered symmetrical and non symmetrical correspondence analysis

When both variables are ordered, we adapt SVD by using basis vectors for the row and column spaces by performing the bivariate moment decomposition (BMD) on the matrix Π~. The BMD of Π~ is expressed as (2)BMD(Π~)=AZBT,

where A and B are the row and column polynomial matrices defined by , respectively, and Z is the matrix of the generalized correlations . The construction of polynomials A and B requires the specification of a priori scores, sX(i) and sY(j) (defined by mi and mj in CAvariants, respectively), to reflect the ordinal structure of the row and column variables. These polynomials are orthonormal with respect to the weight matrices. For the analysis of nominal variables, when a symmetrical association between the variables is considered, the weights in I and J are DI1 and DJ, respectively. When an asymmetric association is considered, the weights are given by I and DJ, respectively.

When only one of the two variables consists of ordered categories, rather than considering the BMD or the GSVD of Π~, one may consider instead its hybrid decomposition (HD) . This method of decomposition consists of singular vectors for the nominal variable and orthogonal polynomials for the ordered variable. Consider the case, as does the package CAvariants, where the column variable consists of ordered categories and the row variable consists of nominal categories. Then the HD of Π~ takes the form (3)HD(Π~)=AZBT,

where A is the column matrix of singular vectors for the nominal row categories and B is the column matrix of orthogonal polynomials for the ordered column categories. The generic elements of Z, zmv, are the hybrid generalized correlations; for further details on these elements see and .

Generalized correlations in ordered CA variants

The generalized correlation matrix, Z, in the BMD of Π~ reflects the various sources of association between the variables and is derived using orthogonal polynomials . For example, when the row and column scores are defined as consecutive integers such that sX(i)=i for i=1,,I and sY(j)=j for j=1,,J, then z11 is Pearson’s product moment correlation of N. A simple generalization of this correlation is z12 which is a measure of the correlation between any change in the location of the row categories and dispersion of the column categories. For this reason, z12 is a generalized correlation describing the linear-by-quadratic association between the row and column categories.

For ordered CA variants, the total inertia is Inertia(Π~)=u=1I1v=1J1zuv2, which can also be written in matrix form as (4)Inertia(Π~)=trace(ZTZ)=trace(ZZT)=trace(Λ2). From the matrix of generalized correlations Z, we can obtain the inertia of each polynomial axis by considering the sum-of-squares of zuv over either u or v. Using BMD or HD, the symmetric and asymmetric measures of association (X2 and τ) can be partitioned into polynomial components that reflect various sources of variation for each of the categories. The inertias of the polynomial components will henceforth be referred to as sources of inertia and are akin to the principal inertia values in (symmetrical or non symmetrical) correspondence analysis.

A formal statistical test of the X2 or τ index can be made. To test the statistical significance of the total inertia in the symmetrical and non symmetrical case, we can compare the chi-squared statistic, or the C-statistic, C=τ(n1)(I1) , with the χ2 distribution with (I1)(J1) degree of freedom; see, for example, for further details.

Unequal inertias of the row and column polynomials.

When considering the BMD of Π~, the total inertia of the row and column spaces (I and J, respectively) will be identical. However, the inertia associated with each of the row and column polynomials will often be different. For the row categories, there are I1 row inertia values – one for each of the axes – where the inertia of the uth polynomial axis is zu2. Similarly, there are J1 column inertia values – one for each of the axes – where the inertia of the vth axis is zv2. For this reason, we recommend constructing polynomial biplots for the ordered variants of correspondence analysis instead of the traditional correspondence plots constructed using principal coordinates. See and for more details on these features.

For the HD of Π~, the interpretation and properties of the Z matrix are a mixture (or hybrid) of Λ from the GSVD and Z from the BMD. When considering the space J, calculating m=1Mzm12=z12 gives the location component of the ordered columns and represents the principal inertia for this variable along the first polynomial axis. Similarly in I, computing v=1J1z1v2=z12=λ12 yields the principal inertia of the first principal axis for the nominal row variable. Like BMD, HD yields different sets of inertia values for each axis in the I and J spaces.

4 Polynomial biplots and elliptical confidence regions

When constructing a polynomial biplot, the ordered row and column categories can be displayed in a single plot since the row and column coordinates are computed with respect to the same set of polynomial axes. For example, in a polynomial row metric preserving (or row isometric) biplot, the column standard polynomial coordinates are G=B(gjv=βjv), while the principal polynomial coordinates for the row categories are (5)F=AZ=Π~WJB(fiv=αiuzuv=j=1Jwjπ~ijβjv). In practice, the coordinates for both the row and column categories are computed using the same orthonormal polynomial axes, i.e., the column polynomials.

The plot method for objects returned by CAvariants provides the user with the option of constructing parametric (or algebraic) elliptical confidence regions for all the six CA variants not only for the nominal CA variants as originally proposed by . We compute the semi-major and semi-minor axis lengths of the elliptical region for the row and column categories. Here, we provide the ellipse axes lengths for the ordered symmetric variants of correspondence analysis. For example, the semi-major axis length of the confidence ellipse for the ith row category is (6)xi(α)=z112χα2n×trace(ZZ)(1pim=3I1aim2), while the semi-minor axis length for this row is (7)yj(α)=z222χα2n×trace(ZZ)(1pim=3I1aim2). Similar semi-axis lengths can also be derived for the column categories and for the non-symmetrical CA variants. Furthermore, note that ellipsoids can be constructed for three- or higher- dimensional correspondence plots by considering the input parameter M > 2 in the plot method. For further details on this issue see .

Unlike the confidence circles of and the more computationally intensive bootstrap techniques proposed in the literature , constructing confidence ellipses in this manner takes into consideration the contribution of the ith row principal polynomial coordinate in dimensions higher than the second. In fact, since all I dimensions are reflected in the semi-major and semi-minor axis lengths, all of the contribution that a point has to the symmetrical or asymmetrical association can be accounted for in a two-dimensional plot using equations ((6)) and ((7)). Additional information for how to compute the p-values of each category point can be easily found by considering a similar theoretical development of the p-values described in for a correspondence analysis of a contingency table with nominal variables.

5 An overview of the CAvariants package

The primary function discussed in this paper is CAvariants. It allows the user to select which analysis to perform from a suite of six correspondence analysis techniques. These include symmetrical (or classical) correspondence analysis, non symmetrical correspondence analysis and their ordered variants, described in .

The six variations of simple correspondence analysis included in the package CAvariants are:

The input parameters of the function CAvariants are:

To visually portray and assess the statistical significance of the categories to the association between the variables of a contingency table, the plot method can be called by the user. As well as displaying the classic correspondence plot or biplot, this function allows one to superimpose onto the plot algebraically derived elliptical confidence regions for each of the principal coordinates for all CA variants. There are other features of the plot, i.e., through the plot method for “CAvariants” objects, the user may utilize. Some of these are applicable to all of the analyses and some are applicable to only a few. The input parameters of the plot method for “CAvariants” objects are:

The print method for “CAvariants” objects included in the package, CAvariants, and displays the main results of the analysis specified by the user. The results displayed depends on the type of analysis being performed. The principal inertia values, total inertia and p-values are included as part of its output when catype = "CA", catype = "SOCA" or catype = "DOCA" and are based on Pearson’s chi-squared statistic. The Goodman Kruskal tau-index is the association measure of interest when catype = "NSCA", catype = "SONSCA" or catype = "DONSCA". When an ordered analysis is specified – such as when catype = "DOCA", catype = "SOCA", catype = "SONSCA" or catype = "DONSCA" – a table describing the significant polynomial components of inertia will also be reported.

The input parameters of the print method for “CAvariants” objects are:

In general, this function produces the following output:

Furthermore, package CAvariants contains a summary method for the objects returned by CAvariants. This method provides the list of the objects names of the output and a selection of the main output objects described in the print method for objects returned by CAvariants.

Numerical outputs

As an example of the complete set of numerical results that is obtained from performing a particular variant of correspondence analysis, consider the case where a singly ordered non symmetrical correspondence analysis is performed on the data table shopdataM available in the package CAvariants. This object is the contingency table being analyzed and is described more fully in Section 6. The output object name of the main function is called res and is the execution of the CAvariants function on the shopdataM. The object res is obtained using

R> res <- CAvariants(shopdataM, catype = "SONSCA")

The results are available in the following entries which can be obtained using

R> names(res)

which gives

 [1] "Xtable"      "rows"        "cols"        "r"           "rowlabels"  
 [6] "collabels"   "Rprinccoord" "Cprinccoord" "Rstdcoord"   "Cstdcoord"  
[11] "tauden"      "tau"         "inertiasum"  "inertias"    "inertias2"  
[16] "comps"       "catype"      "mj"          "mi"          "pcc"        
[21] "Jmass"       "Imass"       "Trend"       "Z"           "ellcomp"    
[26] "risell"      "Mell"       

These results may be printed to the screen by using

R> print(res)

while a summary of each of these numerical features is produced by using

R> summary(res)

6 Application

To demonstrate the application of a variant of simple correspondence analysis described in the CAvariants package, we present the following example. We shall confine our attention to the non symmetrical correspondence analysis of a singly ordered contingency table. The contingency table that we are examining is concerned with shoplifting in The Netherlands and summarizes, in part, the results of a survey of the Dutch Central Bureau of Statistics . The data considers a sample of 20819 men who were suspected of shoplifting in Dutch stores between 1977 and 1978. The predictor variable consists of the age groups of the perpetrators (less than 12yrs, 12 to 14yrs, 15 to 17yrs, 18 to 20yrs, 21 to 29yrs, 30 to 39yrs, 40 to 49yrs, 50 to 64yrs, 65yrs and over) while the response variable of the table consists of the items stolen. These items are clothing, clothing accessory, tobacco and/or provisions, stationary, books, records, household goods, candy, toys, jewelry, perfume, hobby and/or tools and other items. For an extensive description of this example, and the application of correspondence analysis, see .

After choosing the suitable variant of correspondence analysis, we create the object res that consists of the complete features of the analysis by running the command

R> res <- CAvariants(shopdataM, catype = "SONSCA")

print(res) will return as part of its output the following numerical features:

    RESULTS for SONSCA Correspondence Analysis

    Data Table:
            M12<  M13 M16 M19 M25 M35 M45 M57 M65+
clothing     81  138 304 384 942 359 178 137  45
accessories  66  204 193 149 297 109  53  68  28
tobacco     150  340 229 151 313 136 121 171 145
stationary  667 1409 527  84  92  36  36  37  17
books        67  259 258 146 251  96  48  56  41
records      24  272 368 141 167  67  29  27   7
household    47  117  98  61 193  75  50  55  29
candy       430  637 246  40  30  11   5  17  28
toys        743  684 116  13  16  16   6   3   8
jewelry     132  408 298  71 130  31  14  11  10
perfumes     32   57  61  52 111  54  41  50  28
hobby       197  547 402 138 280 200 152 211 111
other       209  550 454 252 624 195  88  90  34

    Row Weights: Imass
            clothing accessories tobacco stationary books records household
clothing           1           0       0          0     0       0         0
accessories        0           1       0          0     0       0         0
tobacco            0           0       1          0     0       0         0
stationary         0           0       0          1     0       0         0
books              0           0       0          0     1       0         0
records            0           0       0          0     0       1         0
household          0           0       0          0     0       0         1
candy              0           0       0          0     0       0         0
toys               0           0       0          0     0       0         0
jewelry            0           0       0          0     0       0         0
perfumes           0           0       0          0     0       0         0
hobby              0           0       0          0     0       0         0
other              0           0       0          0     0       0         0
            candy toys jewelry perfumes hobby other
clothing        0    0       0        0     0     0
accessories     0    0       0        0     0     0
tobacco         0    0       0        0     0     0
stationary      0    0       0        0     0     0
books           0    0       0        0     0     0
records         0    0       0        0     0     0
household       0    0       0        0     0     0
candy           1    0       0        0     0     0
toys            0    1       0        0     0     0
jewelry         0    0       1        0     0     0
perfumes        0    0       0        1     0     0
hobby           0    0       0        0     1     0
other           0    0       0        0     0     1

    Column Weights: Jmass
      12<   13    16     19    25     35     45     57    65+
12< 0.137 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
13  0.000 0.27 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
16  0.000 0.00 0.171 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
19  0.000 0.00 0.000 0.0808 0.000 0.0000 0.0000 0.0000 0.0000
25  0.000 0.00 0.000 0.0000 0.166 0.0000 0.0000 0.0000 0.0000
35  0.000 0.00 0.000 0.0000 0.000 0.0665 0.0000 0.0000 0.0000
45  0.000 0.00 0.000 0.0000 0.000 0.0000 0.0394 0.0000 0.0000
57  0.000 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0448 0.0000
65+ 0.000 0.00 0.000 0.0000 0.000 0.0000 0.0000 0.0000 0.0255

 Total inertia  0.038

Inertias, percent inertias and cumulative percent inertias of the row space

  inertia inertiapc cuminertiapc
1  0.0300     79.88        79.88
2  0.0037      9.86        89.74
3  0.0032      8.44        98.18
4  0.0003      0.92        99.10
5  0.0003      0.67        99.77
6  0.0001      0.17        99.94
7  0.0000      0.05        99.99
8  0.0000      0.01       100.00
Inertias, percent inertias and  cumulative percent inertias of the column space

  inertia2 inertiapc2 cuminertiapc2
1   0.0225      59.83         59.83
2   0.0096      25.58         85.41
3   0.0028       7.33         92.74
4   0.0019       5.18         97.92
5   0.0003       0.82         98.74
6   0.0003       0.74         99.48
7   0.0001       0.35         99.83
8   0.0001       0.17        100.00

    Predictability Index for Variants of Non symmetrical Correspondence Analysis:

 Numerator of Tau Index predicting the rows given the column categories

[1] 0.038

 Tau Index predicting the rows given the column categories

[1] 0.041

 C-statistic 10331.51 and p-value 0

 Polynomial Components of Inertia

** Column Components **
                  Component Value P-value
Location                 6181.536       0
Dispersion               2642.363       0
Cubic                     757.192       0
Error                     750.418       0
** C-Statistic **       10331.509       0

 Generalized correlation matrix of Hybrid Decomposition
       v1     v2     v3     v4     v5     v6     v7     v8
m1 -0.147  0.084  0.018 -0.030  0.011  0.005 -0.005  0.003
m2 -0.028 -0.034 -0.032  0.024  0.005 -0.010  0.003  0.001
m3 -0.013 -0.037  0.036 -0.016 -0.004  0.006 -0.006  0.002
m4 -0.001  0.002  0.006  0.014 -0.010  0.005 -0.001 -0.001
m5 -0.001 -0.001 -0.007 -0.006 -0.007  0.009 -0.004 -0.004
m6  0.000  0.000 -0.001  0.000  0.000 -0.001 -0.006  0.005
m7  0.000  0.000  0.000 -0.001 -0.002 -0.003 -0.001 -0.002
m8  0.000  0.000  0.000  0.000 -0.001  0.000  0.001  0.001
m9  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000

 Column standard polynomial coordinates = column polynomial axes
      Axis 1  Axis 2
M12<  -1.232   1.352
M13   -0.759   0.142
M16   -0.285  -0.652
M19    0.188  -1.029
M25    0.661  -0.991
M35    1.135  -0.536
M45    1.608   0.336
M57    2.081   1.624
M65+   2.554   3.328

  Row principal polynomial coordinates
             Axis 1  Axis 2
clothing      0.072  -0.056
accessories   0.017  -0.014
tobacco       0.039   0.017
stationary   -0.084   0.033
books         0.012  -0.012
records       0.000  -0.021
household     0.015  -0.004
candy        -0.045   0.027
toys         -0.067   0.049
jewelry      -0.017  -0.006
perfumes      0.014   0.000
hobby         0.030   0.015
other         0.014  -0.030

 Column distances from the origin of the plot
      Axis 1  Axis 2
M12<   0.057   0.002
M13    0.027   0.000
M16    0.000   0.000
M19    0.027   0.002
M25    0.046   0.004
M35    0.041   0.000
M45    0.031   0.005
M57    0.021   0.022
M65+   0.010   0.047

 Row distances from the origin of the plot
             Axis 1  Axis 2
clothing      0.005   0.003
accessories   0.000   0.000
tobacco       0.001   0.000
stationary    0.007   0.001
books         0.000   0.000
records       0.000   0.000
household     0.000   0.000
candy         0.002   0.001
toys          0.005   0.002
jewelry       0.000   0.000
perfumes      0.000   0.000
hobby         0.001   0.000
other         0.000   0.001

 Inner product of coordinates (first two axes when 'firstaxis=1' and 'lastaxis=2')
              M12<    M13    M16    M19    M25    M35    M45    M57   M65+
clothing     0.111  0.097  0.014 -0.112 -0.150 -0.118 -0.065 -0.012  0.044
accessories  0.029  0.022  0.002 -0.024 -0.031 -0.027 -0.019 -0.011 -0.002
tobacco      0.057  0.017 -0.010 -0.003  0.004 -0.023 -0.059 -0.094 -0.122
stationary  -0.132 -0.089 -0.003  0.089  0.114  0.110  0.098  0.085  0.064
books        0.023  0.015  0.000 -0.014 -0.018 -0.018 -0.018 -0.017 -0.015
records      0.011  0.008  0.000 -0.008 -0.010 -0.010 -0.008 -0.006 -0.004
household    0.023  0.014  0.000 -0.013 -0.016 -0.018 -0.018 -0.018 -0.017
candy       -0.074 -0.049 -0.001  0.048  0.061  0.061  0.055  0.049  0.039
toys        -0.122 -0.070  0.004  0.061  0.074  0.088  0.101  0.113  0.115
jewelry     -0.021 -0.013  0.000  0.012  0.015  0.016  0.016  0.016  0.015
perfumes     0.021  0.010 -0.001 -0.008 -0.009 -0.013 -0.018 -0.023 -0.026
hobby        0.048  0.007 -0.012  0.010  0.021 -0.011 -0.055 -0.098 -0.135
other        0.026  0.030  0.007 -0.039 -0.054 -0.036 -0.010  0.017  0.043

    Eccentricity of ellipses
[1] 0.757

    Ellipse axes, Area, p-values of rows
            HL Axis 1 HL Axis 2 Area P-value
clothing        0.013     0.009    0   0.000
accessories     0.010     0.007    0   0.000
tobacco         0.011     0.007    0   0.000
stationary      0.010     0.007    0   0.000
books           0.012     0.008    0   0.000
records         0.008     0.005    0   0.000
household       0.015     0.010    0   0.000
candy           0.013     0.008    0   0.000
toys            0.011     0.007    0   0.000
jewelry         0.013     0.008    0   0.000
perfumes        0.015     0.010    0   0.297
hobby           0.011     0.007    0   0.000
other           0.011     0.007    0   0.000

    Ellipse axes, Area, p-values of columns
     HL Axis 1 HL Axis 2  Area P-value
M12<     0.034     0.022 0.002       0
M13      0.020     0.013 0.001       0
M16      0.020     0.013 0.001       0
M19      0.023     0.015 0.001       0
M25      0.025     0.016 0.001       0
M35      0.026     0.017 0.001       0
M45      0.031     0.020 0.002       0
M57      0.046     0.030 0.004       0
M65+     0.070     0.046 0.010       0

The total inertia of data, defined by the Goodman-Kruskal tau index (which may also be referred to as the index of predictability) when performing a non symmetrical correspondence analysis, is τ=0.0414; in the output this is reflected by Tau Index predicting the rows given the column. To determine whether this index is statistically significant, we compute the C-statistic and find that it is equal to 10331.5 (with 96 degrees of freedom). Therefore, with a p-value that is less than 0.0001, the age of the perpetrators is a strong predictor of the items that are stolen. The Goodman-Kruskal tau index and the statistical significance of the C-statistic are summarized as part of the output, together with the partition of the C-statistic, which identifies significant sources of variation in the ordered column categories. Indeed, we can look at the inertia explained by each polynomial axis to mark differences with the other non-ordered analysis. We can see that the most dominant contribution to the total inertia of the data is due to the component associated with the linear polynomial of the columns. This location component is 6182 and explains 59.8% of the total inertia. The next most dominant is the dispersion component of 2642 and reflects that 25.6% of the variation in the column categories is due to their difference in dispersion. Similarly, the cubic component is 757 and accounts for about 7% of the column variation. Even if the remaining, higher order, components are all statistically significant (their associated p-value is less than 0.001), they will be not taken into consideration since polynomials with degree higher than three (and more commonly, four) show limited information about the association structure and variation of the variables. Hence, collectively, components higher than the fourth are referred to as the error polynomial component. Note that the first two components (linear and dispersion) explain 85.4% of the total inertia, so the first two polynomial axes will provide a sufficient graphical display of the variation of the categories. Furthermore, with the specification of ellprint = TRUE in the print method for “CAvariants” objects, the output consists of the eccentricity value of the ellipses, the semi-axis lengths of the ellipse for each of the categories, the area of each ellipse and the associated p-values.

Polynomial biplot: Portraying the predictability

When an ordered analysis is performed, the trend plots of the row and column categories are depicted. For example, when performing a singly ordered NSCA, the variation, or trend, of the row categories is examined by observing how it is affected by the ordered column categories when using a polynomial transformation. Figure 1 shows a parabolic trend of the row category clothing. This trend highlights that there is a greater propensity to steal clothing by people aged 25 to 45 years than those of a younger, or older, age. Figure 2 provides an alternative visual display of these trends and is constructed by depicting the row (items) categories using principal coordinates and the column (age) categories using standard coordinates. Hence a row isometric biplot is constructed. Since the analysis also incorporates the ordered nature of the column categories and the nominal structure of the row categories, Figure 2 is referred to as the row isometric polynomial biplot of the data.

graphic without alt text
Figure 1: Trend of rows: A selection of rows of the centered column profile table reconstructed by using the first two polynomials.

The trend plot of Figure 1 and the polynomial biplot given by Figure 2 can be obtained using the following command:

R> plot(res,  plottype = "biplot", biptype = "row", scaleplot = 5, pos = 1)

When the first two polynomial axes are used to construct the biplot of Figure 2, the resulting configuration has a parabolic shape. Observe that the explained inertia of the polynomial axes is as follows: The first polynomial axis accounts for 59.8% of the inertia and the second polynomial axis for 25.6% of the inertia. We can therefore see that the novelty of the polynomial biplot is based on the polynomial representation of the predictor variable. The first linear polynomial axis represents the deviation from the mean centered profile accounting for the ordered structure of the age groups, which is reflected in the correct ordering of the age categories along the first polynomial axis. The second polynomial axis shows a parabolic shape of the categories with positive concavity. Furthermore, note that the left-hand side of the first axis is dominated by the young age groups with adolescents and young adults at the center of the display (who steal items consistent with the average number of thefts of all items). The mid-adult and older age groups are on the right-hand side of Figure 2.

graphic without alt text
Figure 2: Row-isometric polynomial biplot of singly ordered NSCA of shoplifting data: first two polynomial components, Stolen goods and Age.

The magnitude of the coordinates indicate the importance of the first two polynomial components for modeling the trends of the items. In particular, we see that the first two polynomial coordinates are sufficient to model the trends for most stolen goods.

graphic without alt text
Figure 3: 95% confidence ellipses in the row isometric polynomial biplot of singly ordered NSCA of the shoplifting data: Stolen goods and Age.

The reliability of the graphical representation can be assessed by constructing elliptical confidence regions for the row categories which are depicted using row principal polynomial coordinates. These ellipses can be obtained using the plot method for “CAvariants” objects such that

R> plot(res, scaleplot = 1, ell = TRUE, alpha = 0.05)

Figure 3 gives the 95% confidence ellipses for the row categories and are constructed so that the weights of the semi-axes are expressed in terms of the hybrid generalized correlations rather than the squared singular values associated to each axis. These ellipses are constructed so that the information contained in all of the dimensions is depicted so that, for the plot method for “CAvariants” objects, M = 8. Since this figure does not show clearly ellipses for a scale problem of coordinates, we can focus our attention more closely to those points closer to the origin of Figure 3 by specifying that

R> plot(res, scaleplot = 1, ell = TRUE, alpha = 0.05, prop = 60)

By zooming closer to the origin, the configuration of points near the origin is given by Figure 4. It shows the overlap of the confidence region for perfumes with the origin. It means that all of the items, except perfumes, are important contributors to the asymmetric association since their confidence ellipses do not overlap with the origin of the plot.

graphic without alt text
Figure 4: A zoomed view of the origin of the row-isometric polynomial biplot given by Figure 3.

The contribution of all items to the association structure is also reflected in the p-values that are summarized as part of the output of the print method for “CAvariants” objects with M = 8 and appear in the last column of the table, titled Ellipse axes, Area,p-values of rows where alpha = 0.05. These results show that the only non-statistically significant row category is perfumes, as expected from its ellipse, with a p-value of 0.297. If we now consider the age of the males in the sample, and specify M = 8 when constructing confidence ellipses and calculating p-values, see the last column of the table titled Ellipse axes, Area, p-values of columns, all age groups are useful predictors of the items that are stolen.

7 Conclusion

There are many freely downloadable programs/code available for performing classical correspondence analysis. For example, the R code of and may be considered for performing simple and joint correspondence analysis. However, the CAvariants package provides variants of correspondence analysis which are not offered by other correspondence analysis R packages on CRAN. To the best of these authors’ knowledge, CAvariants is the only package available that provides the user with the option of performing six variants of two-way correspondence analysis and, in particular, ordered symmetrical and non symmetrical correspondence analysis variants. Indeed, symmetrical correspondence analysis for ordered variables was implemented in SPLUS by and has been easily adapted for R.

Subsequent versions of the function may allow for more flexibility by giving the user more tools to assess the reliability of graphical results. These may include bootstrap confidence regions to complement the algebraic regions developed by these authors, or three-dimensional polynomial biplots. While and describe the theoretical aspects of these variants of correspondence analysis for two-way contingency tables in detail, they also provide fragments of R code to undertake the relevant calculations. However, this paper has described the CAvariants package by demonstrating the applicability of one variant and providing new insight into the development of elliptical regions for ordered variants of correspondence analysis.

CRAN packages used

MASS, ca, anacor, FactoMineR, cabootcrs, CAinterprTools, homals, dualScale, ExPosition, vegan, ade4, cncaGUI, PTAk, CAvariants

CRAN Task Views implied by cited packages

ChemPhys, Distributions, Econometrics, Environmetrics, MedicalImaging, MissingData, MixedModels, NumericalMathematics, Phylogenetics, Psychometrics, Robust, Spatial, TeachingStatistics

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

Footnotes

    References

    H. Abdi. Discriminant correspondence analysis. In Encyclopedia of measurement and statistics, Ed N. J. Salkind pages. 270–275 2007. Sage Publications, Inc.
    G. Alberti. CAinterprTools: An R package to help interpreting correspondence analysis results. SoftwareX, 1–2: 26–31, 2015. DOI 10.1016/j.softx.2015.07.001.
    M. J. Baxter and H. E. M. Cool. Correspondence analysis in R for archaeologists: An educational account. Archeologia e Calcolatori, 21: 211–228, 2010.
    D. Beaton, C. R. C. Fatt and H. Abdi. An ExPosition of multivariate analysis with the singular value decomposition in R. Computational Statistics & Data Analysis, 72: 176–189, 2014. DOI 10.1016/j.csda.2013.11.006.
    E. J. Beh. Elliptical confidence regions for simple correspondence analysis. Journal of Statistical Planning and Inference, 140: 2582–2588, 2010. DOI 10.1016/j.jspi.2010.03.018.
    E. J. Beh. Partitioning Pearson’s chi-squared statistic for singly ordered two-way contingency tables. The Australian and New Zealand Journal of Statistics, 43: 327–333, 2001. DOI 10.1111/1467-842x.00179.
    E. J. Beh. Simple correspondence analysis of nominal-ordinal contingency tables. Journal of Applied Mathematics and Computer Sciences, 8: 1–12, 2008. DOI 10.1155/2008/218140.
    E. J. Beh. Simple correspondence analysis of ordinal cross-classifications using orthogonal polynomials. Biometrical Journal, 39: 589–613, 1997. DOI 10.1002/bimj.4710390507.
    E. J. Beh. Simple correspondence analysis: A bibliographic review. International Statistical Review, 72: 257–284, 2004a. DOI 10.1111/j.1751-5823.2004.tb00236.x.
    E. J. Beh. S-PLUS code for ordinal correspondence analysis. Computational Statistics, 19: 593–612, 2004b. DOI 10.1007/bf02753914.
    E. J. Beh and R. Lombardo. Confidence regions and p-values for classical and non-symmetric correspondence analysis. Communications in Statistics – Theory and Methods, 44: 95–114, 2015. DOI 10.1080/03610926.2013.768665.
    E. J. Beh and R. Lombardo. Correspondence analysis: Theory, practice and new strategies. John Wiley & Sons, 2014. DOI 10.1002/9781118762875.
    J. P. Benzécri. Analyse des données. Dunod, Paris, 1973.
    D. J. Best and J. C. W. Rayner. Nonparametric analysis for doubly ordered two-way contingency tables. Biometrics, 52: 1153–1156, 1996. DOI 10.2307/2533077.
    D. Chessel, A. B. Dufour and J. Thioulouse. The ade4 package I: One-table methods. R News, 4(1): 5–10, 2004. URL https://www.R-project.org/doc/Rnews/Rnews_2004-1.pdf.
    J. G. Clavel, S. Nishisato and A. Pita. dualScale: Dual scaling analysis of multiple choice data. 2014. URL https://CRAN.R-project.org/package=dualScale. R package version 0.9.1.
    L. D’Ambra and N. C. Lauro. Non-symmetrical correspondence analysis for three-way contingency table. In Multiway data analysis, Eds R. Coppi and S. Bolasco pages. 301–315 1989. Amsterdam: Elsevier.
    J. De Leeuw. Correspondence analysis in R. 2006.
    J. De Leeuw and P. Mair. Gifi methods for optimal scaling in R: The package homals. Journal of Statistical Software, 31(4): 1–20, 2009a. DOI 10.18637/jss.v031.i04.
    J. De Leeuw and P. Mair. Simple and canonical correspondence analysis using the R package anacor. Journal of Statistical Software, 31(5): 1–18, 2009b. DOI 10.18637/jss.v031.i01.
    S. Dray and A. B. Dufour. The ade4 package: Implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4): 1–20, 2007. DOI 10.18637/jss.v022.i04.
    P. L. Emerson. Numerical construction of orthogonal polynomials from a general recurrence formula. Biometrics, 24: 696–701, 1968. DOI 10.2307/2528328.
    J. Gower, S. Lubbe and N. le Roux. Understanding biplots. Chichester: John Wiley & Sons, 2011. DOI 10.1002/9780470973196.
    M. Greenacre. Theory and application of correspondence analysis. London: London Academic Press, 1984.
    A. Israëls. Eigenvalue techniques for qualitative data. Leiden: DSWO Press, 1987.
    B. Kostov, M. Bécue-Bertaut and F. Husson. Correspondence analysis on generalised aggregated lexical tables (CA-GALT) in the FactoMineR package. The R Journal, 7(1): 109–117, 2015. URL https://journal.r-project.org/archive/2015-1/kostov-becuebertaut-husson.pdf.
    P. M. Kroonenberg and R. Lombardo. Nonsymmetric correspondence analysis: A tool for analysing contingency tables with a dependence structure. Multivariate Behavioral Research Journal, 34: 367–397, 1999. DOI 10.1207/s15327906mbr3403_4.
    N. C. Lauro and L. D’Ambra. L’analyse non symmetrique des correspondances. In Data analysis and informatics III, Ed E. Diday pages. 433–446 1984. Amsterdam: Elsevier.
    S. Lê, J. Josse and F. Husson. FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1): 1–18, 2008. DOI 10.18637/jss.v025.i01.
    L. Lebart, A. Morineau and K. M. Warwick. Multivariate descriptive statistical analysis. New-York, USA: John Wiley & Sons, 1984.
    D. G. Leibovici. Principal tensor analysis on k modes. 2015.
    D. G. Leibovici. Spatio-temporal multiway decomposition using principal tensor analysis on k-modes: The R package PTAk. Journal of Statistical Software, 34(10): 1–34, 2010. DOI 10.18637/jss.v034.i10.
    A. B. N. Librero, P. Willems and P. G. Villardon. cncaGUI: Canonical non-symmetrical correspondence analysis in r. 2015. URL https://CRAN.R-project.org/package=cncaGUI. R package version 1.0.
    R. J. Light and B. H. Margolin. An analysis of variance for categorical data. Journal of the American Statistical Association, 66(335): 534–544, 1971. DOI 10.1080/01621459.1971.10482297.
    M. Linting, J. J. Meulman, P. F. J. Groenen and A. J. Van der Kooij. Stability of nonlinear principal components analysis: An empirical study using the balanced bootstrap. Psychological Methods, 12(3): 359–379, 2007. DOI 10.1037/1082-989x.12.3.359.
    R. Lombardo and E. J. Beh. CAvariants: Correspondence Analysis Variants, . R package version 3.4, 2017. URL https://CRAN.R-project.org/package=CAvariants.
    R. Lombardo, E. J. Beh and P. M. Kroonenberg. Modelling trends in ordered correspondence analysis using orthogonal polynomials. Psychometrika, 81: 325–349, 2016. DOI 10.1007/s11336-015-9448-y.
    R. Lombardo and T. J. Ringrose. Bootstrap confidence regions in non-symmetrical correspondence analysis. Electronic Journal of Applied Statistical Analysis, 5: 413–417, 2012. DOI 10.1080/00949655.2011.579968.
    M. T. Markus. Bootstrap confidence regions in non-linear multivariate analysis. DSWO Press, 1994.
    F. Murtagh. Correspondence analysis and data coding with Java and R. Boca Raton, FL: Chapman & Hall/CRC, 2005. DOI 10.1201/9781420034943.
    O. Nenadic and M. Greenacre. Correspondence analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software, 20: 1–13, 2007. DOI 10.18637/jss.v020.i03.
    S. Nishisato. Multidimensional nonlinear descriptive analysis. Taylor & Francis Group, LLC, 2007.
    J. Oksanen, F. G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, et al. Vegan: Community ecology package. 2016. URL https://CRAN.R-project.org/package=vegan. R package version 2.4-1.
    J. C. W. Rayner and E. J. Beh. Towards a better understanding of correlation. Statistica Neerlandica, 63: 324–333, 2009. DOI 10.1111/j.1467-9574.2009.00425.x.
    T. J. Ringrose. Bootstrap confidence regions for correspondence analysis. Journal of Statistical Computation and Simulation, 83: 1397–1413, 2012. DOI 10.1080/00949655.2011.579968.
    B. Ripley. MASS: Support functions and datasets for venables and ripley’s MASS. 2016. URL https://CRAN.R-project.org/package=MASS. R package version 7.3-45.
    J. Thioulouse, D. Chessel, S. Dolédec and J. M. Olivier. ADE-4: A multivariate analysis and graphical display software. Statistics and Computing, 7: 75–83, 1997. DOI 10.1023/a:1018513530268.
    W. N. Venables and B. D. Ripley. Modern applied statistics with S. 4th ed Springer-Verlag, 2002. DOI 10.1007/978-0-387-21706-2.
    J. L. Williams, H. Abdi, R. French and B. J. Orange. A tutorial on multi-block discriminant correspondence analysis (MUDICA): A new method for analyzing discourse data from clinical populations. Journal of Speech Language and Hearing Research, 53: 1372–1393, 2010.

    Reuse

    Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

    Citation

    For attribution, please cite this work as

    Lombardo & Beh, "Variants of Simple Correspondence Analysis", The R Journal, 2016

    BibTeX citation

    @article{RJ-2016-039,
      author = {Lombardo, Rosaria and Beh, Eric J.},
      title = {Variants of Simple Correspondence Analysis},
      journal = {The R Journal},
      year = {2016},
      note = {https://rjournal.github.io/},
      volume = {8},
      issue = {2},
      issn = {2073-4859},
      pages = {167-184}
    }