Principal Components Analysis (PCA) is a common way to study the sources of variation in a high-dimensional data set. Typically, the leading principal components are used to understand the variation in the data or to reduce the dimension of the data for subsequent analysis. The remaining principal components are ignored since they explain little of the variation in the data. However, the space spanned by the low variation principal components may contain interesting structure, structure that PCA cannot find. Prinsimp is an R package that looks for interesting structure of low variability. “Interesting” is defined in terms of a simplicity measure. Looking for interpretable structure in a low variability space has particular importance in evolutionary biology, where such structure can signify the existence of a genetic constraint.
Principal Components Analysis is simply an eigenanalysis of a covariance
or correlation matrix. To learn more about PCA, see any standard
multivariate analysis text, such as (Johnson and Wichern 2007). Through an
eigenanalysis, a
Researchers can use existing R functions princomp
and prcomp
to
calculate the eigenvectors, eigenvalues and proportion of variance
explained. Likewise, our library
prinsimp (Cubranic et al. 2013)
can be used to study eigenvectors and their properties. In addition,
prinsimp allows researchers to more carefully study the nearly null
space, by calculating simple and easily interpretable basis vectors.
We focus on analysis of covariance matrices, although the user can input a correlation matrix for analysis. As in usual PCA, the choice between using a correlation matrix and a covariance matrix depends upon the types of responses and the question of interest. A correlation matrix should be used when the responses under consideration have non-comparable units of measurement, such as meters for measuring the response, height, and kilograms for measuring the response, weight. In all of our applications, the responses are directly comparable: they are from the same variable measured under different conditions (e.g., the variable weight measured at different ages). To study structure in such data, analysis of the covariance matrix is most useful. In addition, such data usually have natural simplicity measures, whereas non-comparable data do not.
Details of the methodology that prinsimp implements appear in Gaydos et al. (2013), a paper which also discusses the importance of studying genetic constraints.
The simplicity of a vector
A simplicity basis for an
Note that we require that simple
Examples of quadratic simplicity measures can be found in smoothing and
penalized regression. See Eilers and Marx (1996) and Green and Silverman (1994). Different
examples of simplicity measures are discussed in the next section, in
particular the built-in measures of the simpart
function.
The function simpart
has three built-in simplicity measures of vectors
in
Divided differences are often used to approximate derivatives of a
function. For instance, the first divided difference of the
differentiable function
To define a divided difference simplicity measure of a vector based on
the vector of explanatory variables,
The first divided difference simplicity measure of the vector
simpart
. To
define it, write
To define the simplicity measure based on second divided differences,
let
We consider
To write ((1)) in matrix form, write
simpart
calculates
In some applications, responses may vary periodically. For instance, the growth of a plant may have a daily cycle caused by the natural daily cycle of sunlight. However, this periodic variation may be swamped by larger sources of variation and thus not detectable by Principal Components Analysis. In this case, we need new methods to look for small-variation periodic structure, that is, to look for periodic structure in the nearly null space.
To use prinsimp’s periodic simplicity measure, the user must input a
period. For example, if plant height had a daily cycle and was measured
hourly, then the input period would be
As noted, to define a periodicity measure of a vector, the user must
specify an index period
This definition of periodicity only makes practical sense if
The periodic simplicity measure of a vector
This measure is zero if and only if the vector
If the length of
This measure is 0 if and only if
We can easily find
For an alternative method for finding periodic structure in functional data, see Zhao et al. (2004).
The main function, simpart
, of the package prinsimp is designed for
analysis of a simpart
from data. If the user has data
consisting of responses vectors y
, with simpart
will calculate the required sample covariance
matrix. For some data, for instance, if the simpart
.
The simpart
function uses the covariance matrix to partition simpart
function help the user to look for
“simple” and “interesting” structure in the nearly null space. The main
output of simpart
is two sets of orthogonal basis vectors: those that
span the model space and those that span the nearly null space. The
output also includes information about these basis vectors, such as
their simplicity measures and the percents of variance they explain. The
plot
and basisplot
functions allow the user to visually study
simpart
’s output.
This section guides the reader through the capabilities of prinsimp
using analyses of two data sets. We begin with an analysis of
caterpillar data, illustrating the basic simpart
function and its
default print
, summary
and plot
methods and how to use the
argument reverse
. This example follows the analysis from Gaydos et al. (2013).
The second data set is simulated using sine functions and, in the
analysis, we study periodicity in the nearly null space. We use the
built-in measure periodic
and we also define a function that supplies
simpart
a tailor-made definition of periodic simplicity. At the end of
this section, we give an additional example to show how to use a
tailored simplicity measure. The option for the user to provide a
tailored simplicity measure gives simpart
the flexibility needed to
answer a specific scientific question.
We start with a simple example that replicates some of the analysis from
Gaydos et al. (2013). We use the estimated genetic covariance matrix of the
caterpillar growth rate data described in Kingsolver et al. (2004). Growth rates
(mg/hour) were recorded on 529 individuals from x
vector input to simpart
is
> x.cat <- c(11, 17, 23, 29, 35, 40)
We have a choice of the dimension of model space simpledim
caterpillar
.
> library(prinsimp)
> data(caterpillar)
Throughout this example, we will use the default simplicity measure of first divided differences.
First consider a model space of dimension
> cat.sim.6 <- simpart(caterpillar, simpledim = 0, x = x.cat, cov = TRUE)
Since the input caterpillar
is a covariance matrix and not the
original data, we must set cov = TRUE
. The argument
measure = "first"
is not needed because "first"
is the default. When
we choose simpledim = 0
, the resulting model space basis vectors are
simply the usual principal components of the covariance matrix, that is,
they are the eigenvectors of caterpillar
. Recall that, for all values
of simpledim
, the model space basis vectors are always the
eigenvectors of caterpillar
corresponding to the largest eigenvalues.
As in usual principal component analysis, the components of an
eigenvector are called loadings. We extend this terminology to any basis
vector.
Due to rounding, the covariance matrix, caterpillar
, has one negative
eigenvalue. The function simpart
automatically sets negative
eigenvalues equal to
We plot our results by the command
> plot(cat.sim.6)
The resulting plot appears in Figure 1.
summary
method and display=list()
argument for the caterpillar dataNow consider the case with model space of dimension simpledim = 1
, the dimension of the nearly null space. In this case,
the nearly null space is the same space as that spanned by the sixth
principal component. Thus the simpart
output and plot will be
identical to Figure 1, save for colour labelling.
> cat.sim.5 <- simpart(caterpillar, simpledim = 1, x = x.cat, cov = TRUE)
The summary
method prints the simplicity measures, the percents of
variance explained and the cumulative percents of variance explained of
the basis vectors of both the model space and the nearly null space. If
we set loadings=TRUE
, then the summary
method also prints the basis
vectors of both the model space and the nearly null space.
> summary(cat.sim.5, loadings = TRUE)
Simple partition (first divided differences): 1 simple basis
model 1 model 2 model 3 model 4 model 5 simple 1
Simplicity 1.85 3.54 1.25 2.47 3.55 2.68
%-var explained 59.5 19.2 14.7 5.85 0.731 <.1
Cumulative %-var 59.5 78.7 93.4 99.3 100 <.1
Loadings:
model 1 model 2 model 3 model 4 model 5 simple 1
11 0.022 -0.031 0.059 0.022 0.948 -0.308
17 -0.038 0.384 -0.141 -0.479 0.270 0.727
23 0.243 0.317 0.698 0.510 0.047 0.300
29 0.074 0.284 -0.687 0.650 0.074 0.121
35 -0.512 0.736 0.088 -0.082 -0.102 -0.413
40 0.819 0.359 -0.098 -0.284 -0.098 -0.317
In calculating the percent of total variance explained, summary
“restarts” the accumulation for the simplicity basis. In the summary
output, we see, for example, that the basis vector in the
caterpillar
, that the percent is actually equal to
> cat.sim.5$variance
$model
[1] 0.618482630 0.199899686 0.153080939 0.060800542 0.007597639
$simple
[1] 2.305732e-17
$full
[1] 0.618482630 0.199899686 0.153080939 0.060800542 0.007597639 0.000000000
We see that cat.sim.5$variance$simple
is essentially equal to cat.sim.5$variance$full[6]
, the variance associated with the
sixth eigenvalue from the full eigenanalysis, is equal to
We now consider a more interesting higher-dimensional nearly null space,
setting simpledim = 2
. The resulting model basis vectors are identical
to the first four model basis vectors in Figures 1, as
these are the first four principal components of caterpillar
. We can
make a plot similar to Figure 1, but instead we plot just
the two basis vectors of the nearly null space. We do this with the
display = list()
argument in the plot
method.
> cat.sim.4 <- simpart(caterpillar, simpledim = 2, x = x.cat, cov = TRUE)
> plot(cat.sim.4, display = list(simple = c(1, 2)))
This is shown in Figure 2.
Recall that the default plot, without the display = list()
argument,
would contain all of the model and nearly null space basis vectors. If
the data vectors or the covariance matrix have high dimension, an
unappealing number of subplots would be generated. Because of this, the
plot
method plots a maximum of six basis vectors. With the caterpillar
data, this is not an issue because the data are only basisplot
, illustrated in the next section.
print
command for the caterpillar dataThe print
command gives simplicity measure information: the measures
of the basis vectors constructed according to the simpart
call and the
measures of the “full space” simplicity basis vectors - that is, the
basis vectors of simpledim
equal to
> print(cat.sim.4)
Call:
simpart(y = caterpillar, simpledim = 2, x = x.cat, cov = TRUE)
Simplicity measure: first divided differences
Partition simplicity (2 simple basis):
model 1 model 2 model 3 model 4 simple 1 simple 2
1.848069 3.539756 1.245756 2.471847 3.726200 2.501706
Full space simplicity:
full 1 full 2 full 3 full 4 full 5 full 6
4.000000 3.773655 3.132870 2.237603 1.370816 0.818390
Note that the maximum simplicity score is 4, as claimed. We now study the percent of variance explained in the nearly null space.
> cat.sim.4$varperc$simple
[1] 0.6244724 0.1061671
> summary(cat.sim.4)
Simple partition (first divided differences): 2 simple basis
model 1 model 2 model 3 model 4 simple 1 simple 2
Simplicity 1.85 3.54 1.25 2.47 3.73 2.50
%-var explained 59.5 19.2 14.7 5.85 0.624 0.106
Cumulative %-var 59.5 78.7 93.4 99.3 0.624 0.731
The simplest vector in the nearly null space explains simpledim=3, 2, 1, 0
, are left to the user.
reverse
argument for the caterpillar dataIn principal component analysis, the signs of the loadings are arbitrary
and so, in simpart
, each basis vector can have one of two possible
directions. To improve interpretability the user may want to reverse the
direction of a basis vector by multiplying all of its components by
rev
argument in simpart
. The rev
argument takes a logical vector specifying which basis vector directions
we want to reverse. For example if we would like to reverse model basis
vector 2 and keep the other basis vectors’ directions unchanged, we
would use the following command.
> plot(simpart(caterpillar, simpledim = 2, x = x.cat, cov = TRUE,
+ measure = "first",
+ rev = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)))
> # Reverses the second basis vector, in this case the second principal component
This is shown in Figure 3. The subplot in the
second row of the first column contains the reversed second principal
component. Compare this subplot with the one in the same position in
Figure 1: it is the same except for the sign of the
loadings. We would not, in general, prefer the subplot in Figure
3 because all of the loadings are negative. Indeed,
if we were to see such a plot in our analysis we would use the rev
command to reverse its direction.
In this section we analyze simulated data with two simplicity measures that are useful when we expect the data to have some periodic structure “hidden” in the nearly null space. The first measure is the built-in periodic measure. While this built-in measure may be fine on its own, it produces simplicity basis vectors that are rough. We have found that the periodic simplicity measure is most useful in combination with a divided difference measure, to force smoothness of basis functions. That is the subject of our second analysis, which demonstrates how to input a function for a user-defined simplicity measure.
For both examples we use simulated data generated from sine functions.
Parts of this analysis can be seen in the R demo in prinsimp via the
command demo(periodic)
, which will also provide the user with the
function periodic.example
. We have fine-tuned the data parameters to
make our point, but have left enough flexibility in our code to allow
the interested user to experiment.
The data from the
The function periodic.example
produces one simulated data set of n
independent vectors, each of length x.max
.
> periodic.example(L = 72, M = 5, K = 12, a0, a1, a2, sd.sigma, e, n = 100, x.max)
The arguments of periodic.example
are L
, M
and K
, which define
the periods of the different components of the a0
a1
a2
sd.sigma
equals the vector
e
> # Simulate using demo(periodic)
> example <- periodic.example(a0 = 4, a1 = 4, a2 = 0, sd.sigma = 0.2,
+ e = 1, x.max = 72)
Thus the data from subject
We will use simpart
to analyze our high-dimensional data and see if we
can locate any periodic structure. Recall that we simulated the data
{simpart
with the
periodic simplicity measure. However, simpart
with a smoothed periodic
simplicity measure easily identifies the period
We begin with model space determination. This determination is the same
as determining the number of components to retain in principal component
analysis: by using either princomp
or simpart
with simpledim = 0
,
we can study the percent of variance explained. With this analysis, we
see that a model space of dimension basisplot
.
> # We do not need to specify the x vector because
> # the data are present in unit increments from 1 to 72, the default for x
> periodic.full <- simpart(example, simpledim = 0, "periodic", period = 12)
> par(mfrow = c(3, 2))
> basisplot(periodic.full, display = list(model = 1:6))
The chosen measure has no effect on the model basis vectors, as these vectors are simply the principal components of the covariance matrix.
Most researchers would see no reason to look beyond a simpart
specifying a periodic simplicity measure with period of
> periodic.sim <- simpart(example, simpledim = 70, measure = "periodic", period = 12)
> # Produce Figure 6.
> par(mfrow = c(3, 4)); basisplot(periodic.sim, display = list(simple = 1:12))
In Figure 6, we see that the first
> plot(periodic.sim$variance$simple, periodic.sim$simplicity$simple,
+ xlab = "Variance", ylab = "Simplicity")
We see that there are no vectors that have both a high simplicity score and explain noticeably more variance than the others.
Just as in principal components analysis, we can look at individuals’
scores to detect outliers or group structure. For a given basis vector
> plot(periodic.sim$scores[,1], periodic.sim$scores[,3],
+ xlab = "Score on first model basis vector",
+ ylab = "Score on first simplicity basis vector")
In addition to simpart
’s three built-in measures, simpart
allows the
user to pass a user-defined function to measure
. We provide details
for one function, which calculates a weighted combination of the two
built-in measures, the second-divided difference measure and the
periodic simplicity measure.
The previous simpart
analysis used measure = "periodic"
. However,
the periodic simplicity measure was not able to pick up any structure in
the nearly null space. In addition, the simplicity basis vectors were
very rough. If we expect to see low variability structure in our data
that is not only periodic but also smooth, we should incorporate
smoothness into our periodic simplicity measure. An easy way to do that
is by using a simplicity measure that is a weighted sum of the periodic
and the second divided difference measures. This will force the leading
periodic simplicity basis vectors to be smoother. We show how to define
such a simplicity measure by writing the function blend_measures
and
passing it to simpart
. If blend_measures
creates
this matrix.
> blend_measures <- function(alpha, x, period) {
(1-alpha) * prinsimp:::lambda_second(x) +
(alpha) * prinsimp:::lambda_periodic(x, period)
}
> # Recall that the periodic simplicity measure requires an extra period argument
Notice that if we set alpha = 1
then we have the built-in periodic
simplicity measure, and alpha = 0
gives us the second divided
difference measure. Note that all user defined functions must include
the argument x, but can also have additional arguments that are provided
in the call to simpart, in this case alpha
and period
.
We analyze the same simulated data set as in the previous section, still
using a two-dimensional model space. As before, we analyze our data with
a period of
> # We will use a 50-50 weighting of the two periodic and
> # second divided difference Lambda matrices
> periodic.blend <- simpart(example, simpledim = 70,
+ measure = "blend_measures", alpha = 0.5, period = 12)
> plot(periodic.blend$variance$simple, periodic.blend$simplicity$simple,
+ xlab = "Variance", ylab = "Simplicity Score")
The outlying point in the upper right corner is from the second
simplicity basis vector. This vector corresponds to a variance of
> periodic.blend$variance$simple[2]
> periodic.blend$varperc$simple[2]
> basisplot(periodic.blend, display = list(simple = 2))
> lines(0.16 * sin(2*pi*(1:72)/12))
.
This analysis shows how to use the function blend_measures
to locate
an important and simple direction of variability, a direction that was
not captured by principal component analysis. This example also
demonstrates how the periodic measure is used: the user has some idea
about the structure of data and so can propose a period. The user is
encouraged to experiment with the measures.
In some applications, we may want to consider that a vector is complex
if the variance of its components is large - that is, if the components
are very dissimilar. This might be sensible if, for instance, the data
vectors’ entries are different variables measured on similar scales,
such as test scores. Thus, the complexity of the simpart
by defining the function constant_simple
. Since
our simplicity measure only depends on the length of the input vector,
we simply input any
> constant_simple <- function(x) {
n <- length(x)
matrix(1,n,n)/n
}
We use this function in simpart
via measure=constant_simple
as
follows.
> constant.example <- simpart(caterpillar, simpledim = 4,
+ measure = "constant_simple", x = x.cat)
The core functionality of the prinsimp package is accessed through the
simpart
function. This function, as illustrated in examples above,
takes in the data y
in the form of a matrix or a formula object, the
independent variable x
, the dimension of the nearly null space
(argument simpledim
), and the simplicity measure (argument measure
).
Optionally, y
can contain the covariance matrix itself, which is
signalled with argument cov=TRUE
.
The value returned by simpart
is an S3 object of class "simpart"
,
with the following named components:
model
, simple
: basis of the model and nearly null space,
respectively, with vectors arranged in columns.
variance
: a list of variances associated with the vectors in the
model and nearly null spaces (components model
and simple
), as
well as the eigenvalues of the covariance matrix (component full
).
varperc
: a list of the percent of variance explained by the
corresponding basis vector in the model and nearly null space
(components model
and simple
).
simplicity
: a list of simplicity values of vectors in the model
and simplicity basis (components model
and simple
), and the
simplicity values of the simplicity basis when simpledim=d
(component full
).
measure
, call
: the simplicity measure used and the call object
In addition, when the y
argument is a data rather than covariance
matrix, the result includes the component scores
, for scores on the
basis vector loadings.
The prinsimp package provides a range of methods to view objects of
class simpart
. In addition to the print
and summary
methods for
printing out the basis vectors, simplicity scores, and variances to the
console, the user can also view them graphically. The plot
method that
we have used above puts together the collage of basis vectors along with
variance-simplicity view and the percent of variance explained panel.
The individual subplots can also be displayed alone, using the functions
basisplot
, varsimp
, and varperc
. (The plot
method merely sets up
the layout and calls those functions for the actual plotting.)
As mentioned earlier, as an alternative to using one of the built-in
simplicity measures, the user can provide his or her own measure by
passing a custom function as the value of the measure
argument. This
function returns the x
, that receives values of the independent variable simpart
. The custom function is allowed to have additional arguments;
these are included in the call to simpart
, and simply passed through
to the measure function.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Zhang, et al., "Prinsimp", The R Journal, 2014
BibTeX citation
@article{RJ-2014-022, author = {Zhang, Jonathan and Heckman, Nancy and Cubranic, Davor and Kingsolver, Joel G. and Marron, J.S. and Gaydos, Travis L.}, title = {Prinsimp}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {2}, issn = {2073-4859}, pages = {27-42} }