SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares

Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many scientific fields such as chemometrics, pattern recognition, genomic sequence analysis, and so on. The so-called seedCCA is a newly developed R package that implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package, along with the paper, will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structures in data prior to developing a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and an r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many scientific fields such as chemometrics, pattern recognition, genomic sequence analysis, and so on. In Lee and Yoo (2014), it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to the least square estimator. Therefore, the CCA is not only an explanatory and dimension reduction method but also can be utilized as an alternative to least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014). Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In the next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize the CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data. Thus the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses, including the seeded CCA. The package is already uploaded to CRAN (https://cran.r-project.org/web/packages/seedCCA/index. html). The main goal of the paper is to introduce and illustrate the seedCCA package. Accordingly, three real data are fitted by the standard CCA, the seeded CCA, and partial least square. Two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. 13/1, June 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 8 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated, and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. Moreover, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) by satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k − 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then, matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In the sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer to Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014), a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two-step procedure consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. A more detailed discussion on the seeded CCA is as follows in the next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. 13/1, June 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 9 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes. They decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007), it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4). Hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use Mx,u1 and M 0 y,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2 as they define that ∆x,u1 = M 0 x,u1+1 − M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 − M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ Rp. This reduction of X and Y does not cause any loss of information on CCA in the sense that S(Mx,u1 ) = S(Mx) and S(M 0 y,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx are replaced by their m largest eigenvectors in the construction of Rx,u1 , Ry,u2 , M 0 x,u1 and M 0 y,u2 . The following two ways to determine a proper value of m is recommended among many. One is a graphical determination by a scree plot for eigenvalues of Σxy. The other is the number of eigenvalues whose sum is to cover 90% or above of the total variations of Σxy. The primary goal in the initialized step is the reduction of X and Y less than n without loss of information on CCA. In case 1, X and Y are reduced to r-dimensional variates, while they are replaced with the m-dimensional sets of variables in case 2. After the initialized step, r and m are fairly smaller than n. The next step is to conduct the standard CCA for M0 T x,u1 X and M 0 T y,u2 Y for the repairing and orthonormality. This CCA application is called finalized CCA. Finally, this two-step procedure for CCA is called seeded CCA. Partial least squares The main goal of the two CCA methods is dimension reduction based on the joint relation of X and Y rather than the conditional relation of Y|X. For simplicity, in this subsection, Y with r = 1 is assumed as a response variable in a regression of Y|X. Recall Rx,u1 in (3) and M 0 x,u1 in (4): Rx,u1 = (Σxy, ΣxΣxy, Σ 2 xΣxy, . . . , Σ u1−1 x Σxy) and Mx,u1 = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy. According to Helland (1990), the population partial least square (PLS) with u components on the regression of Y|X is as follows: βu1,PLS = M 0 x,u1 . (5) It is noted that this PLS representation in (5) is equivalent to the canonical matrix for X via the seeded CCA. The R Journal Vol. 13/1, June 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 10 Illustration of seedCCA package Outline of seedCCA package The methods discussed in the previous section are implemented through the main function seedCCA. Its arguments are as follows. seedCCA(X, Y, type="seed2", ux=NULL, uy=NULL, u=10, eps=0.01, cut=0.9, d=NULL, AS=TRUE, scale=FALSE) The main function seedCCA returns “seedCCA” class and three subclasses depending on the values of type. The values of type and its resulting subclasses are as follows. type="cca": standard CCA (max(p, r) < n and min(p, r) > 1) / “finalCCA” subclass type="cca": ordinary least squares (max(p, r) < n and min(p, r) = 1) / “seedols” subclass type="seed1": seeded CCA with case1 (max(p, r) ≥ n) / “finalCCA” subclass type="seed2": seeded CCA with case2 (max(p, r) ≥ n) / “finalCCA” subclass type="pls": partial least squares (p ≥ n and r < n) / “seedpls” subclass The function seedCCA prints out estimated canonical coefficient matrices for all subclasses, and additionally does canonical correlations for “finalCCA” subclasses, although it produces more outputs. For details, the readers are recommended to run ?seedCCA after loading the seedCCA package. It should be noted that the seedCCA package must be loaded before using all functions in the package.: of CCA and corpcor (Schafer et al. (2017)). For illustration purpose, three data sets will be considered. Pulp data is used for the standard CCA, which is available from the author’s webpage (http://home.ewha.ac.kr/~yjkstat/pulp.txt). For the seeded CCA, along with the comparison with the regularized CCA and the partial least squares, cookie and nutrimouse in seedCCA package will be illustrated. Standard CCA: pulp data Pulp data is measurements of properties of pulp fibers and the paper made from them. It contains two sets of variables with 62 sample sizes. The first set, Y, is for the pulp fiber characteristics, which are arithmetic fiber length, long fiber fraction, fine fiber fraction, and zero spans tensile. The second set, X, is regarding the paper properties such as breaking length, elastic modulus, stress at failure, and burst strength. To implement the standard CCA application, the function seedCCA with type="cca" should be used. In this case, seeCCA results in the “finalCCA” subclass. The function requires two matrix-type arguments, and it returns the following five components of cor, xcoef, ycoef, Xscores and Yscores. The first component is cor is the sample canonical correlations. The next two ones, xcoef, and ycoef, are the estimated canonical matrices for X and Y. The last two components, which are Xscores and Yscores, are the estimated canonical variates for X and Y. A command plot(object) constructs a plot of the cumulative correlations against the number of canonical pairs. The plot(object) will provide a 90% reference line as default, and users can change the reference line with plot(object,ref=percent). ## loading pulp data > pulp <read.table("http://home.ewha.ac.kr/~yjkstat/pulp.txt", header=TRUE) > Y <as.matrix(pulp[,1:4]) > X <as.matrix(pulp[,5:8]) ## standard CCA for X and Y > fit.cca <seedCCA(X, Y, type="cca") NOTE: The standard CCA is fitted for the two sets. > names(fit.cca) [1] "cor" "xcoef" "ycoef" "Xscores" "Yscores" ## plotting cumulative canonical correlation > par(mfrow=c(1, 2)) > plot(fit.cca, ref=80) > plot(fit.cca) The R Journal Vol. 13/1, June 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 11


Introduction
Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ R p ) and a r-dimensional random variable Y (Y ∈ R r ), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on.
In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation.
If max (p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices sayingΣ x andΣ y , invertible, in González et al. (2008), they are replaced withΣ λ 1 x =Σ x + λ 1 I p andΣ λ 2 y =Σ y + λ 1 I r .
The optimal values of λ 1 and λ 2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ 1 and λ 2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. implementation results by the seeded and regularized CCA are closely compared.
The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work.
We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ R p . So, X ∈ R p means a random variable, although there is no specific mention. For X ∈ R p and Y ∈ R r , we define that cov(X) = Σ x , cov(Y) = Σ y , cov(X, Y) = Σ xy and cov(Y, X) = Σ yx . And, it is assumed that Σ x and Σ y are positive-definite.

Canonical correlation analysis
Suppose the two sets of variable X ∈ R p and Y ∈ R r and consider their linear combinations of U = a T X and V = b T Y. Then we have var(U) = a T Σ x a, var(V) = b T Σ y b, and cov(U, V) = a T Σ xy b, where a ∈ R p×1 and b ∈ R r×1 . Then Pearson-correlation between U and V is as follows: We seek to find a and b to maximize cor(U, V) with satisfying the following criteria.
3. At the k step, the kth canonical variate pair is obtained from the maximization of (1) with restriction that var(U k ) = var(V k ) = 1 and (U k , V k ) are uncorrelated with the previous (k − 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)).
5. Select the first d pairs of (U k , V k ) to represent the relationship between X and Y.

Seeded canonical correlation analysis
Since the standard CCA application requires the inversion ofΣ x andΣ y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections.

Development
Define a notation of S(M) as the subspace spanned by the columns of M ∈ R p×r . Lee and Yoo (2014) show the following relation: The relation in (2) directly indicates that M x and M y form basis matrices of S(Σ −1 x Σ xy ) and S(Σ −1 y Σ yx ) and that M x and M y can be restored from Σ −1 x Σ xy and Σ −1 y Σ yx . Now, we define the following two matrices: In R x,u 1 and R y,u 2 , the numbers of u 1 and u 2 are called termination indexes, and they decide the number of projections of Σ xy and Σ yx onto Σ x and Σ y , respectively. Also define that In Cook et al. (2007) it is shown that S(M 0 x,u 1 ) = S(Σ −1 x Σ xy ) and S(M 0 y,u 2 ) = S(Σ −1 y Σ yx ) in (4), and hence M 0 x,u 1 and M 0 y,u 2 can be used to infer M x and M y , respectively. One clear advantage to use M 0 x,u 1 and M 0 y,u 2 is no need of the inversion of Σ x and Σ y . Practically, it is important to select proper values for the termination indexes u 1 and u 2 . For this define that ∆ x,u 1 = M 0 x,u 1 +1 − M 0 x,u 1 and ∆ y,u 2 = M 0 y,u 2 +1 − M 0 y,u 2 . Finally, the following measure for increment of u 1 and u 2 is defined: nF x,u 1 = ntrace(∆ T x,u 1 Σ x ∆ x,u 1 ) and nF y,u 2 = ntrace(∆ T y,u 2 Σ y ∆ y,u 2 ). Then, a proper value of u is set to have little changes in nF x,u 1 and nF x,u 1 +1 and in nF y,u 2 and nF y,u 2 +1 . It is not necessary that the selected u 1 and u 2 for M 0 x,u 1 and M 0 y,u 2 are common. Next, the original two sets of variables of X and Y are replaced with M 0 T x,u 1 X ∈ R r and M 0 T y,u 2 Y ∈ R p . This reduction of X and Y does not cause any loss of information on CCA in sense that S(M 0 x,u 1 ) = S(M x ) and S(M 0 y,u 2 ) = S(M y ), and it is called initialized CCA. The initialized CCA has the following two cases.
case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M 0 T x,u 1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σ xy and Σ yx are replaced by their m largest eigenvectors in the construction of R x,u 1 , R y,u 2 , M 0 x,u 1 and M 0 y,u 2 . The following two ways to determine a proper value of m is recommended among many. One is graphical determination by a scree plot for eigenvalues of Σ xy . The other is the number of eigenvalues whose sum is to cover 90% or above of the total variations of Σ xy .
The primary goal in the initialized step is reduction of X and Y less than n without loss of information on CCA. In case 1, X and Y are reduced to r-dimensional variates, while they are replaced with the m-dimensional sets of variables in case 2. After the initialized step, r and m is fairly smaller than n.
The next step is to conduct the standard CCA for M 0 T x,u 1 X and M 0 T y,u 2 Y for the repairing and orthonormality. This CCA application is called finalized CCA. Finally, this two step procedure for CCA is called seeded CCA.

Partial least squares
The main goal of the two CCA methods is dimension reduction based on the joint relation of X and Y rather than the conditional relation of Y|X. For simplicity, in this subsection, Y with r = 1 is assumed as a response variable in a regression of Y|X.
Recall R x,u 1 in (3) and M 0 x,u 1 in (4): According to Helland (1990), the population partial least square (PLS) with u components on the regression of Y|X is as follows: It is noted that this PLS representation in (5) is equivalent to the canonical matrix for X via the seeded CCA.

Outline of seedCCA package
The methods discussed in the previous section are implemented through the main function seedCCA. The its arguments are as follows.
seedCCA(X, Y, type="seed2", ux=NULL, uy=NULL, u=10, eps=0.01, cut=0.9, d=NULL, AS=TRUE, scale=FALSE) The main function seedCCA returns "seedCCA" class and three subclasses depending on the values of type. The values of type and its resulting subclasses are as follows.
For illustration purpose, three data sets will be considered. Pulp data is used for the standard CCA, which is available from author's webpage (http://home.ewha.ac.kr/~yjkstat/pulp.txt). For the seeded CCA along with the comparison with the regularized CCA and the partial least squares, cooike and nutrimouse in seedCCA package will be illustrated.

Standard CCA: pulp data
Pulp data is for measurements of properties of pulp fibers and the paper made from them. It contains two sets of variables with 62 sample sizes. The first set Y is for the pulp fiber characteristics, which are arithmetic fiber length, long fiber fraction, fine fiber fraction and zero span tensile. The second set X is regarding the paper properties such as breaking length, elastic modulus, stress at failure and burst strength. To implement the standard CCA application, the function seedCCA with type="cca" should be used. In this case, seeCCA results in "finalCCA" subclass. The function requires two matrix-type arguments, and it returns the following five components of cor, xcoef, ycoef, Xscores and Yscores. The first component is cor is the sample canonical correlations. The next two ones, xcoef and ycoef, are the estimated canonical matrices for X and Y. The last two components, which are Xscores, Yscores, are the estimated canonical variates for X and Y. A command plot(object) constructs a plot of the cumulative correlations against the number of canonical pairs. The plot(object) will provide 90% reference line as default, and users can change the reference line with plot(object,ref=percent).

Ordinary least squares: pulp data
If the dimension of either X or Y is equal to one, the estimated canonical coefficient matrix from the standard CCA is equivalent to that from ordinary least squares. In such case, the command seedCCA(X,Y[,1],type="cca") results in the ordinary least squares estimate, which is "seedols" subclass. The output of "seedols" has three components, which are the estimated coefficients, and the two sets of variables. For example, assume that a regression study of arithmetic fiber length, which is the first column of Y, given X is of specific interest. It should be noted that the order of seedCCA(X,Y[,1],type="cca") and seedCCA(Y[,1],X,type="cca") does not matter, and any of them yields the same results. Also, the commands of coef(object) and fitted(object) return the estimated coefficients and fitted values from the ordinary least squares, respectively.

Seeded CCA (case 1): cookie data
The biscuit dough data set called cookie in seedCCA comes from the experiment of analyzing the composition of biscuit by NIR spectroscopy. Two sets of variables are obtained from 72 biscuit samples. The first set of variables is wavelengths measured by spectroscopy. In the original data set, wavelengths at 700 different points from 1100 to 2798 nanometers (NM) at the steps of 2nm were measured. However, since some of figures seemed to contain little information, wavelengths from 1380nm to 2400 at an interval of 4nm were analyzed. The second set of variables is the percentages of four ingredients of biscuits-fat, sucrose, dry flour and water. Since the 23th and 61st samples in the data set were believed to be outliers, they were deleted from the data set. The standard CCA is not applicable because of p = 256 > n = 72, and case 1 of the seeded CCA should be fitted, considering that n = 72 >> r = 4. The basic command for this is seedCCA(X,Y,type="seed1"), which results in "finalCCA" subclass. Regardless of the order of X and Y, the lower dimensional set alone is reduced in the initial step. Therefore, seedCCA(X,Y,type="seed1") and seedCCA(Y,X,type="seed1") basically produce the same seeded CCA results. For type="seed1", the values of the options of ux, uy, u, eps and AS affect the implementation, whose defaults are NULL, NULL, 10, 0.01 and TRUE, respectively.
The option u controls the maximum number of projections, unless both ux and uy are specified. The option ux=k works, only when the dimension of the first set X is bigger than that of the second set Y. Then, the maximum number of projections becomes the value given in ux=k. The option uy works in the opposite way to ux. The options of AS=TRUE and eps control automatic termination of the projections before reaching the maximum given in u, ux or uy. The projection is terminated, if the increment gets less than the value given in eps. Then, the first candidate value, which satisfies the stopping criteria, is suggested as a proper value of projections. If any of ux, uy and u is specified not enough to guarantee the automatic stopping, a notice is provided to increase it.
After running seedCCA(X,Y,type="seed1"), a plot for the proper selection of u is automatically constructed, and a blue vertical bar in the plot is the suggested value of u.

Seeded CCA (case 2) versus Regularized CCA: nutrimouse data
The nutrimouse data was collected from a nutrition study in 40 mice (n = 40). One of two sets of variables was expressions of 120 genes measured in liver cells by microarray technology. The other set of variables was concentrations of 21 hepatic fatty acids(FA) measured through gas chromatography. In addition, the forty mice are cross-classified based on two factors, genotype and diet. There are two genotypes, wild-type (WT) and PPARα deficient (PPARα) mice and five diets; corn and colza oils (50/50 REF), hydrogenated coconut oil for a saturated FA diet (COC), sunflower oil for ω6 FA-rich diet (SUN), linseed oil for ω3-rich diet (LIN) and corn/colza/enriched fish oils (42.5/42.5/15,FISH). The nutrimouse data is contained in seedCCA package.
In this data, case 2 of the seedec CCA should be used, because min(120, 21) is relatively big compared to n = 40. Then, case 2 of the seeded CCA requires to choose how many eigenvectors ofΣ xy should be enough to replace it. This is another tuning parameter for case 2 of the seeded CCA along with the number of projections. The option cut in seedCCA controls automatic selection of the number of eigenvectors ofΣ xy . The option cut=α determines a set of the eigenvectors whose cumulative proportions of their corresponding eigenvalues is bigger than equal to α. For the set of eigenvectors to be chosen conservatively, we set the default of cut at 0.9. Also, users can directly give the number of eigenvectors using d. Unless d is NULL, the option cut is discarded. This means that cut works, only when d=NULL. If users want to use d, then a function covplot should be run first. The function covplot has the option mind, which set the number of the eigenvalues to show their cumulative percentages. Its default is NULL, and then it becomes min (p, r). The function returns the q q q q q q q q q q q q q q q q q q q q q   Figure 3: Scree plots for the selection of sets of eigenvectors to replace cov(X, Y) generated from covplot(X, Y, mind=10) in Section 3.5 eigenvalues, the cumulative percentages and the number of the eigenvectors to account 60%, 70%, 80% and 90% of the total variation along with the scree plot of the eigenvalues.
The results by the seeded and regularized CCAs are compared. Since the regularized CCA is necessary to choose proper values of the two parameters, we compare running times for the automatic searches for the regularized and seeded CCAs via tictoc package. For seedCCA, we use the default value of cut.
The plot generated from covplot(X,Y,mind=10) is given in Figure 3. According to Figure 3, the first two, three and four eigenvalues account for 79.6%, 91.8% and 95.9% of the total variation ofΣ xy , respectively. Using 90% conservative guideline, it is determined that the first three largest eigenvectors replaceΣ xy well enough. The selection plot of ux and uy is given in Figure 4. The figure suggests that u x and u y are equal to 7 and 6, respectively. Now we compare the parameter selection time. For the regularized CCA, it can be done with estim.regul, and users must provide small enough range for them to reduce the computing time. The resulted optimal λ 1 and λ 2 are 0.168016 and 0.004, respectively. With Intel(R) Core(TM)i7 2.9GHz and 12GB Ram, the seeded CCA took 0.32 second, while 819.58 seconds, around 13.5 minutes, lapsed for the regularized CCA. This difference is really huge, so time consumed in the selection of ux, uy and d is trivially small compared to the regularized CCA. This is a clear desirable aspect and advantage of the seeded CCA over the regularized one.
Next, we compare the first two pairs of estimated canonical variates. The results shown in Figures  5-6 are equivalent to the analysis discussed in González et al. (2008).

+ data=nutrimouse)
According to Figures 5-6, the first pair of canonical variates from both CCAs distinguish genotype very well, while However, their second pairs marked with diet are quite complex. To have more insight on the results for the second pair on diet, multivariate analysis of variance is fitted and pairwise comparison is done via lsmeans (Lenth (2016)) with level 5% and p-values adjusted by false discovery rate Benjamini and Hochberg (1995).
> library(lsmeans) > fit2r <-manova(cbind(rx2, ry2)~diet, data=nutrimouse) > fit3sd <-manova(cbind(sx2, sy2)~diet, data=nutrimouse) > test( contrast( lsmeans(fit2r, "diet"), "pairwise"), side = "=", adjust = "fdr") contrast estimate SE df t.ratio p.value coc -fish -2.3842686 0. Results are averaged over the levels of: rep.meas P value adjustment: fdr method for 10 tests For the regularized CCA, the "coc" diet is different from the others and "fish" differs from "sun". However, the other pairwise comparisons are quite mixed. It is determined that there are no significant differences between "fish-lin", "lin-sun" and "ref-sun". On the contrary, reasonable pairwise comparison results come from the seeded CCA. Like the others, the "coc" diet is different from the others. And, "fish-lin" is not significantly different and "ref-sun" is concluded to be similar. Fish oil is known to contain ω3 and linseed oil is designed for it. Therefore, this conclusion would be reasonable. Also, The reference oil diet consists of corn and colza oil, which is known to contain ω6. Since sun flower oil is, indeed, for ω6-rich diet, this result is also reasonable. In this regard, the seeded CCA results would be more preferable than the regularized CCA.

Partial least square application with nutrimouse data
With the nutrimouse data, consider a regression of the first one " C14.0" in concentrations of 21 hepatic fatty acids given expressions of 120 genes measured in liver cells. In the case, partial least squares is a front-runner choice. Then, to obtain the partial least square estimator in seedCCA, one needs to implement to seedCCA(X,Y,type="pls". This results in "seedpls" subclass. An important matter in partial least squares is that the first set of variable must be predictors. Response variable can be either univariate or multivariate. The option u is recommended to set reasonably small, because the estimated coefficients are reported upto the value given in u. If scale=TRUE, the predictors are standardized to have zero sample means and sample correlation matrix. The estimated coefficients and fitted values by partial least square can obtained via coef(object,u=NULL) and fitted(object,u=NULL). The default of u in both coef and fitt is NULL. In both functions, usage of u is equivalent. If u=k is specified, only the estimated coefficients and fitted values computed from k projections are reported. All of the coefficient estimates and fitted values are reported upto u, if u=NULL.

Discussion
When a study between two sets of variables, saying (X ∈ R p , Y ∈ R r ), are of primary interest, canonical correlation analysis (CCA; Hotelling (1936)) is still popularly used in explanatory studies. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on.
The recently developed seedCCA package implements a collection of CCA methodologies including the standard CCA application, seeded CCA and partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide for the package to implement all the methods along with three real data examples. Also, the seeded CCA application results are compared with the regularized CCA in the existing CCA package. : Selection plot of u generated from seedCCA(X, Y1, u=10, type="pls")(left) and seedCCA(X, Y1, u=3, type="pls", scale=TRUE)(right) in Section 3.6 It is believed that the package along with the paper will contribute to high-dimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful.