lmridge: A Comprehensive R Package for Ridge Regression

Muhammad Imdad Ullah; Muhammad Aslam; Saima Altaf

1 Introduction

For data collected either from a designed experiment or from an observational study, the ordinary least squares (OLS) method does not provide precise estimates of the effect of any explanatory variable (regressor) when regressors are interdependent (collinear with each other). Consider a multiple linear regression (MLR) model, \[\begin{aligned} \label{eqols} y=X\beta + \varepsilon, \end{aligned} \tag{1}\] where $y$ is an $n \times 1$ vector of observation on dependent variable, $X$ is known design matrix of order $n \times p$, $\beta$ is a $p \times 1$ vector of unknown parameters and $\varepsilon$ is an $n \times 1$ vector of random errors with mean zero and variance $\sigma^2 I_n$, where $I_n$ is an identity matrix of order $n$.

The OLS estimator (OLSE) of $\beta$ is given by \[\begin{aligned} \label{olse} \hat{\beta}= (X'X)^{-1}X' y, \end{aligned} \tag{2}\] which depends on characteristics of the matrix $X'X$. If $X'X$ is ill-conditioned (near dependencies among various columns (regressors) of $X'X$ exist) or $det(X'X)\approx0$, then the OLS estimates are sensitive to a number of errors, such as non-significant or imprecise regression coefficients (Kmenta 1980) with wrong sign and non-uniform eigenvalues spectrum. Moreover, the OLS method, can yield high variances of estimates, large standard errors, and wide confidence intervals. Quality and stability of the fitted model may be questionable due to erratic behaviour of the OLSE in case when regressors are collinear.

Researchers may tempt to eliminate regressor(s) causing the problem by consciously removing regressors from the model. However, this method may destroy the usefulness of the model by removing relevant regressor(s) from the model. To control variance and instability of the OLS estimates, one may regularize the coefficients, with some regularization methods such as ridge regression (RR), Liu regression, and Lasso regression methods etc., as alternative to OLS. Computationally, RR suppresses the effects of collinearity and reduces the apparent magnitude of the correlation among regressors in order to obtain more stable estimates of the coefficients than the OLS estimates and it also improves accuracy of prediction (see Hoerl and Kennard 1970b; Montgomery and Peck 1982; Tripp 1983 etc.; Myers 1986; Rawlings et al. 1998; Seber and Lee 2003).

There are only a few software programs and R packages capable of estimating and/ or testing of ridge coefficients. The design goal of our lmridge (Imdad and Aslam 2018a) is primarily to provide functionality of all possible ridge related computations. The output of our developed package (lmridge) is consistent with output of existing software/ R packages. The package, lmridge also provides the most complete suite of tools for ordinary RR, comparable to those listed in Table 1. For package development and R documentation, we followed (Hadley 2015), (Leisch 2008) and (R Core Team 2015). The ridge package by (Moritz and Cule 2017) and lm.ridge() from the MASS (Venables and Ripley 2002) also provided guidance in coding.

Table 1: Comparison of ridge related software and R packages.
	NCSS	SAS	Stata	StatGraphics	lrmest	ltsbase	penalized	glmnet	ridge	lmridge
Standardization of regressors
	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$		$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
Estimation and testing of ridge coefficient
Estimation	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
Testing			$\checkmark$		$\checkmark$				$\checkmark$	$\checkmark$
SE of coef	$\checkmark$		$\checkmark$		$\checkmark$				$\checkmark$	$\checkmark$
Ridge related statistics
$R^2$	$\checkmark$		$\checkmark$	$\checkmark$						$\checkmark$
adj-$R^2$			$\checkmark$	$\checkmark$						$\checkmark$
m-scale & ISRM										$\checkmark$
Variance										$\checkmark$
Bias${}^2$										$\checkmark$
MSE					$\checkmark$	$\checkmark$				$\checkmark$
F-test			$\checkmark$							$\checkmark$
Shrinkage factor										$\checkmark$
CN										$\checkmark$
$\sigma^2$										$\checkmark$
C${}_k$										$\checkmark$
DF										$\checkmark$
EDF										$\checkmark$
Eft										$\checkmark$
Hat matrix										$\checkmark$
Var-Cov matrix										$\checkmark$
VIF	$\checkmark$			$\checkmark$					$\checkmark$	$\checkmark$
Residuals	$\checkmark$		$\checkmark$	$\checkmark$		$\checkmark$	$\checkmark$			$\checkmark$
Ridge fitted						$\checkmark$	$\checkmark$			$\checkmark$
Predict	$\checkmark$		$\checkmark$	$\checkmark$			$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
Ridge model selection
CV & GCV			$\checkmark$				$\checkmark$	$\checkmark$		$\checkmark$
AIC & BIC										$\checkmark$
PRESS										$\checkmark$
Ridge related graphs
Ridge trace	$\checkmark$	$\checkmark$		$\checkmark$					$\checkmark$	$\checkmark$
VIF trace	$\checkmark$	$\checkmark$		$\checkmark$						$\checkmark$
Bias, var, MSE										$\checkmark$
CV, GCV										$\checkmark$
AIC & BIC										$\checkmark$
m-scale, ISRM										$\checkmark$
DF, RSS, PRESS										$\checkmark$

All available software and R packages mentioned in Table 1 are compared with our lmridge package. For multicollinearity detection, NCSS statistical software (NCSS 11 Statistical Software 2016) computes VIF/TOL, $R^2$, eigenvalue, eigenvector, incremental and cumulative percentage of eigenvalues and CN. For RR, ANOVA table, coefficient of variation, plot of residuals vs predicted, histogram and density trace of residuals are also available in NCSS. In SAS (Inc. 2011), collin option in the model statement is used to perform collinearity diagnostics while for remedy of multicollinearity, RR can be performed using a ridge option in proc reg statement. The outVIF option results in VIF values. For RR, Stata (StataCorp 2014) has no built-in command, however ridgereg add-on is available that performs calculation on scalar k. The lrmest package (Dissanayake et al. 2016) computes estimators such as OLS, ordinary RR (ORR), Liu estimator (LE), LE type-1,2,3, Adjusted Liu Estimator (ALTE), and their type-1,2,3 etc. Moreover, lrmest provides scalar mean square error (MSE), prediction residual error sum of squares (PRESS) values of some of the estimators. The testing of ridge coefficient is performed only on scalar k, however, for vector of k, function rid() of lrmest package returns only MSE along with value of biasing parameter used. The function optimum() of lrmest package can be used to get the optimal scalar MSE and PRESS values (Arumairajan and Wijekoon 2015). Statgraphics standardizes the dependent variable and computes some statistics for detection of collinearity such as $R^2$, adj-$R^2$, and VIF. Statgraphics also facilitates to perform RR and computes different RR related statistics such as VIF and ridge trace for different biasing parameter used, $R^2$, adj-$R^2$ and standard error of estimates etc. The ltsbase package (Kan-Kilinc and Alpu 2013, 2015) computes ridge and Liu estimates based on the least trimmed squares (LTS) method. The MSE value from four regression models can be compared graphically if the argument plot=TRUE is passed to the ltsbase() function. There are three main functions (i) ltsbase() computes the minimum MSE values for six models: OLS, ridge, ridge based on LTS, LTS, Liu, and Liu based on LTS method for sequences of biasing parameters ranging from 0 to 1. If print=TRUE, ltsbase() prints all the MSEs (along with minimum MSE) for ridge, Liu, and ridge & Liu based on LTS method for the sequence of biasing parameters given by the user, (ii) the ltsbaseDefault() function returns the fitted values and residual of the six models (OLS, ridge, Liu, LTS, and ridge & Liu based LTS methods) having minimum MSE, and (iii) the ltsbaseSummary() function returns the coefficients and the biasing parameter for the best MSE among the four regression models. The penalized package (Goeman et al. 2017) is designed for penalized estimation in generalized linear models. The supported models are linear regression, logistic regression, Poisson regression and the Cox proportional hazard models. The penalized package allows an L1 absolute value ("LASSO") penalty, and L2 quadratic ("ridge") penalty or a combination of the two. It is also possible to have a fused LASSO penalty with L1 absolute value penalty on the coefficients and their differences. The penalized package also includes facilities for likelihood, cross-validation and for optimization of the tuning parameter. The glmnet package (Friedman et al. 2010) has some efficient procedures for fitting the entire LASSO or elastic-net regularization path for linear regression, logistic and multinomial regression model, Poisson regression and Cox model. The glmnet can also be used to fit the RR model by setting alpha argument to zero. The ridge package fits linear and also logistic RR models, including functions for fitting linear and logistic RR models for genome-wide SNP data supplied as files names when the data are too big to read into R. The RR biasing parameter is chosen automatically using the method proposed by (Cule and De Iorio 2012), however value of biasing parameter can also be specified for estimation and testing of ridge coefficients. The function, lm.ridge() from MASS only fits linear RR model and returns ridge biasing parameters given by (Hoerl and Kennard 1970b) and (Venables and Ripley 2002) and vector GCV criterion, given by (Golub et al. 1979).

There are other software and R packages that can be used to perform RR analysis such as S-PLUS (S-PLUS 2008), Shazam (Shazam 2011) and R packages such as RXshrink (Obenchain 2014), rrBLUP (Endelman 2011), RidgeFusion (Price 2014), bigRR (Shen et al. 2013), lpridge (Seifert 2013), genridge (Friendly 2017) and CoxRidge (Perperoglou 2015) etc.

This paper outlines the collinearity detection methods available in the existing literature and uses the mctest (Imdad and Aslam 2018b) package through an illustrative example. To overcome the issues of the collinearity effect on regressors a thorough introduction to ridge regression, properties of the ridge estimator, different methods for selecting values of $k$, and testing of the ridge coefficients are presented. Finally, estimation of the ridge coefficients, methods of selecting a ridge biasing parameter, testing of the ridge coefficients, and different ridge related statistics are implemented in R within the lmridge.

2 Collinearity detection

Diagnosing collinearity is important to many researchers. It consists of two related but separate elements: (1) detecting the existence of collinear relationship among regressors and (2) assessing the extent to which this relationship has degraded the parameter estimates. There are many diagnostic measures used for detection of collinearity in the existing literature provided by various authors (Klein 1962; Farrar and Glauber 1967; Marquardt 1970; Theil 1971; Gunst and Mason 1977; Koutsoyiannis 1977; Belsley et al. 1980; Kovács et al. 2005; Curto and Pinto 2011; Fox and Weisberg 2011; Imdadullah et al. 2016). These diagnostics methods assist in determining whether and where some corrective action is necessary (Belsley et al. 1980). Widely used, and the most suggested diagnostics, are value of pair-wise correlations, variance inflation factor (VIF)/ tolerance (TOL) (Marquardt 1970), eigenvalues and eigenvectors (Kendall 1957), CN & CI (Belsley et al. 1980; Maddala 1988; Chatterjee and Hadi 2006), Leamer’s method (Greene 2002), Klein’s rule (Klein 1962), the tests proposed by Farrar and Glauber (Farrar and Glauber 1967), Red indicator (Kovács et al. 2005), corrected VIF (Curto and Pinto 2011) and Theil’s measures (Theil 1971), (see also (Imdadullah et al. 2016)). All of these diagnostic measures are implemented in the R package, mctest. Below, we use the Hald dataset (Hald 1952), for testing collinearity among regressors. We then use the lmridge package to compute the ridge coefficients for different ridge related statistics and methods of selection of ridge biasing parameter is also performed. For optimal choice of ridge biasing parameter, graphical representations of the ridge coefficients, vif values, cross validation criteria (CV & GCV), ridge DF, RSS, PRESS, ISRM and m-scale versus used ridge biasing parameter are considered. In addition graphical representation of model selection criteria (AIC & BIC) of ridge regression versus ridge DF is also performed. The Hald data are about heat generated during setting of 13 cement mixtures of 4 basic ingredients and used by (Hoerl et al. 1975). Each ingredient percentage appears to be rounded down to a full integer. The data set is already bundled in mctest and lmridge packages.

Collinearity detection: Illustrative example

> library("mctest")
> x <- Hald[, -1]
> y <- Hald[,  1]
> mctest (x, y)
Call:
omcdiag(x = x, y = y, Inter = TRUE, detr = detr, red = red, conf = conf,
   theil = theil, cn = cn)

Overall Multicollinearity Diagnostics

                       MC Results detection
Determinant |X'X|:         0.0011         1
Farrar Chi-Square:        59.8700         1
Red Indicator:             0.5414         1
Sum of Lambda Inverse:   622.3006         1
Theil's Method:            0.9981         1
Condition Number:        249.5783         1

1 --> COLLINEARITY is detected
0 --> COLLINEARITY is not detected by the test

The results from all overall collinearity diagnostic measures indicate the existence of collinearity among regressor(s). These results do not tell which regressor(s) are reasons of collinearity. The individual collinearity diagnostic measures can be obtained through:

> imcdiag(x = x, y, all = TRUE)
Call:
imcdiag(x = x, y = y, method = method, corr = FALSE, vif = vif,
       tol = tol, conf = conf, cvif = cvif, leamer = leamer, all = all)

Individual Multicollinearity Diagnostics

    VIF  TOL  Wi  Fi  Leamer  CVIF  Klein
X1    1    1   1   1       0     0      0
X2    1    1   1   1       1     0      1
X3    1    1   1   1       1     0      0
X4    1    1   1   1       0     0      1

1 --> COLLINEARITY is detected
0 --> COLLINEARITY is not detected by the test

X1, X2, X3, X4, coefficient(s) are non-significant may be due to multicollinearity

R-square of y on all x: 0.9824

* use method argument to check which regressors may be the reason of collinearity

Results from the most of individual collinearity diagnostics suggest that all of the regressors are the reason for collinearity among regressors. The last line of imcdiag() function’s output suggests that method argument should be used to check which regressors may be the reason of collinearity among different regressors. For further information about method argument, see the help file of imcdiag() function.

3 Ridge regression analysis

In the seminal work by (Hoerl 1959, 1962, 1964) and (Hoerl and Kennard 1970a,b) have developed ridge analysis technique that purports the departure of the data from orthogonality. (Hoerl 1962) introduced the RR, based on the James-Stein estimator by stating that existence of correlation among regressors can cause errors in estimating regression coefficients when applying the OLS method. The RR is similar to the OLS method however, it shrinks the coefficients towards zero by minimizing the MSE of the estimates, making the RR technique better than the OLSE with respect to MSE, when regressors are collinear with each other. A penalty (degree of bias) is imposed on the size of coefficients in the RR to reduce their variances. However, the expected values of these estimates are not equal to the true values and tend to under estimate the true parameter. Though the ridge estimators are biased but have lower MSE (more precision) than the OLSEs have, less sensitive to sampling fluctuations or model misspecification if number of regressors is more than the number of observations in a data set (i.e., $p>n$), and omitted variables specification bias (Theil 1957). In summary, the RR procedure is intended to overcome the ill-conditioned situation, and is used to improve the estimation of regression coefficients when regressors are correlated and it also improves the accuracy of prediction (Seber and Lee 2003). Obtaining the ridge model coefficients ($\hat{\beta}_R$) is relatively straight forward, because the ridge coefficients are obtained by solving a slightly modified form of the OLS method.

The design matrix $X$ in Eq. (1) can be standardized, scaled or centered. Usually, standardization of $X$ matrix is done as described by (Belsley et al. 1980) and (Draper and Smith 1998), that is, $X_j=\frac{x_{ij}-\overline{x}_j} {\sqrt{\sum (x_{ij}-\overline{x}_j)^2}}$; where $j=1, 2, \cdots, p$ such that $\overline{X}_j=0$ and $X'_j X_j=1$, where $X_j$ is the $j$th column of the matrix $X$. In this way, the new design matrix (say $\tilde{X})$ that contains the standardized $p$ columns and the matrix $\tilde{X}' \tilde{X}$ will be correlation matrix of regressors. To avoid complexity of different notations and terms, the centered and scaled design matrix $\tilde{X}$ will be represented by $X$ and centered response variable as $y$.

The ridge model coefficients are estimated as, \[\begin{aligned} \label{ridgeeq} \hat{\beta}_{R_k}=(X' X+k I_p)^{-1} X' y, \end{aligned} \tag{3}\] where $\hat{\beta}_{R_k}$ is the vector of standardized RR coefficients of order $p\times 1$ and $kI_p$ is a positive semi-definite matrix added to the $X'X$ matrix. Note that for $k=0$, $\hat{\beta}_{R_k}=\hat{\beta}_{ols}$.

The addition of constant term $k$ to diagonal element of $X' X$ (in other words addition of $k I_p$ to $X' X$) in Eq. (3) is known as penalty and $k$ is called the biasing or shrinkage parameter. Addition of this biasing parameter guarantees the invertibility of $X' X$ matrix, such that there is always a unique solution $\hat{\beta}_{R_k}$ exists (Hoerl and Kennard 1970b; McCallum 1970; Draper and Smith 1998) and the condition number (CN) of $X' X+kI$ ($CN_k=\sqrt{\frac{\lambda_1+kI}{\lambda_p+kI}}$) also becomes smaller as compared to that of $X' X$, where $\lambda_1$ is the largest and $\lambda_p$ is the smallest eigenvalues of the correlation matrix $X' X$. Therefore, the ridge estimator (RE) is an improvement over the OLSE for collinear data.

It is desirable to select the smallest value of $k$ for which stabilized regression coefficients occur and there always exists a particular value of $k$ for which the total MSE of the REs is less than the MSE of the OLSE, however, the optimum value of $k$ (which produces minimum MSE as compared to other values of $k$s) varies from one application to another and hence optimal value of $k$ is unknown. Any estimator that has a small amount of bias, less variance and substantially more precise than an unbiased estimator may be preferred since it will have larger probability of being close to the true parameter being estimated. Therefore, criterion of goodness of estimation considered in the RR is the minimum total MSE.

Properties of the ridge estimator

Let $X_j$ denotes the $j$th column of $X$ ($1,2,\cdots,p$), where $X_j=(x_{1j},x_{2j},\cdots,x_{nj})'$. As already discussed, assume that the regressors are centered such that $\sum\limits_{i=1}^n x_{ij}=0$ and $\sum\limits_{i=1}^n x_{ij}^2=1$ and the response variable $y$ is centered.

The RR is the most popular among biased methods, because of its relationship to the OLS method and statistical properties of the RE are also well defined. Most of the RR properties have been discussed, proved and extended by many researchers such as (Hoerl and Kennard 1970a,b; Marquardt 1970; Newhouse and Oman 1971; Allen 1974; Hemmerle 1975; McDonald and Galarneau 1975). Table 2 lists the RR properties.

Table 2: Properties of the ridge estimator.
sr.#	Property	Formula
1)	Mean	$E(\hat{\beta}_R)=(X' X+kI_p)^{-1}X' X\beta$
2)	Shorter regression coeffs.	$\hat{\beta}'_R\hat{\beta}_R\le \hat{\beta}'\hat{\beta}$
3)	Linear transformation	$\hat{\beta}_R=Z\hat{\beta} \, \text{,where } Z=(X' X+kI)^{-1}X' X$
4)	Variance	$Var(\hat{\beta}_R)=\sigma^2 \sum\limits_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2}$
5)	Var-Cov matrix	$\begin{aligned}Cov(\hat{\beta}_R)&=Cov(Z\hat{\beta})\\ &=\sigma^2(X' X+kI)^{-1}X' X(X' X+kI)^{-1}\\&=\sigma^2[VIF]\end{aligned}$
6)	Bias	$\begin{aligned}Bias(\hat{\beta}_R)&=-k(X' X+kI)^{-1}\beta\\&=-k\, P\, diag\left(\frac{1}{\lambda_j+k}\right)P'\beta\end{aligned}$
7)	MSE	$MSE=\sigma^2\sum\limits_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2}+\sum\limits_{j=1}^p \frac{k^2 \alpha^2_j}{(\lambda_j+k)^2}$
8)	Distance between $\hat{\beta}_R$ and $\beta$	$\hat{\beta}_R$ and the true vector of $\beta$ have minimum distance
9)	Inflated RSS	$\phi_0=k^2 \hat{\beta}'_R(X' X)^{-1}\hat{\beta}_R$
10)	$R^2_R$	$R^2_R=\frac{\hat{\beta}'_RX' y-k\hat{\beta}'_R\hat{\beta}_R}{y' y}$
11)	Sampling fluctuations	The $\hat{\beta}_R$ is less sensitive to the sampling fluctuation
12)	Accurate prediction	$\sigma^2_{f_R}=\sigma^2\left[1+x' \, P \, diag \left(\frac{\lambda_j}{(\lambda+k)^2}\right)P' x\right]+(Bias(\hat{\beta}_R))^2$
13)	Wide range of $k$	$0<k<k_{max}$, have smaller set of MSE than OLSE
14)	Optimal $k$	An optimal $k$ always exists that gives minimum MSE
15)	DF Ridge	$df_{R_k}=EDF=\sum\limits_{j=1}^p \frac{\lambda_j}{\lambda_j+k}=trace\left[H_{R_k}\right]$,
		where $H_R{_k}=X\left(X'X+kI\right)^{-1}X'$
16	Effective no. of parameters	$EP=trace[2H_{R_k}-H_{R_k}H'_{R_k}]$
17	Residual EDF	$REDF=n-trace[2H_{R_k}-H_{R_k}H'_{R_k}]=n-EP$

Theoretically and practically, the RR is used to propose some new methods for the choice of the biasing parameter $k$ to investigate the properties of RE, since biasing parameter plays a key role while the optimal choice of $k$ is the main issue in this context. In the literature, there are many methods for estimating the biasing parameter $k$ (Hoerl and Kennard 1970a,b; see Allen 1974; Guilkey and Murphy 1975; Hemmerle 1975; McDonald and Galarneau 1975; Hocking et al. 1976; Lawless and Wang 1976; Vinod 1976; Kasarda and Shih 1977; Obenchain 1977; Hemmerle and Brantle 1978; Wichern and Churchill 1978; Nordberg 1982; Saleh and Kibria 1993; Singh and Tracy 1999; Wencheko 2000; Kibria 2003; Khalaf and Shukur 2005; Alkhamisi et al. 2006; Alkhamisi and Shukur 2007; Khalaf 2013 among many more), however, there is no consensus about which method is preferable (Chatterjee and Hadi 2006). Similarly, each of the estimation method of biasing parameter cannot guarantee to give a better $k$ or even cannot give a smaller MSE as compared to that for the OLS.

Methods of selecting values of $k$

The optimal value of $k$ is one which gives minimum MSE. There is one optimal $k$ for any problem, while a wide range of $k$ ($0<k<k_{opt}$) give smaller MSE as compared to that of the OLS. For collinear data, a small change in $k$ varies the RR coefficients rapidly. At some values of $k$, the ridge coefficients get stabilized and the rate of change slow down gradually to almost zero. Therefore, a disciplined way of selecting the shrinkage parameter is required that minimizes the MSE. The biasing parameter $k$ depends on the true regression coefficients ($\beta$) and the variance of the residuals $\sigma^2$, unfortunately these are unknown, but they can be estimated from the sample data.

We classified these estimation method as (i) Subjective or (ii) Objective

Subjective methods

In all these methods, the selection of $k$ is subjective or of judgmental nature and provides graphical evidence of the effect of collinearity on the regression coefficient estimates and also accounts for variation by the RE as compared to the OLSE. In these methods, the reasonable choice of $k$ is done using the ridge trace, df trace, VIF trace and plotting of bias, variance, and MSE. The ridge trace is a graphical representation of regression coefficients $\hat{\beta}_R$, as a function of $k$ over the interval $[0,1]$. The df trace and VIF trace are like the ridge trace plot in which EDF and VIF values are plotted against $k$. Similarly, plotting of bias, variance, and MSE from the RE may also be helpful in selecting an appropriate value of $k$. All these graphs can be used for selection of optimal (but judgmental) value of $k$ from horizontal axis to assess the effect of collinearity on each of the coefficients. The effect of collinearity is depressed when value of $k$ increases and all the values of the ridge coefficients, EDF and VIF values decrease and/ or may stabilize after certain value of $k$. These graphical representations do not provide a unique solution, rather they render a vaguely defined class of acceptable solutions. However, these traces are still useful graphical representations to check for some optimal $k$.

Objective methods

Suppose, we have set of observations $(x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n)$ and the RR model as given in Eq. (3). Objective methods, to some extent, are similar to judgmental methods for selection of biasing parameter $k$, but they require some calculations to obtain these biasing parameters. Table 3 lists widely used methods to estimate the biasing parameter $k$ already available in the existing literature. Table 3 also lists other statistics that can be used for the selection of the biasing parameter $k$.

Table 3: Objective methods for selection of biasing parameter $k$.
method	formula	reference
$C_k$	$\begin{aligned} C_k&=\frac{SSR_k}{s^2}-n+2+2\, trace(H_{R_k})\\ &=\frac{SSR_k}{s^2}+2(1+trace(H_{R_k}))-n\end{aligned}$	(Kennard 1971; Mallows 1973)
$PRESS_k$	$\begin{aligned}PRESS_k&=\sum\limits_{i=1}^n (y_i-\hat{y}_{(i,-i)_k})^2\\&=\sum\limits_{i=1}^n e^2_{(i,-i)_k}\end{aligned}$	(Allen 1971; Allen 1974)
CV	$CV_k=n^{-1}\sum\limits_{i=1}^n (y_i-X_j \hat{\beta}_{j_{R_k}})^2$	(Delaney and Chatterjee 1986)
GCV	$GCV_k=\frac{SRR_k}{n-(1+trace(H_{R_k}))^2}$	(Golub et al. 1979)
ISRM	$ISRM_k= \sum\limits_{j=1}^p \left(\frac{p\left(\frac{\lambda_j}{\lambda_j+k}\right)^2}{\sum_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2} \, \lambda_j}-1\right)^2$	(Vinod 1976)
m-scale	$m=p-\sum\limits_{j=1}^p \frac{\lambda_j}{\lambda_j+k}$	(Vinod 1976)
Information criteria	$\begin{aligned}AIC&=n\cdot \log(RSS/n)+2\cdot df_R{_k}\\BIC&=n \cdot \log(RSS) + 2 \cdot df_R{_k}\end{aligned}$	(Akaike 1973; Schwarz 1978)
Effectiveness index (Eft)	$EF=\frac{\sigma^2\, trace(X' X)^{-1}-\sigma^2\,trace(VIF)} {(Bias(\hat{\beta}_R))^2}$	(Lee 1979)

There are other methods to estimate biasing parameter $k$. Table 4 lists various methods for the selection of biasing parameter $k$, proposed by different researchers.

Table 4: Different available methods to estimate $k$.
Sr. #	Formula	Reference
1)	$K_{HKB}=\frac{p \hat{\sigma}^2}{\hat{\beta}'\hat{\beta}}$	Hoerl and Kennard (1970b)
2)	$K_{TH}=\frac{(p-2)\hat{\sigma}2}{\hat{\beta}'\hat{\beta}}$	Thisted (1976)
3)	$K_{LW}=\frac{p\hat{\sigma}^2}{\sum_{j=1}^{p}\lambda_j \hat{\alpha}_j^2}$	Lawless and Wang (1976)
4)	$K_{DS}=\frac{\hat{\sigma}^2}{\hat{\beta}'\hat{\beta}}$	Dwividi and Shrivastava (1978)
5)	$K_{LW}=\frac{(p-2)\hat{\sigma}^2 \times n}{\hat{\beta}' X' X\hat{\beta}}$	Venables and Ripley (2002)
6)	$K_{AM}=\frac{1}{p}\sum\limits_{j=1}^p \frac{\hat{\sigma}^2}{\hat{\alpha}}_j$	Kibria (2003)
7)	$\hat{K}_{GM}=\frac{\hat{\sigma}^2}{\left(\prod\limits_{j=1}^p \hat{\alpha}_j^2\right)^ {\frac{1}{p}}}$	Kibria (2003)
8)	$\hat{K}_{MED}=Median \{\frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}\}$	Kibria (2003)
9)	$K_{KM2}=max \left(\frac{1}{\sqrt{\frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}}} \right)$	Muniz and Kibria (2009)
10)	$K_{KM3}=max \left(\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j }}\right)$	Muniz and Kibria (2009)
11)	$K_{KM4}=\left(\prod\limits_{j=1}^p \frac{1}{\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}_j^2}}} \right)^{\frac{1}{p}}$	Muniz and Kibria (2009)
12)	$K_{KM5}=\left(\prod\limits_{j=1}^p \sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j}}\right)^{\frac{1}{p}}$	Muniz and Kibria (2009)
13)	$K_{KM6}=Median \left(\frac{1}{\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j}}}\right)$	Muniz and Kibria (2009)
14)	$K_{KM8}=max\left(\frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2 +\lambda_{max}\hat{\alpha}_j^2}}}\right)$	Muniz et al. (2012)
15)	$K_{KM9}=max \left(\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2+\lambda_{max} \hat{\alpha}_j^2}}\right)$	Muniz et al. (2012)
16)	$K_{KM10}=\left(\prod\limits_{j=1}^p \frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p) \hat{\sigma^2}+\lambda_{max} \hat{\alpha}_j^2}}}\right)^{\frac{1}{p}}$	Muniz et al. (2012)
17)	$K_{KM11}=\left(\prod\limits_{j=1}^p \sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2+\lambda_{max}\hat{\alpha}_j^2}} \right)^{\frac{1}{p}}$	Muniz et al. (2012)
18)	$K_{KM12}=Median\left(\frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2 +\lambda_{max}\hat{\alpha}_j^2}}} \right)$	Muniz et al. (2012)
19)	$K_{KD}=max\left(0, \frac{p\hat{\sigma}^2}{\hat{\alpha}' \hat{\alpha}}-\frac{1}{n (VIF_j)_{max} }\right)$	Dorugade and Kashid (2010)
20)	$\begin{aligned}K_{4(AD)}&=Harmonic\, Mean[K_i (AD)]\\ &=\frac{2p}{\lambda_{max}} \sum_{j=1}^p \frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}\end{aligned}$	Dorugade (2014)

Testing of the ridge coefficients

Investigating of the individual coefficients in a linear but biased regression models such as ridge based, exact and non-exact $t$ type and $F$ test can be used. Exact $t$-statistics derived by (Obenchain 1977) based on the RR for matrix $G$ whose columns are the normalized eigenvectors of $X' X$, is, \[\begin{aligned} \label{Texact} t^*=\frac{\hat{\beta}_{R_j}-\beta_j}{\sqrt{\hat{var}(\hat{\beta}_{R_j}-\beta_j)}}, \end{aligned} \tag{4}\] where $j=1,2,\cdots,p$, $\hat{var}(\hat{\beta}_{R_j}-\beta_j)$ is an unbiased estimator of the variance of the numerator in Eq. (4), and \[\beta_j=g'_i \Delta G'[I-(X' X)^{-1} e'_i (e_i(X' X)^{-1}e'_i)^{-1}]\hat{\beta}(0),\] where $g'_i$ is the $i$th row of $G$, $\Delta$ is the ($p\times p$) diagonal matrix with $i$th diagonal element given by $\delta_i=\frac{\lambda_i}{\lambda_i+k}$ and $e_i$ is the $i$th row of the identity matrix.

It has been established that $\beta_R\sim N(ZX\beta,\, \phi=Z\Omega Z')$, where $Z=(X' X+kI_p)^{-1}X'$. Therefore, for $j$th ridge coefficient $\beta_R \sim N(Z_j X\beta,\, \phi_{jj}=Z_j \Omega Z'_j)$ (Halawa and El-Bassiouni 2000; see Aslam 2014). Halawa and El-Bassiouni (2000) presented to tackle the problem of testing $H_0:\beta_j=0$ by considering a non-exact $t$ type test of the form, \[t_{R_j}=\frac{\hat{\beta}_{R_j}}{\sqrt{S^2 (\hat{\beta}_{R_j})}},\] where $\hat{\beta}_{R_j}$ is the $j$th element of RE and $S^2(\hat{\beta}_{R_j})$ is an estimate of the variance of $\hat{\beta}_{R_j}$ given by the $i$th diagonal element of the matrix $\sigma^2(X' X+kI_p)^{-1}X' X(X' X+kI_p)^{-1}$.

The statistic $t_{R_j}$ is assumed to follow a Student’s $t$ distribution with ($n-p$) d.f. (Halawa and El-Bassiouni 2000). Hastie and Tibshirani (1990; Cule and De Iorio 2012) suggested to use [$n-trace(H_R{_k})$] d.f. For large sample size, the asymptotic distribution of this statistic is normal (Halawa and El-Bassiouni 2000). Thus, $H_0$ is rejected when $|T| > Z_{1-\frac{\alpha}{2}}$.

Similarly, for testing the hypothesis $H_0:\beta \ne \beta_0$, where $\beta_0$ is vector of fixed values. The $F$ statistic for significance testing of the ORR estimator $\beta_R$ with $E(\hat{\beta}_R)=ZX\beta$ and estimate of $Cov(\beta_R)$ distributed as $F(DF_{ridge}, REDF)$ is \[F=\frac{1}{p} (\hat{\beta}_R-ZX\beta)'\,{ \left(Cov(\hat{\beta}_R)\right)}^{-1}\,(\hat{\beta}_R-ZX\beta)\]

4 The R package lmridge

Our R package lmridge contains functions related to fitting of the RR model and provides a simple way of obtaining the estimates of RR coefficients, testing of the ridge coefficients, and computation of different ridge related statistics, which prove helpful for selection of optimal biasing parameter $k$. The package computes different ridge related measures available for the selection of biasing parameter $k$, and also computes value of different biasing parameters proposed by some researchers in the literature.

The lmridge objects contain a set of standard methods such as print(), summary(), plot() and predict(). Therefore, inferences can be made easily using summary() method for assessing the estimates of regression coefficients, their standard errors, $t$ values and their respective $p$ values. The default function lmridge which calls lmridgeEst() to perform required computations and estimation for given values of non-stochastic biasing parameter $k$. The syntax of default function is,

lmridge (formula, data, scaling = ("sc", "scaled", "centered"), K, ...)

The four arguments of lmridge() are described in Table 5:

Table 5: Description of `lmridge()` function arguments.
Argument	Description
`formula`	Symbolic representation for RR model of the form, response $\sim$ predictors.
`data`	Contains the variables that have to be used in RR model.
`K`	The biasing parameter, may be a scalar or vector. If a $K$ value is not provided, $K=0$ will be used as the default value, i.e., the OLS results will be produced.
`scaling`	The methods for scaling the predictors. The `sc` option uses the default scaling of the predictors in correlation form as described in (Belsley 1991; Draper and Smith 1998); the `scaled` option standardizes the predictors having zero mean and unit variance; and the `centered` option centers the predictors.

The lmridge() function returns an object of class "lmridge". The function summary(), kest(), and kstats1() etc., are used to compute and print a summary of the RR results, list of biasing parameter given in Table 4, and ridge related statistics such as estimated squared bias, $R^2$ and variance etc., after addition of $k$ to diagonal of $X' X$ matrix. An object of class "lmridge" is a list, the components of which are described in Table 6:

Table 6: Objects from "lmridge" class.
Object	Description
coef	A named vector of fitted ridge coefficients.
xscale	The scales used to standardize the predictors.
xs	The scaled matrix of predictors.
y	The centered response variable.
Inter	Whether an intercept is included in the model or not.
K	The RR biasing parameter(s).
xm	A vector of means of design matrix $X$.
rfit	Matrix of ridge fitted values for each biasing parameter $k$.
d	Singular values of the SVD of the scaled predictors.
div	Eigenvalues of scaled regressors for each biasing parameter $k$.
scaling	The method of scaling used to standardized the predictors.
call	The matched call.
terms	The `terms` object used.
Z	A matrix $(X' X+kI_p)^{-1}X'$ for each biasing parameter.

Table 7 lists the functions and methods available in lmridge package:

Table 7: Functions and methods in *lmridge* package.
Functions	Description
Ridge coefficient estimation and testing
`lmridgeEst()`	The main model fitting function for implementation of RR models in R.
`coef()`	Display de-scaled ridge coefficients.
`lmridge()`	Generic function and default method that calls `lmridgeEst()` and returns an object of S3 class "lmridge" with different set of methods to standard generics. It has a print method for display of ridge de-scaled coefficients.
`summary()`	Standard RR output (coefficient estimates, scaled coefficients estimates, standard errors, $t$ values and $p$ values); returns an object of class "summaryridge" containing the relative summary statistics and has a print method.
Residuals, fitted values and prediction
`predict()`	Produces predicted value(s) by evaluating `lmridgeEst()` in the frame `newdata`.
`fitted()`	Displays ridge fitted values for observed data.
`residuals()`	Displays ridge residuals values.
`press()`	Generic function that computes prediction residual error sum of squares (PRESS) for ridge coefficients.
Methods to estimate $k$
`kest()`	Displays various $k$ (biasing parameter) values from different authors available in literature and have a print method.
Ridge statistics
`vcov()`	Displays associated Var-Cov matrix with matching ridge parameter $k$ values
`hatr()`	Generic function that displays hat matrix from RR.
`infocr()`	Generic function that compute information criteria AIC and BIC.
`vif()`	Generic function that computes VIF values.
`rstats1()`	Generic function that displays different statistics of RR such as MSE, squared bias and $R^2$ etc., and have print method.
`rstats2()`	Generic function that displays different statistics of RR such as df, m-scale and LSRM etc., and have print method.
Ridge plots
`plot()`	Ridge and VIF trace plot against biasing parameter $k$.
`bias.plot()`	Bias-Variance tradeoff plot. Plot of ridge MSE, bias and variance against $k$
`cv.plot()`	Cross validation plots of CV and GCV against biasing parameter $k$.
`info.plot()`	Plot of AIC and BIC against $k$.
`isrm.plot()`	Plots ISRM and m-scale measure.
`rplots.plot()`	Miscellaneous ridge related plots such as df-trace, RSS and PRESS plots.

The lmridge package implementation in R

The use of lmridge is explained through examples by using the Hald dataset.

> library("lmridge")
> mod <- lmridge(y ~ X1 + X2 + X3 + X4, data = as.data.frame(Hald),
+  scaling = "sc", K = seq(0, 1, 0.001))

The output of linear RR from lmridge() function is assigned to an object mod. The first argument of the function is formula, which is used to specify the required linear RR model for the data provided as second argument. The print method for mod, an object of class "lmridge", will display the de-scaled coefficients. The output (de-scaled coefficients) from the above command is only for a few selected biasing parameter values.

Call:
lmridge.default(formula = y ~ ., data = as.data.frame(Hald),
K = seq(0, 1, 0.001))
       Intercept      X1      X2       X3       X4
K=0.01  82.67556 1.31521 0.30612 -0.12902 -0.34294
K=0.05  85.83062 1.19172 0.28850 -0.21796 -0.35423
K=0.5   89.19604 0.78822 0.27096 -0.36391 -0.28064
K=0.9   90.22732 0.65351 0.24208 -0.34769 -0.24152
K=1     90.42083 0.62855 0.23540 -0.34119 -0.23358

To get the ridge scaled coefficients mod$coef can be used,

> mod$coef
       K=0.01    K=0.05      K=0.5      K=0.9        K=1
X1  26.800306  24.28399  16.061814  13.316802  12.808065
X2  16.500987  15.55166  14.606166  13.049400  12.689060
X3  -2.862655  -4.83610  -8.074509  -7.714626  -7.570415
X4 -19.884534 -20.53939 -16.272482 -14.004088 -13.543744

Objects of class "lmridge" contain components such as rfit, K and coef etc. For fitted ridge model, the generic method summary() is used to investigate the ridge coefficients. The parameter estimates of ridge model are summarized using a matrix of 5 columns namely estimates, estimates (Sc), StdErr (Sc), t values (Sc) and P(>|t|) for ridge coefficients. The following results are shown only for $K=0.012$ which produces minimum MSE as compared to others values specified in the argument.

> summary(mod)
Call:
lmridge.default(formula = y ~ ., data = as.data.frame(Hald), K = 0.012)

Coefficients: for Ridge parameter K= 0.012
           Estimate Estimate (Sc) StdErr (Sc) t-value (Sc)  Pr(>|t|)
Intercept   83.1906     -246.5951    269.2195       -0.916  0.3837
X1           1.3046       26.5843      3.8162        6.966  0.0001 ***
X2           0.3017       16.2649      4.6337        3.510  0.0067 ***
X3          -0.1378       -3.0585      3.7655       -0.812  0.4377
X4          -0.3470      -20.1188      4.7023       -4.279  0.0021 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ridge Summary
     R2    adj-R2  DF ridge         F       AIC       BIC
0.96990   0.95980   3.04587 134.14893  23.24068  58.30578
Ridge minimum MSE= 390.5195 at K= 0.012
P-value for F-test ( 3.04587 , 9.779581 ) = 2.914733e-08

The summary() function also displays ridge related $R^2$, adjusted-$R^2$, df, $F$ statistics, AIC, BIC and minimum MSE at certain $k$ given in lmridge().

The kest() function, which works with ridge fitted model, computes different biasing parameters developed by researchers, see Table 4. The list of different k values (22 in numbers) may help in deciding the amount of bias needs to be introduced in RR.

> kest(mod)

Ridge k from different Authors
                              k values
Thisted (1976):                0.00581
Dwividi & Srivastava (1978):   0.00291
LW (lm.ridge)                  0.05183
LW (1976)                      0.00797
HKB (1975)                     0.01162
Kibria (2003) (AM)             0.28218
Minimum GCV at                 0.01320
Minimum CV at                  0.01320
Kibria 2003 (GM):              0.07733
Kibria 2003 (MED):             0.01718
Muniz et al. 2009 (KM2):      14.84574
Muniz et al. 2009 (KM3):       5.32606
Muniz et al. 2009 (KM4):       3.59606
Muniz et al. 2009 (KM5):       0.27808
Muniz et al. 2009 (KM6):       7.80532
Mansson et al. 2012 (KMN8):   14.98071
Mansson et al. 2012 (KMN9):    0.49624
Mansson et al. 2012 (KMN10):   6.63342
Mansson et al. 2012 (KMN11):   0.15075
Mansson et al. 2012 (KMN12):   8.06268
Dorugade et al. 2010:          0.00000
Dorugade et al. 2014:          0.00000

The rstats1() and rstats2() functions can be used to compute different statistics for a given ridge biasing parameter specified in a call to lmridge. The ridge statistics are MSE, squared bias, $F$ statistics, ridge variance, degrees of freedom by (Hastie and Tibshirani 1990), condition numbers, PRESS, $R^2$, and ISRM etc. Following are the results using rstats1() and rstats2() functions, for some ($K = 0,\, 0.012,\, 0.1,\, 0.2$).

> rstats1(mod)
Ridge Regression Statistics 1:

         Variance   Bias^2       MSE rsigma2        F     R2 adj-R2        CN
K=0     3309.5049   0.0000 3309.5049  5.3182 125.4142 0.9824 0.9765 1376.8806
K=0.012   72.3245 318.1951  390.5195  4.9719 134.1489 0.9699 0.9598  164.9843
K=0.1     19.8579 428.4112  448.2692  5.8409 114.1900 0.8914 0.8552   22.9838
K=0.2     16.5720 476.8887  493.4606  7.6547  87.1322 0.8170 0.7560   12.0804

> rstats2(mod)
Ridge Regression Statistics 2:

             CK DF ridge     EP    REDF      EF   ISRM m scale    PRESS
K= 0     6.0000   4.0000 4.0000  9.0000  0.0000 3.9872  0.0000 110.3470
K= 0.012 4.8713   3.0459 3.2204  9.7796 10.1578 3.6181  0.9541  92.8977
K= 0.1   4.2246   2.5646 2.9046 10.0954  7.6829 2.8471  1.4354 121.2892
K= 0.2   3.8630   2.2960 2.7290 10.2710  6.9156 2.5742  1.7040 162.2832

The residuals, fitted values from the RR and predicted values of the response variable $y$ can be computed using functions residual(), fitted() and predict(), respectively. To obtain the Var-Cov matrix, VIF and Hat matrix, the function vcov(), vif() and hatr() can be used. The df are computed by following (Hastie and Tibshirani 1990). The results for VIF, Var-Cov and diagonal elements of the hat matrix from vif(), vcov() and hatr() functions are given below for $K=0.012$.

> hatr(mod)
> hatr(mod)[[2]]
> diag(hatr(mod)[[2]])
> diag(hatr(lmridge(y ~ ., as.data.frame(Hald), K = c(0, 0.012)))[[2]])
      1       2       3       4       5       6       7       8       9      10      11
0.39680 0.21288 0.10286 0.16679 0.24914 0.04015 0.28424 0.30163 0.12502 0.58426 0.29625
     12      13
0.12291 0.16294

> vif(mod)
              X1        X2       X3        X4
k=0     38.49621 254.42317 46.86839 282.51286
k=0.012  2.92917   4.31848  2.85177   4.44723
k=0.1    1.28390   0.51576  1.20410   0.39603
k=0.2    0.78682   0.34530  0.75196   0.28085

R> vcov(mod)
$`K=0.012`
          X1        X2        X3        X4
X1 14.563539  1.668783 11.577483  4.130232
X2  1.668783 21.471027  3.066958 19.075274
X3 11.577483  3.066958 14.178720  4.598000
X4  4.130232 19.075274  4.598000 22.111196

Following are possible uses of some functions to compute different ridge related statistics. For detail description of these functions/ commands, see the lmridge package documentation.

> mod$rfit
> resid(mod)
> fitted(mod)
> infocr(mod)
> press(mod)

For given values of $X$, such as for first five rows of $X$ matrix, the predicted values for some $K=0,\, 0.012,\, 0.1$, and $\,0.2$ will be computed by predict():

> predict(mod, newdata = as.data.frame(Hald[1 : 5, -1]))
        K=0   K=0.012     K=0.1     K=0.2
1  78.49535  78.52225  79.75110  80.73843
2  72.78893  73.13500  74.32678  75.38191
3 105.97107 106.39639 106.04958 105.62451
4  89.32720  89.48443  89.52343  89.65432
5  95.64939  95.73595  96.56710  96.99781

The model selection criteria’s of AIC and BIC can be computed using infocr() function for each value of $K$ used in argument of ridge(). For some $K=0, 0.012, 0.1$, and $0.2$, the AIC and BIC values are:

> infocr(mod)
             AIC      BIC
K=0     24.94429 60.54843
K=0.012 23.24068 58.30578
K=0.1   24.78545 59.57865
K=0.2   27.98813 62.62961

The effect of multicollinearity on the coefficient estimates can be identified by using different graphical displays such as ridge, VIF and df traces, plotting of RSS against df, PRESS vs $k$, and the plotting of bias, variance, and MSE against $K$ etc. Therefore, for selection of optimal $k$ using subjective (judgmental) methods, different plot functions are also available in lmridge package. For example, the ridge (Figure 1) or vif trace (Figure 2) can be plotted using plot() function. The argument to plot functions are abline = TRUE, and type = c("ridge", "vif"). By default, ridge trace will be plotted having horizontal line parallel to horizontal axis at $y=0$ and vertical line on $x$-axis at $k$ having minimum $GCV$.

graphic without alt text — Figure 1: Ridge trace plot.

> mod <- lmridge(y ~ ., data = as.data.frame(Hald), K = seq(0, 0.5, 0.001))
> plot(mod)
> plot(mod, type = "vif", abline = FALSE)
> plot(mod, type = "ridge", abline = TRUE)

> bias.plot(mod, abline = TRUE)
> info.plot(mod, abline = TRUE)

> cv.plot(mod, abline = TRUE)

The vertical lines in ridge trace and VIF trace suggest the optimal value of biasing parameter $k$ selected at which GCV is minimum. The horizontal line in ridge trace is reference line at $y=0$ for ridge coefficient against vertical axis .

The bias-variance tradeoff plot (Figure 3) may be used to select optimal $k$ using bias.plot() function. The vertical line in bias-variance tradeoff plot shows the value of biasing parameter $k$ and horizontal line shows minimum MSE for ridge.

The plot of model selection criteria AIC and BIC for choosing optimal $k$ (Figure 4), info.plot() function may be used,

Function cv.plot() plots the CV and GCV cross validation against biasing parameter $k$ for the optimal selection of $k$ (see Figure 5), that is,

> isrm.plot(mod)

The m-scale and ISRM (Figure 6) measures by (Vinod 1976) can also be plotted from function of isrm.plot() and can be used to judge the optimal value of $k$.

Function rplots.plot() plots the panel of three plots namely (i) df trace, (ii) RSS vs $k$ and (iii) PRESS vs $k$ and may be used to judge the optimal value of $k$, see Figure 7.

> rplots.plot(mod)

5 Summary

Our developed lmridge package provides the most complete suite of tools for RR available in R, comparable to those available as listed in Table 1. We have implemented functions to compute the ridge coefficients, testing of these coefficients, computation of different ridge related statistics and computation of the biasing parameter for different existing methods by various authors (see Table 4).

We have greatly increased the ridge related statistics and different graphical methods for the selection of biasing parameter $k$ through lmridge package in R.

Up to now, a complete suite of tools for RR was not available for an open source or paid version of statistical software packages, resulting in reduced awareness and use of developed ridge related statistics. The package lmridge provides a complete open source suite of tools for the computation of ridge coefficients estimation, testing and computation of different statistics. We believe the availability of these tools will lead to increase utilization and better ridge related practices.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

	NCSS	SAS	Stata	StatGraphics	lrmest	ltsbase	penalized	glmnet	ridge	lmridge
Standardization of regressors
	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Estimation and testing of ridge coefficient
Estimation	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Testing			\(\checkmark\)		\(\checkmark\)				\(\checkmark\)	\(\checkmark\)
SE of coef	\(\checkmark\)		\(\checkmark\)		\(\checkmark\)				\(\checkmark\)	\(\checkmark\)
Ridge related statistics
\(R^2\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)						\(\checkmark\)
adj-\(R^2\)			\(\checkmark\)	\(\checkmark\)						\(\checkmark\)
m-scale & ISRM										\(\checkmark\)
Variance										\(\checkmark\)
Bias\({}^2\)										\(\checkmark\)
MSE					\(\checkmark\)	\(\checkmark\)				\(\checkmark\)
F-test			\(\checkmark\)							\(\checkmark\)
Shrinkage factor										\(\checkmark\)
CN										\(\checkmark\)
\(\sigma^2\)										\(\checkmark\)
C\({}_k\)										\(\checkmark\)
DF										\(\checkmark\)
EDF										\(\checkmark\)
Eft										\(\checkmark\)
Hat matrix										\(\checkmark\)
Var-Cov matrix										\(\checkmark\)
VIF	\(\checkmark\)			\(\checkmark\)					\(\checkmark\)	\(\checkmark\)
Residuals	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)
Ridge fitted						\(\checkmark\)	\(\checkmark\)			\(\checkmark\)
Predict	\(\checkmark\)		\(\checkmark\)	\(\checkmark\)			\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Ridge model selection
CV & GCV			\(\checkmark\)				\(\checkmark\)	\(\checkmark\)		\(\checkmark\)
AIC & BIC										\(\checkmark\)
PRESS										\(\checkmark\)
Ridge related graphs
Ridge trace	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)					\(\checkmark\)	\(\checkmark\)
VIF trace	\(\checkmark\)	\(\checkmark\)		\(\checkmark\)						\(\checkmark\)
Bias, var, MSE										\(\checkmark\)
CV, GCV										\(\checkmark\)
AIC & BIC										\(\checkmark\)
m-scale, ISRM										\(\checkmark\)
DF, RSS, PRESS										\(\checkmark\)

lmridge: A Comprehensive R Package for Ridge Regression

1 Introduction

2 Collinearity detection

Collinearity detection: Illustrative example

3 Ridge regression analysis

Properties of the ridge estimator

Methods of selecting values of \(k\)

Subjective methods

Objective methods

Testing of the ridge coefficients

4 The R package lmridge

The lmridge package implementation in R

5 Summary

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation

sr.#	Property	Formula
1)	Mean	\(E(\hat{\beta}_R)=(X' X+kI_p)^{-1}X' X\beta\)
2)	Shorter regression coeffs.	\(\hat{\beta}'_R\hat{\beta}_R\le \hat{\beta}'\hat{\beta}\)
3)	Linear transformation	\(\hat{\beta}_R=Z\hat{\beta} \, \text{,where } Z=(X' X+kI)^{-1}X' X\)
4)	Variance	\(Var(\hat{\beta}_R)=\sigma^2 \sum\limits_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2}\)
5)	Var-Cov matrix	\(\begin{aligned}Cov(\hat{\beta}_R)&=Cov(Z\hat{\beta})\\ &=\sigma^2(X' X+kI)^{-1}X' X(X' X+kI)^{-1}\\&=\sigma^2[VIF]\end{aligned}\)
6)	Bias	\(\begin{aligned}Bias(\hat{\beta}_R)&=-k(X' X+kI)^{-1}\beta\\&=-k\, P\, diag\left(\frac{1}{\lambda_j+k}\right)P'\beta\end{aligned}\)
7)	MSE	\(MSE=\sigma^2\sum\limits_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2}+\sum\limits_{j=1}^p \frac{k^2 \alpha^2_j}{(\lambda_j+k)^2}\)
8)	Distance between \(\hat{\beta}_R\) and \(\beta\)	\(\hat{\beta}_R\) and the true vector of \(\beta\) have minimum distance
9)	Inflated RSS	\(\phi_0=k^2 \hat{\beta}'_R(X' X)^{-1}\hat{\beta}_R\)
10)	\(R^2_R\)	\(R^2_R=\frac{\hat{\beta}'_RX' y-k\hat{\beta}'_R\hat{\beta}_R}{y' y}\)
11)	Sampling fluctuations	The \(\hat{\beta}_R\) is less sensitive to the sampling fluctuation
12)	Accurate prediction	\(\sigma^2_{f_R}=\sigma^2\left[1+x' \, P \, diag \left(\frac{\lambda_j}{(\lambda+k)^2}\right)P' x\right]+(Bias(\hat{\beta}_R))^2\)
13)	Wide range of \(k\)	\(0<k<k_{max}\), have smaller set of MSE than OLSE
14)	Optimal \(k\)	An optimal \(k\) always exists that gives minimum MSE
15)	DF Ridge	\(df_{R_k}=EDF=\sum\limits_{j=1}^p \frac{\lambda_j}{\lambda_j+k}=trace\left[H_{R_k}\right]\),
		where \(H_R{_k}=X\left(X'X+kI\right)^{-1}X'\)
16	Effective no. of parameters	\(EP=trace[2H_{R_k}-H_{R_k}H'_{R_k}]\)
17	Residual EDF	\(REDF=n-trace[2H_{R_k}-H_{R_k}H'_{R_k}]=n-EP\)

method	formula	reference
\(C_k\)	\(\begin{aligned} C_k&=\frac{SSR_k}{s^2}-n+2+2\, trace(H_{R_k})\\ &=\frac{SSR_k}{s^2}+2(1+trace(H_{R_k}))-n\end{aligned}\)	(Kennard 1971; Mallows 1973)
\(PRESS_k\)	\(\begin{aligned}PRESS_k&=\sum\limits_{i=1}^n (y_i-\hat{y}_{(i,-i)_k})^2\\&=\sum\limits_{i=1}^n e^2_{(i,-i)_k}\end{aligned}\)	(Allen 1971; Allen 1974)
CV	\(CV_k=n^{-1}\sum\limits_{i=1}^n (y_i-X_j \hat{\beta}_{j_{R_k}})^2\)	(Delaney and Chatterjee 1986)
GCV	\(GCV_k=\frac{SRR_k}{n-(1+trace(H_{R_k}))^2}\)	(Golub et al. 1979)
ISRM	\(ISRM_k= \sum\limits_{j=1}^p \left(\frac{p\left(\frac{\lambda_j}{\lambda_j+k}\right)^2}{\sum_{j=1}^p \frac{\lambda_j}{(\lambda_j+k)^2} \, \lambda_j}-1\right)^2\)	(Vinod 1976)
m-scale	\(m=p-\sum\limits_{j=1}^p \frac{\lambda_j}{\lambda_j+k}\)	(Vinod 1976)
Information criteria	\(\begin{aligned}AIC&=n\cdot \log(RSS/n)+2\cdot df_R{_k}\\BIC&=n \cdot \log(RSS) + 2 \cdot df_R{_k}\end{aligned}\)	(Akaike 1973; Schwarz 1978)
Effectiveness index (Eft)	\(EF=\frac{\sigma^2\, trace(X' X)^{-1}-\sigma^2\,trace(VIF)} {(Bias(\hat{\beta}_R))^2}\)	(Lee 1979)

Sr. #	Formula	Reference
1)	\(K_{HKB}=\frac{p \hat{\sigma}^2}{\hat{\beta}'\hat{\beta}}\)	Hoerl and Kennard (1970b)
2)	\(K_{TH}=\frac{(p-2)\hat{\sigma}2}{\hat{\beta}'\hat{\beta}}\)	Thisted (1976)
3)	\(K_{LW}=\frac{p\hat{\sigma}^2}{\sum_{j=1}^{p}\lambda_j \hat{\alpha}_j^2}\)	Lawless and Wang (1976)
4)	\(K_{DS}=\frac{\hat{\sigma}^2}{\hat{\beta}'\hat{\beta}}\)	Dwividi and Shrivastava (1978)
5)	\(K_{LW}=\frac{(p-2)\hat{\sigma}^2 \times n}{\hat{\beta}' X' X\hat{\beta}}\)	Venables and Ripley (2002)
6)	\(K_{AM}=\frac{1}{p}\sum\limits_{j=1}^p \frac{\hat{\sigma}^2}{\hat{\alpha}}_j\)	Kibria (2003)
7)	\(\hat{K}_{GM}=\frac{\hat{\sigma}^2}{\left(\prod\limits_{j=1}^p \hat{\alpha}_j^2\right)^ {\frac{1}{p}}}\)	Kibria (2003)
8)	\(\hat{K}_{MED}=Median \{\frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}\}\)	Kibria (2003)
9)	\(K_{KM2}=max \left(\frac{1}{\sqrt{\frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}}} \right)\)	Muniz and Kibria (2009)
10)	\(K_{KM3}=max \left(\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j }}\right)\)	Muniz and Kibria (2009)
11)	\(K_{KM4}=\left(\prod\limits_{j=1}^p \frac{1}{\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}_j^2}}} \right)^{\frac{1}{p}}\)	Muniz and Kibria (2009)
12)	\(K_{KM5}=\left(\prod\limits_{j=1}^p \sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j}}\right)^{\frac{1}{p}}\)	Muniz and Kibria (2009)
13)	\(K_{KM6}=Median \left(\frac{1}{\sqrt{\frac{\hat{\sigma}^2_j}{\hat{\alpha}^2_j}}}\right)\)	Muniz and Kibria (2009)
14)	\(K_{KM8}=max\left(\frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2 +\lambda_{max}\hat{\alpha}_j^2}}}\right)\)	Muniz et al. (2012)
15)	\(K_{KM9}=max \left(\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2+\lambda_{max} \hat{\alpha}_j^2}}\right)\)	Muniz et al. (2012)
16)	\(K_{KM10}=\left(\prod\limits_{j=1}^p \frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p) \hat{\sigma^2}+\lambda_{max} \hat{\alpha}_j^2}}}\right)^{\frac{1}{p}}\)	Muniz et al. (2012)
17)	\(K_{KM11}=\left(\prod\limits_{j=1}^p \sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2+\lambda_{max}\hat{\alpha}_j^2}} \right)^{\frac{1}{p}}\)	Muniz et al. (2012)
18)	\(K_{KM12}=Median\left(\frac{1}{\sqrt{\frac{\lambda_{max} \hat{\sigma}^2}{(n-p)\hat{\sigma}^2 +\lambda_{max}\hat{\alpha}_j^2}}} \right)\)	Muniz et al. (2012)
19)	\(K_{KD}=max\left(0, \frac{p\hat{\sigma}^2}{\hat{\alpha}' \hat{\alpha}}-\frac{1}{n (VIF_j)_{max} }\right)\)	Dorugade and Kashid (2010)
20)	\(\begin{aligned}K_{4(AD)}&=Harmonic\, Mean[K_i (AD)]\\ &=\frac{2p}{\lambda_{max}} \sum_{j=1}^p \frac{\hat{\sigma}^2}{\hat{\alpha}_j^2}\end{aligned}\)	Dorugade (2014)