The welchADF Package for Robust Hypothesis Testing in Unbalanced Multivariate Mixed Models with Heteroscedastic and Non-normal Data

A new R package is presented for dealing with non-normality and variance heterogeneity of sample data when conducting hypothesis tests of main effects and interactions in mixed models. The proposal departs from an existing SAS program which implements Johansen’s general formulation of Welch-James’s statistic with approximate degrees of freedom, which makes it suitable for testing any linear hypothesis concerning cell means in univariate and multivariate mixed model designs when the data pose non-normality and non-homogeneous variance. Improved type I error rate control is obtained using bootstrapping for calculating an empirical critical value, whereas robustness against non-normality is achieved through trimmed means and Winsorized variances. A wrapper function eases the application of the test in common situations, such as performing omnibus tests on all effects and interactions, pairwise contrasts, and tetrad contrasts of two-way interactions. The package is demonstrated in several problems including unbalanced univariate and multivariate designs.


Introduction
The problem of testing for mean equality between several groups can be accomplished using classical techniques such as Student's t test, when only two groups are compared, or ANOVA when more than two groups are involved. Both of them have been widely applied in the past in a number of areas ranging from ecology and biology to psychology, social sciences and medicine (Levin, 1997;Coates and McKenzie-Mohr, 2010), although their use tends to decrease for the reasons mentioned next.
In order for these approaches to work well, the data must satisfy three conditions, namely independence, normality and homoscedasticity of the errors. While ANOVA is known to be robust to small departures from normality, homogeneity of population variances is crucial as concluded by simulation studies in which these methods have been found to exhibit type I error rates oscillating from too conservative to extremely liberal, specially in unbalanced designs leading to very heterogeneous cell variances. The behaviour depends on the relation between the variance of the cell with the fewest observations and the number of observations contained in it (Milligan et al., 1987). Indeed, data from a number of experiments conducted in the aforementioned research fields often exhibit non-homogeneous variances. This is not a problem for one-way designs since the built-in function oneway.test in package stats is able to account for different variances. Unfortunately, real-world studies usually require more complex designs, like the typical mixed between x within-subjects designs for clinical trials involving either animals or persons. For both reasons, the application of simple ANOVA in serious analyses of experimental data is nowadays not very common.
A distinction should be made here on what homogeneity of variances means depending on the design being considered (Lix and Keselman, 1995). In univariate settings with between-subjects factors only, all the cell variances should be equal, while in multivariate settings, it refers to the equality of the population covariance matrices across all cells. In mixed designs with at least one between-subjects factor, a set of orthonormalized contrasts on the repeated measures must have common covariance matrices (sphericity assumption), and all those matrices must be equal across all cells of the betweensubject factors. When both conditions are met, it is said that multisample sphericity holds (Huynh, 1978).
A number of alternatives have been proposed to overcome the parametric restrictions (Higgins, 2003). Traditionally the most widely used choices have been nonparametric tests such as Mann-Whitney-Wilcoxon rank sum test and Kruskal test for two groups, or Wilcoxon signed-rank test and Friedman test (for which paired versions exist) for more than two groups. However, they cannot handle multiple factors or interactions. Generalized Linear Models can deal with multiple factors and interactions with non-normal data, but require specifying the link function and are unable to handle repeated measures. Since our study focuses on the most general models, i.e. those with betweensubjects and within-subjects interactions, the aforementioned tests are not useful. The generalization of Mixed Models, namely Generalized Linear Mixed Models (Bolker et al., 2009), do constitute a valid alternative. They present some drawbacks, though, such as being complex to apply and interpret, not very widely available, and requiring a particular treatment for each problem since a suitable link function must be supplied in each case, which is not always possible.
Two surveys on nonparametric techniques in experimental design can be found in Sawilowsky (1990); Salazar-Alvarez et al. (2014). In these works, several rank transformation variants are emphasized, as they constitute the most widely used nonparametric approach for detecting interactions Conover (2012). Among them, the Aligned Rank Transform (Higgins and Tashtoush, 1994), for which an implementation in R has been made available in package ART (Villacorta, 2015), is one of the best performing for this task, keeping type I error rates close to the theoretical significance level while preserving good power. Although it has been applied to a split-plot design (one between-and one within-subjects factor) in Beasley (2002), showing good type I error rates and power, it lacks a general unified formulation for mixed models with any number of between-and within-subjects factors that also works in unbalanced and multivariate settings. Erce-Hurn and Mirosevich (2008); Ruscio and Roche (2012) constitute two more broad surveys (the latter dealing with variance heterogeneity in depth) covering both classical nonparametric methods and recent research efforts like the one implemented here, which is cited in both works.
Some other valid alternatives for nonparametric analysis of any mixed model are the Improved General Approximation (IGA), the generalization of Welch-James (WJ) test statistic, the Kenward-Roger correction with mixed models (KR), and the modified Brown and Forsythe (MBF) procedure. They all do a correction for the degrees of freedom to account for heterogeneous variances, hence the name ADF for approximate degrees of freedom. The IGA (Huynh, 1978) was specifically developed to account for multisample sphericity violations in repeated measures designs by adjusting the critical value, and was generalized to any mixed model by Algina (1997). Similarly, Welch's non-pooled statistic with approximate degrees of freedom (ADF) (Welch, 1951) was also conceived for this purpose, using the sample data to estimate the error degrees of freedom. Several non-pooled ADF statistics have been proposed later but all can be derived from the general matrix formulation of Welch's statistic given by Lix and Keselman (1995) based on Johansen (1980), which makes it applicable for univariate and multivariate mixed models with an arbitrary number of effects. Lix and Keselman (1995) shows how the same statistic can be employed in different models for both omnibus contrasts (testing whether a given effect is significant or not) and pairwise comparisons for a given effect (for every pair of categories of a given effect, testing whether the response is significantly different for one of the categories against one another). Generally, the IGA and WJ statistic perform similarly; however a slight advantage favorable to WJ has been reported by Algina (1997) in some contexts. The WJ ADF approach has been successfully tested in nonparametric analysis of a variety of mixed models; see Keselman et al. (2003) and references therein. The KR correction for the degrees of freedom can be implemented on top of a conventional mixed model and performs similarly to the MBF procedure with slight advantage for the latter (Vallejo and Livacic-Rojas, 2005). Both yield reasonably good results. With respect to the comparison between the MBF and the generalized WJ statistic, there is no consensus about the results. In most conditions they perform similarly, but some authors state that MBF is better at detecting interaction effects when the number of subjects is not high enough (Vallejo et al., 2001(Vallejo et al., , 2006. However, to the best of our knowledge, there is no general formulation of MBF for any mixed model with an arbitrary number of between-x within-subjects factors, although it has been tested in split-plot designs (Vallejo et al., 2006) (which are probably the most common design in medicine and particularly in psychological studies) and factorial designs (Vallejo et al., 2008).
Software available for robust testing of mixed models not meeting parametric assumptions Wilcox (2012) constitutes an important source dealing with robust estimation. The book is accompanied by an R package called WRS 1 that implements all the methods reviewed in the book, including the Welch-James test following Johansen's approach with robust mean estimators described in sections 7.2, 8.6 and 8.7 which our package welchADF also implements. The functions described in those sections, bwtrim,t1way,t1waybt,t2way,t2way,t3way, include the most common designs such as one, two and three-way and split-plot (one between x one within subjects designs) also with bootstrapping and trimming. Although not included in WRS2, the original WRS exposes functions bbw-,bww-for between x between x within, and between x within x within subjects designs, and their trimmed and bootstrap versions. While useful, they do not provide a uniform, easy-to-use interface in a single function valid for any mixed design. There exists a CRAN task force on robust statistical methods 2 . Unfortunately, none of the packages mentioned there implements the aforementioned approaches. Nevertheless, packages robustbase , robust (Wang et al., 2017) (by the authors of Maronna et al. (2006)) and function rlmer in package robustlmm (Koller, 2017) are some remarkable efforts. The latter implements the techniques proposed in Koller (2013) for linear mixed models on a basis of a parametric model having some contaminated data (Koller, 2016). However, it does not focus on testing inherently heterogeneous and non-normal data. Package nlme (Pinheiro et al., 2017) provides a way of capturing and modeling variance heterogeneity in mixed models through the argument weights of function lme, which can be set to different covariance matrix structures that are fitted from the data. The well-known package lme4 (Bates et al., 2016(Bates et al., , 2015 is also a good choice for dealing with non-normal models in presence of within-subjects effects (called generalized linear mixed models as an extension to generalized linear models where the user can specify the probabilistic model to be used). Package glmmADMB 3 (Skaug et al., 2016;Fournier et al., 2012) has a similar purpose.
SAS (SAS Institute, 2011) implementations of the WJ statistic, MBF statistic (split-plot and factorial designs only) and IGA do exist; see Keselman et al. (2003); Vallejo et al. (2006); Algina (1997) respectively. While SAS is still widely used mainly in social and biomedical sciences, it is proprietary software. The same applies to the ERP-PCA software (Dien, 2010) written in Matlab. Interestingly, ERP-PCA incorporates a Matlab translation of the SAS code described in Keselman et al. (2003) for the WJ statistic. For these reasons, together with the fast expansion of R and open-source statistical software in general -closely related to the growing interest in reproducible research-among researchers of many different disciplines, we consider the R package introduced here a useful effort.

Main contributions
The present work describes an R package called welchADF (Villacorta, 2017) that implements Johansen's formulation of the Welch-James test, with two additional improvements: first, the use of trimmed means and Winsorized variances to deal with non-normality, and second, the use of bootstrap for calculating an empirical critical value for achieving better type I error control. Both aspects are mentioned in Wilcox et al. (1998) and implemented in Keselman et al. (2003). The core of our code is an R translation of the SAS program 4 described in Keselman et al. (2003). A new wrapper function has been built on top of it that poses the following benefits: • It can be applied to univariate and multivariate mixed models with an arbitrary number of within-and between-subjects effects.
• It simplifies some common tasks such as performing omnibus tests on effects or interactions, multiple pairwise comparisons on the levels of one factor, and tetrad contrasts. All of these can be done without indicating the contrast matrices, which are automatically formed by the program depending on the kind of test required and the number of levels found in each factor.
• It provides a more natural and uniform data input mechanism through data frames that do not depend on the model specified. In the original SAS code, the input data had to be carefully arranged in matrices whose shape had to mirror the experimental design being analyzed in each problem, which can be error-prone.
• It integrates with other similar packages of the R ecosystem through a formula interface and also provides additional interfaces that accept model objects returned by some commonly used functions such as stats::lm, lme4::lmer and stats::aov.
• It enables selecting one among several built-in p-value correction methods when performing multiple pairwise comparisons.
There are several reasons that justify an R implementation of this particular test: • The generalized WJ test described in Lix and Keselman (1995); Keselman et al. (2003) has good theoretical properties and has proven successful in controlling type I error rate while preserving high power. Moreover, the use of trimmed means and Winsorized variances can protect both against skewness and outliers in the data, as noted by Keselman et al. (2008). The percentage of trimming can be adjusted to deal with higher ratios of outliers and skewness. The statistic is also able to cope with heterogeneous variances, which commonly (but not only) arise when having very different cell sample sizes. In this sense, one should notice the sample size requirement stated right after this list.
• These two works have received a number of citations, and the approach explained there is being used in current research in different fields such as medicine (Dien et al., 2008;Dien, 2010), psychology (Müller et al., 2011;Kayser et al., 2014;Huang and Jun, 2015) and behavioral research (Symes et al., 2010), just to cite a few.
• The function interface is simple and the test can be used in a straightforward way for the most common tasks. This may contribute positively towards its adoption by the research community, specially by researchers with little expertise in statistics, but with an understanding of the importance of applying suitable, robust techniques when parametric conditions are not met.
• Despite the existence of different alternatives explained before, some of them also implemented in R, these are either not applicable to multivariate mixed models with heterogeneous variance or non-normal data, or are generally complex and more difficult to use.
The WJ approach also has some disadvantages. The first one is the sample size needed to assure an effective control of type I error under some (somewhat extreme) circumstances, specially in repeatedmeasures designs, like when the cell with the fewest subjects presents the largest variance (i.e. cell size and cell variance are negatively paired). In general, the number of subjects of the smallest cell should be four or five times greater than the number of repeated measures minus one, and sometimes even more when testing an interaction. However, when combined with trimmed means, robustness is increased and some of these problems are mitigated as a much smaller number of subjects are required (Keselman et al., 2000). The second drawback is that welchADF is only applicable to categorical predictors. In case the design has numeric predictors, the reader may try the packages mentioned in the preceding section, as well as gamm4 (Wood and Scheipl, 2017) and mgcv (Wood, 2017) for fitting generalized additive mixed models.
Finally, the application of bootstrap to contaminated data has been extensively studied and even questioned by some authors in the past (Singh, 1998), specially regarding numerical instability: some bootstrap samples that intervene in the computation of the final bootstrapped estimate may contain a higher proportion of outliers than the general dataset, and therefore be too heavily influenced by them (Salibián-Barrera et al., 2008). Singh (1998) proposed using bootstrapping with Winsorized observations, which is what we do in this package when enabling trimming and bootstrapping at the same time, as it provides some additional benefits. Salibián-Barrera et al. (2008) introduce a fast and robust bootstrap (FRB) method that improves classical bootstrapping. Although it has not been incorporated to welchADF, it may be done in the future.
The remainder of this contribution is structured as follows. In first place, the mathematical background is briefly reviewed. In second place, we present the function exposed by the package and explain its arguments, together with some issues regarding the arrangement of the data. In third place, we address three case studies, namely a univariate one-way between-subjects design, a two-way factorial design, a mixed design, and a doubly multivariate design analyzed as a multivariate mixed design. Finally, we present some conclusions and further work.

The Welch-James ADF test statistic
Here we summarize the theoretical background given in Lix and Keselman (1995); Keselman et al. (2003). Following the General Linear Model, where Y is an N × p matrix of observations on p dependent variables or p repeated measures, N is the sample size, X is the design matrix (formed by zeros and ones, such that rank(X) = r with r being the number of different groups or cells 5 ), β is an r × p matrix of non random (unknown) parameters (population means) and ξ is an N × p matrix of random error components. If we denote by Y j (j = 1, ..., r) the submatrix of Y that contains the observations of the n j subjects of the j-th cell, the original parametric model assumes We proceed to explain how population means and variances are estimated. The matrix of population means can be estimated by the usual least-squares approach (Eq. 2) or by using robust estimation techniques such as trimming, discussed later. Now, let X j (j = 1, ..., r) be the j-th column of X, composed of zeros and ones, and let 1 p be a p × 1 vector of ones. Define Y j = Y • (X j 1 T p ) as the N × p matrix that results of the Hadamard (i.e. element-wise) product between matrices Y and X j 1 T p . Then, the following expressions are used to estimate population means and variances: The matrix formulation of the WJ statistic always tests the following general linear hypothesis: with R = C ⊗ U T . In this expression, ⊗ is the Kronecker product, C is a d f C × r matrix that indicates the contrasts on the between-subjects effects, so that rank(C) = d f C ≤ r, and U is a p × d f U matrix that indicates the contrasts on the within-subjects effects, so that rank(U) = d f U ≤ p. Therefore R has d f C d f U rows and rp columns. Note that µ = vec(β T ) = (β 1 , ...β r ) T , which is a column vector with rp elements obtained by stacking the columns of β T , one on top of another. Matrices C and U determine the type of contrast being performed, be it an omnibus contrast, a pairwise contrast, etc. We provide details about the structure of both matrices in the general case in Section 2.2.1.
The generalized Welch-James test statistic presented by Johansen in Johansen (1980) is whereμ is an estimate of µ (either with LS or any other technique), andΣ is a block diagonal matrix whose blocks areΣ j /n j . It is known that T WJ /c approximately follows an F(ν 1 ; ν 2 ) where where tr is the trace of an square matrix (sum of the elements on the main diagonal), and Q j is an rp × rp matrix associated with X j in which the (s, t)-th diagonal block of Q j = I p when s = t = j and is 0 otherwise.
In a between-subjects design (no within-subjects factors), U must be set to I p where p is the number of dependent variables (in univariate designs, it reduces to U = 1). In a within-subjects design (no between-subjects factors), C must be set to 1 both in the univariate and multivariate cases.

Structure of the contrast matrices
The preceding formulation of the model is valid for any type of contrasts. Most often the user may want to perform two types of contrasts, namely omnibus contrasts to check whether a given effect or interaction is statistically significant, and in case it is, post-hoc pairwise contrasts on one effect or interaction to check whether the response associated to some of the levels of that factor or interaction is statistically different than the responses associated to other levels of the factor.
Omnibus contrasts This test is aimed at checking whether the level adopted by a given variable of interest (effect or interaction) has an influence over the response variable. In the simplest case, consider a one-way design, either univariate or multivariate, whose single factor A has a different levels. Then C would be an (a − 1) × a matrix specifying linearly independent contrasts between the levels. We will call this matrix C A (capital A) because it is the matrix we use to conduct an omnibus test on effect A. If for instance, a = 3, we would have Algina and Olejnik (1984) provide a general formulation to compose matrix C for omnibus tests on factorial designs with several between-subjects factors. The same idea, slightly modified, is also valid to compose matrix U in designs with several within-subjects factors. These procedures have been implemented in our welchADF package. Let C a (non-capital a) be the (a − 1) × a matrix of linear contrasts associated to the a levels of effect A. Its rows define a − 1 linearly independent contrasts between the levels of A. When the design has no more factors than A, then C A = C a as in the case above. However, when more than one between-subjects factor exists, then C has to be properly set according to the factor to which we want to apply an omnibus test. In those cases, C is the Kronecker product of one matrix (or vector) per factor existing in the design, as follows.
Assume a between-subjects design with four effects A, B, C, D. Let 1 T a denote a (1 × a) vector of ones. If we want to perform an omnibus contrast on a main effect, say B, then In other words, if the factor matches the effect being tested, the corresponding contrast matrix appears in the product; otherwise, a vector of ones appears. The same applies to interactions. If we want to test the B × D interaction, for instance, we would have Since factors B and D are involved in the interaction BD being tested, their contrast matrices appear in the product, while a vector of ones appears in the positions of the remaining factors of the design.
For within-subjects designs a similar rule applies. Assume that now A is a within-subjects factor, and let U a = C T a , so that the columns of U a define a − 1 linearly independent contrasts between the levels of A. In a within-subjects design with several factors, the transposes of the U contrast matrices of those factors involved in the effect or interaction being tested appear in the Kronecker product; otherwise, a row vector of ones is used in that place as explained before. If the design is multivariate with p dependent response variables, an additional factor I p always appears in the last place of the product. At the end, the result of the Kronecker product must be transposed again to obtain U. For example, in a two-factor within-subjects multivariate design, the U matrices for conducting ominbus tests on effects A, B and the interaction AB would be, respectively, In case we want to test a main effect (either a between-or a within-subjects effect) in a design containing both between-and within-subjects factors, the same rules apply: C and U are composed separately, and one of them will (for sure) be the result of the Kronecker product of vectors of ones only (including I p as well when constructing U if the design is multivariate). Finally, if we want to test an interaction involving one or more between-subjects factors and one or more within-subjects, C must be formed as if we were testing only the between-subjects factors involved in the mixed interaction, and U as if testing the within-subjects factors, following the rules explained above.
Pairwise contrasts Now for a given effect, we are aimed at testing for every pair of categories of the effect whether the response is significantly different for one of the categories against one another. The procedure is similar to the omnibus contrasts. The only difference is that contrast matrices C a and U a associated to an effect A are replaced by contrast vectors, as follows. When testing for significant differences between factor levels j and j of an effect A, either C a is replaced by a row vector c jj if A is a between-subjects factor, or U a is replaced by a column vectors u jj if A is a within-subjects factor. In a pairwise contrast vector, all positions are set to 0 except for those corresponding to the factor levels j and j being tested, which are set to 1 and -1 respectively. These vectors are then used in the corresponding positions of the Kronecker products described before. For probing interactions (once the omnibus test on such interaction proved significant), tetrad contrasts have been implemented in our package. The null hypothesis being tested in a tetrad contrast involving two factors A and B, from which the interaction between levels j and j from A, and k and k from B is being tested, can be written as

Trimmed means and Winsorized variances
Trimmed means help mitigate the effects of non-normality. When least-squares means are substituted by trimmed means, the null hypotheses being tested are the equality of population trimmed means: j be the sorted observations of the j-th group, and g j = γn j with γ being the proportion of observations to be trimmed in each tail of the distribution. Therefore the sample size for the j-th group becomes h j = n j − 2g j , and its sample trimmed mean is computed by averaging the h j central observations of that group: Some authors suggest using 20 % trimming.
The sample Winsorized mean is a similar measure that is computed by replacing all observations smaller than Y (g j +1)j (i.e. the 20th percentile) by that value, and those larger than Y (n j −g j )j (i.e. the 80th percentile) by that value, and then averaging over all the (modified) observations. In the j-th group: This measure is required to compute the sample Winsorized variance: The R Journal Vol. 9/2, December 2017 ISSN 2073-4859 Therefore, in order to compute the trimmed version of the WJ statistic, T W J , trimmed means replace least-squares means, Winsorized variances replace least-squares variances, and the new sample sizes h j replace the original n j in all groups. Past studies have found the trimmed version of the WJ statistic to be more robust and provide better type I error control to non-normality, also in more complex designs; see Keselman et al. (2000) and references therein.

Bootstrapping to obtain an empirical critical value
After computing the cell trimmed means, let C ij = Y ij −μ W J /c computed on the original data. The null hypothesis Rµ (t) = 0 of trimmed means equality will be rejected if T The process explained before applies to omnibus contrasts. For focused contrasts such as pairwise tests on marginal means, the same idea with minor modifications is used. For more details as well as a generalization to other designs, see Keselman et al. (2003) and references therein.

Effect size and confidence intervals
As stated in Keselman et al. (2008); APA (2013), it is now widely recommended to report an estimate of the effect size when performing a hypothesis test. Many different measures exist for this purpose, although few of them are valid for non-homogeneous variances. The approach implemented in our package was proposed in Keselman et al. (2008) and has the following formulation for the case of two groups:δ Note this approach uses trimmed means and Winsorized standard deviation and for that reason, it is robust to non-normality. Factor η stands for a scaling factor of the Winsorized standard deviation. In case 20 % trimming is used (as recommended), η = .642 which is the Winsorized standard deviation for a 20% trimmed standard normal distribution. In our code, η is computed according to the percentage of trimming indicated by the user. For building a robust CI around this value, a percentile bootstrap method is run to determine the empirical bounds of the interval as recommended in Keselman et al. (2008).

An R implementation of the WJ statistic
The WJ test is implemented in our package as an S3 generic called welchADF.test. The name has been chosen to be compliant with other existing tests such as t.test,wilcox.test, etc. The function receives parameters to modulate its behaviour, such as the type of contrast to be performed (omnibus or pairwise), whether trimming should be employed or not, and if employed, the percentage of data to be trimmed at each side, and whether bootstrapping should be used or not. The default S3 method expects a data.frame in the formula parameter, but additional S3 methods are provided for classes formula, lm, lmer and aov, which allow our package to integrate well with other linear models functions, as described later in this section.
The prototype of the S3 default method is welchADF.test(formula, response, between.s, within.s = NULL, subject = NULL, contrast = c("omnibus", "all.pairwise"), effect = NULL, correction = c("hochberg", "holm"), trimming = FALSE, per = 0.2, bootstrap = FALSE, numsim_b = 999, effect.size = FALSE, numsim_es = 999, scaling = TRUE, standardize.effsz = TRUE, alpha = 0.05, seed = 0, ...) We summarize below the meaning of the arguments; the reader may refer to the package documentation for further detail. Note only the three first arguments are required. Table 1: An example dataset with two between-subjects factors A, B, with a and b different levels, respectively; two within-subjects factors X, W with x and w different levels, and a multivariate response with p = 2 correlated response columns Y 1 , Y 2 . The total number of subjects is N.
• formula is a data frame object containing the observations and the level combination to which they correspond. The next four arguments refer to the column names.
• response,between.s,within.s,subject are strings (or string vectors) indicating the column names for the response, the between-subjects effect(s), the within-subject(s) effects, and the subject column that stores which subject corresponds to each row (hence it cannot be a vector but a single string). In case the design is multivariate, the response will be a vector of columns, one for each response variable. A sample data frame is displayed in Table 1. Each cell A i B j has n A i B j subjects, and n A i B j · x· w rows in the data frame. Here, N = ∑ i,j n A i B j subjects. • contrast refers to the type of contrast to be performed. Both in "omnibus" and "all.pairwise" contrasts, the corresponding contrasts matrices are automatically computed as described in Section 2.2.1.
• effect is the effect (i.e. column name) involved in the selected contrast. If effect is a vector with length 2 or greater and contrast = "omnibus", then an omnibus contrast on an interaction effect will be tested involving simultaneously all the effects of the vector. If contrast = "all.pairwise", then effect must have length 1 or 2 to indicate a single effect or a two-way interaction to which tetrad contrasts will be applied; otherwise an error will be thrown. If left blank, the contrast will be applied separately to all of the existing effects and their interactions.
• The rest of arguments specify whether trimmed means and Winsorized variances will be used and the percentage of trimming (use.robust.estimators, per), whether bootstrapping should be used to compute an empirical critical value and how many iterations to do (use.bootstrap, numsim_b), and whether the effect size and a confidence interval should be computed (again via bootstrap). Effect size allows the choice of using scaling (scaling = TRUE) or not (if not, η = 1 in Eq. 10) and the number of effect-size bootstrap simulations (effect.size, scaling, numsim_es, loc1, loc2).
The aforementioned function is an R wrapper that configures the parameters needed for each type of problem by two private functions, which lie at the core of our package. Both are R translations of the SAS functions wjglm and bootcom and have been named almost the same. Function wjglm is used in all cases (including those in which bootstrapping is needed), except when bootstrap is applied to obtain an empirical critical value for a family of contrasts. A modification of function bootcom is only invoked in that case in order to control family-wise type I error rate (FWER) via percentile bootstrapping. This scenario arises when performing all pairwise contrasts or tetrad contrasts via bootstrap (i.e. contrast = "all.pairwise",bootstrap = TRUE).
The prototype of the S3 method for class formula is as usual: welchADF.test(formula,data,subset,...) where data is a data frame following the same rules as described above for the formula parameter of the default method, subset is an indexing vector to indicate which rows of data should be used The R Journal Vol. 9/2, December 2017 ISSN 2073-4859 (all by default), and ... stands for the rest of arguments accepted by welchADF.test.default and described above to configure the behavior of the test. As with other models, the terms in formula are first sought in data and then in the environment of the formula. Note, however, that only between-subjects and within-subjects effects and interactions can be specified together with a Subject column when conducting a WJ test, but no model is fit to the data. For this reason, formula should be understood only as a way to indicate the factors involved and their nature (between-or within-subjects) but not as a description of a particular model structure. The presence or absence of an interaction in the formula only affects which effects are tested when contrast = "omnibus",effect = NULL; otherwise it does not affect at all. The structure should mirror that of the lme4 package, e.g.
welchADF.test(cbind(visits,time,latency)~nurs*tunnel + (tunnel|Subject), miceData) means that there is a multivariate response composed of three correlated variables visits, time, latency, and the design has one between-subjects factor nurs (because it appears outside but not inside the parenthesis term) and one within-subjects factor tunnel because it appears inside the parenthesis. While a within-subjects effect may appear outside the parenthesis to indicate an interaction with a between-subjects effect, between-subjects must not appear inside the parenthesis.
The function returns an object of class welchADFt, which is actually a tagged list of lists, one sub-list per effect in an omnibus contrast, or per category involved a pairwise contrast of a given effect. The call is also stored as the last element of the upper-level list with the name call, no matter the S3 method employed (be it welchADF.test.default, or the ones for class formula,lm,aov or lmer). This allows to implement S3 method update for class welchADFt, no matter which S3 method was called to calculate the model object 6 . Each sub-list has elements named welch.T,numeratorDF,denominatorDF,contrast.matrix, mean.vector,sigma.matrix which store, respectively, the value of the T W J /c statistic (or T

(t)
W J /c if the trimmed version was used), the approximate degrees of freedom of the numerator and denominator, the contrast matrix R obtained as R = C ⊗ U T , and the estimatesμ andΣ. It also stores the user arguments when the function was called and, in case the user asked for the effect size, it provides the effect size along with a confidence interval. Refer to the package documentation for further detail.
The package implements S3 methods summary, format and print for objects of class welchADFt, as well as other methods widely used on model objects such as confint to get confidence intervals of the effect size (in case the user requested to compute it), model.frame to extract the input data frame, and formula to extract the formula (not available if the object was generated by welchADF.test.default).

Case studies
All the datasets analyzed in this section were mentioned in examples designed by the authors of the SAS implementation (Lix and Keselman, 1995), and have also been included in our R package. For that reason it is not necessary to explicitly read them from text files. They are described in detail in the package documentation.

Univariate one-way between-subject design
The dataset was artificially created by Lix et al. and can be downloaded from her personal website 7 . The data recreate those reported by a real study on perception and concentration, on which 42 students were given several puzzles to be solved. The students are divided into three balanced groups as they had previously been asked to imagine solving puzzles in the distant future, near future, or not to imagine anything at all (control group). The response variable represents the number of puzzles each student was able to solve, out of 12. The data are delivered in our package in a variable named perceptionData.
This design presents one between-subjects variable, namely the Group to which the student belongs, and no within-subjects variables as each student is measured on only one response variable and then the student is never measured again. The following R commands demonstrate the types of analyses that can be done with this dataset. The results can be checked on the PDF file of the footnote.
> str(perceptionData) data.frame : 42 obs. of 2 variables: \$ Group: Factor w/ 3 levels "control","distantFuture",..: 2 2 2 2 2 2 2 2 2 2 ... \$ y : int 7 5 8 9 8 8 7 7 6 2 ... > omnibus_LSM <-welchADF.test(perceptionData, response = "y", between.s = "Group") 6 The update should be done in accordance with the function that generated the welchADFt object, i.e. passing a formula is not allowed if the object was generated by the default method, and vice-versa. The output shows that the responses associated to the control group significantly differs from those associated to the distantFuture group. The confidence intervals on the effect size can be retrieved as > confint(pairwise_trimmed) 2.5 % 97.5 % control:nearFuture -0.8208 0.85290 distantFuture:nearFuture -0.3402 1.74365 control:distantFuture -2.2571 -0.09678 An important issue arises in this example that justifies again the use of trimmed estimators. As can be seen in the omnibus tests, the Group is not significant with Least-squares means but it is when we use trimmed means and Winsorized variances. This yields a significant result which could not be detected unless trimming is applied. As the result of the omnibus test with trimmed means is significant, we proceed to the pairwise comparisons using trimming as well. This yields that control and distantFuture have associated significantly different values of the number of puzzles solved by the students on average.

Two-way factorial (between-subjects) design
Once again, this dataset 8 was artificially created by Lix et al. Quoting from the PDF, the author used summary data presented by Wicherts et al. (2005). These authors examined the effects of stereotype threat on women's mathematics ability. Study participants were assigned to one of six groups defined by crossing the independent factors of test condition (control, nullified, stereotype threat) and sex (male, female). Originally there were four different tests administered to study participants (arithmetic, number series, word problems, and sums tests) the dataset contains only scores for the arithmetic test (out of 40) because these scores exhibited a greater magnitude of variance heterogeneity than scores for the other tests. It is an unbalanced design with cell sizes ranging from 45 to 50 participants, and a total sample size of 283. The data are delivered in our package in a variable named womenStereotypeData.
The output of the omnibus tests using robust estimators (trimmed means and Winsorized variances) with and without bootstrapping is shown in first place. Since the interaction between "condition" and "sex" is significant according to trimmed means, the post-hoc pairwise comparisons (tetrad contrasts) are shown using trimmed means with and without bootstrapping. The results match those presented in pages 5 and 6 of the PDF.
Since the omnibus test confirms the significance of all effects, pairwise comparisons should be done on all of them. Due to space constraints, only the two-way condition:sex interaction effect was probed (as effect = c("condition","sex") in the call to create the pairwise_LSM object that was subsequently updated to account for trimming and bootstrapping). Pairwise contrasts on two-way interactions are also known as tetrad contrasts.

Multivariate (mixed) between-× within-subjects design
The problem and the data are described in Keselman et al. (2003). The data represent the reaction times in milliseconds of children with attention-deficit hyperactivity (ADHD) and normal children when they are presented four kinds of inputs: a target alone or an arrow stimuli incongruent, congruent and neutral to the target. According to the authors, the dataset was artificially generated from the summary measures given in the original study by Jonkman et al. (1999), in groups of 20 and 10 children to create an unbalanced design. The data are delivered in our package in two variables named adhdData and adhdData2.
One-way multivariate vs univariate mixed model This problem can be approached in two different ways: (a) as a one-way multivariate design, which would be the non-parametric equivalent of MANOVA (multivariate ANOVA), or (b) as a univariate mixed model having one between-subjects factor (the student's group) and one within-subjects factor (with four levels, namely the four stimuli measured in every single student). In case we were analysing the data under parametric assumptions, the second option requires sphericity while MANOVA does not (although it needs more data). On the other hand, mixed models are able to capture the covariance structure of the dependent variables and can be generalized to any number of factors. An in-depth discussion on this topic can be found in chapter 9 of Maxwell and Delaney (2004). Keselman et al. (2003) (page 593) insist on using trimmed means and/or bootstrapping with this kind of models in order to overcome deviations from sphericity.
Our package admits both types of analysis. When dealing with a one-way multivariate design, the data must be formated as in Figure 1(a), while a mixed model requires the more systematic format of Figure 1(b) which is valid for an arbitrary amount of factors of both types. As done in their paper, we will analyze this dataset as a mixed model in which the stimuli are an explicit within-subjects factor.
Implicit within-subjects effect The package admits a third way to indicate the within-subjects effects that simplifies its use. It is common to have a dataset with a within-subjects effect expressed in the form of Figure 1(a). In this case, we may want to consider the within-subjects effect underlying the multivariate response. Reshaping this file to match the structure of Figure 1(b) would require some effort by the user. To avoid this, the function allows indicating that the multivariate response is actually an implicit within-subjects effects by including the word "multivariate" in the vector of within-subjects column names (if this argument was empty, then we just set the argument within.s = "multivariate"). This can be generalized as follows: if we have K within-subjects effects, we can have K − 1 columns in the data with their explicit names and levels, and leave one effect to be indicated in the multivariate response. In that case, we set within.s = c("within1", "within2", ..., "within-k-1", "multivariate") and pass a multivariate response vector argument because the data must have one response column per level of the K-th within-subjects effect. In the code below, the variables with the termination _multi show the equivalent calls. Unless we change the model itself (i.e. consider a mixed model or a multivariate one-way model with no within-subjects factor), the results obtained are the same in all types of analyses (omnibus, pairwise, etc), no matter the structure of the input data file. We demonstrate all the possibilities below.
> summary(omnibus_LSM_mixed_implicit) Call: welchADF.test(formula = adhdData, response = c("TargetAlone", "Congruent", "Neutral", "Incongruent"), between.s = "Group", within.s = "multivariate", contrast = "omnibus") WJ The results match those presented in Keselman et al. (2003), page 594. The only significant effect is the Stimulus, which acts as the within-subjects factor. When we consider a one-way multivariate design, with no within-subjects factor, then the Group (the between-subjects factor) is deemed not significant. In order perform all pairwise comparisons between the levels of Stimulus, we proceed as follows (only the bootstrap results are shown due to space constraints). The comparison is statistically significant when the value of the WJ statistic is greater than or equal to the bootstrap critical value. In this case, there were two significant differences, namely Incongruent vs TargetAlone, and Incongruent vs Congruent, as mentioned in page 595 of the aforementioned work. Note that, when using the implicit within-subjects effect format, we have to specify effect = "multivariate" to indicate that the effect to be tested is the within-subjects effect, even though there is no column with such name in our data. In case we are using the formula interface, this is not available because formula terms must strictly correspond to column names or variables in the environment of the formula.

Multivariate within-subjects design (doubly multivariate)
The three case studies addressed before are probably the most common in practice. Nevertheless, and with the aim of demonstrating the flexibility of the welchADF package, we present here an additional, not-so-common setting, namely a doubly multivariate design also proposed by Lix and Keselman (1995). In general, this design arises when several dependent variables are measured for each individual at several time points (every variable measured at each time point), or under different conditions; the latter is the case of the example addressed below.
We could not access the data employed by Lix, hence we have analyzed data from Wuensch (1992) which are freely available in the author's website 9 . In this experiment, wild strain house mice were, at birth, cross fostered onto house mouse (Mus), deer mouse (Peromyscus) or rat (Rattus) nursing mothers. Ten days after weaning, each subject was tested in an apparatus that allowed it to enter four different tunnels: one scented with clean pine shavings, and the other three tunnels with shavings bearing the scent of Mus, Peromyscus, or Rattus respectively. Three variables were measured for each tunnel: the number of visits to the tunnel during a twenty minute test, the time spent by each subject in each of the four tunnels and the latency to first visit of each tunnel.
In this design, the type of nursing mother is a between-subjects factor. The within-subjects factor is scent, with four levels (clean, Mus, Peromyscus, and Rattus). The multivariate response is composed of visits, time, and latency for each tunnel. With this approach, the multivariate response is not treated as another within-subjects factor. The data are delivered in our package in a variable named miceData. We first do an omnibus contrast. In the second call we demonstrate how the formula interface can be used in this design to obtain exactly the same result.

Conclusions and further work
This contribution has demonstrated the applicability of the new welchADF package in a variety of experimental designs, ranging from the most simple one, namely a univariate one-way betweensubjects design, to a more exotic one like a doubly-multivariate design. The unified approach of Johansen (1980) that has been implemented here leads to a great ease of use for any case study. We have shown in the example code that specifying the factors involved in the design and the type of analysis are done in a straightforward way, and then the code automatically generates the contrast matrices needed and runs the test, no matter how complex the user's design is. Therefore, researchers from other areas may find it more friendly and hence, our effort may contribute to the diffusion of the Welch-James ADF test in applied studies.
In the future, an enhancement may be added so that custom contrasts can be done in addition to the most common omnibus and pairwise contrasts. This requires designing a simple, yet powerful mechanism for the user to describe the desired test in the function arguments.