EMSaov : An R Package for the Analysis of Variance with the Expected Mean Squares and its Shiny Application

EMSaov is a new R package that we developed to provide users with an analysis of variance table including the expected mean squares (EMS) for various types of experimental design. It is not easy to find the appropriate test, particularly the denominator for the F statistic that depends on the EMS, when some variables exhibit random effects or when we use a special experimental design such as nested design, repeated measures design, or split-plot design. With EMSaov, a user can easily find the F statistic denominator and can determine how to analyze the data when using a special experimental design. We also develop a web application with a GUI interface using the shiny package in R . We expect that our application can contribute to the efficient and easy analysis of experimental data.


Introduction
The analysis of variance (ANOVA) is a well-known method that can be used to analyze data obtained with different experimental designs.Its use mainly depends on the primary design of the experiment, and the main testing method for the analysis of variance is the F test.If all factors are fixed effects and there is no specific design of the experiment, that is, the experimental design is a factorial design, then the usual way to calculate the F statistic is to use the mean squares of the corresponding source of variation as the numerator and the mean squares of errors as the denominator.However, if some variables exhibit random effects or some variables are nested in the other variables, it is not easy to find the appropriate F statistic.This depends on the expected mean square (EMS), and the denominator of the F statistic is determined by the EMS of the corresponding source of variation.Therefore, we first have to calculate the expected mean squares for the ANOVA and then find the exact F statistic for the test using the EMS, especially when data comes from a special experimental design.Even though the EMS is very important to finding the exact F statistic in the ANOVA, few tools show this EMS and most of the tools that have been developed provide only the result of the ANOVA without any further explanation.Therefore, users cannot figure out how to calculate the test statistics and only know the final result.
Several packages can be used to handle models with various experimental designs.The lm function in R can handle factorial design with fixed effects without taking the special experimental design or the random effects into account.The lme function in the nlme (Pinheiro et al., 2016) package handles the mixed effect model, and in this function, the user can specify the factors with a random effect.However this function mainly focuses on the grouped data and on estimating the variance components instead of testing the corresponding factor.Also, it does not provide the EMS of each source in the ANOVA table .Another R package that can be used in the analysis of factorial experiments is afex (Singmann et al., 2016).This package provides the function ems to calculate EMS for the factorial designs.They adapted the Cornfield-Tukey algorithm (Cornfield and Tukey, 1956) to derive the expected values of the mean squares.The afex package also provides the mixed function to calculate p-values for various ANOVA tables considering the corresponding EMS to find the exact F statistic.However, the ems function provides the general information on the factorial design, and the result for ems only shows the coefficients of variances.It is provided in a p × p table form instead of the EMS formula for each source in the ANOVA table, where p is the number of sources, and the elements of this table represent the coefficient of variance in the formula of the EMS.It is not easy to find the corresponding F statistic with this result, and this EMS does not match the analysis of variance for real data due to the use of the number of levels in each factor with characters instead of the real number of levels, so users need to match this character with their own number of levels.
In this paper, we provide a tool to show how to calculate test statistics as well as the final result.We focus on the classical analysis of variance method, based on the F test using EMS, exclusively for balanced designs.We develop a new R package EMSaov to provide users with the analysis of variance table with EMS for various types of experimental design.With the ANOVA table combined with EMS, users can easily understand how to calculate the F statistics, especially the denominator of the F statistic, and then figure out the result of the analysis.We also provide an application for novice users based on Shiny (shiny).First, we explain the general concepts of the analysis of variance and the The R Journal Vol.9/1, June Table 1: Expected mean squares for the three different types of models special types of experimental designs.Then we introduce EMSaov, our newly developed R package, with its implementation, and explain the usage of functions in EMSaov in detail.We also introduce the web interface of EMSaov, followed by the conclusion.

Fixed, random, and mixed models
There are two ways to select the levels of factors for various factorial experimental designs.One is to select the appropriate levels as fixed values, and the other is to choose at random from many possible levels.Bennett et al. (1954) discuss a case in which the chosen levels are obtained from a finite set of possible levels.When all levels are fixed, the statistical model for the experiment is referred to as a fixed model, and when all levels are chosen as random levels, the model is referred to as a random model.When two or more factors are involved and some factors are chosen as fixed levels and the others are chosen as random levels, the model is referred to as a mixed model.There is no difference between the fixed model and the random model during data analysis for a single-factor experiment.However, the EMS for each factor should be different from that of a fixed model if there is more than one factor, some factors exhibit random effects, and the other factors are fixed effects.We thus have to be careful when generating an F statistic to test the significance of each factor.
Consider the two-factor factorial experiment with factors A and B. The corresponding experimental model with a completely randomized design is where µ is a common effect, A i represents the effect of the ith level of factor A, and B j represents the effect of the jth level of factor B, AB ij is the interaction effect of factors A and B, and ε ijk represents the random error in the kth observation on the ith level of A and jth level of In this model, we assume that µ is a fixed constant, and ε ijk is a random variable that follows N(0, σ 2 ε ).We can assume three cases: the fixed model, the random model, and the mixed model.In a mixed model, we treat factor A as a fixed effect and factor B as a random effect.The main test method for the analysis of variance is an F test.The usual way to calculate the F statistic is to use the mean squares of the corresponding source as the numerator and the mean square error as the denominator.However, if some variables exhibit random effects, it is not easy to find the appropriate denominator for the F statistic.In fact, this depends on the expected mean square (EMS), so we have to calculate the expected mean square for the analysis of variance of the data.
The expected mean squares are different among the three models, and they are represented in Table 1.For all three models, the mean squares error (MS E ) is used as the denominator to test the interaction effect between A and B.

EMS rule
As we can see in the review on generating the F statistics for the three different models, the expected mean squares are very important.The previous examples consist of very simple factorial models with only two factors.For complex experimental designs, particularly when using models involving random or mixed effects with nested factors, it is frequently helpful to have a formal procedure to generate the expected mean squares, that is the EMS rules (Montgomery, 2008).The EMS rules are simple and convenient procedures that determine the expected mean squares, and these are also appropriate for manually calculating the expected mean squares for any nested, repeated-measures, or split-plot design.We follow the EMS rules in Montgomery (2008) to generate the expected mean squares in the ANOVA table with various experimental designs.

Nested and nested-factorial design
In the case of experiments with two or more factors without any restriction in the randomization, most experimental designs can be categorized in one of three ways: crossed design, nested design, or nested-factorial design.In this EMSaov package, we didn't consider the unbalanced design and the fractional factorial design.The crossed design considers every possible combination of the levels of factors in the model.However, when the levels of one factor are not identical but similar to different levels of another factor, it is referred to as a nested design.
In the nested design, when the levels of factor B are nested under the levels of factor A, the levels of factor B belonging to the first level of factor A are not the same as the levels of factor B in the second level of factor A, as shown in Figure 1.One of the features of this model is the lack of interaction effect between the two factors that are nested, so when the analysis of variance is carried out, the interaction term AB should be pooled to the nested factor B(A).The ANOVA table for this design is shown in Table 3.Thus, the nested design can be extended to a more complex nested design, for example, to a model with another nested factor C under the existing nested factor B. It can also be combined with factorial design -a model with another factor C that is nested in both A and B, while factors A and B are crossed.

Split-plot design
The split-plot design is used when it may not be possible to completely randomize the order of experimentation.In this case, we assume for one factor to be a block.Since a factor treated as a block is restricted during randomization, the effects of the corresponding factor are confounded with the blocks, and it is thus difficult to determine the pure significance of this effect.On the other hand, there is no loss of information for the other factors that are not treated as a block because it is completely randomized under these factors.In this design, two levels of randomization are applied to assign the experimental units to the treatment.The first level of randomization is applied to the whole plot and is used to assign the experimental units to the levels of treatment factor A. The whole plot is split into a split-plot, and the second level of randomization is used to assign the experimental units of the subplot to levels of the treatment factor B. Since the split-plot design has two levels of experimental units, the whole plot and the subplot portions have separate experimental errors.Therefore, the F tests must be run only within the whole plot or within the split plot, and the mean squares in the whole plot should not be compared with the mean squares in the split plot, regardless of the EMS value.We handle this split-plot design as a hierarchical design with respect to the levels of model.The first level of model is the whole plot, and the second level of model is the split plot.These levels of model can then be extended to 3, 4, or more levels.

Approximate F test
In factorial experiments with three or more factors involving a random or mixed model, sometimes there is no exact test statistic for certain effects in the model.We have to calculate a new F statistic if there is no denominator that differs from the expected value of the numerator only by the specific component being tested.Therefore, Satterthwaite (1946) proposed a test procedure that uses linear combinations of the original mean squares to form the F statistic, for example, ) is equal to the effect considered in the null hypothesis.Then, where In d f num and d f den , d f num,i and d f den,j are the degrees of freedom associated with the mean square MS num,i and MS den,j , respectively, where i = 1, • • • , r n and j = 1, • • • , r d

Implementation of EMSaov package
The EMSaov package includes EMSanova, PooledANOVA, and ApproxF as main functions and EMSaovApp as a function for the Shiny application.EMSanova generates the analysis of variance (ANOVA) table with the expected mean squares (EMS) and the corrected F tests considered with the EMS.Several arguments are needed for this function (Table 4).We use the formula argument to specify the response variable and factors in the ANOVA table with data, nested factors (nested), and types of factors (type).Sometimes, we cannot find the appropriate denominator for the F statistic, and we have to use the approximate F test.The function ApproxF is developed to approximate the results of the F test.The ApproxF function takes SS.table and approx.nameas arguments.SS.table is the result from EMS.anova, and approx.namedesignates the source of variation in SS.table to calculate the approximate F values for the test.To show how to use these functions in the EMSaov package, we use the three sets of example data in Hicks (1982).
Example 1: Mixed effect model with approximate F test film data in EMSaov corresponds to the mixed effect model (Example 10.3 in Hicks (1982)).There are three factors: Gate, Operator, and Day.The experiment consists of measuring the dry-film thickness of varnish in millimeters for three different gate setting (1, 2, and 3) twice with operators A, B, and C, for two days.
In film, "thickness" is the dependent variable, and "Gate", "Operator", and "Day" are the factors that we want to consider.We use thickness ∼ Gate + Operator + Day as a formula.In the EMSanova function, we use the formula format just for specifying the factors in the model.To specify the types of factors and whether the factors are nested or not, we need to use the other arguments.should follow the order of "Gate", "Operator" and "Day".In this example, "Gate" is treated as a fixed effect and "Operator" and "Day" are treated as random effects.Therefore, type = c("F","R","R").
> data(film) > anova.result<-EMSanova(thickness ~Gate + Operator + Day, data = film, + type = c("F", "R", "R")) > anova.For the factor "Gate", the EMS of the denominator should be "Error + 2Gate:Operator:Day + 6Gate:Day + 4Gate:Operator", but it is not so in this table.Therefore, we cannot find the exact denominator for the F test and need to use the approximate F test.The factor "Gate" is in the first row in the result of EMSanova and approx.nameshould be "Gate" for the ApproxF function.
> ApproxF(SS.table = anova.result,approx.name= "Gate") $Appr.F The approximate F value for the test of the factor "Gate" is 48.17076 with p-value 0.0002.Therefore, we can conclude that there are significant differences among the levels of the factor "Gate" at the significance level 0.05.
If we want to combine "Gate:Day" and "Residuals", and treat them as a combined residual for the further analysis, we can define del.ID as c("Gate:Day","Residuals") and use the PooledANOVA The R Journal Vol.9/1, June EMSaovApp: Web interface for ANOVA with EMS Even though we provide three functions to produce the appropriate analysis of variance for many different types of experimental designs, this is not so easy for a novice user of R. For convenience, we provide a Shiny-based application with a graphical user interface (GUI) to obtain the ANOVA of data from various experimental designs.Figure 2 shows the main GUI for EMSaovApp with Example 2. The first part of the GUI can be used to read data in the csv format.The middle part has various input windows to select the dependent variable, the factors in the ANOVA table, the number of categories in each factor, the specification of the nested factor, and the level of the model.In this example, "Y" is selected as the dependent variable (Y variable) and the "Group", "Subject", and "Test" factors are selected.Among the selected factors, "Subject" is treated as the random effect and the others are treated as fixed effects.Therefore "Subject" is checked in the "Random Effect" part.The "Subject" factor is nested in "Group", and the "test" factor is in the second level of the model.This information should thus be specified in this part.The number of categories for each factor is automatically calculated from the original data, but the user can change them in this GUI.
The bottom part has five tabs, including "EDA-main effect", "EDA-interaction", "ANOVA table", "ANOVA table with Approx.F", and "Pooled ANOVA".The "EDA-main effect" tab shows parallel box plots for each factor (Figure 2)."Subject" and "Test" show significant differences among the levels in each factor, and the "EDA-interaction" tab shows the interaction plots to help see whether the interaction effect between the two factors is significant or not.EMSaovApp automatically generates interaction plots for all pairs of selected factors.In Figure 3, the interaction effect between "Group" and "Test" is highly significant.Even though the interaction effect between "Group" and "Subject" and the interaction effect between "Subject" and "Test" are provided, "Subject" is nested in "Group" and The R Journal Vol.9/1, June 2017 ISSN 2073-4859  The R Journal Vol.9/1, June 2017 ISSN 2073-4859

Figure 1 :
Figure 1: A two-stage nested design

Figure 4 :Figure 5 :
Figure 4: Analysis of variancetable with approximate F test result For factor A, MS E is used in the fixed model but MS AB is used in

Table 2 :
F statistics in ANOVA table for the three different modelsThe R Journal Vol.9/1, June 2017 ISSN 2073-4859

Table 3 :
ANOVA table for the nested design the other two models.For factor B, MS AB is used in the random model, but MS E is used in the other two models.The appropriate test statistics for each factor are summarized in Table2.
r d where MS num,1 , • • • , MS num,r n and MS den,1 , • • • , MS den,r d are selected from MS values in ANOVA table such that E (MS num ) − E (MS den

Table 4 :
It designates whether each factor is random or not.use "F" for the fixed effect, "R" for the random effect nested the list of the nested effects level the list of the model level n.table numbers of levels in each factor approximate calculate approximate F test when it is TRUE Arguments of EMSanova function specifies formula = thickness ∼ Gate + Operator + Day, type, nested, level , and n.table