eat is a new package for R that includes functions to estimate production frontiers and technical efficiency measures through nonparametric techniques based upon regression trees. The package specifically implements the main algorithms associated with a recently introduced methodology for estimating the efficiency of a set of decisionmaking units in Economics and Engineering through Machine Learning techniques, called Efficiency Analysis Trees (Esteve et al. 2020). The package includes code for estimating input and outputoriented radial measures, input and outputoriented Russell measures, the directional distance function and the weighted additive model, plotting graphical representations of the production frontier by tree structures, and determining rankings of importance of input variables in the analysis. Additionally, it includes the code to perform an adaptation of Random Forest in estimating technical efficiency. This paper describes the methodology and implementation of the functions, and reports numerical results using a real data base application.
Efficiency analysis refers to the discipline of estimating production frontiers while measuring the efficiency of a set of observations, named Decision Making Units (DMUs), which use several inputs to produce several outputs. In the literature of Economics, Engineering and Operations Research, the estimation of production frontiers is a current topic of interest (see, for example, Arnaboldi et al. 2014; Aparicio et al. 2017; O’Donnell et al. 2018). In this line, many models for estimating production frontiers have been developed, resorting to parametric and nonparametric approaches. In the nonparametric approach, a functional form does not need to be specified (e.g. a CobbDouglas production function) through the specification of a set of parameters to be estimated, since they are usually datadriven. Additionally, nonparametric models innately cope with multipleoutput scenarios. In contrast, the parametric approach aggregates the outputs into a single production index or attempts to model the technology using a dual cost function (Orea and Zof ’io 2019). These are some of the advantages that makes the nonparametric approaches for measuring technical efficiency more appealing than their parametric counterparts.
Contextualizing the nonparametric measurement of efficiency analysis requires outlining the following works. Farrel (1957) was a renowned opponent of estimating efficiency by determining average performance and, indeed, he was the first author in the literature to introduce a method for constructing production frontiers as the maximum producible output from an input bundle. Inspired by Koopmans (1951) and Debreu (1951), Farrell introduced a piecewise linear upper enveloping surface of the data cloud as the specification of the production frontier, satisfying some microeconomics postulates: free disposability, convexity and minimal extrapolation. A DMU is considered technically inefficient if it is located below the frontier. Furthermore, Farrell’s measure of efficiency, inspired by Shephard (1953), is based on radial movements (equiproportional changes) from technically inefficient observations to their benchmarks located at the estimated production frontier. In the same context as Farrell, Afriat (1972) determined a production frontier under nondecreasing and concavity mathematical assumptions and, at the same time, as close as possible to the sample of observations. Finally, in the same line of research, Charnes et al. (1978) and Banker et al. (1984) proposed Data Envelopment Analysis (DEA), rooted in mathematical programming to provide a relative efficiency assessment of a set of DMUs by the construction of a piecewise linear frontier. Along with DEA, Free Disposal Hull (FDH) is another of the most recognized nonparametric models for estimating production frontiers. FDH is a deterministic model introduced by Deprins and Simar (1984), which is only based on the free disposability and minimal extrapolation principles, as opposed to DEA, which also assumes convexity. In fact, FDH can be considered the skeleton of DEA, since the convex hull of the former coincides with DEA’s frontier (see Daraio and Simar 2005). In addition, other recent alternative nonparametric techniques for estimating production frontiers are: Banker and Maindiratta (1992) and Banker (1993), who showed that DEA can be interpreted as a Maximum Likelihood estimator; Simar and Wilson (1998, 2000b; 2000a), who introduced how to determine confidence intervals for the efficiency score of each DMU through adapting the bootstrapping methodology by Efron (1979) to the context of FDH and DEA; or Kuosmanen and Johnson (2010; 2017), who have recently shown that DEA may be interpreted as nonparametric leastsquares regression, subject to shape constraints on the production frontier and sign constraints on residuals; to name a few.
However, few of the above methodologies are based upon Machine Learning techniques, despite being a rising research field (see, for example, the recent papers by Khezrimotlagh et al. 2019; and Zhu et al. 2019; or the book by Charles et al. 2020). Recently, a bridge has been built between these literatures, Machine Learning and production theory, through a new technique proposed in Esteve et al. (2020), called Efficiency Analysis Trees. This new method shares some similarities with the standard FDH technique. In contrast to FDH, Efficiency Analysis Trees overcomes the wellknown problem of overfitting linked to FDH and DEA, by using crossvalidation to prune back the deep tree obtained in a first stage. Esteve et al. (2020) also showed that the performance of Efficiency Analysis Trees, checked through Monte Carlo simulations, clearly outperforms the FDH technique with respect to bias and mean squared error.
Many of the standard models for estimating technical efficiency are nowadays available as R packages such as: Benchmarking (Bogetoft and Otto 2010), for estimating technologies and measuring efficiencies using Data Envelopment Analysis (DEA) and Stochastic Frontier Analysis (SFA); nonparaeff (Oh and Suh 2013), for measuring efficiency and productivity using DEA and its variations; npbr (Daouia et al. 2017), which covers data envelopment techniques based on piecewise polynomials, splines, local linear fitting, extreme values and kernel smoothing; snfa (McKenzie 2018), which fits both a smooth analogue of DEA and a nonparametric analogue of SFA; or semsfa (Ferrara and Vidoli 2018), which, in a first stage, estimates Stochastic Frontier Models by semiparametric or nonparametric regression techniques to relax parametric restrictions and, in a second stage, applies a technique based on pseudolikelihood or the method of moments for estimating variance parameters. Additionally, there are other packages on efficiency measurement developed for alternative platforms. In MATLAB (The MathWorks Inc. 2021), we can find the Data Envelopment Analysis Toolbox (Álvarez et al. 2020), which implements the main DEA models and solves measures like the directional distance function (with desirable and undesirable outputs), the weighted additive model, and the MalmquistLuenberger index; or the Total Factor Productivity Toolbox (Balk et al. 2018), which includes functions to calculate the main Total Factor Productivity indices and their decomposition by DEA models. In Stata (StataCorp 2021), it is possible to find a similar package in Ji and Lee (2010).
In this paper, we introduce a new package in R, called eat, for fitting regression trees to estimate production frontiers in microeconomics and engineering, by implementing the main features of Efficiency Analysis Trees (Esteve et al. 2020). In particular, eat includes a complete set of baseline functions, covering a wide range of efficiency models fitted by Efficiency Analysis Trees (Esteve et al. 2020) and Random Forest (Esteve et al. 2021), and reporting numerical and graphical results. eat is available as free software, under the GNU General Public License version 3, and can be downloaded from the Comprehensive R Archive Network (CRAN) at https://CRAN.Rproject.org/package=eat, including supplementary material as datasets or vignettes to replicate all the results presented in this paper. In addition, eat is hosted on an open source repository on GitHub at https://github.com/MiriamEsteve/EAT. The main objective of this package is the estimation of a production frontier through regression trees satisfying the microeconomic principles of free disposability, convexity and deterministic data. Free disposability states that if a certain input and output bundle is producible, then any input and output bundle that presents a greater value for inputs and a lower value for outputs is also producible. In some sense, it means that doing it worse is always feasible. Convexity means that if two inputoutput bundles are assumed producible, then any convex combination of them are also feasible. Finally, the deterministic quality means that the observations that belong to the data sample have been observed without noise. In other words, the technology always contains all these observations and, graphically, the production frontier envelops all the data cloud from above.
The efficiency measurement field has witnessed the introduction of many different technical efficiency measures throughout the last decades. Regarding the technical efficiency measures implemented in the new eat package, it is worth mentioning that numerical scores and barplots are provided for the outputoriented and inputoriented BCC radial models (Banker et al. 1984), the directional distance function (Chambers et al. 1998), the weighted additive model (Lovell and Pastor 1995; and Cooper et al. 1999) and the outputoriented and inputoriented Russell measures (Färe and Lovell 1978). Additionally, the adaptation of Random Forest (Breiman 2001) for dealing with ensembles of Efficiency Analysis Trees, recently introduced in Esteve et al. (2021) and denoted as RF+EAT, has been also incorporated into the new eat package. The frontier estimator based on Random Forest, which is associated with more robust results, also allows to determine outofsample efficiency evaluation for the assessed DMUs. Another remarkable aspect of Efficiency Analysis Trees is the inherited ability to calculate feature importance as performed by other treebased models of Machine Learning. Specifically, this fact allows the researchers to know which are the most relevant variables for obtaining efficiency and thus getting an explanation of the level of technical efficiency identified for each assessed unit. This ranking of importance variable has been implemented in the eat package. Finally, and from a data visualization point of view, the obtained frontier from Efficiency Analysis Trees can be represented by means of a tree structure, ideal for highdimensional scenarios where the patterns between the inputs and the efficient levels of outputs are very complex. In addition, FDH and DEA standard models have been included in the new package in order to facilitate comparison with the efficiency scores determined by the Efficiency Analysis Trees technique. Also, the convexification of the estimation of the technology provided by Efficiency Analysis Trees, named Convexified Efficiency Analysis Trees (CEAT) by Aparicio et al. (2021), is implemented in the eat package, with the objective of determining estimations under the axiom of convexity.
The functions included in the eat package are summarized in Table 1. This table comprises two columns divided into four subsections for Efficiency Analysis Trees, Random Forest for Efficiency Analysis Trees, Convexified Efficiency Analysis Trees and functions for data simulation. The first column is the name of the main functions and the second one is the description of the functions and the reference of the paper in which we can find the most detailed theoretical explanation of the corresponding function.
The paper is organized as follows. The following section summarises the two methodologies implemented in the eat package in R: Efficiency Analysis Trees and Random Forest for Efficiency Analysis Trees. Section 3 describes the data structures that characterize the production possibility sets, the structure of the functions, the results, etc., and briefly explains which data are used to illustrate the package. Section 4 presents the basic methods. The next Section 5 deals with the measurement of economic efficiency of FDH, DEA, Efficiency Analysis Trees, Random Forest for Efficiency Analysis Trees and Convexified Efficiency Analysis Trees models. Advanced options, including displaying and exporting results can be found in Section 6. Section 7 concludes.
Functions  Description 

EAT

For EAT . It generates a pruned Efficiency Analysis Trees model and returns an EAT object.

bestEAT

For EAT . It computes the root mean squared error (RMSE) for a set of Efficiency Analysis Trees models made up of a set of userentered hyperparameters. These models are fitted with a training sample and evaluated with a test sample.

efficiencyEAT

For EAT . It computes the efficiency scores of a set of DMUs through a Efficiency Analysis Trees model and returns a data.frame . The FDH scores can also be computed. Alternative mathematical programming models for calculating the efficiency scores are: "BCC.OUT" (the outputoriented BCC radial model), "BCC.INP" (the inputoriented BCC radial model), "DDF" (the directional distance function), "RSL.OUT" (the outputoriented Russell model), "RSL.INP" (the inputoriented Russell model), "WAM.MIP" (the weighted additive model with Measure of Inefficiency Proportion) and "WAM.RAM" (the weighted additive model with Range Adjusted Measure of Inefficiency).

efficiencyJitter

For EAT . It returns a jitter plot (from ggplot2 ) that represents the dispersion of the efficiency scores of the set of DMUs in the leaf nodes of an Efficiency Analysis Trees model. Mean and standard deviation of scores are shown.

efficiencyDensity

It returns a density plot (from ggplot2 ) to compare the distribution of efficiency scores between two given models ("EAT" , "FDH" , "CEAT" , "DEA" and "RFEAT" are available).

plotEAT

For EAT . It returns a plot of the treestructure (from ggparty and partykit ) of an Efficiency Analysis Trees model.

frontier

For EAT . It returns a plot (from ggplot2 ) of the estimated production function obtained by an Efficiency Analysis Trees model in a twodimensional scenario (1 input and 1 output). Optionally, the FDH frontier can be plotted.

predict

For EAT . Generic function to predict the expected output by an EAT object. The result is a data.frame with the the predicted values.

rankingEAT

For EAT . It returns a data.frame with the scores of variable importance obtained by an Efficiency Analysis Trees model and optionally a barplot representing the variable importance.

RFEAT

For RFEAT . It generates a Random Forest for Efficiency Analysis Trees model and returns an RFEAT object.

bestRFEAT

For RFEAT . It computes the root mean squared error (RMSE) for a set of Random Forest for Efficiency Analysis Trees models made up of a set of userentered hyperparameters. These models are fitted with a training sample and evaluated with a test sample.

efficiencyRFEAT

For RFEAT . It computes the efficiency scores of a set of DMUs through a Random Forest for Efficiency Analysis Trees model and returns a data.frame . The FDH scores can also be computed. Only the outputoriented BCC radial model is available.

plotRFEAT

For RFEAT . It returns a line plot (from ggplot2 ) with the OutofBag (OOB) error for a random forest consisting of k trees.

predict

For RFEAT . Generic function to predict the expected output by an RFEAT object. The result is a data.frame with the predicted values.

rankingRFEAT

For RFEAT . It returns a data.frame with the scores of variable importance obtained by a Random Forest for Efficiency Analysis Trees model and optionally a barplot representing the variable importance.

efficiencyCEAT

For CEAT . It computes the efficiency scores of a set of DMUs through a Convexified Efficiency Analysis Trees model and returns a data.frame . The DEA scores can also be computed. Alternative mathematical programming models for calculating the efficiency scores are: "BCC.OUT" (the outputoriented BCC radial model), "BCC.INP" (the inputoriented BCC radial model), "DDF" (the directional distance function), "RSL.OUT" (the outputoriented Russell model), "RSL.INP" (the inputoriented Russell model), "WAM.MIP" (the weighted additive model with Measure of Inefficiency Proportion) and "WAM.RAM" (the weighted additive model with Range Adjusted Measure of Inefficiency).

Y1.sim

Simulation function. It returns a data.frame with simulated data in a single output scenario (1, 3, 6, 9, 12 and 15 inputs can be generated).

X2Y2.sim

Simulation function. It returns a data.frame with simulated data in a scenario with 2 inputs and 2 outputs.

In this section, we briefly introduce the main fundaments of Efficiency Analysis Trees. Nevertheless, we first need to introduce some notation related to the standard Free Disposal Hull (FDH) and Classification and Regression Trees (CART) techniques.
We consider the observation of \(n\) Decision Making Units (DMUs), which consumes \({\it \textbf{x}}_{i} =(x_{1i} ,...,x_{mi})\) \(\in R_{+}^{m}\) quantity of inputs for the production of \({\it \textbf{y}}_{i} =(y_{1i} ,...,y_{si} )\in R_{+}^{s}\) quantity of outputs. The dataset is denoted in a compact way as \(\aleph =\left\{\left({\it \textbf{x},\textbf{y}}\right)\right\}_{i=1,...,n}^{}\). The socalled production possibility set or technology, which is the set of technically feasible combinations of \(({\it \textbf{x},\textbf{y}})\), is defined, in general terms, as:
\[\begin{equation} \label{(1)} \psi =\left\{({\it \textbf{x},\textbf{y}})\in R_{+}^{m+s} :{\it \textbf{x}\; }{\it can\; produce\; \textbf{y}}\right\}. \end{equation}\]On this set, certain assumptions are usually made (see, Färe and Primont 1995), such as: monotonicity (free disposability) of inputs and outputs, which means that if \(({\it \textbf{x},\textbf{y}})\in \psi\), then \(({\it \textbf{x'},\textbf{y'}})\in \psi\), as long as \({\it \textbf{x'}\; }\ge {\it \textbf{x}}\) and \({\it \textbf{y'}\; }\le {\it \textbf{y}}\); and convexity, i.e., if \(\left({\it \textbf{x}},{\it \textbf{y}}\right)\in \psi\) and \(\left({\it \textbf{x'}},{\it \textbf{y'}}\right)\in \psi\), then \(\lambda \left({\it \textbf{x}},{\it \textbf{y}}\right)+\left(1\lambda \right)\left({\it \textbf{x'}},{\it \textbf{y'}}\right)\in \psi\), \(\forall \lambda \in \left[0,1\right]\). In the case of the FDH estimator, only free disposability is assumed. Additionally, FDH is assumed to be deterministic. In other words, the production possibility set determined by FDH always contains all the observations that belong to the data sample and, graphically, the production frontier envelops the data cloud from above. Also, FDH satisfies the minimal extrapolation postulate, which is associated with the typical problemsolving principle of Occam’s razor. That is, additional requirements are needed to select the right estimator because there are a lot of possible estimators that can meet free disposability and the deterministic quality. In this sense, according to Occam’s razor, the most conservative estimate of the production frontier would be that related to a surface that would envelop the data from above, satisfy free disposability and, at the same time, be as close as possible to the data cloud. In contrast, the DEA estimator requires stronger assumptions, such as convexity of the set \(\psi\).
With regard to the measurement of technical efficiency, a certain part of the set \(\psi\) is actually of interest. It is the efficient frontier or production frontier of \(\psi\), which is defined as \(\partial ({\it \boldsymbol{\psi}}):=\left\{({\it \textbf{x}},{\it \textbf{y}})\in {\it \boldsymbol{\psi}}:\hat{{\it \textbf{x}}}<{\it \textbf{x}},\hat{{\it \textbf{y}}}>{\it \textbf{y}}\Rightarrow (\hat{{\it \textbf{x}}},\hat{{\it \textbf{y}}})\notin {\it \boldsymbol{\psi}}\right\}\). Technical efficiency is understood as the distance from a point belonging to \(\psi\) to the production frontier \(\partial ({\it \boldsymbol{\psi}})\). In particular, Deprins and Simar (1984) proposed the FDH estimator of the set \(\psi\) from the dataset \(\aleph\) as:
\[\begin{equation} \label{(2)} \hat{{\it \boldsymbol{\psi}}}_{FDH} =\left\{\left({\it \textbf{x}},{\it \textbf{y}}\right)\in R_{+}^{m+s} :\exists i=1,...,n\, \, {\rm such\; that}\, {\it \textbf{y}}\le {\it \textbf{y}}_{i} ,{\it \textbf{x}}\ge {\it \textbf{x}}_{i} \right\}. \end{equation}\]The FDH technique is very attractive because it is based on very few suppositions, but it suffers from overfitting due to its construction. This problem is shared by other wellknown databased approaches. For example, Classification and Regression Trees (CART), a technique that belongs to the field of machine learning, suffer problems of overfitting when a deep tree is developed. However, this problem can be fixed using a crossvalidation process to prune the deep tree. The principle behind CART is relatively simple: a certain criterion is chosen to recursively generate binary partitions of the data until a meaningful division is no longer possible or a stopping rule is maintained. The graphic result of this approach is a tree that starts at the root node, develops through the intermediate nodes and ends at the terminal nodes, also known as leaves. The binary nature of CART is represented by each parent node, except for the leaves, giving rise to two child nodes.
Next, we briefly introduce the recent technique named Efficiency Analysis Trees by Esteve et al. (2020). This technique allows the estimation of production frontiers, fulfilling the common axioms of microeconomics, through a databased approach that is not founded on any particular distribution on the data noise and, in addition, creates a step function as a estimator. It shares these characteristics with the FDH technique, but the overfitting problem related to FDH can be solved through crossvalidation based on pruning.
We now introduce the main steps of the algorithm linked to the Efficiency Analysis Trees technique. Let us assume that we have a node \(t\) in the tree structure to be split. This node contents a subset of the original sample \(\aleph\). The algorithm has to select an input variable \(j\), \(j=1,...,m\), and a threshold \(s_{j} \in S_{j}\), where \(S_{j}\) is the set of possible thresholds for variable \(j\), such that the sum of the mean squared error (MSE) calculated for the data that belong to the left child node \(t_{L}\) and the MSE corresponding to the data belonging to the right child node \(t_{R}\) is minimized. The data of the left child node \(t_{L}\) satisfies the condition \(x_{j} <s_{j}\), while the data of the right child node \(t_{R}\) satisfies the condition \(x_{j} \ge s_{j}\). Additionally, in the algorithm, the set \(S_{j}\) is defined from the observed values of the input \(j\) in the data sample \(\aleph\). Formally, the split consists in selecting the combination \(\left(x_{j} ,s_{j} \right)\) which minimizes \(R\left(t_{L} \right)+R\left(t_{R} \right)=\frac{1}{n} \sum _{\left({\it x}_{i} ,{\it y}_{i} \right)\in t_{L} }\sum _{r=1}^{s}\left(y_{ri} y_{r} \left(t_{L} \right)\right)^{2} +\frac{1}{n} \sum _{\left({\it x}_{i} ,{\it y}_{i} \right)\in t_{R} }\sum _{r=1}^{s}\left(y_{ri} y_{r} \left(t_{R} \right)\right)^{2}\), where \(y_{r} \left(t\right)\) denotes the estimation of the \(r\)th output of the node \(t\). One of the most important aspects in the production context is how to define \(y_{r} \left(t\right)\) in each node for fulfilling the free disposability property during the growing process of the tree. In this sense, the notion of Paretodominance between nodes introduced in Esteve et al. (2020) is really relevant.
As described above, each node \(t\) is defined by a series of conditions in the input space as \(\left\{x_{j} <s_{j} \right\}\) or \(\left\{x_{j} \ge s_{j} \right\}\). In this sense, after executing the split, a region in the input space is created. This region in the input space is called the “support” of node \(t\) and is defined as \({\rm supp}\left(t\right)=\left\{{\it \textbf{x}}\in R_{+}^{m} :a_{j}^{t} \le x_{j} <b_{j}^{t} ,j=1,...,m\right\}\). The parameters \(a_{j}^{t}\) and \(b_{j}^{t}\) are originated from the several thresholds selected during the splitting process. Giving the notion of support of a node, it is possible to establish the concept of Paretodominance. Let \(k=1,...,K\) be the total number of splits executed. Let \(T_{k} \left(\aleph \right)\) be the tree built after the th split. Let \(\tilde{T}_{k} \left(\aleph \right)\) be the set of leaves in the tree \(T_{k} \left(\aleph \right)\). More notation: let \(t^{*} \in \tilde{T}_{k} \left(\aleph \right)\) be the node to be split in a certain step of the algorithm, then \(T\left(kt^{*} \to t_{L} ,t_{R} \right)\) denotes the tree associated with this specific split. Let \(k=1,...,K\) and \(t\in \tilde{T}_{k} \left(\aleph \right)\), then the set of Paretodominant nodes of node \(t\) is defined as \(P_{T_{k} \left(\aleph \right)} \left(t\right)=\{t'\in \tilde{T}_{k} \left(\aleph \right)t:\exists {\it \textbf{x}}\in {\rm supp}\left(t\right),\exists {\it \textbf{x'}}\in {\rm supp}\left(t'\right)\, \, {\rm such\; that}\,\) \({\it \textbf{x'}} \le {\it \textbf{x}}\}\). \(P_{T_{k} \left(\aleph \right)} \left(t\right)\) contains all the nodes such that at least one input vector in its corresponding support, nonnecessarily observed, dominates at least one input vector belonging to the support of node \(t\) (in the Pareto sense). To do so, in practice, it is only necessary to compare the components of \({\it \textbf{a}}^{t'}\) and \({\it \textbf{b}}^{t}\). Specifically, \({\it \textbf{a}}^{t'} <{\it \textbf{b}}^{t}\) if and only if \(t'\in P_{T_{k} \left(\aleph \right)} \left(t\right)\).
Now, we return to how to estimate the outputs in each child node with the aim of guaranteeing the satisfaction of free disposability. For any node \(t^{*} \in \tilde{T}_{k} \left(\aleph \right)\), the way to estimate the value of the outputs for the right child node is through the estimation of the outputs of its parent node, i.e., \(y_{r} \left(t_{R} \right)=y_{r} (t^{*} ),\, \, r=1,...,s,\) while the estimation of outputs for the left child node is:
\[\begin{equation} \label{(3)} y_{r} \left(t_{L} \right)=\max \left\{\max \left\{y_{ri} :\left({\it \textbf{x}}_{i} ,{\it \textbf{y}}_{i} \right)\in t_{L} \right\},y_{r} \left(I_{T\left(kt^{*} \to t_{L} ,t_{R} \right)} \left(t_{L} \right)\right)\right\},\, \, r=1,...,s, \end{equation}\]where \(y_{r} \left(I_{T\left(kt^{*} \to t_{L} ,t_{R} \right)} \left(t_{L} \right)\right)= \max \left\{y_{r} \left(t'\right):t'\in I_{T\left(kt^{*} \to t_{L} ,t_{R} \right)} \left(t_{L} \right)\right\}\) and \(y_{r} \left(t'\right)\) is the estimation of the output \(y_{r}\) at node \(t'\in \tilde{T}\left(kt^{*} \to t_{L} ,t_{R} \right)\), \(r=1,...,s\). This way of estimating the output values guarantees free disposability.
Accordingly, the algorithm selects the best pair \(\left(x_{j^{*} } ,s_{j^{*} } \right)\) such that the sum of the MSE of the left and right child nodes is minimized. Once the split of node \(t^{*}\) is executed, the tree \(T\left(kt^{*} \to t_{L}^{*} ,t_{R}^{*} \right)\) is obtained. This process continues until bipartition is not possible because all the data in a node have the same input values or a certain stopping rule is satisfied. The usual stopping rule is \(n\left(t\right)\le n_{\min } = 5\), where \(n(t)\) is the sample size of node \(t\). The final tree built is denoted as \(T_{\max } \left(\aleph \right)\), which usually is a deep tree.
\(T_{\max} \left(\aleph \right)\) suffers from the same problem as FDH, i.e., overfitting. Esteve et al. (2020) proposed to prune the tree exploiting the same technique as Breiman et al. (1984). This pruning process resorts to the notion of the errorcomplexity measure \(R_{\alpha} \left(T\left(\aleph \right)\right)\), which is a combination between a measure of the accuracy of the tree, defined as the sum of the MSE determined at each leaf node, and a measure of the number of leaf nodes. Also, \(R_{\alpha } \left(T\left(\aleph \right)\right)\) depends on a parameter \(\alpha\), which compensates the values of the two components of the error: \(R_{\alpha } \left(T\left(\aleph \right)\right)=R\left(T\left(\aleph \right)\right)+\alpha \left\tilde{T}\left(\aleph \right)\right\). The idea behind the pruning of \(T_{\max } \left(\aleph \right)\) is to minimize \(R_{\alpha } \left(T\left(\aleph \right)\right)\). The pruning process is also based on crossvalidation (see Breiman et al. (1984) for more details). The tree resulting from the pruned process is \(T^{*} \left(\aleph \right)\). This tree doesn’t suffer from the overfitting problem. For this reason, the use of \(T^{*} \left(\aleph \right)\) is recommended rather than \(T_{\max } \left(\aleph \right)\), unless a descriptive analysis of the sample is required.
Finally, \({\it \textbf{d}}_{T^{*} \left(\aleph \right)} \left({\it \textbf{x}}\right)\) will denote hereinafter the multidimensional estimator defined from \(T^{*} \left(\aleph \right)\) and the sample \(\aleph\), i.e., \(d_{rT^{*} \left(\aleph \right)} \left({\it \textbf{x}}\right)=\sum _{t\in T^{*} \left(\aleph \right)}y_{r} \left(t\right)I\left({\it \textbf{x}}\in t\right)\), for all \(r=1,...,s\), with \(I\left(\cdot \right)\) being the indication function. From this estimator, it is possible to define a production possibility set or technology estimated from the Efficiency Analysis Trees technique as:
\[\begin{equation} \label{(4)} \hat{\Psi}_{T^{*} (\aleph)} =\left\{\left({\it \textbf{x}},{\it \textbf{y}}\right)\in R_{+}^{m+s} :{\it \textbf{y}}\le {\it \textbf{d}}_{T^{*} \left(\aleph \right)} \left({\it \textbf{x}}\right)\right\}. \end{equation}\]\(\hat{\Psi }_{T^{*} \left(\aleph \right)}\) satisfies free disposability and the deterministic quality.
By analogy with the existing relationship between FDH and DEA, it is possible to derive an estimation of \(\Psi\) by the convexification of the set \(\hat{\Psi }_{T^{*}}\). In this sense, the convexification of the production possibility set derived from EAT would be as follows:
\[\begin{equation} \label{(5)} conv\left(\hat{\Psi}_{T^{*} } \right)=\left\{\left({\it \textbf{x}},{\it \textbf{y}}\right)\in R_{+}^{m+s} :{\it \textbf{x}}\ge \sum _{t\in \tilde{T}^{*} }^{}\lambda _{t} {\it \textbf{a}}^{t} ,{\it \textbf{y}}\le \sum _{t\in \tilde{T}^{*} }^{}\lambda _{t} {\it \textbf{d}}_{T^{*} } \left({\it \textbf{a}}^{t} \right) ,\sum _{t\in \tilde{T}^{*} }^{}\lambda_{t} {\rm =1}{\it ,\boldsymbol{\lambda}}\ge 0_{\left\tilde{T}^{*} \right} \right\}. \end{equation}\]Under the convexity assumption, the EAT methodology is known as the Convexified Efficiency Analysis Trees technique (hereinafter referred to as CEAT) (see Aparicio et al. 2021).
In this section, we briefly describe the extension of the approach by Esteve et al. (2020) to the context of using ensembles of trees to provide estimates of production frontiers (see Esteve et al. 2021). Specifically, we briefly revise the way to adapt the standard Random Forest (Breiman 2001) for estimating production frontiers satisfying fundamental postulates of microeconomics, such as free disposability. The adaptation of Random Forest to the estimation of production frontiers by Esteve et al. (2021) is the first one that focuses on the introduction of a methodology for measuring technical efficiency that is robust to the resampling of data and, at the same time, to the specification of input variables.
Data robustness and resampling methods for input modeling are both topics of interest in the literature on technical efficiency measurement. Regarding robustness to data, Simar and Wilson (1998, 2000b; 2000a) were the first ones to adapt the bootstrapping methodology (Efron 1979) to the context of DEA and FDH. As regards the importance of the robustness of input and output variables in nonparametric efficiency analysis, since the beginning of DEA and FDH, researchers have always been aware that the selection of input and output variables to be considered in efficiency analysis is one of the crucial issues in the specification of the model. In practice, the researchers’ previous experience may lead to the selection of some inputs and outputs considered essential to represent the underlying technology. However, there may be other variables whose inclusion in the model the analyst is not always sure of (Pastor et al. 2002). Some approaches focus on balancing the experience of researchers with the information provided by observations (see, for example, Banker 1993, 1996; Pastor et al. 2002). Another recent approach is based, in contrast, on determining efficiency scores that are robust to variable selection by considering all the possible combinations of inputs and outputs and their aggregation (Landete et al. 2017).
On the whole, Random Forest (Breiman 2001) is an ensemble learning method that works by constructing a multitude of decision trees by CART (Breiman et al. 1984) at training time and aggregating the information of the individual trees in a final prediction value. In particular, when Random Forest is applied to regression problems, the final estimator corresponds to the mean of each individual prediction (Breiman 2001). Random Forest modifies the growing process of an individual tree as follows, by: (i) applying bootstrapping on the data training for each individual tree and (ii) selecting a random subset of the predictors in each iteration. In this way, given a learning sample \(\aleph\) of size \(n\), Random Forest repeatedly selects random samples of size \(n\) with replacement of the set \(\aleph\). Then, the method fits the trees to these samples but, to do this, it uses a modified tree learning algorithm that chooses, in each candidate division of the learning process, a random subset of predictors. The reason for doing this is due to the instability of the model. It is known that individual decision trees, such as CART, are very unstable (Berk 2016). This means that completely different tree structures are given when the training data is modified slightly. In this way, the result of applying Random Forest is an estimator that overcomes overfitting and instability problems in general, resulting in a substantial reduction in variance.
The algorithm associated with the adaptation of the Random Forest technique to the world of technical efficiency assessment, called RF+EAT, is introduced in Esteve et al. (2021). The steps that must be carried out in Random Forest for Efficiency Analysis Trees are shown in Algorithm Algorithm 1. This algorithm is based on the typical algorithm of Random Forest that can be found in Kuhn et al. (2013). In Algorithm Algorithm 1, the first step consists of selecting the number of trees that will make up the forest, that is, the hyperparameter \(p\). Then, \(p\) (bootstrap) random samples from the original data sample with replacement are generated. Next, the Efficiency Analysis Trees algorithm by Esteve et al. (2020) is applied to each subsample applying the stopping rule \(n\left(t\right)\le n_{\min}\), but without pruning. Also, in this algorithm, \(n_{\min}\) is treated as an additional hyperparameter that could be tuned. During the execution of the Efficiency Analysis Trees algorithm, a subset of input variables (\(mtry\)) from the original set is randomly selected each time the splitting subroutine is applied. To do that, one of the following five thumb rules is used following the literature:
Breiman’s Rule: \(mtry=\frac{m}{3}\),
Rule DEA1: \(mtry=\frac{n\left(t\right)}{2} s\) (Golany and Roll 1989; Homburg 2001),
Rule DEA2: \(mtry=\frac{n\left(t\right)}{3} s\) (Nunamaker 1985; Banker et al. 1989; Friedman and Sinuany Stern 1998; Raab and Lichty 2002),
Rule DEA3: \(mtry=\frac{n\left(t\right)}{2s}\) (Dyson et al. 2001),
Rule DEA4: \(mtry=\min \left\{\frac{n\left(t\right)}{s}, \frac{n\left(t\right)}{3} s\right\}\) (Cooper et al. 2007).
Once Algorithm Algorithm 1 has been applied, \(p\) fitted trees are determined with the aim of obtaining an output estimation giving an input vector \({\it \textbf{x}}\in R_{+}^{m}\). In this regard, we have \(T\left(\aleph _{1} \right),...,T\left(\aleph _{p} \right)\) tree structures derived from the application of the Efficiency Analysis Trees algorithm on the \(p\) bootstrap subsamples \(\aleph _{1} ,...,\aleph _{p}\). Given an input vector \({\it \textbf{x}}\in R_{+}^{m}\), an output estimator is determined by averaging the individual estimator corresponding to each tree:
\[\begin{equation} \label{(6)} {\it \textbf{y}}^{RF+EAT\left(\aleph \right)} \left({\it \textbf{x}}\right):=\frac{1}{p} \sum_{q=1}^{p}{\it \textbf{d}}_{T\left(\aleph _{q} \right)} \left({\it \textbf{x}}\right). \end{equation}\]where \({\it \textbf{d}}_{T\left(\aleph _{q} \right)} \left({\it \textbf{x}}\right)\) denotes the output estimator associated with each tree structure \(T\left(\aleph _{q} \right)\), given an input vector \({\it \textbf{x}}\in R_{+}^{m}\).
In addition, this estimator allows the technology or production possibility set to be defined as:
\[\begin{equation} \label{(7)} \hat{\Psi }_{RF+EAT} =\left\{\left({\it \textbf{x}},{\it \textbf{y}}\right)\in R_{+}^{m+s} :{\it \textbf{y}}\le {\it \textbf{y}}^{RF+EAT\left(\aleph \right)} \left({\it \textbf{x}}\right)\right\}. \end{equation}\]As happens with the standard Random Forest, Random Forest for Efficiency Analysis Trees also exploits the OutOfBag (OOB) concept. The OOB estimate at observation \(\left({\it \textbf{x}}_{i} ,{\it \textbf{y}}_{i} \right)\) consists in evaluating the prediction of the ensemble just using the individual models \(T\left(\aleph _{q} \right)\) whose corresponding bootstrap samples \(\aleph _{q}\) are such that \(\left({\it \textbf{x}}_{i},{\it \textbf{y}}_{i} \right)\notin \aleph _{q}\). From this definition, the generalization error is defined as the average of the OOB estimates calculated over all the observations in the learning sample \(\aleph\):
\[\begin{equation} \label{(8)} err^{RF+EAT\left(\aleph \right)} =\frac{1}{n} \sum_{\left(x_{i} ,y_{i} \right)\in \aleph }\sum_{r=1}^{s}\left(y_{ri} y_{r}^{RF+EAT\left(\aleph \right)} \left({\it \textbf{x}}_{i} \right)\right)^{2}. \end{equation}\]The generalization error is useful for determining a measure of variable importance, which can be used for creating a sorted list of inputs \(x_{1} ,...,x_{m}\). The way to calculate the input importance of variable \(x_{j}\) is: firstly, generate a new database, \(\aleph ^{j}\), identical to the original one \(\aleph\), where specifically the values of variable \(x_{j}\) were randomly permuted; secondly, apply Algorithm Algorithm 1 on the new `virtual’ learning sample \(\aleph ^{j}\); thirdly, determine the value of the generalization error, i.e., \(err^{RF+EAT\left(\aleph ^{j} \right)}\); and, finally, calculate the percentage increase of the generalization error when variable \(x_{j}\) is shuffled as:
\[\begin{equation} \label{(9)} \% Inc^{RF+EAT} \left(x_{j} \right)=100\cdot \left(\frac{err^{RF+EAT\left(\aleph ^{j} \right)} err^{RF+EAT\left(\aleph \right)} }{err^{RF+EAT\left(\aleph \right)}} \right). \end{equation}\]Data are managed as a regular R data.frame
in the eat functions (matrix
is often accepted but will be converted to a data.frame
in the functions preprocessing). The main functions of the eat package are EAT()
and RFEAT()
, which return structured objects named EAT
and RFEAT
, respectively. These objects contain fields with relevant information such as the estimation results or the arguments introduced by the user in the function call.
The fields of the EAT
object are the following:
data
: Contains the input and output variables.
df
: Data introduced by the user in a data.frame
structure after being preprocessed.x
: Input indexes in df
.y
: Output indexes in df
.input_names
: Name of the input variables in df
.output_names
: Name of the output variables in df
.row_names
: Name of the observations in df
.control
: Contains the hyperparameters selected by the user.
fold
: Number of folds in which is divided df
to apply crossvalidation.numStop
: Minimum number of observations in a node.max.leaves
: Maximum number of leaf nodes.max.depth
: Maximum number of nodes between the root node (not included) and the furthest leaf node.na.rm
: A logical variable that indicates if NA
rows should be ignored.tree
: list
containing the nodes of the fitted Efficiency Analysis Trees model. Each node is made up of the following elements:
id
: Node indexF
: Father node index.SL
: Left child node index.SR
: Right child node index.index
: Set of indexes corresponding to the observations in a node.R
: Error at the node.xi
: Index of the variable that produces the split in a node.s
: Threshold of the variable xi
.a
: The components of the vector \({\it \textbf{a}}^{t}\).b
: The components of the vector \({\it \textbf{b}}^{t}\).nodes_df
: Contains the following information related to the nodes of the fitted Efficiency Analysis Trees model in a data.frame
structure:
id
: Node indexN
: Number of observations in a node.Proportion
: Proportion of observations in a node.y
: Fitted values.R
: Error at the node.index
: Indexes of the observations in a node.model
: Contains the following information related to the fitted Efficiency Analysis Trees model:
nodes
: Number of nodes in the tree.leaf_nodes
: Number of leaf nodes in the tree.a
: The components of the vector \({\it \textbf{a}}^{t}\).y
: Output estimation for each leaf node.Regarding the RFEAT
object, it contains the following fields:
data
: same fields as the EAT
object.
control
: Contains the hyperparameters selected by the user.
numStop
: Minimum number of observations in a node.m
: Number of trees that make up the random forest.s_mtry
: Number of inputs that can be randomly selected in each split.na.rm
: A logical variable that indicates if NA
rows should be ignored.forest
: A list
containing the individual Efficiency Analysis Trees that make up the random forest.
Error
: The OutofBag error at the random forest.
OOB
: A list
containing the observations used for training each Efficiency Analysis Tree that makes up the random forest.
We illustrate all the models presented in this paper resorting to a single dataset (PISAindex
) available in the eat package. Our dataset consists of 72 countries with 3 outputs and 13 inputs. The output data have been collected by the PISA (Programme for International Student Assessment) 2018 survey (OECD 2018) and refers to the average score in mathematics, reading and science domains for schools in these countries. Regarding the input data, the variables have been collected from the Social Progress Index (2018) and are related to the socioeconomic environment of these countries. These inputs can be classified into four blocks as follows:
NBMC
)WS
)S
)PS
).ABK
)AIC
)HW
)EQ
).PR
)PFC
)I
)AAE
).GDP_PPP
).Finally, in order to simplify the examples and reduce computation time, a subset of variables is selected as follows:
# Inputs (5): PR, PFC, I, AAE, GDP_PPP
# Outputs (3): S_PISA, R_PISA, M_PISA
PISAindex < PISAindex[, c(3, 4, 5, 14, 15, 16, 17, 18)]
head(PISAindex)
S_PISA R_PISA M_PISA PR PFC I AAE GDP_PPP
SGP 551 549 569 71.70 87.90 48.26 74.31 97.745
JPN 529 504 527 94.07 82.40 62.32 81.29 41.074
KOR 519 514 526 92.71 79.06 63.54 86.32 41.894
EST 530 523 523 95.67 84.10 55.58 73.16 35.308
NLD 503 485 519 96.34 89.04 75.82 82.99 56.455
POL 511 512 516 86.41 78.25 57.58 76.21 31.766
Table 2 reports the descriptive statistics for these variables (outputs and inputs).
variable  type  mean  sd  min  median  max 

S_PISA  output  455.06  48.32  336.00  466.00  551.00 
R_PISA  output  450.89  50.52  340.00  466.00  549.00 
M_PISA  output  454.81  52.17  325.00  463.50  569.00 
PR  input  81.62  17.98  21.14  88.40  98.07 
PFC  input  75.42  11.03  47.25  78.19  91.65 
I  input  54.17  17.07  12.37  55.51  81.91 
AAE  input  69.87  10.75  48.37  71.65  90.43 
GDP_PPP  input  36.04  21.91  7.44  31.42  114.11 
In this section, we introduce the main functions of the library related to Efficiency Analysis Trees and Random Forest for Efficiency Analysis Trees. To execute the following examples, the package eat must be loaded and the seed 100 must be set for reproducibility.
# We set the seed
set.seed(100)
The basic model of Efficiency Analysis Trees that we explained in subsection Efficiency Analysis Trees can be implemented in R using the function EAT()
:
EAT(
data, x, y,
numStop = 5,
fold = 5,
max.depth = NULL,
max.leaves = NULL,
na.rm = TRUE
)
The EAT()
function is the cornerstone of the eat library. The minimum arguments of this function are the data (data
) containing the study variables, the indexes of the predictor variables or inputs (x
) and the indexes of the predicted variables or outputs (y
). Additionally, the numStop
, fold
, max.depth
and max.leaves
arguments are included for more experienced users in the fields of machine learning and treebased models. Modifying these four hyperparameters allows obtaining different frontier estimates and therefore selecting the one that best suits the needs of the analysis. The description of these parameters is as follows:
numStop
refers to the minimum number of observations in a node to be split and is directly related to the size of the tree. The higher the value of numStop
, the smaller the size of the tree.
fold
refers to the number of parts in which the data
is divided to apply the crossvalidation technique. Variations in the fold
argument are not directly related to the size of the tree.
max.depth
limits the number of nodes between the root node (not included) and the furthest leaf node. When this argument is introduced, the typical process of growthpruning is not carried out. In this case, the tree is allowed to grow to the required depth.
max.leaves
determines the maximum number of leaf nodes. As in max.depth
, the process of growthpruning is not performed. In this respect, the tree grows until the required number of leaf nodes is reached, and then, the tree is returned.
Notice that including the arguments max.depth
or max.leaves
reduces the computation time by eliminating the pruning procedure. However, the pruning process is preferred if the objective of the study is inferential instead of simply descriptive. If both are included at the same time, a warning
message is displayed and only max.depth
is used.
As an example, using data from subsection Dataset and statistical sources, we next create a multi response tree using the suitable code as follows. Results are returned as an EAT
object, as explained in Section Data structure.
modelEAT < EAT(data = PISAindex, x = 4:8, y = 1:3)
The basic model of Random Forest for Efficiency Analysis Trees that we explained in subsection Random Forest for Efficiency Analysis Trees can be implemented in R using the function RFEAT()
:
RFEAT(
data, x, y,
numStop = 5,
m = 50,
s_mtry = "BRM",
na.rm = TRUE
)
The RFEAT()
function has also been developed with the aim of providing greater statistical robustness to the results obtained by the EAT()
function. The RFEAT()
function requires the data
containing the variables for the analysis, x
and y
corresponding to the inputs and outputs indexes respectively, the minimum number of observations in a node for a split to be attempted (numStop
) and na.rm
to ignore observations with NA
cells. All these arguments are used for the construction of the \(p\) (this is denoted with m
in the RFEAT()
function) individual Efficiency Analysis Trees that make up the random forest. Finally, the argument s_mtry
indicates the number of inputs that can be randomly selected in each split. It can be set as any integer although there are also certain predefined values. Let \(m\) be the number of inputs, let \(s\) be the number of outputs and let \(n(t)\) be the number of observations in a node. Then, the predefined values for s_mtry
are:
BRM
= \(\frac{m}{3}\),
DEA1
= \(\frac{n\left(t\right)}{2}s\),
DEA2
= \(\frac{n\left(t\right)}{3}s\),
DEA3
= \(\frac{n\left(t\right)}{2s}\),
DEA4
= \(\min \left\{\frac{n\left(t\right)}{s}, \frac{n\left(t\right)}{3}s \right\}\).
As an example, using data from subsection Dataset and statistical sources, we next create a forest with 30 trees. Results are returned as an RFEAT
object, as explained in Section Data structure.
modelRFEAT < RFEAT(data = PISAindex, x = 4:8, y = 1:3, m = 30)
The estimators of the Efficiency Analysis Trees and Random Forest for Efficiency Analysis Trees can be computed in R using the function predict()
:
predict(
object,
newdata,
x, ...
)
Regarding the arguments of predict()
, object
can be an EAT
or an RFEAT
object, newdata
refers to a data.frame
and x
to the set of inputs to be used. This function returns a data.frame
with the expected output for a set of observations. For predictions using an EAT
object, only one tree is used. However, for the RFEAT model, the output is predicted by each of the \(p\) (m
in the RFEAT()
function) individual trees trained and subsequently the mean value of all predictions is obtained.
As an example, we next evaluate the last 3 DMUs from the data of subsection Dataset and statistical sources and the corresponding EAT and RFEAT models. Results are returned in a data.frame
structure with the output predictions:
S_PISA_pred R_PISA_pred M_PISA_pred
1 428 424 437
2 377 359 368
3 377 359 368
S_PISA_pred R_PISA_pred M_PISA_pred
1 439.9667 435.1333 441.2000
2 402.0667 389.0667 403.9000
3 399.0333 389.3333 399.6333
In the same way, the user can also create a new data.frame
and calculate predictions for it as follows:
new < data.frame(AAE = c(61, 72), PR = c(76, 81), I = c(41, 55), GDP_PPP = c(19, 31),
PFC = c(67, 78))
predict(object = modelEAT, newdata = new, x = 1:5)
S_PISA_pred R_PISA_pred M_PISA_pred
1 428 424 421
2 481 479 488
The way to compute in R the predictor variables importance in the Efficiency Analysis Trees methodology is using the functions rankingEAT()
or rankingRFEAT()
:
# Through Efficiency Analysis Trees
rankingEAT(
object,
barplot = TRUE,
threshold = 70,
digits = 2
)
# Through Random Forest for Efficiency Analysis Trees
rankingRFEAT(
object,
barplot = TRUE,
digits = 2
)
These functions allow a selection of variables by calculating a score of importance through Efficiency Analysis Trees or Random Forest for Efficiency Analysis Trees, respectively. These importance scores represent how influential each variable is in the model. Regarding the Efficiency Analysis Trees [RankingEAT()
], the notion of surrogate splits by Breiman et al. (1984) was implemented. In this regard, the measure of importance of a variable \(x_j\) is defined as the sum over all nodes of the decrease in mean squared error produced by the best surrogate split on \(x_j\) at each node (see Definition 5.9 in Breiman et al. (1984)). Since only the relative magnitudes of these measures are interesting for researchers, the actual measures of importance that we report are normalized. In this way, the most important variable has always a value of 100, and the others are in the range 0 to 100. As for the Random Forest for Efficiency Analysis Trees [RankingRFEAT()
], (9) was implemented for each input variable. Regarding the available arguments of the functions, the user can specify the number of decimal units (digits
) and include a barplot (from ggplot2
) with the scores of importance (barplot
). Additionally, the rankingEAT()
function allows to display a horizontal line in the graph to facilitate the cutoff point between important and irrelevant variables (threshold
).
As an example, we next use the objects modelEAT
(an EAT
object from the EAT()
function) and modelRFEAT
(an RFEAT
object from the RFEAT()
function) created in the previous section to assess the predictors used. These functions return the name of the predictor variables, the scores of importance (in the range 0100 for the rankingEAT()
function) and a barplot (without horizontal line for the rankingRFEAT()
function).
rankingEAT(object = modelEAT)
$scores
Importance
AAE 100.00
GDP_PPP 82.13
I 72.58
PR 72.45
PFC 29.07
$barplot
rankingRFEAT(object = modelRFEAT)
$scores
Importance
PR 1.75
PFC 1.64
GDP_PPP 2.16
I 2.87
AAE 3.67
$barplot
Note that negative scores may appear when calculating the importance of variables using the rankingRFEAT()
function. The appearance of this type of (negative) score can be understood as, if that variable were removed from the model, ceteris paribus, then an improvement in the predictive capacity of the model would be produced.
Efficiency scores are numerical values that indicate the degree of efficiency of a set of Decision Making Units (DMU). In the eat package, these scores can be calculated through an Efficiency Analysis Trees model, a Random Forest for Efficiency Analysis Trees model or a Convexified Efficiency Analysis Trees model. The code is as follows:
# For Efficiency Analysis Trees
efficiencyEAT(
data, x, y, object, scores_model, digits = 3,
FDH = TRUE, print.table = FALSE, na.rm = TRUE
)
# For Random Forest for Efficiency Analysis Trees
efficiencyRFEAT(
data, x, y, object, digits = 3,
FDH = TRUE, print.table = FALSE, na.rm = TRUE
)
# For Convexified Efficiency Analysis Trees
efficiencyCEAT(
data, x, y, object, scores_model, digits = 3,
DEA = TRUE, print.table = FALSE, na.rm = TRUE
)
A dataset (data
) and the corresponding indexes of input(s) (x
) and output(s) (y
) must be entered. It is recommended that the data
with the DMUs whose efficiency is to be calculated coincide with those used to estimate the frontier. However, it is also possible to calculate the efficiency scores for new data
. The efficiency scores are calculated using the mathematical programming model included in the argument scores_model
. The following models are available:
BCC.OUT
: The outputoriented radial model (Banker et al. 1984).
BCC.INP
: The inputoriented radial model (Banker et al. 1984).
RSL.OUT
: The outputoriented Russell model (Färe and Lovell 1978).
RSL.INP
: The inputoriented Russell model (Färe and Lovell 1978).
DDF
: The Directional Distance Function (Chambers et al. 1998).
WAM.MIP
: The Measure of Inefficiency Proportions as a type of Weighted Additive Model (Lovell and Pastor 1995).
WAM.RAM
: The RangeAdjusted Measure of Inefficiency as a type of Weighted Additive Model (Lovell and Pastor 1995; Cooper et al. 1999).
FDH or DEA scores can optionally be computed by setting FDH = TRUE
or DEA = TRUE
, respectively. Finally, a summary descriptive table of the efficiency scores can be displayed with the argument print.table = TRUE
.
The outputoriented radial model determines the efficiency score for \(\left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)\in R_{+}^{m+s}\) by equiproportionally increasing all its outputs while maintaining inputs constant: \(\phi \left({\it \textbf{x}}_{k}, {\it \textbf{y}}_{k} \right)=\max \{\phi _{k} \in R:\) \(\left({\it \textbf{x}}_{k} ,\phi_{k} {\it \textbf{y}}_{k} \right)\in \Psi \}\).
The efficiency score \(\phi \left({\it \textbf{x}}_{k}, {\it \textbf{y}}_{k} \right)\) can be estimated through FDH by plugging \(\hat{\Psi }_{FDH}\) from (2) into \(\max \left\{\phi _{k} \in R:\left({\it \textbf{x}}_{k} ,\phi _{k} {\it \textbf{y}}_{k} \right)\in \Psi \right\}\) in place of \(\Psi\). In that case, the optimization problem can be rewritten as a mixedinteger linear optimization program, as follows:
\[\begin{equation} \label{(10)} \begin{array}{lllll} {\phi ^{FDH} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)=} & {\max } & {\phi,} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge \phi y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} = 1,} & {} & {} \\ {} & {} & {\lambda _{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \end{array} \end{equation}\]The Linear Programming model that should be solved under Data Envelopment Analysis would be:
\[\begin{equation} \label{(11)} \begin{array}{lllll} {\phi ^{DEA} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)=} & {\max } & {\phi_{}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge \phi_{} y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} =1,} & {} & {} \\ {} & {} & {\lambda _{i} \ge 0,} & {i=1,...,n} & {} \end{array} \end{equation}\]The following MixedInteger Linear Program should be solved for Efficiency Analysis Trees:
\[\begin{equation} \label{(12)} \begin{array}{lllll} {\phi ^{EAT} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)=} & {\max } & {\phi, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}} \lambda_{t} {\it a}_{j}^{t} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}} \lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge \phi y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} =1,} & {} & {} \\ {} & {} & {\lambda _{t} \in \left\{0,1\right\},} & {t\in \tilde{T}^{*} } & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "BCC.OUT"
in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "BCC.OUT", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 1.03 0.04 1 1 1.01 1.01 1.16
FDH 1.01 0.02 1 1 1.00 1.00 1.12
EAT_BCC_OUT FDH_BCC_OUT
CHL 1.09 1.05
SAU 1.05 1.00
CAN 1.01 1.01
Finally, the optimization model that should be solved for Convexified Efficiency Analysis Trees is:
\[\begin{equation} \label{(13)} \begin{array}{lllll} {\phi ^{CEAT} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)=} & {\max } & {\phi, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it a}_{j}^{t} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge \phi y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda _{t} \ge 0,} & {t\in \tilde{T}^{*}} & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "BCC.OUT"
in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "BCC.OUT", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 1.11 0.07 1 1.05 1.09 1.09 1.31
DEA 1.05 0.04 1 1.01 1.05 1.05 1.18
CEAT_BCC_OUT DEA_BCC_OUT
BLR 1.07 1.00
SVK 1.07 1.04
RUS 1.04 1.00
In the case of the outputoriented radial model, Esteve et al. (2021) showed how this measure can be computed through Random Forest where \(\phi \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)=\max \left\{\phi _{k} \in R:\left({\it \textbf{x}}_{k} ,\phi_{k} {\it \textbf{y}}_{k} \right)\in \Psi \right\}\) can be estimated by substituting the theoretical production possibility set \(\Psi\) by its estimation \(\hat{\Psi}_{RF+EAT}\), i.e., \(\phi ^{RF+EAT} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)= \max \left\{\phi _{k} \in R:\left({\it \textbf{x}}_{k} ,\phi _{k} {\it \textbf{y}}_{k} \right)\in \hat{\Psi }_{RF+EAT} \right\}\). In particular, \(\phi^{RF+EAT} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)\) may be calculated as:
\[\begin{equation} \label{(14)} \phi^{RF+EAT} \left({\it \textbf{x}}_{k} ,{\it \textbf{y}}_{k} \right)={\mathop{\min }\limits_{r=1,...,s}} \left\{\frac{y_{r}^{RF+EAT\left(\aleph \right)} \left({\it \textbf{x}}_{k} \right)}{y_{rk} } \right\}, \end{equation}\]where \(y_{r}^{RF+EAT\left(\aleph \right)} \left({\it \textbf{x}}_{k} \right)\) is the estimation of the rth output given the input bundle \({\it \textbf{x}}_{k}\).
In R, this model can be computed using efficiencyREAT()
:
scores < efficiencyRFEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelRFEAT,
digits = 2, print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
RFEAT 1.03 0.04 0.94 1 1.02 1.02 1.15
FDH 1.01 0.02 1.00 1 1.00 1.00 1.12
RFEAT_BCC_OUT FDH_BCC_OUT
SGP 0.94 1
HUN 0.99 1
MEX 1.00 1
By analogy with the previous section, where the outputoriented radial model was shown, it is possible to calculate the inputoriented radial technical efficiency of the inputoutput bundle \((\textbf{x}_k, \textbf{y}_k)\) by solving the following MixedInteger Linear Program, counterpart to (10):
\[\begin{equation} \label{(15)} \begin{array}{lllll} {} & {\min } & {\theta, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le \theta_{} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} =1,} & {} & {} \\ {} & {} & {\lambda_{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \end{array} \end{equation}\]The same type of technical measure can be estimated through DEA by convexification of the production frontier generated by FDH. Next, we show the Linear Programming model that should be solved in that case:
\[\begin{equation} \label{(16)} \begin{array}{lllll} {} & {\min } & {\theta, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le \theta_{} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} =1,} & {} & {} \\ {} & {} & {\lambda _{i} \ge 0,} & {i=1,...,n} & {} \end{array} \end{equation}\]The inputoriented radial model in the case of the Efficiency Analysis Trees technique can be determined through the following MixedInteger Linear Program:
\[\begin{equation} \label{(17)} \begin{array}{lllll} {} & {\min } & {\theta,} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le \theta x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda _{t} =1,} & {} & {} \\ {} & {} & {\lambda _{t} \in \left\{0,1\right\},} & {t\in \tilde{T}^{*} } & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "BCC.INP"
in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "BCC.INP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 0.94 0.06 0.69 0.90 0.96 0.96 1
FDH 0.98 0.03 0.90 0.97 1.00 1.00 1
EAT_BCC_INP FDH_BCC_INP
DEU 0.88 0.92
KAZ 1.00 1.00
SRB 0.99 1.00
Additionally, under the Convexified Efficiency Analysis Trees technique, the optimization model corresponding to the convexification of the production possibility set derived from \(conv(\hat{\Psi }_{T^{*}})\) from (5) should be solved in order to determine an estimation of the inputoriented radial measure as follows:
\[\begin{equation} \label{(18)} \begin{array}{lllll} {} & {\min } & {\theta,} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le \theta_{} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} =1,} & {} & {} \\ {} & {} & {\lambda _{t} \ge 0,} & {t\in \tilde{T}^{*} } & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "BCC.INP"
in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "BCC.INP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 0.82 0.08 0.69 0.76 0.81 0.81 1
DEA 0.92 0.07 0.72 0.87 0.91 0.91 1
CEAT_BCC_INP DEA_BCC_INP
QAT 0.93 1.00
CYP 0.69 0.78
CHE 0.73 0.85
The outputoriented Russell measure under FDH must be calculated through the following optimization model:
\[\begin{equation} \label{(19)} \begin{array}{lllll} {} & {\max } & {\frac{1}{s} \sum_{r=1}^{s}\phi_{r},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge \phi_{r} y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda_{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\phi}}\ge {\bf 1}_{s}.} \end{array} \end{equation}\]Under DEA, the corresponding model would be:
\[\begin{equation} \label{(20)} \begin{array}{lllll} {} & {\max } & {\frac{1}{s} \sum_{r=1}^{s}\phi_{r},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge \phi_{r} y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda_{i} \ge 0,} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\phi}}\ge {\bf 1}_{s}.} \end{array} \end{equation}\]If we resort to the Efficiency Analysis Trees technique, then the model to be solved should be the following:
\[\begin{equation} \label{(21)} \begin{array}{lllll} {} & {\max } & {\frac{1}{s} \sum_{r=1}^{s}\phi_{r},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}\lambda_{t} {\it a}_{j}^{t} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}\lambda_{t} {\it d}_{rT^{*}} ({\it \textbf{a}}^{t}) \ge \phi_{r} y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda_{t} \in \left\{0,1\right\},} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\phi}}\ge {\bf 1}_{s}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "RSL.OUT"
in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "RSL.OUT", digits = 2,
print.table = TRUE)
scores %>% sample_n(3)
Finally, under the Convexified Efficiency Analysis Trees technique, the model would be:
\[\begin{equation} \label{(22)} \begin{array}{lllll} {} & {\max } & {\frac{1}{s} \sum_{r=1}^{s}\phi_{r}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge \phi_{r} y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} =1,} & {} & {} \\ {} & {} & {\lambda_{t} \ge 0,} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\phi}}\ge {\bf 1}_{s}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "RSL.OUT"
in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "RSL.OUT", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 1.13 0.08 1 1.07 1.10 1.10 1.34
DEA 1.06 0.05 1 1.02 1.06 1.06 1.22
CEAT_RSL_OUT DEA_RSL_OUT
LVA 1.07 1.05
CHL 1.18 1.13
MAR 1.14 1.00
By analogy with the outputoriented Russell measure, the inputoriented Russell measure should be calculated through the following optimization models, depending on the selected approach:
\[\begin{equation} \label{(23)} \begin{array}{lllll} {} & {\min } & {\frac{1}{m} \sum_{j=1}^{m}\theta_{j}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le \theta_{j} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} =1,} & {} & {} \\ {} & {} & {\lambda_{i} \in \left\{0,1\right\},} & {i=1,...n} & {} \\ {} & {} & {{\it \boldsymbol{\theta}}\le {\bf 1}_{m}.} \end{array} \end{equation}\]Under DEA, the corresponding model would be:
\[\begin{equation} \label{(24)} \begin{array}{lllll} {} & {\min } & {\frac{1}{m} \sum_{j=1}^{m}\theta_{j},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le \theta_{j} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda _{i} \ge 0,} & {i=1,...n} & {} \\ {} & {} & {{\it \boldsymbol{\theta}}\le {\bf 1}_{m}.} \end{array} \end{equation}\]If we resort to the Efficiency Analysis Trees technique, then the model to be solved should be the following:
\[\begin{equation} \label{(25)} \begin{array}{lllll} {} & {\min } & {\frac{1}{m} \sum_{j=1}^{m}\theta_{j}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le \theta_{j} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda _{t} \in \left\{0,1\right\},} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\theta}}\le {\bf 1}_{m}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "RSL.INP"
in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "RSL.INP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 0.58 0.09 0.43 0.52 0.56 0.56 0.81
FDH 0.87 0.10 0.59 0.81 0.86 0.86 1.00
EAT_RSL_INP FDH_RSL_INP
LBN 0.58 0.73
MAR 0.69 0.97
MKD 0.58 0.87
Finally, under the Convexified Efficiency Analysis Trees technique, the model would be:
\[\begin{equation} \label{(26)} \begin{array}{lllll} {} & {\min } & {\frac{1}{m} \sum_{j=1}^{m}\theta_{j}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le \theta_{j} x_{jk} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} =1,} & {} & {} \\ {} & {} & {\lambda _{t} \ge 0,} & {i=1,...,n} & {} \\ {} & {} & {{\it \boldsymbol{\theta}}\le {\bf 1}_{m}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "RSL.INP"
in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "RSL.INP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 0.54 0.08 0.43 0.49 0.53 0.53 0.79
DEA 0.80 0.11 0.59 0.74 0.78 0.78 1.00
CEAT_RSL_INP DEA_RSL_INP
ARG 0.44 0.59
LUX 0.44 0.65
CHE 0.49 0.74
Chambers et al. (1998) introduced the directional distance function (DDF) as a technical efficiency measure that projects \((\textit{x}_k, \textit{y}_k)\) through a preassigned direction \(\textbf{g}=(\textbf{g}_{j}^{},+\textbf{g}_{r}^{+}) \ne 0_{m+s}, \textbf{g}_{j}^{} \in R^{m}, \textbf{g}_{r}^{+} \in R^{s}\) to the efficiency frontier of the corresponding technology. Under FDH, the DDF is calculated as follows:
\[\begin{equation} \label{(27)} \begin{array}{lllll} {} & {\max } & {\beta_{k}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le x_{jk} \beta_{k} g_{j}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge y_{rk} +\beta_{k} g_{r}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda _{i} =1,} & {} & {} \\ {} & {} & {\lambda _{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \end{array} \end{equation}\]The corresponding linear program in DEA is as follows:
\[\begin{equation} \label{(28)} \begin{array}{lllll} {} & {\max } & {\beta_{k}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} x_{ji} \le x_{jk} \beta_{k} g_{j}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} y_{ri} \ge y_{rk} +\beta_{k} g_{r}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i}=1,} & {} & {} \\ {} & {} & {\lambda_{i} \ge 0,} & {i=1,...,n} & {} \end{array} \end{equation}\]In the context of Efficiency Analysis Trees, the DDF is calculated through the following MixedInteger Linear Program:
\[\begin{equation} \label{(29)} \begin{array}{lllll} {} & {\max } & {\beta_{k}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t}{\it a}_{j}^{t} \le x_{jk} \beta_{k} g_{j}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t}{\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} +\beta_{k} g_{r}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda _{t} \in \left\{0,1\right\},} & {t\in \tilde{T}^{*}. } & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "DDF"
in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "DDF", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 0.02 0.02 0 0 0.01 0.01 0.13
FDH 0.01 0.01 0 0 0.00 0.00 0.05
EAT_DDF FDH_DDF
MDA 0.00 0.00
LVA 0.00 0.00
BRA 0.03 0.01
In the case of Convexified Efficiency Analysis Trees, the optimization model is as follows.
\[\begin{equation} \label{(30)} \begin{array}{lllll} {} & {\max } & {\beta _{k}, } & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it a}_{j}^{t} \le x_{jk} \beta_{k} g_{j}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} +\beta_{k} g_{r}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*} }^{}\lambda_{t}=1,} & {} & {} \\ {} & {} & {\lambda _{t} \ge 0,} & {t\in \tilde{T}^{*}. } & {} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "DDF"
in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "DDF", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 0.07 0.04 0 0.05 0.06 0.06 0.18
DEA 0.03 0.03 0 0.00 0.03 0.03 0.12
CEAT_DDF DEA_DDF
IRL 0.05 0.03
LBN 0.15 0.10
FRA 0.07 0.06
The additive model measures technical efficiency based on input excesses and output shortfalls. It characterizes efficiency in terms of the input and output slacks: \(\textbf{s}^{} \in R^{m}\) and \(\textbf{s}^{+} \in R^{s}\), respectively. The eat package implements the weighted additive model formulation of Lovell and Pastor (1995), where \(({\it \textbf{w}}^{} ,{\it \textbf{w}}^{+})\in R_{+}^{m} \times R_{+}^{s}\) are the input and output weights whose elements can vary across DMUs.
In the case of the FDH, the optimization program to be solved would be:
\[\begin{equation} \label{(31)} \begin{array}{lllll} {} & {\max } & {\sum_{j=1}^{m}w_{j}^{} {\it s}_{jk}^{} + \sum_{r=1}^{s}w_{r}^{+} {\it s}_{rk}^{+},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le x_{jk}  {\it s}_{jk}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge y_{rk} + {\it s}_{rk}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} = 1,} & {} & {} \\ {} & {} & {\lambda _{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \\ {} & {} & {{\it \textbf{s}}_{k}^{} \ge \textbf{0}_{m}, {\it \textbf{s}}_{k}^{+} \ge \textbf{0}_{s}.} \end{array} \end{equation}\]Under DEA, the model would be as follows:
\[\begin{equation} \label{(32)} \begin{array}{lllll} {} & {\max } & {\sum_{j=1}^{m}w_{j}^{} {\it s}_{jk}^{} + \sum_{r=1}^{s}w_{r}^{+} {\it s}_{rk}^{+},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} x_{ji} \le x_{jk}  {\it s}_{jk}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{t} y_{ri} \ge y_{rk} + {\it s}_{rk}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} = 1,} & {} & {} \\ {} & {} & {\lambda _{i} \ge 0,} & {i=1,...,n} & {} \\ {} & {} & {{\it \textbf{s}}_{k}^{} \ge \textbf{0}_{m}, {\it \textbf{s}}_{k}^{+} \ge \textbf{0}_{s}.} \end{array} \end{equation}\]Within the framework of Efficiency Analysis Trees, the weighted additive model would be calculated as follows:
\[\begin{equation} \label{(33)} \begin{array}{lllll} {} & {\max } & {\sum_{j=1}^{m}w_{j}^{} {\it s}_{jk}^{} +\sum_{r=1}^{s}w_{r}^{+} {\it s}_{rk}^{+},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it a}_{j}^{t} \le x_{jk}  {\it s}_{jk}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} +{\it s}_{rk}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} = 1,} & {} & {} \\ {} & {} & {\lambda _{i} \in \left\{0,1\right\},} & {i=1,...,n} & {} \\ {} & {} & {{\it \textbf{s}}_{k}^{} \ge \textbf{0}_{m}, {\it \textbf{s}}_{k}^{+} \ge \textbf{0}_{s}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "WAM.MIP"
for the Measure of Inefficiency Proportions or "WAM.RAM"
for the RangeAdjusted Measure of Inefficiency (Cooper et al. 1999) in efficiencyEAT()
:
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "WAM.MIP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 2.07 0.56 0.61 1.58 2.18 2.18 3.14
FDH 0.40 0.58 0.00 0.00 0.00 0.00 2.37
EAT_WAM_MIP FDH_WAM_MIP
MLT 2.54 0
SVN 1.86 0
HRV 2.30 0
scores < efficiencyEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "WAM.RAM", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
EAT 2704.65 1677.11 455.3 1467.78 2425.09 2425.09 8293.59
FDH 959.09 1388.56 0.0 0.00 0.00 0.00 5413.27
EAT_WAM_RAM FDH_WAM_RAM
ITA 2192.98 0.00
LVA 1890.38 0.00
CAN 1724.26 1182.66
And, finally, the Convexified Efficiency Analysis Trees weighted additive model would be:
\[\begin{equation} \label{(34)} \begin{array}{lllll} {} & {\max } & {\sum_{j=1}^{m}w_{j}^{} {\it s}_{jk}^{} +\sum_{r=1}^{s}w_{r}^{+} {\it s}_{rk}^{+},} & {} & {} \\ {} & {s.t.} & {} & {} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it a}_{j}^{t} \le x_{jk}  {\it s}_{jk}^{} ,} & {j=1,...,m} & {} \\ {} & {} & {\sum_{t\in \tilde{T}^{*}}^{}\lambda_{t} {\it d}_{rT^{*}} \left({\it \textbf{a}}^{t} \right) \ge y_{rk} +{\it s}_{rk}^{+} ,} & {r=1,...,s} & {} \\ {} & {} & {\sum_{i=1}^{n}\lambda_{i} = 1,} & {} & {} \\ {} & {} & {\lambda _{i} \ge 0,} & {i=1,...,n} & {} \\ {} & {} & {{\it \textbf{s}}_{k}^{} \ge \textbf{0}_{m}, {\it \textbf{s}}_{k}^{+} \ge \textbf{0}_{s}.} \end{array} \end{equation}\]In R, this model can be computed by setting scores_model = "WAM.MIP"
for the Measure of Inefficiency Proportions or "WAM.RAM"
for the RangeAdjusted Measure of Inefficiency in efficiencyCEAT()
:
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "WAM.MIP", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 2.39 0.45 0.96 2.24 2.50 2.50 3.14
DEA 0.90 0.62 0.00 0.23 1.07 1.07 2.37
CEAT_WAM_MIP DEA_WAM_MIP
ISR 2.74 1.13
ITA 2.73 1.31
SGP 1.91 0.00
scores < efficiencyCEAT(data = PISAindex, x = 4:8, y = 1:3, object = modelEAT,
scores_model = "WAM.RAM", digits = 2,
print.table = TRUE)
Model Mean Std. Dev. Min Q1 Median Q3 Max
CEAT 5285.99 2573.13 612.54 3362.98 4836.79 4836.79 12088.64
DEA 2413.66 1821.33 0.00 746.22 2495.20 2495.20 7167.29
CEAT_WAM_RAM DEA_WAM_RAM
KOR 1538.61 0.00
ROU 7239.16 4727.17
EST 612.54 0.00
The bestEAT()
and bestRFEAT()
functions are aimed at finding the value of the hyperparameters that minimize the root mean squared error (RMSE) calculated from a test sample through an Efficiency Analysis Trees or a Random Forest for Efficiency Analysis Trees model fitted using a training sample. The code of these functions is as follows:
# Hyperparameter tuning for Efficiency Analysis Trees
bestEAT(
training, test, x, y,
numStop = 5, fold = 5,
max.depth = NULL,
max.leaves = NULL,
na.rm = TRUE
)
# Hyperparameter tuning for Random Forest for Efficiency Analysis Trees
bestRFEAT(
training, test, x, y,
numStop = 5, m = 50,
s_mtry = c("5", "BRM"),
na.rm = TRUE
)
Here is an example of using the bestEAT()
function. First, the PISAindex database explained in Section Data structure is divided into a training subset with 70% of the DMUs and a test subset with the remaining 30% (these values can be modified).
Then, we can apply the bestEAT()
function. This function, and its equivalent bestRFEAT()
, requires a training set (training
) on which to fit an Efficiency Analysis Trees model (with crossvalidation), a test set (test
) on which to calculate the root mean squared error and the input and output indexes (x
and y
, respectively). The rest of the arguments (numStop
, fold
, max.depth
and max.leaves
in case of using the bestEAT()
function) are used to create a grid of combinations that determines the number of models to fit. Notice that it is not possible to enter NULL
and a certain value in max.depth
or max.leaves
arguments at the same time (i.e. max.depth = c(NULL, 5, 3)
).
In the following example, the arguments numStop = (3, 5, 7)
and fold = (5, 7)
are entered and, consequently, six different models are constructed and fitted with {numStop = 3
, fold = 5
}, {numStop = 3
, fold = 7
}, {numStop = 5
, fold = 5
}, {numStop = 5
, fold = 7
}, {numStop = 7
, fold = 5
} and {numStop = 7
, fold = 7
}. Let us show a numerical example:
numStop fold RMSE leaves
1 5 5 74.74 15
2 7 7 83.61 13
3 3 5 84.17 13
4 3 7 84.17 13
5 7 5 86.39 8
6 5 7 104.93 8
The best model is given by the hyperparameters {numStop = 5
, fold = 5
} with RMSE = 74.74 and 15 leaf nodes. Note that sometimes it might be interesting to select a model with a higher RMSE but with a lower number of leaf nodes. With this result, we fit the final Efficiency Analysis Trees model using all the original data.
bestEAT_model < EAT(data = PISAindex, x = 4:8, y = 1:3, numStop = 5, fold = 5)
The simplest functions to use in order to explore the results of an EAT
object are print()
and summary()
. The function print()
returns the treestructure of an Efficiency Analysis Trees model; while the function summary()
returns general information about the fitted model. We show the results with an example:
modelEAT2 < EAT(data = PISAindex, x = 7, y = 3)
print(modelEAT2) # [node] y: [prediction]  R: error n(t): nº of DMUs
[1] y: [ 569 ]  R: 15724.19 n(t): 72
 [2] AAE < 70.12 > y: [ 486 ]  R: 3094.43 n(t): 34
  [4] AAE < 60.75 > y: [ 472 ] <*>  R: 1553.93 n(t): 17
  [5] AAE >= 60.75 > y: [ 486 ] <*>  R: 1006.94 n(t): 17
 [3] AAE >= 70.12 > y: [ 569 ] <*>  R: 3753.38 n(t): 38
<*> is a leaf node
# Primary & surrogate splits: Node i > {SL, SR}  var > {R: error, s: threshold}
summary(modelEAT2)
Formula: M_PISA ~ AAE
# ========================== #
# Summary for leaf nodes #
# ========================== #
id n(t) % M_PISA R(t)
3 38 53 569 3753.38
4 17 24 472 1553.93
5 17 24 486 1006.94
# ========================== #
# Tree #
# ========================== #
Interior nodes: 2
Leaf nodes: 3
Total nodes: 5
R(T): 6314.25
numStop: 5
fold: 5
max.depth:
max.leaves:
# ========================== #
# Primary & surrogate splits #
# ========================== #
Node 1 > {2,3}  AAE > {R: 6847.81, s: 70.12}
Node 2 > {4,5}  AAE > {R: 2560.88, s: 60.75}
efficiencyJitter()
returns a jitter plot from ggplot2
. This graphic shows how DMUs are grouped into leaf nodes in a model built using the EAT()
function where each leaf node groups DMUs with the same level of resources. A black dot and a black line represent, respectively, the mean value and the standard deviation of the scores (df_scores
from the efficiencyEAT()
or the efficiencyCEAT()
functions) of a given node. Additionally, efficient DMU labels are always displayed based on the model entered in the scores_model
argument. Finally, the user can specify an upper bound (upb
) and a lower bound (lwb
) in order to show, in addition, the labels whose efficiency score lies between them. The code is as follows:
efficiencyJitter(
object,
df_scores,
scores_model,
upb = NULL,
lwb = NULL
)
As an example, using data from Section Data structure, we create a new Efficiency Analysis Trees model containing only the AAE
and the M_PISA
variables. Next, we evaluate the Efficiency Analysis Trees efficiency scores corresponding to the outputoriented radial model and plot them through efficiencyJitter()
.
scores < efficiencyEAT(data = PISAindex, x = 7, y = 3, object = modelEAT2,
scores_model = "BCC.OUT", digits = 2,
print.table = FALSE)
efficiencyDensity()
returns a density plot from ggplot2
. This graphic allows to verify the similarity between the scores obtained by the different available methodologies (EAT
, FDH
, CEAT
, DEA
and RFEAT
) in the eat package.
efficiencyDensity(
df_scores,
model = c("EAT", "FDH")
)
In this case, a comparison between the scores of the EAT and FDH models is shown, where it can be clearly seen how FDH is less restrictive when determining a unit as efficient:
In the limited case of using only one input for producing only one output, we can display the frontier (from ggplot2
) estimated by the EAT()
function through the frontier()
function:
frontier(
object, FDH = TRUE,
observed.data = TRUE,
observed.color = "black",
pch = 19,vsize = 1,
rwn = FALSE,
max.overlaps = 10
)
Optionally, the frontier estimated by FDH can also be plotted if FDH = TRUE
. Observed DMUs can be showed by a scatterplot if observed.data = TRUE
and its color, shape and size can be modified with observed.color
, pch
and size
respectively. Finally, row names can be included with rwn = TRUE
.
As an example, we use data simulated from the eat package to generate a data.frame
with 50 rows (N
= DMUs) and 1 input (nX
):
Then, we apply the frontier()
function, where it can be observed how the Efficiency Analysis Trees model generalizes the results obtained by the FDH model: