PanJen: An R package for Ranking Transformations in a Linear Regression

PanJen is an R-package for ranking transformations in linear regressions. It provides users with the ability to explore the relationship between a dependent variable and its independent variables. The package offers an easy and data-driven way to choose a functional form in multiple linear regression models by comparing a range of parametric transformations. The parametric functional forms are benchmarked against each other and a non-parametric transformation. The package allows users to generate plots that show the relation between a covariate and the dependent variable. Furthermore, PanJen will enable users to specify speciﬁc functional transformations, driven by a priori and theory-based hypotheses. The package supplies both model ﬁts and plots that allow users to make informed choices on the functional forms in their regression. We show that the ranking in PanJen outperforms the Box-Tidwell transformation, especially in the presence of inefﬁciency, heteroscedasticity or endogeneity.


Introduction
A model that fits data well but is unrelated to theory can only describe correlations. In contrast, a model with both a good fit and a theoretically sound foundation can give insights into hypotheses on causality. The functional form in a regression model describes the relationship between a dependent variable and its covariates. There are numerous examples of researchers who have neglected functional form relationships and applied the default linear relationship between variables in their regression models (Box, 1976;Breiman and others, 2001;Berk, 2004;Angrist and Pischke, 2010). From a superficial point of view, these models may provide efficient parameter estimates with narrow standard errors, high t-values, and significance. A strong non-linear relationship between a dependent variable and a covariate will often offer reasonable test statistics with a default linear functional form specification. However, positive test statistics are not the same as proof of a linear relationship. Even more important is the fact that a misspecified functional form can lead to an incorrect interpretation and prediction of the relationship between the dependent variable and a given covariate. The specification should be driven by theory with an a priori hypothesis of the relationship between the dependent and covariates.
PanJen was developed over several years of applied research on property value models. Here, the sales price is estimated as a function of its characteristics, such as the size of the living space, the number of rooms, and access to shopping. However, the package is applicable to most cases in which the relationship between a continuous dependent variable and its covariates is explored. In the applied econometrics literature, this task has commonly been solved using power transformations in initial analyses (Palmquist, 2006). Two examples of power-transformations are Box-Cox and Box-Tidwell (Box and Tidwell, 1962;Box, 1976). The power-transformations are easy to use; the ability of these two power transformations to detect functional forms has been studied extensively in the academic literature, e.g., Kowalski and Colwell (1986); Brennan et al. (1984); Clark (1984), and they are still used in applied studies (Cohen et al., 2013;Farooq et al., 2010;Link, 2014;Joshi et al., 2017;Benson et al., 1998;Troy and Grove, 2008).
The popularity of these power-transformations is surprising given that their shortcomings are well described in the literature (Levin et al., 1993;Wooldridge, 1992a). While power transformations perform well in many circumstances, they do not perform well in the presence of omitted variables, inefficiency, heteroscedasticity and endogeneity. Furthermore, a transformation can be challenging to interpret, does not necessarily relate back to a theory-driven hypothesis and does not detect whether the relationship changes across the distribution of the dependent variable.
Another approach to the functional form issue is to abandon the parametric model and approach the challenge from a non-or semi-parametric angle. In the academic literature, a number of alternatives have been proposed and used, such as non-parametric or semi-parametric methods (Anglin and Gençay, 1996;Gençay, 1996;Clapp and Giaccotto, 2002;Bin, 2004;Geniaux and Napoléone, 2008). Non-or semi-parametric models provide data-driven approaches to establish the relationship between a dependent variable and covariates. A non-parametric model can be attractive, because the functional form is revealed by the data instead of being be predefined by the researcher. However, the gained flexibility of non-parametric analysis comes at the cost of more difficult interpretation of the estimates, which is perhaps why parametric models are often used in applied work.
In non-parametric models, the relationship is fitted to the sample to the extent that the estimated relationship is at risk of being over-fitted. In other words, the estimated effect captures random error or noise in combination with the underlying relationship in the population (Wood, 2006). Additionally, a central critique concerning this approach is that the results from a non-parametric model are difficult to generalize or extend outside of the sample (McMillen and Redfearn, 2010). Even so, a non-parametric model holds great potential in exploratory analysis.
We are certainly not the first researchers to consider the possibility of utilizing the apparent advantages of the non-parametric modelling to explore functional form relationships in parametric modelling. The literature on using a non-parametric model to test different parametric specifications is large (González-Manteiga and Crujeiras, 2013). The primary approach in the existing literature has been to test a parametric version of a model against a non-parametric version (Wooldridge, 1992b;Horowitz and Härdle, 1994;Zheng, 1996;Li et al., 2016). However, to the best of our knowledge, none of these tests have been widely adopted in the empirical literature. PanJenis developed to allow applied researchers to utilize non-parametric estimation to identify a better parametric functional form. The package offers a test based on well-established measures and is provided on a well-established software platform. We believe that very few empirical researchers know about the existing tests, and the few that do perceive them as too complicated due to their non-parametric basis. PanJen offers the user a transformation-ranking based on the parametric transformation that captures most of the variance of the dependent variable. The ranking includes a non-parametric specification that can detect if the relationship between the dependent variable and covariate is non-stationary. In contrast to existing tests, semi-parametric transformations are only included as a benchmark rather than as an incremental part of the test. The engine in the PanJen ranking is the well-known Akaike Information Criterion (AIC ) and Bayesian Information Criterion (BIC ) measures. These model-fit measures are already widely applied in the empirical literature and should not be a hindrance for the empirical researcher. With the PanJen ranking, we hope to introduce an approach that will make applied researchers explicitly consider functional form in their parametric models.
In the next section, we briefly describe the main idea behind the ranking in the package (PanJen ranking. In the following section, we describe the workhorse behind it, the Generalized Additive Model (GAM ). In section 4, we explain how the PanJen ranking works. In section 5, we illustrate how to use the package by using a real example from our research. Section 6 offers a comparison between the PanJen ranking and the Box-Tidwell transformation. We simulated 10,000 datasets and recovered the functional form of one variable in a model with different impediments to show the merits of PanJen relative to a conventional approach. In section 6, we conclude the paper with a short discussion of when and how the researcher should use the package with an emphasis on the risk of pre-test bias.

The main idea of the PanJen ranking
PanJen is built on the idea that the choice of a functional form can be guided by model fit. In the PanJen ranking, a given number of models that vary only in the transformation of one covariate are estimated. One of these transformations is a so-called function that for now we will simply note makes this one model semi-parametric. All the estimated models are then ranked according to their BIC). The BIC provides a relative goodness-of-fit measure that accounts for the complexity of the model. More formally, the PanJen ranking estimates a model Y = β 0 + Xβ k + g(x)β l + ε where Y is the dependent variable, ε is an i.i.d. error-term, X is a vector of k of covariates, β k is the corresponding vector of parameter estimates, g(x) represents a set of functional form transformations among the set: where β l the corresponding l parameter estimates for the parametric transformations. In the last two transformations, there is no parameter estimate for g(x), because f (x) is the non-parametric smoothing and 0 leaves out the explanatory variable.
The ranked BIC-values show how each transformation performs relative to the others. The semiparametric transformation allows the user to assess how well parametric transformations perform relative to a flexible semi-parametric function. If the data generation process does not resemble any of the parametric transformations, the smoothing function will still capture the relationship. The BIC scores are supplemented by the closely related AIC. In practice, both the AIC and the BIC penalize model complexity, although the penalty term in BIC is larger than in AIC (Burnham and Anderson, 2004). The smoothing function is highly flexible, but the flexibility comes at a cost. Therefore, it is not necessarily ranked the highest since both AIC and BIC penalize the model complexity. There is no objective and transparent way to choose between the measures. The right measure depends on the user's a priori theory of the data generation process. If the users assume that one of their models perfectly fits the underlying data generation process, then the BIC is the right measure. If they instead assume that the underlying data generation process is extremely complex and none of the possible models will be able to perfectly capture it, then AIC is the right measure (Aho et al., 2014).
The PanJen ranking is supported by a plot function that graphically outlines the relationship between the dependent variable and covariate. The plot is created by predicting the dependent variable using the median for all independent variables other than the one in question. The variable in question varies across a scale from the 5th quantile to the 95th quantile of the actual distribution in the dataset. The plot shows the user how each transformation captures the relationship across the distribution of the dependent variable. If the smoothing far outperforms all parametric transformations, the reason may be that the relationship changes across the distribution and the proposed simple parametric transformation does not capture the relationship between the dependent and independent variable. The plot will reveal this.

A semi-parametric model for benchmark
We estimate the parametric transformations using the Generalized Linear Model (GLM ) and the semiparametric using GAM. GAM is a special case of the Generalized Linear Model (GLM) in which it is possible to include one or more so-called smoothing functions. A smoothing function is a nonparametric way to include a continuous covariate in a parametric model and make it semi-parametric.
The GAM can be written as follows: Y i is the dependent variable of observation i. It is distributed as an exponential family distribution, e.g.the normal, the gamma or chi-square distribution. Xi is a matrix of covariates that are parametrically related to the dependent variable. β is the corresponding vector of the parameter estimate, and f i is a smoothing function of covariate x 1i .
The GAM provides a flexible specification of a covariate by only specifying it as a smoothed function. By entering a variable with a smoothing function, the researcher does not specify a functional form, but instead lets the data speak. The smoothing function comprises the sum of k thin plate regression spline bases b h (•) multiplied by their coefficients. It is estimated as follows: . The non-parametric component of the model is fitted with a penalty on wiggliness (how flexible the smoothing is). The penalty, θ, is determined from the data using generalized cross-validation or related techniques. The penalty directly enters the objective function through an additional term capturing wiggliness in the smoothing function, i.e., Here, Y is the fitted dependent variable, and the second derivatives of the smoothing function describe its wiggliness. We estimate the GAM using the mgcv R-package mgcv (Wood, 2017). For a thorough introduction to GAM, please see (Wood, 2006).

Using the package
We illustrate the use of PanJen using a hedonic house pricing model. The central idea is that the sale price of a home is a function of its characteristics, understood as both the characteristics of the home itself and its surroundings. The latter poses a problem in the empirical application of the hedonic method because observations can be correlated through space. A very flexible solution to this problem is to use the GAM framework to smooth over the x-y coordinates, thus allowing one to non-parametrically control for spatial correlations. von Graevenitz and Panduro (2015) illustrated the relationship between smoothing over space and classic spatial econometrics with weight-matrices and fixed spatial effects. They also showed that smoothing is a better alternative when the researcher does not know the underlying spatial data generation process. For recent applications, please see Rajapaksa et al. (2017) or Schäfer et al. (2017).
In our example here, we solely focus on the structural characteristics. These characteristics are measured by a range of variables. The researcher does not a priori know how the characteristics of the house are related to its price. For example, we expect the price to increase with the size of the home, but we do not know if that relationship is linear. It could be that going from 2 to 3 bedrooms is different than going from 7 to 8 bedrooms, i.e., we want to know if we should take account of marginally increasing or decreasing price-relationships. PanJen was developed to answer this type of question by finding the functional form relationship between the home price and different home characteristics.

An example: the implicit price for living area
Names Description lprice log transformed price in 1000 EUR area living area in square meters age build year bathrooms number of bathrooms lake_SLD distance to nearest lake in meters highways distance to nearest highway in meters big_roads distance to nearest large road in meters railways distance to nearest railway in meters nature_SLD distance to nearest nature area in meters  (Lundhede et al., 2013). We have 9 continuous and 7 dummy variables for quality at our disposal. In addition, the dataset includes 3 year dummies to control for price trends. The variables are listed in Tables 1 and 2  > library(PanJen) > data("hvidovre") Then, we set up a formula-object. We log-transform the prices because this introduces flexibility and is the convention within the hedonic literature (Diewert, 2003). It is possible to test different transformations by simply transforming the variable or test different link-functions by leaving variable empty in fform(). Ten of the variables are dummy variables where transformations are irrelevant. We include only these in the first regression: > formBase<-formula(lprice~brick+roof_tile+roof_cemen + + rebuild70+rebuild80+rebuild90+rebuild00+y7+y8+y9) > summary(gam(formBase, method="GCV.Cp",data=hvidovre)) Family: gaussian Link function: identity Formula: lprice~brick + roof_tile + roof_cemen + rebuild70 + rebuild80 + rebuild90 + rebuild00 + y7 + y8 + y9 Parametric coefficients:  This initial model explains nearly 15% of the variation in price. The next characteristic we want to control for is the size of the home. In the dataset, the living area in square metres is stored under 'area'.
We start out by using the default transformations supplied by the PanJen function fform(). This function ranks the fit of nine predefined transformations and a smoothing. The mandatory inputs are the name of the dataset, the model formula and the new variable we wish to test using the PanJen ranking: > PanJenArea<-fform(hvidovre,"area",formBase,distribution=Gamma (link=log) Wood (2006) for an elaboration" The results are ranked according to their BIC. Strictly according to this ranking, we should logtransform the area. This implies approximately that a % change in living area results in a % change in price. Given the respondent variable has been log-transformed previous to the model-fitting, any interpretation at the the original scale should be done with care , see e.g. (Barrera-Gómez et al., 2015).
The differences in the score for the four lowest BIC are small, and it might be a matter of differences in the tails of the distribution. This can be checked by plotting the predicted price against the area. The function plotff() generates a plot with the predicted price against the area from the 5th to the 95th percentiles with all other covariates variables at their median value:

> plotff(PanJenArea)
The black line is the smoothing function. The log-squared and the linear specification closely follow this line. In conclusion, the implicit price for the living area is positive and slightly marginally declining. You can specify your own transformations using choose.fform(). In the following, we test three transformations: 'area', 'log(area)' and 'area 2 '. We start out by defining a list of transformations:

A changing relationship
The first transformation was rather straightforward because the relation between area and price is stable across the price-distribution. The age of the home is a characteristic with which this is not the case. The reason is that the building changes over the years. As the home gets older, there is not a direct link between the age and state of the home. Where a newly built home is likely to be built with modern standards and tastes in mind, an old house can instead be charming and authentic. At the same time, in general, houses built during periods of building booms, which in Denmark were in the 1960s, are of lesser quality than those built in the 1950s or 1970s. We can show this by running fform() and plotting the results.
The plot shows the relationship between age and price from the 5th to 95th percentile of the age Figure 3: The relation between price and age distribution. Given the plot, it is difficult to think of a parametric relationship that will capture this relationship. If the age of the home is somehow related to the research question, the best solution might be to use the smoothing function. If age is nothing more than a control variable, one could perhaps resort to interval dummies similar to the year dummies in the model. As a part of the final test of the model, it would be worthwhile to test to what degree the variable of interest is robust to the way age enters the pricing function. In our setting, what should be noted is that the complexity of the relation between age and price would have gone unnoticed if we had compared only the parametric transformations.

Interacting with other packages:
When you run choose.fform() or fform(), all generated models and datasets are stored in a new list of list -object. Within the list 'models', all estimated models are stored as 'gam ', 'glm' and 'lm'-objects. This means that all objects used to create the fform output are easily available. The plotting function in PanJen is simple, but perhaps not enough when the researcher needs to produce plots for a third party. Here, we show how to use this to make a plot using base R, but we could just as well have used the tlm if we wanted a more detailed plot (Barrera-Gómez and Basagaña, 2017).

Monte Carlo simulations
We tested the performance of the PanJen ranking against the well-known Box-Tidwell transformation (Box and Tidwell, 1962). We choose this as a benchmark since it is the most structured choice applied in existing empirical work. Power transformations such as Box-Cox and Box-Tidwell were suggested in the 1960s by Box and Cox (1964) and Box and Tidwell (1962). The Box-Tidwell transformation identifies the transformation that minimizes non-normality in the error term and linearizes the relationship between the dependent variable and the covariate using a maximum likelihood function. Thus, the researcher can use the test to find the power-transformation with the highest likelihood. This section presents the results from nine Monte Carlo simulations in which the performance of the Box-Tidwell and PanJen is tested.
The simulations are centred on a base model: where x 3 is the variable of interest and x 1 and x 2 are two other covariates. The functional relationship between Y and f (x 3 ) was then tested using PanJen ranking and Box-Tidwell transformations. Table 3 summarizes the results. The fourth and fifth columns show the share of times each method reported the true functional form. In the Box-Tidwell case, the transformation parameter was allowed to vary by up to 0.2 from the correct specification.

Simulation Simulation description
PanJen BoxTidwell  Each of the nine simulations tested the robustness of the methods in relation to different wellknown econometric methods. Overall, the PanJen ranking performed acceptably. The method pointed to the correct functional form in 97 to 100% of the cases. The Box-Tidwell transformation performed just as well when the dataset was "well-behaved." It was already well-established in the literature that the method is sensitive to endogeneity, inefficient model estimates, heteroscedasticity and endogeneity, and this is also what we find in our study. In conclusion, PanJen Ranking performs better or just as well as Box-Tidwell.

Conclusion
In this paper, we present the PanJen package. We provide a simple and intuitive description of the PanJen ranking. Based on a house price dataset, we show how the functions in the package can be applied to determine the relationship between a dependent variable and its covariates. Furthermore, we compare the PanJen ranking method to the Box-Tidwell transformation and show that the PanJen ranking performs just as well as or better than the Box-Tidwell transformations. The Panjen ranking outperforms Box-Tidwell in situations where the model suffers from inefficiency, heteroscedasticity or endogeneity. In some circumstances, the theory provides little or no guidance on the functional relationship between the dependent and covariates in multiple regression models. In such circumstances, PanJen can support users in their decision on the functional form of the covariates. If the functional form relationship is more complex than a simple parametric transformation, we suggest considering a semi-or non-parametric model. The package has deliberately been restricted to test one covariate at a time without a silent output option. We want to deter the user from looping over every explanatory variable in search of a fit using the PanJen ranking, because this increases the pre-test bias. However, we also recognize that exploratory analysis is part of any empirical application of statistical modelling. Learning is a sequential process, and in many circumstances, we have not properly thought out an a priori hypothesis on which to base our models (Wallace, 1977). People perform exploratory model estimations and in many cases under-report their approach. Even so, pre-test bias is not a problem caused or exacerbated by PanJen Ranking. Regardless of how a researcher performs multiple model estimations, the risk of pre-test bias can be reduced by adopting a sampling approach. The sampling approach can be implemented by dividing data into training and test datasets, where the explorative analysis is conducted on the first. It is our hope that people will use PanJen to improve their models by specifying relationships that more accurately fit their data. In doing so, users should consider PanJen ranking as a guide and not as a substitute for a priori hypothesis.