SARIMA Analysis and Automated Model Reports with BETS , an R Package by

This article aims to demonstrate how the powerful features of the R package BETS can be applied to SARIMA time series analysis. BETS provides not only thousands of Brazilian economic time series from different institutions, but also a range of analytical tools, and educational resources. In particular, BETS is capable of generating automated model reports for any given time series. These reports rely on a single function call and are able to build three types of models (SARIMA being one of them). The functions need few inputs and output rich content. The output varies according to the inputs and usually consists of a summary of the series properties, step-by-step explanations on how the model was developed, predictions made by the model, and a file containing these predictions. This work focuses on this feature and several other BETS functions that are designed to help in modeling time series. We present them in a thorough case study: the SARIMA approach to model and forecast the Brazilian production of intermediate goods index series.


Introduction
The BETS (Ferreira et al., 2017) package (an abbreviation for Brazilian Economic Time Series) for R (R Core Team, 2017) allows easy access to the most important Brazilian economic time series and a range of powerful tools for analyzing and visualizing not only these data, but also external series.It provides a much-needed single access point to the many Brazilian series through a simple, flexible and robust interface.The package now contains more than 18,000 Brazilian economic series and contains an integrated modelling and learning environment.In addition, some functions include the option of generating explanatory outputs that encourage and help users to expand their knowledge.
BETS relies on several well-established R packages.In order to work with time series, BETS functions build upon packages such as forecast (Hyndman, 2015), mFilter (Balcilar, 2016), urca (Pfaff et al., 2016) and seasonal (Sax, 2016).Access to external APIs to obtain data is handled by the web scraping and HTTP tools available through the httr (Wickham, 2017) and rvest (Wickham, 2016) packages, whereas BETS own databases are queried using the package RMySQL (Ooms et al., 2016).
This article seeks to demonstrate many of these features through a practical case.We create a SARIMA model (Box and Jenkins, 1970) for the Brazilian production of intermediate goods index (BPIGI) series, step by step, using mostly functions from BETS.Then, we show that BETS is capable of using rmarkdown (Allaire, 2017) to generate a fully automated report that includes a virtually identical SARIMA analysis.The report needs few inputs, runs for any given time series and has the advantage of outputting several informative comments, explaining to the user how and why it achieved the given results.The outputs, of course, vary according to the inputs.Reports provide a link to a file containing the series under analysis and the predictions calculated by the automated model.Automated reports can also produce GRNN1 and exponential smoothing models, but here we focus on the previously specified methodology2 .In the next section, we briefly present the SARIMA approach and show how to use BETS functions to conduct such an analysis in a typical case: predicting the future levels of the BPIGI.Then, in Section 3, we introduce the automated model reports and illustrate its capabilities by applying it to the series under study using a range of other inputs.Section 4 concludes this article.

SARIMA Analysis: A Case Study
In this section we develop a case study in which we use a handful of BETS functions to model and forecast values of the Brazilian production of intemediate goods index, following the SARIMA (Box & Jenkins) approach.Before starting the analysis we will briefly look at the methodology to be used.For a more comprehensive treatment of the topic, see Box and Jenkins (1970) and Ferreira et al. (2016) 3 .

Box & Jenkins Methodology
The Box & Jenkins methodology allows the prediction of future values of a series based only on past and present values of the same series.These univariate models are known as SARIMA, an abbreviation for Seasonal Autoregressive Integrated Moving Average, and have the following form: • B is the lag operator, i.e., for all t > 1, BZ t = Z t−1 • P, Q, p and q are the orders of the polynomials is the difference operator and d is the number of unit roots is the difference operator at the seasonal frequency s and D is the number of seasonal unit roots is the seasonal moving average polynomial • θ q (B) is the moving average polynomial • a t is the random error.
In its original form, which will be used here, the Box & Jenkins methodology consists of three iterative stages: (i) Identification and selection of the models: the series is checked for stationarity and the appropriate corrections are made for non-stationary cases, by differencing the series d times in its own frequency and D times in the seasonal frequency.The orders of the polynomials φ, Φ, θ, and Θ are identified using the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the series.
(ii) Estimation of the model's parameters using maximum likelihood or dynamic regression.
(iii) Confirmation of model compliance using statistical tests.
The ACF, used in stage (i), can be defined as the correlation of a series with a lagged copy of itself as a function of lags.Formally, we can write the ACF of a time series s with realizations s i for i ∈ [1, N] and mean s as Another tool for phase (i), the PACF also gives the correlation of a series with its own lagged values, but controlling for all lower lags.Let P t,l (x) denote the projection of x t onto the space spanned by x t+1 , . . ., x t+l−1 .We can define the PACF as The PACF is used to find p (the lag beyond which the PACF becomes zero is the indicated number of autoregressive terms), while the ACF is used to find q (the cut in the ACF points to the number of moving average terms).Seasonal polynomial orders, P and Q, are identified in a similar fashion, but by inspecting high values at fixed intervals.In this case, the intervals become the orders.
If the model does not pass phase (iii), the procedure starts again at step (i).Otherwise, the model is ready to be used to make forecasts.In the next section we will look at each of these stages in more detail and say more about the methodology as we go through the case study.

Some preliminary topics
The first step is to find the series in the BETS database.This can be done with the BETSsearch() function.It performs flexible searches on the metadata table that stores series characteristics.Since metadata is accessed through SQL queries, searchers are fast and the package performs well in any The R Journal Vol.10/2, December 2018 ISSN 2073-4859 environment.BETSsearch() is a crucial feature of BETS, so we provide a thorough description in this section.
Accepted parameters, expected data types, and definitions are the following: • description -A character.A search string to look for matching series descriptions.
• src -A character.The source of the data.
• periodicity -A character.The frequency with which the series is observed.
• unit -A character.The unit in which the data were measured.
• code -An integer.The unique code for the series in the BETS database.
• view -A boolean.By default, TRUE.If FALSE, the results will be shown directly on the R console.
• lang -A character.Search language.By default, 'pt', for Portuguese.A search can also be performed in English by changing the value to 'en'.
To refine the search, there are syntax rules for the parameter description: 1. To look for alternative words, separate them by blank spaces.Example: description = core ipca means that the description of the series should contain the words core and ipca.
2. To search for complete expressions, put them inside single quotes (' ').Example: description = "index and core ipca " means that the description of the series should contain the expression core ipca and the word index.
3. To exclude words from the search, insert a ∼ before each word.Example: description = "ipca ∼ core" means that the description of the series should contain the word ipca and should not contain the word core.
4. To exclude a whole expression from a search, place them inside single quotes (' ') and insert a ∼ before each of them.Example: description = "index ∼ core ipca " means that the description of the series should contain index and should not contain core ipca.
5. It is possible to search for or exclude certain words as long as these rules are obeyed.
6.No blank space is required after the exclusion sign (∼), but it is required after each expression or word.
Finally, the command to find information about the series, which we will use throughout this article, is shown below, along with the corresponding output.
# Searching for the production of intermediate goods index series # excluding the term imports from the search results <-BETSsearch(description = " intermediate goods ~imports", view = F, lang = "en") results #> # A tibble: 3 x 7 #> code description unit periodicity start #> <chr> <chr> <chr> <chr> <chr> #> 1 11068 Intermediate go?Index M 31/0?#> 2 1334 Production indi?Index M 31/0?#> 3 21864 Physical Produc?Index M 01/0?#> # ... with 2 more variables: last_value <chr>, #> source <chr> The series we are looking for is the third (code 21864).We now load it using the function BETSget() and store some values for later comparison with the model's forecasts.We will also produce a graph (Figure 1) to help formulate hypotheses about the behavior of the underlying stochastic process.The professional looking chart is created with chart(), which has the nice property of displaying useful information about the series, such as the last value and monthly/interannual variations.Four characteristics can be observed.First, the series is homoskedastic, with monthly seasonality.A further striking feature is the structural break in November 2008, when the international financial crisis occurred and investor confidence plummeted.This break had a direct impact on the next important characteristic of the series, the trend, which was initially clearly one of growth, albeit not at explosive rates.From November 2008 onward, however, production appeared to remain unchanged or even to fall.Since the main goal of this article is to present BETS' features, rather than performing an overcautious time series analysis, we decided to work with fewer observations, thus avoiding problems with the structural break.Only the values from January 2009 on were used.This did not imply poorer results.In fact, discarding older values yields better predictions, at least within the SARIMA framework.
data <-window(data, start = c(2009,1), end = c(2015,10), frequency = 12) In the following sections we create a model for the chosen series following the previously defined modelling steps.

Stationarity Tests
This subsection covers a crucial step in the Box & Jenkins approach: checking if there are unit roots in the seasonal and non-seasonal autoregressive model polynomials and determining how many of these roots there are.If there are no unit roots, the series is stationary.In case we find unit roots, it is possible to obtain stationarity by differencing the original series.Then, the order of the polynomials can be identified using the ACF and the PACF funcitons.
The ur_test() function performs unit root tests.The user can choose among Augmented Dickey Fuller (ADF, Dickey and Fuller (1979)), Phillips-Perron (PP, Phillips and Perron (1988)), and KPSS (Kwiatkowski et al., 1992) tests.The ur_test() function was built around functions from the urca package.Relative to those, the advantage of ur_test() is the output, which is designed to quickly see the test's results and all the needed information.The output is an object with two fields: (i) a table showing test statistics, critical values and whether the null hypothesis is rejected or not, and (ii) a vector containing the residuals of the test's equation.This equation is shown below.
The test statistics in the output table refer to the coefficients φ (drift), τ 1 (deterministic trend) and τ 2 (unit root).Inclusion of the constant (drift) and deterministic trend is optional.To control the The R Journal Vol.10/2, December 2018 ISSN 2073-4859 test parameters, ur_test() accepts the same parameters as urca's ur.df(), as well as the desired significance level.

yes
Therefore, for the series in levels, the null hypothesis that there is a unit root cannot be rejected at a 95% confidence level, as the test statistic tau2 is greater than the critical value.We will now apply the diff function to the series repeatedly, to determine whether the differenced series has a unit root.This way we find out how many unit roots there are.

#> [1] 1
The results indicate that there is a seasonal unit root and that seasonal differencing is therefore required: d.data <-diff(diff(data), lag = 12, differences = 1)

Autocorrelation Functions
The previous conclusions are corroborated by the autocorrelation function of the series in differences (from now on, ∇∇ 12 Z t ).It shows that statistically significant correlations, i.e., correlations outside the confidence interval, are not persistent for lags corresponding to multiples of 12 or values close to these multiples.This indicates the absence of a seasonal unit root.
The BETS function that we use to draw the correlograms is corrgram().Unlike its main alternative, the forecast package's Acf() function, corrgram() returns an attractive graph and provides the option of calculating the confidence intervals using the method proposed by Bartlett (Bartlett, 1946).The greatest advantage it offers, however, cannot be shown here as it depends on Javascript.If the style parameter is set to plotly , an interactive graph is produced, displaying all the values of interest (autocorrelations, lags and confidence intervals) on mouse hover, as well as providing zooming, panning, and the option to save the graph in png format.

II. Estimation
To estimate the coefficients of the SARIMA(1,1,0)(0,1,0)[12] model, we will use the Arima() function from the forecast package.The t tests will be performed using the t_test() function, which receives an object of type "arima" or "Arima", an integer representing the number of exogenous variables in the model and an integer for the desired significance level.It returns a data frame containing information related to the test (estimated coefficients, standard errors, test statistics, critical values and the results of the test).

III. Diagnostic Tests
The aim of the diagnostic tests is to determine whether the chosen model is suitable.Here, two well-known tools will be used: analysis of standardized residuals and the Llung-Box test (Ljung and Box, 1978).
The graph of the standardized residuals will be produced with the std_resid() function, which was implemented specifically for this purpose.We can see that there is a prominent and statistically significant outlier in April 2013.A second model including a dummy (a time series whose values can be either 0 or 1, to control for the structural break) is therefore proposed.The dummy is defined as follows: This dummy can be created using the dummy() function, as shown below.The start and end parameters indicate time period covered by the dummy.The from and to fields indicate the interval within which the dummy should take a value of 1. Estimation of this model by maximum likelihood resulted in coefficients that are statistically different from 0 at a 1% significance level.The graph of the standardized residuals of the new model (Figure 5) also shows that the decision to include D t was correct as there is no more evidence of a structural break.
# Estimation of the parameters of the model with a dummy model2 <-Arima(data, order = c(1,1,0), seasonal = list(order = c(0,1,0), period = 12), xreg = dummy1  We spotted some other possible outliers in the standard residuals graph.Hence we introduce two more dummies to improve the model and then estimate a third set of coefficients.One possible way of evaluating the quality of the models is to calculate the value of an information criterion.This is already contained in the object returned by Arima().# Show BIC for the two estimated models model1$bic Note also that the Bayesian Information Criterion (BIC, Schwarz (1978)) for the model with the dummies is smaller.Hence, based on this criterion as well, the model with the dummy should be preferred over the previous one.
The Ljung-Box test for the chosen model can be performed with the Box.test() function in the stats package.We will follow Rob Hyndman's suggestion4 and use min(2m, T/5) lags, where m is the period of seasonality (12 here) and T is the length of the time series (82), which yields 16 in our case.The p-value of 0.041 is low and indicates there is no statistical evidence to reject the null hypothesis of no autocorrelation in the residuals.We could have added one MA term to double the p-value and generate such evidence, but we decided not to, for two reasons: (i) the law of parsimony; (ii) adding an extra term would be an ad hoc solution -we have no indication from the ACF and PACF that this term should exist; and (iii) there is no reason to be overcautious and present an extensive analysis, because our intention is to exhibit BETS' features.

IV. Forecasts
BETS provides a convenient way of making forecasts with SARIMA, GRNN, and exponential smoothing models.The predict() function receives the parameters from the forecast() function (from the package of the same name) and returns not only the objects containing the information related to the forecasts, but also produces a graph of the series including the predicted values.Displaying the information in this way is important to get a better idea of how suitable the model is.The actual values for the forecasting period can also be shown if desired.
We now call predict() to generate the forecasts for the proposed model.The parameters object (object of type arima or Arima), h (forecast horizon), and xreg (the dummy for the forecasting period) are inherited from the forecast() function.The others are from predict() itself and control features of the graph, except for actual, which is the the series of effective values observed during the forecasting period.The areas in dark blue and light blue around the forecasts are the 85% and 95% confidence intervals, respectively.It appears that the forecasts were satisfactory, since in general the actual values are inside the confidence interval.
To make this statement more meaningful, we can check various measures of fit by consulting the accuracy field contained within the predictions object (preds).2. The graph of the series along with its trend, which is extracted with an HP Filter (Hodrick and Prescott, 1997).The graph is produced with the dygraphs package (Vanderkam et al., 2016) and enables the user to define windows and examine the series' values.
3. The steps involved in the identification of a possible model: unit root tests (at the moment, ACF (Dickey and Fuller, 1979), PP (Phillips and Perron, 1988) or KPSS (Kwiatkowski et al., 1992) for non-seasonal unit roots, and OCSB (Osborn et al., 1988) test for seasonal unit roots) and correlograms of the original series (if no unit root is found), or the differenced series (if one or more unit roots are found).
4. Estimation of the parameters and display of the result from the automatic model selection performed by the auto.arima()function (from the forecast package).
5. The ARCH test (Engle, 1982) for heteroskedasticity in the residuals.If residuals are found to be heteroskedastic, the report performs a log transformation on the original series and run steps 3 and 4 again for the transformed series.
6.The graph of the standard residuals.
7. The Ljung-Box (Ljung and Box, 1978) test for autocorrelation in the residuals.If the null hypothesis of no autocorrelation is rejected, the user will be advised to use a dummy or another type of model.
8. The n-steps-ahead forecasts and a graph of the original series with the forecasted values and confidence intervals, also produced with dygraphs (this output is an option of predict()).
9. A link to the file containing the original data and the forecasts.
One example of how the output could change in response to different inputs is obtained by removing the dummy from the last example.The call without a dummy renders a report that, after detecting autocorrelation in the residuals, suggests the usage of a dummy to solve the problem (Figure 8).It does indeed solve it, as we demonstrated.We now provide one more call and output snippets from the report() function, aiming at illustrating its capabilities.First, suppose we need to model the Brazilian production of capital goods (BPCG) series.Its unique identifier in the BETS database is 21863, so this will be the value of parameter ts.The next piece of code builds a SARIMA report using the same dummy we needed in the case study, but ARCH test and Box test parameters are different.Figures 9 to 11 show snippets of the report that was generated by the last call and demonstrate that it presents different results when compared to the one produced previously.OCSB test, for instance, did not find any seasonal unit root, thereby displaying a different comment (Figure 9).In addition, residuals were found to be heteroskedastic, which forced the model to be estimated a second time, using the log transformation of the original series (Figure 10).Note that running a report saves a lot of time and might be insightful.However, performing the analysis in detail, using individual BETS functions rather than the batch of functions included in the reports, may provide more flexibility.Since reports use auto.arima(),they might not find models that are in line with a more sofisticated analysis.For example, the SARIMA model calculated by auto.arima() in the first example (Figure 7) is not the one we proposed in the case study.This happened because the function did not find a non-seasonal unit root, since it does not test for more than 2 lags, and we tested for 6.For now, we cannot include the option of testing for more lags, the reason being that auto.arima()does not accept it.

Conclusion
We have shown in this article that the approach adopted in BETS is a novel one as it allows users not only to quickly and easily access thousands of Brazilian economic time series, but also to use and analyze these series in many different ways.Its automated reports are comprehensive, fully customizable and three methods can be readily applied: SARIMA models, GRNNs and exponential smoothing models.We showed case studies to examine the use of the first, but only mentioned the latter two.Nevertheless, the user can benefit from the extensive documentation shipped with the The R Journal Vol.10/2, December 2018 ISSN 2073-4859 package, where example code to generate GRNN and exponential smoothing reports are exhibited and discussed.
In the future we hope to expand these dynamic reports and provide new options for users.One such option has already been taken into account in the design of the package: a display mode that does not include explanations about the methodology, so that the information is provided in a more technical, succinct manner.In addition to developing new modeling methods, such as Multi-Layer Perceptrons, fuzzy logic, Box & Jenkins with transfer functions, and multivariate models, we plan to refine the analyses themselves.

Figure 1 :
Figure 1: Graph of the BPIGI, made with function chart().Indicators below the x coordinates represent index variation over a month and over 12 months, respectively.

Figure 2 :
Figure 2: Autocorrelation function of ∇∇ 12 Z t This correlogram, however, is not sufficient for us to propose a model for the series.We will therefore produce a graph of the partial autocorrelation function (PACF) of ∇∇ 12 Z t .corrgram() can also be used for this purpose.# Partial autocorrelation function of diff(data) corrgram(d.data,lag.max = 48, type = "partial", style="normal")

#Figure 4 :
Figure 4: Standard residuals from the first proposed model.

Figure 6 :
Figure 6: Graph of the proposed SARIMA model forecasts, an output of predict

Figure 7 :
Figure 7: The output file of report() function for the series under study.

Figure 9 :
Figure 9: First snippet of the example report with the BPCG series

Figure 10 :
Figure 10: Second snippet of the example report with the BPCG series

Figure 11 :
Figure 11: Third snippet of the example report with the BPCG series The other field in this object, predictions, contains the object returned by forecast() (or by another forecasting function, such as grnn.test() or predict(), depending on the model).