This article is a self-contained introduction to the R package ercv and to the methodology on which it is based through the analysis of nine examples. The methodology is simple and trustworthy for the analysis of extreme values and relates the two main existing methodologies. The package contains R functions for visualizing, fitting and validating the distribution of tails. It also provides multiple threshold tests for a generalized Pareto distribution, together with an automatic threshold selection algorithm.
Extreme value theory (EVT) is one of the most important statistical techniques for the applied sciences. A review of the available software on extreme value analysis appears in Gilleland et al. (2013). R software (R Core Team 2017) contains some useful packages for dealing with EVT. The R package evir (Pfaff and McNeil 2012) provides maximum likelihood estimation (MLE) at the same time for the block maxima and threshold model approaches. The R package ismev (Heffernan and Stephenson 2018) allows fitting parameters of a generalized Pareto distribution depending on covariates and offers diagnostics such as qqplots and return level plots with confidence bands. The R package poweRlaw (Gillespie 2015) enables power laws and other heavy tailed distributions to be fitted using the techniques proposed by Clauset et al. (2009). This approach had been used to describe sizes of cities and word frequency and is linked to the physics of phase transitions and to complex systems.
This paper shows that the R package ercv (del Castillo et al. 2017a), based on the coefficient of variation (CV), is a complement, and often an alternative, to the available software on EVT. The mathematical background is shown in Section 2, including threshold models and the relationship between power law distribution and the generalized Pareto distributions (GPD), which is the relationship between the two different approaches followed by the aforementioned R packages evir, or ismev, and poweRlaw.
Section 3 introduces the tools for the empirical
residual coefficient of variation developed in the papers
del Castillo et al. (2014), del Castillo and Serra (2015) and
del Castillo and Padilla (2016). Section 3.2 also shows the
exploratory data analysis of nine examples, some of them from the R
packages evir and poweRlaw, with the cvplot
function, see Figure
1.
Section 4 explains the Tm
function in the R package
ercv that provides a multiple thresholds test that truly reduces the
multiple testing problem in threshold selection and provides clearly
defined p-values. The function includes an estimation method of the
extreme value index. An automatic threshold selection algorithm provided
by the thrselect
function is explained in Section
5 to determine the point above which GPD can
be assumed for the tail distribution.
Section 6 shows how the methodology developed in
the previous sections can be extended with the tdata
function to all
GPD distributions, even with no finite moments. This technique is
applied to the MobyDick example and to the Danish fire insurance
dataset, a highly heavy-tailed, infinite-variance model. Finally,
Section 7 describes the functions of the R
package ercv that allow estimation of the parameters (fitpot
) and
drawing of the adjustments (ccdfplot
) for the peak-over-threshold
method.
Extreme value theory is widely used to model exceedances in many disciplines, such as hydrology, insurance, finance, internet traffic data and environmental science. The underlying mathematical basis is now thoroughly established in Leadbetter et al. (1983), Embrechts et al. (1997), de Haan and Ferreira (2007), Novak (2012) and Resnick (2013). Statistical tools and methods for use with a single time series of data, or with a few series, are well developed in Coles (2001), Beirlant et al. (2006) and Markovich (2007).
The first fundamental theorem on EVT by Fisher and Tippett (1928) and Gnedenko (1943) characterizes the asymptotic distribution of the maximum in observed data. Classical analyses now use the generalized extreme value family of distribution functions for fitting to block maximum data provided the number of blocks is sufficiently large. Another point of view emerged in the 1970’s with the fundamental theorem by Pickands (1975) and Balkema and de Haan (1974). The Pickands-Balkema-DeHaan (PBdH) theorem, see McNeil et al. (2005, chap 7), initiated a new way of studying extreme value theory via distributions above a threshold, which use more information than the maximum data grouped into blocks.
Let meplot
function in evir R package).
The residual coefficient of variation is given by
The PBdH theorem characterizes the asymptotic distributions of the
residual distribution over a high threshold under widely applicable
regularity conditions, see Coles (2001). The result essentially says that
GPD is the canonical distribution for modelling excess over high
thresholds. The probability density function for a
GPDcvevi
and evicv
functions of the R package ercv correspond to
this function and its inverse.
The residual distribution of a
The probability density functions
The power law distribution is the model, introduced by Pareto,
Note that if
The MLE for model ((6)) leads to the Hill estimator and
Hill-plot (hill
function in evir R package). The support of the
distributions in ((6)) depends on the minimum value
parameter
However, the support of the distributions in ((3)), with
gdp
function in the evir R package provides the MLE for
((3)).
Note that model ((3)) includes all the limit distributions
(heavy or not) of the residual distribution over a high threshold and
comes from a mathematical result (the PBdH theorem) and often
((6)) comes from empirical evidence of the linear
relationship ((7)) and comparison with other models.
Moreover, the linear relationship ((7)) is also obtained
from the relationship between the parameters ((8)), see
the ccdfplot
function in Section 7.
Gupta and Kirmani (2000) show that the residual
cvplot
functionIn this section the cvplot
function of the R package ercv is
introduced as a graphical tool for use in a exploratory data analysis,
through the nine examples described in Section 3.2. The cvplot
function is essentially the empirical residual CV whose asymptotic
distribution as a stochastic process is explained by
del Castillo et al. (2014) and del Castillo and Padilla (2016).
Assume that the raw data consist of a sequence of independent and
identically distributed measurements cvplot
function
provides the function
Point-wise error limits for
By default, if cvplot
function draws the line tdata
(Section 6).
Alternatively, finite moments can be checked by a confidence interval
for the MLE estimator of evi, or the methods in the R package
RobExtremes (Ruckdeschel et al. 2019) and the references cited
therein can be used.
The
The use of the cvplot
function and its options is described using nine
examples. The first four (iFFT, FFT, BIFP and MA) correspond to
different types of execution time distributions observed for a set of
representative programs for the analysis of automotive applications.
Three others are in R packages: MobyDick (“moby” in R package
poweRlaw), Danish and Nidd (“danish” and ’“nidd.thresh” in the R
package evir). The Bilbao waves dataset (bilbao) was originally
analysed by Castillo and Hadi (1997). EURUSD is the dataset of euro/dollar
daily exchange rates between 1999 and 2016.
We collect samples with
library("ercv")
data(iFFT)
cvplot(iFFT,thr=median(iFFT))
The plots in Figure 1 are stowed from left to right and
top to bottom. For iFFT, the
The following three cvplot
function options, but including title, for instance:
data("moby", package = "poweRlaw")
cvplot(moby,main="MobyDick")
The second row of Figure 1 shows three examples that
suggest heavy tailed distributions. In the centre is MobyDick and on the
right is the Danish fire insurance dataset, which is a highly
heavy-tailed infinite-variance example used to illustrate the basic
ideas of extreme value theory, see Embrechts et al. (1997),
McNeil et al. (2005, Example 7.23) and Novak (2012, Example
9.8). Section 6 shows how to analyse these
examples, with the tdata
function, using the methodology developed in
del Castillo and Padilla (2016).
Nidd is the dataset of high levels of the River Nidd above a threshold
value of 65. Its CV-plot is always lower than
By default, the cvplot
function draws a
data(bilbao)
cvplot(bilbao,evi = c(0,-1),main="Bilbao")
EURUSD is the data frame object of the euro/dollar daily exchange rates between 1999 and 2016, including the financial crisis of 2007-08, which was obtained from the R package quantmod (Ryan 2016). Various parts of the EURUSD series have been studied by several authors, see Gomes and Pestana (2007) and del Castillo and Padilla (2016). The last plot in Figure 1 shows the CV-plot of the positive log-returns of the euro/dollar daily prices, obtained from
data("EURUSD")
prices<-ts(EURUSD$EUR.USD,frequency=365,start=1999)
#plot(prices,col="blue",main="euro/dollar daily prices(1999-2016)")
return <- 100*diff(log(prices));
pos.return <- subset(return, return >0);
cvplot(pos.return,main="pos.returns EUR/USD 1999-2016")
The dynamics of the daily return can be described by a GARCH(1,1) model.
One might then hope that for sufficiently high values of
Tm
functionFollowing the exploratory analysis, we would like to confirm or deny
some of the previous observations. It is known that in order to make
optimum decisions, it is necessary to quantify the uncertainty of
information extracted from data. Statistics provides mechanisms to
ensure a controlled probability of error, but there is always the risk
of misuse for multiple testing, especially in EVT where quite small
changes can be greatly magnified on extrapolation. The asymptotic
distribution of the residual coefficient of variation for GPD as a
random process indexed by the threshold by del Castillo and Padilla (2016)
provides pointwise error limits for CV-plot, used in the last section,
and a multiple thresholds test that truly reduces the multiple testing
problem, hence, the
Using the building blocks given by ((10)) the multiple
threshold test Tm
function of the R package ercv) for a
(supplementary) number of thresholds
The Tm
function makes it possible to see whether the
data("nidd.thresh",package = "evir")
Tm(nidd.thresh,evi=0, nextremes = 75)
nextremes cvopt evi tms pvalue
75 1.000 0.000 0.981 0.310
The Tm
function provides Tm
function assumes that the
parameter
The following code shows that the assumption of constant CV (GDP) is rejected for the complete sample.
Tm(nidd.thresh)
nextremes cvopt evi tms pvalue
154 1.225 0.167 1.214 0.030
It is rejected that Bilbao dataset is uniform distributed. However, It can not be rejected GPD as the following code shows
Tm(bilbao,evi=-1,nsim=1000)
nextremes cvopt evi tms pvalue
179 0.577 -1.000 0.629 0.003
Tm(bilbao,nsim=1000)
nextremes cvopt evi tms pvalue
179 0.650 -0.685 0.254 0.172
The confidence interval for the parameter estimation
cievi(nextremes=length(bilbao),evi=-0.685)
5% 95%
-0.778 -0.549
Using a small threshold, Tm
function shows that the
positive and negative returns of the euro/dollar between 1999 and 2016
can be assumed exponentially distributed.
Tm(pos.return,m=50,evi=0,thr=0.1,nsim=1000)
nextremes cvopt evi tms pvalue
2207 1.000 0.000 0.392 0.780
neg.return <- -subset(return, return <0);
Tm(neg.return,m=50,evi=0,thr=0.1,nsim=1000)
nextremes cvopt evi tms pvalue
2187 1.000 0.000 1.160 0.231
The last statement with Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz takes elapsed=4.73 (R> proc.time()).
thrselect
)There are two different approaches to the question of threshold choice. The first approach is to regard the free choice of the threshold as an advantageous feature of the procedure. By varying the threshold, the data can be explored, and if a single estimate is needed it can be obtained by subjective choice. It may well be that such a subjective approach is in reality the most useful one.
The other, to some extend opposing, view is that there is a need for an automatic method whereby the threshold is chosen by the data. It is fairer to use the word automatic rather than objective for such a method, because there are arbitrary decisions involved in the choice of the method itself. Nevertheless, it is of course the case that conditional on the automatic method being used, the threshold is indeed objective. Automatic methods need not be used in an uncritical way; they can of course be used as a starting point for fine tuning.
The thrselect
function in the R package ercv starts with the
If we apply the function thrselect
on the Nidd dataset the code shows
DF <- thrselect(nidd.thresh,m=10, nsim=1000)
m nextremes threshold rcv cvopt evi tms pvalue
5 6 63 87.85 1.193 1.073 0.0656 0.408 0.102
This means that the algorithm need
The output of thrselect
is in the data frame DF, the printed values
are in DF$solution and DF$options provides complementary information
that can be used for a more personal approach.
print(DF$options,digits=4)
m nextremes threshold rcv cvopt evi tms pvalue
1 10 154 65.08 1.2486 1.2249 0.166758 1.33553 0.023
2 9 123 74.38 1.4082 1.2183 0.163112 1.47158 0.012
3 8 99 77.80 1.3163 1.1634 0.130594 0.93927 0.034
4 7 79 81.40 1.2587 1.1175 0.099606 0.64548 0.064
5 6 63 87.85 1.1933 1.0728 0.065559 0.40795 0.102
6 5 50 92.82 1.1328 1.0320 0.030493 0.24415 0.217
7 4 40 99.14 1.0714 0.9945 -0.005584 0.12917 0.457
8 3 32 107.94 1.0054 0.9619 -0.040406 0.05888 0.609
9 2 26 115.93 0.9006 0.9396 -0.066323 0.04218 0.637
10 1 21 131.87 0.9473 0.9667 -0.034986 0.01755 0.597
tdata
)It is possible to extend the previous methodology based on CV to all
distributions, even without finite moments. For CV-plots above the
straight line tdata
function in the R package ercv.
This technique is founded on the following result: if tdata
, using MLE with the
internal function egpd
, (see del Castillo and Serra (2015)) or may be provided
by the researcher as a preliminary estimate.
The CV-plots for Danish and MobyDick transformed by tdata
function are
obtained with:
data("danish",package = "evir")
tdanish<- tdata(danish)
cvplot(tdanish,main="transformed Danish")
tmoby<- tdata(moby)
cvplot(tmoby,main="transformed MobyDick")
The CV-plots in Figure 2 for the transformed datasets are more stable than the original CV-plots in Figure 1 and actually look light tailed. The CV-plot of the transformed MobyDick has a sawtooth profile because the original dataset only takes positive integer values and the smaller values have a high frequency (among the 18,855 values, 1 appears 9,161 times, 2 appears 3,085, ... ). In order to use a GPD approach for this example we assume that the data correspond to positive values rounded to the nearest integer.
The Tm
function rejects GPD for the complete transformation of
MobyDick. The same result is obtained with the transformation of the
dataset on the thresholds 2 and 3. However, GPD is not rejected on
threshold 4, hence the frequencies of words that appear four or more
times in the novel Moby Dick (4,980 observations) can be approximated by
a GPD distribution with
t4moby<-tdata(moby,thr=4)
Tm(t4moby,m=50,nsim=1000)
nextremes cvopt evi tms pvalue
4980 0.581 -0.982 0.198 0.293
The Danish example was studied by del Castillo and Padilla (2016). The results
obtained are validated by the Tm
function after the transformation
tdata
Tm(tdanish,m=20,nextremes = 951,omit = 8, nsim = 1000)
nextremes cvopt evi tms pvalue
951 0.676 -0.595 0.256 0.253
Applying the thrselect
function to Danish after the transformation by
tdata
we obtain
DF<-thrselect(tdanish,m=30,nsim=1000)
m nextremes threshold rcv cvopt evi tms pvalue
19 12 116 1.283 0.589 0.6747 -0.598 0.265 0.11
The automatic algorithm chooses the threshold
cievi(116,evi=-0.596)
5% 95%
-0.714 -0.440
In the next section we will discuss these results with new features of the R package ercv.
fitpot
ccdfplot
)The tools described in the previous sections provide an asymptotic model
for threshold exceedances over a high quantile, the so-called
peak-over-threshold (PoT) method, see McNeil et al. (2005). The
PoT method is based on determining a high enough threshold from which
the distribution of the observations above this value, adjusted to zero,
approaches to a GPD distribution. Then, given a threshold
The ppot
function is the cumulative distribution function for the PoT
method. That is, given an estimate of the four parameters in
((12)), ppot
qpot
is the quantile function for the PoT method that assigns to
each probability ppot
the value x for which
ppot
qpot
function can be used in the estimation of high quantiles, that in
terms of risk is expressed as the value at risk (VaR). For a small
The fitpot
function of the R package ercv provides an estimate of
the four parameters in ((12)) that allow approximating the
empirical cumulative distribution function of a dataset. It is assumed
that the threshold fitpot
uses MLE.
However, since parameter Tm
function, this value can be entered into the
function fitpot
and then it uses MLE by the restricted model to a
single parameter. From now on this method of estimation will be called
CV method.
The two methods of estimation offitpot
applied to Danish explain the
differences between the results obtained by us and by other researchers,
which we have discussed in the previous section, as we can see with the
code
fit1<-fitpot(danish,nextremes =116);fit1 #MLE
evi psi threshold prob
0.446 7.462 9.200 0.054
fit2<-fitpot(danish,evi=0.598,nextremes =116);fit2 #CV
evi psi threshold prob
0.598 6.450 9.200 0.054
Naturally, different estimation methods provide different estimates, but
the question of identifying the best approach still remains. To clarify
this point, we can use the ccdfplot
function, which draws the
empirical complementary cumulative distribution function with the
approximations provided by the parameters estimated by fitpot
. The
ccdfplot
function allows to draw several approaches at several scales.
The approximation is linear in the log-log scale for datasets with heavy
tails, although it is linear in log scale for datasets with exponential
tails (log = y, by default). To draw the approach on natural scale the
option log = has to be used.
The plots of Figure 3 have been obtained with ccdfplot
function applied to Danish data with the estimates obtained by MLE
(orange) and CV method (green) on logarithmic and double logarithmic
scales, with
ccdfplot(danish,pars=list(fit1,fit2),main="Danish (log scale)")
ccdfplot(danish,pars=list(fit1,fit2),log="xy",main="Danish (log-log
scale)")
Figure 3 shows that both adjustments are reasonable. The CV method is not worse than MLE, perhaps less optimistic or more realistic. The previous PoT approach can be validated using the Clauset et al. (2009) point of view.
Based on the four parameters estimated by fitpot
fit1<-as.numeric(fit1$coeff);sg1<- fit1[2]/fit1[1];sg1
fit2<-as.numeric(fit2$coeff);sg2<- fit2[2]/fit2[1];sg2
exDanish<-danish[danish>fit1[3]]-fit1[3] #origin to zero
exDanish1<- exDanish+sg1 #origin to sg1
exDanish2<-exDanish+sg2 #origin to sg2
exfit1<-c(fit1[1],fit1[2],sg1,1)
exfit2<-c(fit2[1],fit2[2],sg2,1)
ccdfplot(exDanish, pars=c(exfit1),log="xy",main="adjusted by MLE")
ccdfplot(exDanish2, pars=c(exfit2),log="xy",main="adjusted by CV")
The Figure 4 plot (a) shows the linear relationship
((7)) for the 116 upper extremes of Danish adjusted by MLE.
Changing the previous fit1 by fit2 the linear relationship is
obtained by the CV method and is shown in plot (b). Notice that the
linear relationship ((7)) begins at the threshold
We can also calculate the threshold
This work was supported by the Spanish Ministry of Economy and Competitiveness under Grant: Statistical modelling of environmental, technological and health risks, MTM2015-69493-R. David Moriña acknowledges financial support from the Spanish Ministry of Economy and Competitiveness, through the María de Maeztu Programme for Units of Excellence in R&D (MDM-2014-0445) and from Fundación Santander Universidades.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Castillo, et al., "Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package", The R Journal, 2019
BibTeX citation
@article{RJ-2019-044, author = {Castillo, Joan del and Serra, Isabel and Padilla, Maria and Moriña, David}, title = {Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {2}, issn = {2073-4859}, pages = {56-68} }