robslopes: Efficient Computation of the (Repeated) Median Slope
Abstract:
Modern use of slope estimation often involves the (repeated) estimation of a large number of slopes on a large number of data points. Some of the most popular non-parametric and robust alternatives to the least squares estimator are the Theil-Sen and Siegel’s repeated median slope estimators. The robslopes package contains fast algorithms for these slope estimators. The implemented randomized algorithms run in and expected time respectively and use space. They achieve speedups up to a factor compared with existing implementations for common sample sizes, as illustrated in a benchmark study, and they allow for the possibility of estimating the slopes on samples of size and larger thanks to the limited space usage. Finally, the original algorithms are adjusted in order to properly handle duplicate values in the data set.
Several implementations of these estimators exist, most notably in the R
packages deming(Therneau 2018), zyp(Bronaugh and Pacific Climate Impacts Consortium 2019),
mblm(Komsta 2019), and
RobustLinearReg(Hurtado 2020), all publicly available on CRAN. Despite their
popularity and importance, all of these publicly available
implementations are based on a brute-force computation of the (repeated)
median slope. Given a sample of observations, this approach requires
a computational cost of as well as
space. This makes the currently available implementations unsuitable for
modern applications involving larger data sets as they are rather slow
and potentially use prohibitively large amounts of storage space.
Nevertheless, there exist several algorithms for the Theil-Sen and
repeated median estimators which run in quasilinear time and require
only space. For the Theil-Sen estimator,
Cole et al. (1989), Katz and Sharir (1993), and Brönnimann and Chazelle (1998) proposed
algorithms running in deterministic time. A
randomized algorithm requiring -expected time was
proposed in Matoušek (1991), Dillencourt et al. (1992) and
Shafer and Steiger (1993). For the repeated median slope,
Stein and Werman (1992), Matoušek et al. (1993) and Matoušek et al. (1998)
proposed algorithms running in and
(-expected) time. A potential explanation for
these algorithms not being available in R may be twofold. Firstly, they
are all more involved than brute-force computation of the median slopes
and sample implementations in other programming languages are scarce, if
available at all. Secondly, most of them make efficient use of features
not (readily) available in R, such as pointers and in-place assignment,
as well as complex data structures.
In this article we discuss the R package
robslopes(Raymaekers 2022) which contains implementations of the randomized
algorithms by Matoušek (1991) and Matoušek et al. (1998) for
the Theil-Sen and repeated median estimators respectively. The original
algorithms have been adjusted in order to properly deal with potential
duplicates in the data. We start by briefly reviewing the problem
setting and the estimators. Next, we present a rough outline of the key
ideas underlying the algorithms and a brief description of the usage of
the functions in the package. We end with a benchmarking study comparing
the implementation with existing implementations in publicly available R
packages and concluding remarks.
2 The Theil-Sen and repeated median estimators
The typical problem setting of slope estimation is that of the simple
linear model. Suppose we have observations which follow
the model
where and are the (unknown) true intercept and slope
parameters and represents the noise. The most popular estimator
for the regression coefficients is the ordinary least squares (OLS)
estimator, i.e. the minimizing
. The OLS
estimator possesses several attractive properties, including ease of
computation and interpretation and optimal performance when the errors
are i.i.d. and follow a normal distribution. Despite these properties,
it is well known that the OLS estimator is very sensitive to outliers
and can have a quickly deteriorating performance as the error term
deviates from normality. For these reasons, many alternatives have been
suggested, of which the Theil-Sen estimator and Siegel’s repeated median
slope are two popular and intuitively attractive examples.
The Theil-Sen (TS) estimator (Theil 1950; Sen 1968) of
is defined as the median of the slopes of all the lines determined by
two different observations and :
where the median avoids as the slope through one point is
undefined and where and
. In case there are duplicate values in
, the proposal of (Sen 1968) is to only include
the slopes which are constructed using two observations with different
-values and we adopt this approach here.
The TS estimator has been analyzed extensively from a theoretical point
of view. (Sen 1968) showed its asymptotic normality and
equivariance properties. In particular, we have
scale equivariance:
for all
regression equivariance
for all ,
but the estimator is not equivariant under affine transformations of
both predictor and response variables. (Wang and Yu 2005) give precise
conditions for the unbiasedness of the TS estimator and
(Wilcox 1998) studied its behavior in the case of heteroscedastic
errors. From a robust statistics perspective, it is known that the TS
estimator is much more robust against outliers than the OLS estimator
(see (Rousseeuw and Leroy 2005)). In particular, it has a bounded influence
function (Hampel et al. 1986) and its breakdown value is
, where the latter can be
roughly interpreted as the maximum percentage of contaminated samples
that the estimator can handle (see (Donoho and Huber 1983) for an exact
definition). Finally, its maximum bias properties under contamination
have been studied by (Adrover and Zamar 2004).
The search for a slope estimator with a breakdown value higher than 30%
prompted Siegel to propose the repeated median (RM) slope
(Siegel 1982). It is the first slope estimator attaining the
maximal breakdown value of 50%. The repeated median estimator of
is computed by first calculating the median slope per observation ,
yielding values, and then taking the median of these values:
where now the inner median avoids . We handle duplicate values in
by skipping them in the computation of the inner
median, in the same spirit as duplicate handling for the TS estimator.
The consistency, unbiasedness and efficiency of the RM estimator were
discussed by (Siegel 1982), whereas the asymptotic normality was
analyzed in (Hossjer et al. 1994). The estimator was designed to
have a 50 % asymptotic breakdown value, and like the TS estimator, its
influence function is also bounded
(Rousseeuw et al. 1993; Rousseeuw et al. 1995; Rousseeuw and Leroy 2005).
Finally, it possesses the same equivariance properties as the TS
estimator, i.e. it is both scale and regression equivariant but not
affine equivariant, and its maximum bias properties are also discussed
in (Adrover and Zamar 2004).
It is clear that both and can be
computed by calculating all pairwise slopes, and
selecting the median or repeated median of these slopes. Clearly, this
brute-force approach has a computational cost, and
the available implementations of this approach also require storing the
slopes. In the next section, we describe the more
efficient algorithms implemented in the robslopes package.
While not the main focus of this article, it is worth mentioning that
the intercept can be estimated in several ways. We opt for the most
straight-forward approach of taking the median of the residuals as the
estimator for :
where can be or .
3 Randomized algorithms for slope selection
The R-package robslopes contains an implementation of the randomized
algorithm by Matoušek (1991) for the Theil-Sen estimator and
the algorithm of Matoušek et al. (1993; Matoušek et al. 1998) for
the repeated median estimator. Unlike in their original proposals, we
haven taken into account the case of possible duplicate values in the
and adjusted the algorithms to work properly in that case as well.
We now briefly outline these algorithms without going too much into
detail. For specific details and a complete description we refer to the
original references and our code.
Suppose we are given data points with
and we are interested in the median (or more generally, any order
statistic) of the slopes formed by connecting two data points. The main
idea behind the algorithms is to consider each observed data point
in dual space by associating it with the line
We will denote the coordinates in dual space with in the
following. It can be verified that the -coordinate of the
intersection of two lines in dual space (also called “intersection
abscissa” (IA)) is equal to the slope of the line passing through the
two points in the original space. Therefore, the problem of finding
order statistics of slopes can be translated into the problem of finding
order statistics of intersection abscissas in dual space. To find these
order statistics in dual space, the algorithms use the following idea.
Given an interval in dual space, the number of IAs
in that interval can be computed by counting the number of inversions in
a certain permutation. For a permutation on , an
inversion is a pair of elements for which and
. Figure 1 shows three
lines in dual space and illustrates the connection between IAs and
inversions which works as follows. Suppose we have two lines associated
with the points and . The -coordinates of the
intersection of these lines with in dual space are given by
and respectively. Suppose without
loss of generality that . Now, if
lines and intersect in the interval , we
must have that the -coordinate of the intersection of these lines
with have switched order:
. Therefore, there is a bijection
between the IAs in and the permutation obtained by
going from the order of the intersections of the lines with
to the order of the intersections with .
Counting the number of inversions in a permutation can be done in
time using an adaptation of merge sort
(Knuth 1998), and constructing the permutation itself has the same
computational complexity.
Figure 1: Three lines in dual space, restricted to the interval
. If two lines intersect in this interval, the
ordering of their values of the -coordinate at and
must have turned around. Therefore, the number of IAs in the
interval is given by the number of inversions in
the permutation
.
For the Theil-Sen estimator, we obtain IAs in dual space, where
parallel lines meet at by convention. Suppose we want to find
the -th smallest IA (e.g., could be
for the lower median). The idea is to maintain a half-open interval
, which is initialized at and
always contains the IA we are looking for. We also keep a count and
of the total number of IAs to the left and within the interval
respectively. If the number of IAs in the interval is of order
, we can enumerate them in time
and select the desired element (we call this “brute-force computation”).
If there are more IAs left in the interval, we use a contraction
strategy which makes the interval progressively
smaller while still containing the target -th smallest slope, until
it is small enough for brute-force computation.
In order to contract the interval, we randomly sample a
number of IAs. These are then used to estimate a new interval
which contains the
target IA with high probability. To check whether the contracted
interval contains the target, we count the number of IAs in the
intervals and . If the
current count added to the number of IAs in the former interval
exceeds , we know that the target IA is in . If
not, we check whether additionally adding the number of IAs in
gives a total count exceeding , in which case
the target IA is in . If neither of those cases
hold, we know the target IA is in . After the
contraction, we update and and this process is repeated until
, the number of IAs in the current interval, is of the order
. To execute this strategy, we need a method to randomly
sample IAs. It turns out that this can be achieved in
time, again using an adaptation of merge sort.
For the Theil-Sen estimator there is an expected number
of iterations required for convergence, leading to a total expected
complexity of .
For the repeated median estimator, the algorithm is similar to the
previously described algorithm in that it also works with an
interval-contraction strategy. In contrast with the previous algorithm
however, we now keep track of the lines for which the median IA falls
within the interval, in addition to the number of abscissas on the left
and within the current interval for each individual line. In each
contraction step, we first sample a number of
lines for which we know that their median IA lies within the current
interval. For each of these lines, a number of
IAs are now sampled which allow for the estimation of a new interval
containing the target
slope with high probability. The interval can then
be contracted by counting the number of IAs for each line on the left
and within . As before, the algorithm switches to
brute-force computation once the number of IAs in
is of the order . For this algorithm, there is an
expected number of contraction steps required,
resulting in a total complexity of .
The original algorithms were only described for the case where the
values are all distinct. In case of duplicate values however, some of
the slopes are not defined and the natural way of handling this is by
ignoring the undefined slopes and computing the median (or any other
order statistic) on the remaining slopes. Fortunately, we can
incorporate this into our algorithms by realizing that these duplicate
values correspond with parallel lines in dual space. Using the
convention that parallel lines in dual space meet at , the
(repeated) median slope can be found by appropriately adjusting the
value of the order statistic of the slope we want to find. For the
Theil-Sen estimator, this means we need to find the order statistic
corresponding with
values, where denotes the number of times the value occurs
in the predictor variable. For the repeated median estimator, we only
have to adjust the inner median, but the adjustment is dependent on the
individual line . More specifically, the order statistic for the
inner median needs to be computed on values. Note that we can
count the number of duplicates easily in time and so
the overall complexity of the algorithm is not affected by this change.
There is one additional complication, namely that of duplicate pairs
. If such pairs are present, one should be careful with
computing the permutation of intersections in dual space given an
interval . Obtaining such a permutation involves
sorting (and ranking) a vector of IAs. However, in case of duplicate
pairs, this sorting needs to be done using stable sort. If not,
duplicate pairs may randomly produce inversions, yielding sampled IAs at
even if the current interval has . In the
original algorithm, this scenario was not considered and thus not
explicitly accounted for. These changes have been incorporated in the
implementation provided in the robslopes package.
4 Implementation and usage
We briefly describe the implementation and usage of the functions in the
robslopes package. The package revolves around 2 main functions, the
TheilSen function and the RepeatedMedian function, both returning a
list with the self-explanatory elements intercept and slope. Both
functions start with input checking and duplicate counting, which are
done in R. The duplicate counts are then used to set the correct target
order statistics for the main part of the algorithm. This information is
then passed on to the randomized algorithms which are implemented in
C++, making use of the
rcpparmadillo
package (Eddelbuettel and Sanderson 2014; Eddelbuettel and François).
The TheilSen function has the layout
TheilSen(x, y, alpha = NULL, verbose = TRUE), where the x and y
arguments are the input vectors for the predictor and response variable
respectively. The verbose option allows for switching off the printing
of the computation progress. Finally, the alpha argument is a value
between 0 and 1 which determines the order statistic corresponding with
the target slope. When alpha = NULL, the default, the upper median of
the slopes is computed, which corresponds with the
-th order statistic. For any other value of
alpha, the function computes the alpha-th
order statistic of the slopes, where is the rounding operator.
The RepeatedMedian function has the layout
RepeatedMedian(x, y, alpha = NULL, beta = NULL, verbose = TRUE), where
the x, y and verbose arguments play the same role as for the
TheilSen function. The arguments alpha and beta determine the
order statistics of the inner and outer “median”. When NULL, the
default, they again correspond to the upper median. If they contain
values between zero and one, the order statistics corresponding with the
inner and outer “median” are given by alpha and
beta respectively.
For convenience, the TheilSen and RepeatedMedian functions have been
wrapped in the user-friendly functions robslope and robslope.fit.
These mimic the structure of common regression functions (e.g., lm and
lm.fit). In particular, robslope takes the standard arguments
formula, data, subset, weights and na.action, in addition to
the type argument selecting which type of slope to compute, as well as
the optional alpha, beta and verbose arguments of the TheilSen
and RepeatedMedian functions.
As an example we analyze the flights data of the
nycflights13
package (Wickham 2019). It contains on-time data for all flights that
departed in New York City (i.e. JFK, LGA or EWR airports) in 2013. We
consider the distance traveled in miles as the predictor variable, and
take the air time in minutes as the response. After removing all NA
values, we end up with a data set of 327,346 observations. In contrast
to the previous example, this one is too large to compute the TS or RM
slope with a brute-force algorithm (on most computers). We now calculate
the TS and RM slopes three times each, where we change the inner order
statistic to the first, second and third quartiles. With the exception
of graphical parameters, the code executed is shown below.
library("robslopes")library("nycflights13")data("flights")ts.out.25<-robslope(formula = air_time~distance, data = data, alpha =0.25)ts.out.50<-robslope(formula = air_time~distance, data = data, alpha =0.50)ts.out.75<-robslope(formula = air_time~distance, data = data, alpha =0.75)plot(data$distance, data$air_time)abline(coef(ts.out.25), col ="red", lwd =3)abline(coef(ts.out.50), col ="green", lwd =3)abline(coef(ts.out.75), col ="blue", lwd =3)rm.out.25<-robslope(formula = air_time~distance, data = data, beta =0.25,type ="RepeatedMedian")rm.out.50<-robslope(formula = air_time~distance, data = data, beta =0.50,type ="RepeatedMedian")rm.out.75<-robslope(formula = air_time~distance, data = data, beta =0.75,type ="RepeatedMedian")plot(data$distance, data$air_time)abline(coef(rm.out.25), col ="red", lwd =3)abline(coef(rm.out.50), col ="green", lwd =3)abline(coef(rm.out.75), col ="blue", lwd =3)
The resulting figure is shown in Figure 2. In this
example, the slopes calculated by the TS and RM estimators are virtually
identical due to the very limited number of outliers and the large
number of data points. Note that there are many duplicates in this data
set, which are handled appropriately by the proposed implementation.
Figure 2: The TS (left) and RM (right) slopes fitted on the
flights data. The first, second and third quartiles as the (inner) order
statistic are shown in red, green and blue respectively. The slopes are
virtually identical for both estimators in this example, as there are
not many outliers in the data. The implemented algorithm appropriately
deals with the many duplicate x-values in this data
set.
5 Benchmarking study
We now benchmark the implemented algorithm against the existing
implementations in the deming, zyp, mblm, and RobustLinearReg
packages. Note that the deming and zyp packages only contain the TS
estimator, the others contain both the TS and RM estimators. All
simulations were done using the R-package
microbenchmark(Mersmann 2019) and were run on an Intel Core i7-10750H @2.60GHz
processor. The setup is as follows. For each value of
, we have generated 100 samples
from the bivariate standard normal distribution. Afterward,
each of the available implementations was used to estimate the
regression parameters and the execution time was measured (in
nanoseconds).
The following code snippets show the code used for benchmarking the TS
and RM estimator for a sample of size , the same code was used
for the other sample sizes, with exception of , in which case
the implementation of the mblm package was left out due to its
computational burden (it takes several hours to compute it once).
n <-10^3mbm <-microbenchmark("deming"= deming::theilsen(y~x, data = data),"zyp"= zyp::zyp.sen(y~x, dataframe = data),"mblm"= mblm::mblm(y~x, dataframe = data, repeated =FALSE),"RobustLinearReg"= RobustLinearReg::theil_sen_regression(y~x,data = data),"robslopes"= robslopes::TheilSen(x, y),setup = {x =rnorm(n); y =rnorm(n); data =as.data.frame(cbind(x, y))}, times =100)autoplot(mbm)
n <-10^3mbm <-microbenchmark("mblm"= mblm::mblm(y~x, dataframe = data, repeated =TRUE),"RobustLinearReg"= RobustLinearReg::siegel_regression(y~x,data = data),"robslopes"= robslopes::RepeatedMedian(x, y),setup = {x =rnorm(n); y =rnorm(n); data =as.data.frame(cbind(x, y))}, times =100)autoplot(mbm)
Figure 3 shows the results for the Theil-Sen estimator.
We see that for the smallest sample size, , the absolute
computation times are extremely low and while there are visible
differences, they are probably not of practical relevance. For ,
the robslopes package is roughly five times faster than the fastest
competitor, and the mblm implementation starts to run away from the
other competitors. For , the differences become much clearer.
The robslopes implementation is now between one and two orders of
magnitude faster than the best competitor. Finally, for the
gap widens further to a difference of over two orders of magnitude. Note
that for , we have left out the mblm package in the
comparison as a single run takes several hours and the resulting plot
would hinder a clear comparison.
Figure 3: Computation times of the different implementations
of the Theil-Sen slope estimator for the sample size n equal to 10, 102, 103 and 104 in the top left, top right,
bottom left and bottom right panels respectively. The robslopes
implementation is consistently faster than the competition by orders of
magnitude. The difference scales close to linearly in n, gaining almost one order of
magnitude advantage for a similar increase in the sample size, which is
what we expect based on the theoretical computational complexities. The
mblm estimator was left out for sample size n = 104 due to it
requiring several hours to compute once.
Figure 4 shows the results for the RM estimator. As for
the TS estimator, the difference in computation time is already visible
for small sample sizes of and , where the robslopes
implementation is roughly a factor 20 faster than the competition.
However, the absolute computation times for these sample sizes is by no
means prohibitive for any of the implementations. Starting at
however, we start to see a larger difference, especially between the
robslopes and mblm implementations. The former is now roughly 800
times faster than the latter, and 60 times faster than the
RobustLinearReg implementation. For sample size , we did not
include the mblm implementation as it now takes roughly half an hour
to compute, in contrast with 16 seconds for the RobustLinearReg
implementation and 0.034 seconds (on average) for the robslopes
implementation. Finally note that interestingly, the repeated median of
the mblm package is much faster than the Theil-Sen estimator of the
same package. This is somewhat counter-intuitive, but turns out to be
caused by a much higher number of calls to the c() function
(concatenate) in the implementation of the TS estimator.
Figure 4: Computation times of the different implementations
of the repeated median slope estimator for the sample size n equal to 10, 102, 103 and 104 in the top left, top right,
bottom left and bottom right panels respectively. The robslopes
implementation is consistently faster than the competition by orders of
magnitude. The difference scales close to linearly in n, which is what we expect based on
the theoretical computational complexities. Note that the mblm
estimator was left out for sample size n = 104 due to it
requiring more than 106
milliseconds (approx. 30 minutes) to compute.
As explained before, the brute-force implementations of the (repeated)
median slope require storage as they typically
compute all slopes and store them in a matrix. Therefore,
while a sample size of is by no means extremely large, we
cannot go much higher than that. A sample size of for example,
would require a RAM memory of about 75GB, which is rather uncommon on
most computers and laptops. The implementations in the robslopes
package however, require only storage space, and we can
thus easily compute the estimators on much larger samples. To illustrate
this, we have continued the benchmarking study with only the robslopes
implementation, but this time for the sample sizes and
. This was done using the following code snippet, where the saving
of the results in the for loop is omitted:
for (n in10^(5:7)) { mbm <-microbenchmark("robslopes"= robslopes::TheilSen(x, y),setup = {x =rnorm(n); y =rnorm(n); data =as.data.frame(cbind(x, y))}, times =100)} for (n in10^(5:7)) { mbm <-microbenchmark("robslopes"= robslopes::RepeatedMedian(x, y),setup = {x =rnorm(n); y =rnorm(n); data =as.data.frame(cbind(x, y))}, times =100)}
Figure 5 shows the mean computation times of the TS
estimator and the RM estimator for sample sizes up to . As an
example, for a sample size of , the TS estimator requires
roughly 4 seconds of computation time, whereas the RM estimator requires
roughly 6 seconds. The blue shade on the figure indicates the maximum
and minimum computation time over the 100 replications, showing that the
computation times have a fairly small variance around their mean. The
red line shows a (robustly) estimated fit of the theoretical computation
times to the observed ones (i.e. of the functions
and for a
), indicating that for , the observed
computational cost grows according to the theoretical complexity.
Figure 5: Mean computation times of the Theil-Sen (left) and
repeated median (right) estimators as implemented in robslopes
for increasing sample size n.
The red line is an estimate of the ∼ nlog (n) (for TS) and
∼ nlog2(n) (for
RM) expected computational cost. The blue shade indicates the minimum
and maximum computation times. Other than the deviations for very small
n which are due to
computational overhead, the computational cost scales precisely as the
theory predicts. Furthermore, the variance of the computation times
around their mean is negligible.
6 Summary
We have introduced the robslopes package which contains fast
implementations of the popular Theil-Sen and repeated median slope
estimators. The implemented algorithms are randomized algorithms running
in quasilinear expected time and use linear space. In contrast, the
currently available implementations in different R packages on CRAN
require quadratic time and space. A benchmark study comparing the common
implementations of the slope estimator with the newly introduced one
illustrates speedups up to a factor compared with the next best
alternative implementation for common sample sizes. Additionally, due to
the linear space requirements of the algorithms, the slope estimators
can be computed on much larger sample sizes than the current maximum.
Finally, the original algorithms were adjusted to properly deal with
potential duplicates in the predictor variable.
The fast implementation of these popular slope estimators unlocks new
possibilities for their use in modern applications where the slope has
to be estimated repeatedly and on a large number of data points.
Evidently, inferential procedures based on bootstrapping are also highly
facilitated by these fast algorithms. Finally, the underlying C++
implementation may serve as a useful reference for implementations of
these algorithms in other programming languages, which also seem to be
scarce.
This article is converted from a Legacy LaTeX article using the
texor package.
The pdf version is the official version. To report a problem with the html,
refer to CONTRIBUTE on the R Journal homepage.
Footnotes
References
J. Adrover and R. H. Zamar. Bias robustness of three median-based regression estimates. Journal of statistical planning and inference, 122(1-2): 203–227, 2004. URL https://doi.org/10.1016/j.jspi.2003.06.001.
D. Alabi, A. McMillan, J. Sarathy, A. Smith and S. Vadhan. Differentially private simple linear regression. Proceedings on Privacy Enhancing Technologies (PoPETs), 2022(2): 184–204, 2022. URL https://doi.org/10.2478/popets-2022-0041.
T. Bernholt, R. Fried, U. Gather and I. Wegener. Modified repeated median filters. Statistics and Computing, 16(2): 177–192, 2006. URL https://doi.org/10.1007/s11222-006-8449-1.
D. Bronaugh and A. W. for the Pacific Climate Impacts Consortium. Zyp: Zhang + yue-pilon trends package. 2019. URL https://CRAN.R-project.org/package=zyp. R package version 0.10-1.1.
R. Cole, J. S. Salowe, W. L. Steiger and E. Szemerédi. An optimal-time algorithm for slope selection. SIAM Journal on Computing, 18(4): 792–810, 1989. URL https://doi.org/10.1137/0218055.
P. L. Davies, R. Fried and U. Gather. Robust signal extraction for on-line monitoring data. Journal of Statistical Planning and Inference, 122(1-2): 65–78, 2004. URL https://doi.org/10.1016/j.jspi.2003.06.012.
M. B. Dillencourt, D. M. Mount and N. S. Netanyahu. A randomized algorithm for slope selection. International Journal of Computational Geometry & Applications, 2(01): 1–27, 1992. URL https://doi.org/10.1142/S0218195992000020.
D. L. Donoho and P. J. Huber. The notion of breakdown point. In A festschrift for erich lehmann, Eds P. Bickel, K. Doksum and J. L. Hodges pages. 157–184 1983. Belmont: Wadsworth.
D. Eddelbuettel and C. Sanderson. RcppArmadillo: Accelerating r with high-performance c++ linear algebra. Computational Statistics and Data Analysis, 71: 1054–1063, 2014. URL http://dx.doi.org/10.1016/j.csda.2013.02.005.
R. Fried, T. Bernholt and U. Gather. Repeated median and hybrid filters. Computational statistics & data analysis, 50(9): 2313–2338, 2006. URL https://doi.org/10.1016/j.csda.2004.12.013.
R. Fried, J. Einbeck and U. Gather. Weighted repeated median smoothing and filtering. Journal of the American Statistical Association, 102(480): 1300–1308, 2007. URL https://doi.org/10.1198/016214507000001166.
U. Gather, K. Schettlinger and R. Fried. Online signal extraction by robust linear regression. Computational Statistics, 21(1): 33–51, 2006. URL https://doi.org/10.1007/s00180-006-0249-8.
S. Gelper, R. Fried and C. Croux. Robust forecasting with exponential and holt–winters smoothing. Journal of forecasting, 29(3): 285–300, 2010. URL https://doi.org/10.1002/for.1125.
F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel. Robust statistics: The approach based on influence functions. New York: Wiley, 1986.
O. Hossjer, P. J. Rousseeuw and C. Croux. Asymptotics of the repeated median slope estimator. The Annals of Statistics, 1478–1501, 1994. URL https://doi.org/10.1214/aos/1176325638.
J. Matoušek, D. M. Mount and N. S. Netanyahu. Efficient randomized algorithms for the repeated median line estimator. Algorithmica, 20(2): 136–150, 1998. URL https://doi.org/10.1007/PL00009190.
J. Matoušek, D. M. Mount and N. S. Netanyahu. Efficient randomized algorithms for the repeated median line estimator. In Proceedings of the fourth ACM-SIAM annual symposium on discrete algorithms, pages. 74–82 1993.
P. Meer, D. Mintz, A. Rosenfeld and D. Y. Kim. Robust regression methods for computer vision: A review. International journal of computer vision, 6(1): 59–70, 1991. URL https://doi.org/10.1007/BF00127126.
J. Raymaekers. Robslopes: Fast algorithms for robust slopes. 2022. R package version 1.1.2.
P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. John wiley & sons, 2005.
P. J. Rousseeuw, N. S. Netanyahu and D. M. Mount. New statistical and computational results on the repeated median line. In New directions in statistical data analysis and robustnes, Eds S. Morgenthaler, E. Ronchetti and W. A. Stahel pages. 177–194 1993. Basel: Birkh¨auser-Verlag.
P. Rousseeuw, C. Croux and O. Hössjer. Sensitivity functions and numerical analysis of the repeated median slope. Computational Statistics, 10(1): 71–90, 1995.
P. K. Sen. Estimates of the regression coefficient based on kendall’s tau. Journal of the American statistical association, 63(324): 1379–1389, 1968. URL https://doi.org/10.1080/01621459.1968.10480934.
L. Shafer and W. Steiger. Randomizing optimal geometric algorithms. In CCCG, 1993.
A. F. Siegel. Robust regression using repeated medians. Biometrika, 69(1): 242–244, 1982. URL https://doi.org/10.2307/2335877.
A. Stein and M. Werman. Finding the repeated median regression line. In SODA ’92, 1992.
H. Theil. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Nederl. Akad. Wetensch., Proc., 53: 386--392, 521–525, 1397–141, 1950.
T. Therneau. Deming: Deming, theil-sen, passing-bablock and total least squares regression. 2018. URL https://CRAN.R-project.org/package=deming. R package version 1.4.
P. Zhai, X. Zhang, H. Wan and X. Pan. Trends in total precipitation and frequency of daily precipitation extremes over china. Journal of Climate, 18(7): 1096–1108, 01 Apr. 2005. URL https://doi.org/10.1175/JCLI-3318.1.
X. Zhang, L. A. Vincent, W. Hogg and A. Niitsoo. Temperature and precipitation trends in canada during the 20th century. Atmosphere-ocean, 38(3): 395–429, 2000. URL https://doi.org/10.1080/07055900.2000.9649654.
Reuse
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
Citation
For attribution, please cite this work as
Raymaekers, "robslopes: Efficient Computation of the (Repeated) Median Slope", The R Journal, 2023
BibTeX citation
@article{RJ-2023-012,
author = {Raymaekers, Jakob},
title = {robslopes: Efficient Computation of the (Repeated) Median Slope},
journal = {The R Journal},
year = {2023},
note = {https://rjournal.github.io/},
volume = {14},
issue = {4},
issn = {2073-4859},
pages = {38-49}
}