Modern use of slope estimation often involves the (repeated) estimation of a large number of slopes on a large number of data points. Some of the most popular non-parametric and robust alternatives to the least squares estimator are the Theil-Sen and Siegel’s repeated median slope estimators. The robslopes package contains fast algorithms for these slope estimators. The implemented randomized algorithms run in \(\mathcal{O}(n\log(n))\) and \(\mathcal{O}(n\log^2(n))\) expected time respectively and use \(\mathcal{O}(n)\) space. They achieve speedups up to a factor \(10^3\) compared with existing implementations for common sample sizes, as illustrated in a benchmark study, and they allow for the possibility of estimating the slopes on samples of size \(10^5\) and larger thanks to the limited space usage. Finally, the original algorithms are adjusted in order to properly handle duplicate values in the data set.
The Theil-Sen estimator (Theil 1950; Sen 1968) is arguably the most popular non-parametric and robust alternative to the least squares estimator for estimating the slope in simple linear regression. A variation on this estimator, Siegel’s repeated median slope (Siegel 1982), was the first slope estimator to attain the maximal breakdown value of 50 %, which roughly means that it can withstand up to 50% of outliers in the data. Since their introduction, these estimators have seen widespread use in a variety of applications including signal extraction (Davies et al. 2004; Gather et al. 2006), filtering (Bernholt et al. 2006; Fried et al. 2006, 2007; Gelper et al. 2010), computer vision (Meer et al. 1991), climatology (Zhai et al. 01 Apr. 2005; Zhang et al. 2000; Kosaka and Xie 2013) and very recently differentially private estimation (Fu et al. 2019; Alabi et al. 2022).
Several implementations of these estimators exist, most notably in the R packages deming (Therneau 2018), zyp (Bronaugh and Pacific Climate Impacts Consortium 2019), mblm (Komsta 2019), and RobustLinearReg (Hurtado 2020), all publicly available on CRAN. Despite their popularity and importance, all of these publicly available implementations are based on a brute-force computation of the (repeated) median slope. Given a sample of \(n\) observations, this approach requires a computational cost of \(\mathcal{O}(n^2)\) as well as \(\mathcal{O}(n^2)\) space. This makes the currently available implementations unsuitable for modern applications involving larger data sets as they are rather slow and potentially use prohibitively large amounts of storage space.
Nevertheless, there exist several algorithms for the Theil-Sen and repeated median estimators which run in quasilinear time and require only \(\mathcal{O}(n)\) space. For the Theil-Sen estimator, Cole et al. (1989), Katz and Sharir (1993), and Brönnimann and Chazelle (1998) proposed algorithms running in \(\mathcal{O}(n\log(n))\) deterministic time. A randomized algorithm requiring \(\mathcal{O}(n\log(n))\)-expected time was proposed in Matoušek (1991), Dillencourt et al. (1992) and Shafer and Steiger (1993). For the repeated median slope, Stein and Werman (1992), Matoušek et al. (1993) and Matoušek et al. (1998) proposed algorithms running in \(\mathcal{O}(n\log(n))\) and \(\mathcal{O}(n\log(n)^2)\) (-expected) time. A potential explanation for these algorithms not being available in R may be twofold. Firstly, they are all more involved than brute-force computation of the median slopes and sample implementations in other programming languages are scarce, if available at all. Secondly, most of them make efficient use of features not (readily) available in R, such as pointers and in-place assignment, as well as complex data structures.
In this article we discuss the R package robslopes (Raymaekers 2022) which contains implementations of the randomized algorithms by Matoušek (1991) and Matoušek et al. (1998) for the Theil-Sen and repeated median estimators respectively. The original algorithms have been adjusted in order to properly deal with potential duplicates in the data. We start by briefly reviewing the problem setting and the estimators. Next, we present a rough outline of the key ideas underlying the algorithms and a brief description of the usage of the functions in the package. We end with a benchmarking study comparing the implementation with existing implementations in publicly available R packages and concluding remarks.
The typical problem setting of slope estimation is that of the simple linear model. Suppose we have \(n\) observations \((x_i, y_i)\) which follow the model \[y_i = \alpha + \beta x_i + e_i\] where \(\alpha\) and \(\beta\) are the (unknown) true intercept and slope parameters and \(e_i\) represents the noise. The most popular estimator for the regression coefficients is the ordinary least squares (OLS) estimator, i.e. the \((\hat{\alpha}, \hat{\beta})\) minimizing \(\sum_{i=1}^{n}{(y_i - \hat{\alpha}-\hat{\beta}x_i)^2}\). The OLS estimator possesses several attractive properties, including ease of computation and interpretation and optimal performance when the errors are i.i.d. and follow a normal distribution. Despite these properties, it is well known that the OLS estimator is very sensitive to outliers and can have a quickly deteriorating performance as the error term deviates from normality. For these reasons, many alternatives have been suggested, of which the Theil-Sen estimator and Siegel’s repeated median slope are two popular and intuitively attractive examples.
The Theil-Sen (TS) estimator (Theil 1950; Sen 1968) of \(\beta\) is defined as the median of the slopes of all the lines determined by two different observations \((x_i,y_i)\) and \((x_j, y_j)\): \[\hat{\beta}_{TS}(\mathbf{x},\mathbf{y})=\mathop{\mathrm{med}}_{i\neq j} \frac{y_i - y_j}{x_i - x_j},\] where the median avoids \(i=j\) as the slope through one point is undefined and where \(\mathbf{x}= x_1, \ldots, x_n\) and \(\mathbf{y}= y_1, \ldots, y_n\). In case there are duplicate values in \(x_1,\ldots,x_n\), the proposal of (Sen 1968) is to only include the slopes which are constructed using two observations with different \(x\)-values and we adopt this approach here.
The TS estimator has been analyzed extensively from a theoretical point of view. (Sen 1968) showed its asymptotic normality and equivariance properties. In particular, we have
scale equivariance: \(\hat{\beta}_{TS}(\mathbf{x},c\mathbf{y}) = c\hat{\beta}_{TS}(\mathbf{x},\mathbf{y})\) for all \(c \in \mathbb{R}\)
regression equivariance \(\hat{\beta}_{TS}(\mathbf{x},\mathbf{y}+ a\mathbf{x}) = \hat{\beta}_{TS}(\mathbf{x},\mathbf{y}) + a\) for all \(a \in \mathbb{R}\),
but the estimator is not equivariant under affine transformations of
both predictor and response variables. (Wang and Yu 2005) give precise
conditions for the unbiasedness of the TS estimator and
(Wilcox 1998) studied its behavior in the case of heteroscedastic
errors. From a robust statistics perspective, it is known that the TS
estimator is much more robust against outliers than the OLS estimator
(see (Rousseeuw and Leroy 2005)). In particular, it has a bounded influence
function (Hampel et al. 1986) and its breakdown value is
\(1 - \sqrt{\frac{1}{2}} \approx 0.293 \%\), where the latter can be
roughly interpreted as the maximum percentage of contaminated samples
that the estimator can handle (see (Donoho and Huber 1983) for an exact
definition). Finally, its maximum bias properties under contamination
have been studied by (Adrover and Zamar 2004).
The search for a slope estimator with a breakdown value higher than 30%
prompted Siegel to propose the repeated median (RM) slope
(Siegel 1982). It is the first slope estimator attaining the
maximal breakdown value of 50%. The repeated median estimator of \(\beta\)
is computed by first calculating the median slope per observation \(x_i\),
yielding \(n\) values, and then taking the median of these values:
\[\hat{\beta}_{RM}(\mathbf{x},\mathbf{y})=\mathop{\mathrm{med}}_i \mathop{\mathrm{med}}_{j\neq i} \frac{y_j - y_i}{x_j - x_i},\]
where now the inner median avoids \(i=j\). We handle duplicate values in
\(x_1,\ldots,x_n\) by skipping them in the computation of the inner
median, in the same spirit as duplicate handling for the TS estimator.
The consistency, unbiasedness and efficiency of the RM estimator were
discussed by (Siegel 1982), whereas the asymptotic normality was
analyzed in (Hossjer et al. 1994). The estimator was designed to
have a 50 % asymptotic breakdown value, and like the TS estimator, its
influence function is also bounded
(Rousseeuw et al. 1993; Rousseeuw et al. 1995; Rousseeuw and Leroy 2005).
Finally, it possesses the same equivariance properties as the TS
estimator, i.e. it is both scale and regression equivariant but not
affine equivariant, and its maximum bias properties are also discussed
in (Adrover and Zamar 2004).
It is clear that both \(\hat{\beta}_{TS}\) and \(\hat{\beta}_{RM}\) can be computed by calculating all \(\frac{n(n-1)}{2}\) pairwise slopes, and selecting the median or repeated median of these slopes. Clearly, this brute-force approach has a \(\mathcal{O}(n^2)\) computational cost, and the available implementations of this approach also require storing the \(\mathcal{O}(n^2)\) slopes. In the next section, we describe the more efficient algorithms implemented in the robslopes package.
While not the main focus of this article, it is worth mentioning that the intercept can be estimated in several ways. We opt for the most straight-forward approach of taking the median of the residuals as the estimator for \(\alpha\): \[\hat{\alpha}= \mathop{\mathrm{med}}_i \left(y_i - \hat{\beta}x_i\right),\] where \(\hat{\beta}\) can be \(\hat{\beta}_{TS}\) or \(\hat{\beta}_{RM}\).
The R-package robslopes contains an implementation of the randomized algorithm by Matoušek (1991) for the Theil-Sen estimator and the algorithm of Matoušek et al. (1993; Matoušek et al. 1998) for the repeated median estimator. Unlike in their original proposals, we haven taken into account the case of possible duplicate values in the \(x_i\) and adjusted the algorithms to work properly in that case as well. We now briefly outline these algorithms without going too much into detail. For specific details and a complete description we refer to the original references and our code.
Suppose we are given \(n\) data points \((x_i, y_i)\) with \(i = 1,\ldots,n\) and we are interested in the median (or more generally, any order statistic) of the slopes formed by connecting two data points. The main idea behind the algorithms is to consider each observed data point \((x_i, y_i)\) in dual space by associating it with the line \[v = x_i u - y_i.\] We will denote the coordinates in dual space with \((u,v)\) in the following. It can be verified that the \(u\)-coordinate of the intersection of two lines in dual space (also called “intersection abscissa” (IA)) is equal to the slope of the line passing through the two points in the original space. Therefore, the problem of finding order statistics of slopes can be translated into the problem of finding order statistics of intersection abscissas in dual space. To find these order statistics in dual space, the algorithms use the following idea. Given an interval \((u_{low},u_{high}]\) in dual space, the number of IAs in that interval can be computed by counting the number of inversions in a certain permutation. For a permutation \(\pi\) on \(1,\ldots,n\), an inversion is a pair of elements \((\pi(i),\pi(j))\) for which \(i<j\) and \(\pi(i) > \pi(j)\). Figure 1 shows three lines in dual space and illustrates the connection between IAs and inversions which works as follows. Suppose we have two lines associated with the points \((x_i,y_i)\) and \((x_j,y_j)\). The \(v\)-coordinates of the intersection of these lines with \(u_{low}\) in dual space are given by \(x_i u_{low}- y_i\) and \(x_ju_{low}- y_j\) respectively. Suppose without loss of generality that \(x_i u_{low}- y_i < x_ju_{low}- y_j\). Now, if lines \(i\) and \(j\) intersect in the interval \((u_{low},u_{high}]\), we must have that the \(v\)-coordinate of the intersection of these lines with \(u_{high}\) have switched order: \(x_i u_{high}- y_i > x_ju_{high}- y_j\). Therefore, there is a bijection between the IAs in \((u_{low},u_{high}]\) and the permutation obtained by going from the order of the intersections of the lines with \(u = u_{low}\) to the order of the intersections with \(u = u_{high}\). Counting the number of inversions in a permutation can be done in \(\mathcal{O}(n\log(n))\) time using an adaptation of merge sort (Knuth 1998), and constructing the permutation itself has the same computational complexity.
For the Theil-Sen estimator, we obtain \(n(n-1)\) IAs in dual space, where parallel lines meet at \(+\infty\) by convention. Suppose we want to find the \(k\)-th smallest IA (e.g., \(k\) could be \(\lfloor (n(n-1)+1)/2\rfloor\) for the lower median). The idea is to maintain a half-open interval \((u_{low},u_{high}]\), which is initialized at \((-\infty, \infty]\) and always contains the IA we are looking for. We also keep a count \(L\) and \(C\) of the total number of IAs to the left and within the interval respectively. If the number of IAs in the interval is of order \(\mathcal{O}(n)\), we can enumerate them in \(\mathcal{O}(n\log(n))\) time and select the desired element (we call this “brute-force computation”). If there are more IAs left in the interval, we use a contraction strategy which makes the interval \((u_{low},u_{high}]\) progressively smaller while still containing the target \(k\)-th smallest slope, until it is small enough for brute-force computation.
In order to contract the interval, we randomly sample a \(\mathcal{O}(n)\) number of IAs. These are then used to estimate a new interval \((u_{low}',u_{high}']\subset (u_{low},u_{high}]\) which contains the target IA with high probability. To check whether the contracted interval contains the target, we count the number of IAs in the intervals \((u_{low}, u_{low}']\) and \((u_{low}',u_{high}']\). If the current count \(L\) added to the number of IAs in the former interval exceeds \(k\), we know that the target IA is in \((u_{low}, u_{low}']\). If not, we check whether additionally adding the number of IAs in \((u_{low}',u_{high}']\) gives a total count exceeding \(k\), in which case the target IA is in \((u_{low}', u_{high}']\). If neither of those cases hold, we know the target IA is in \((u_{high}', u_{high}]\). After the contraction, we update \(L\) and \(C\) and this process is repeated until \(C\), the number of IAs in the current interval, is of the order \(\mathcal{O}(n)\). To execute this strategy, we need a method to randomly sample IAs. It turns out that this can be achieved in \(\mathcal{O}(n\log(n))\) time, again using an adaptation of merge sort. For the Theil-Sen estimator there is an expected \(\mathcal{O}(1)\) number of iterations required for convergence, leading to a total expected complexity of \(\mathcal{O}(n\log(n))\).
For the repeated median estimator, the algorithm is similar to the previously described algorithm in that it also works with an interval-contraction strategy. In contrast with the previous algorithm however, we now keep track of the lines for which the median IA falls within the interval, in addition to the number of abscissas on the left and within the current interval for each individual line. In each contraction step, we first sample a \(\mathcal{O}(\sqrt{n})\) number of lines for which we know that their median IA lies within the current interval. For each of these lines, a \(\mathcal{O}(\sqrt{n})\) number of IAs are now sampled which allow for the estimation of a new interval \((u_{low}',u_{high}']\subset (u_{low},u_{high}]\) containing the target slope with high probability. The interval \((u_{low},u_{high}]\) can then be contracted by counting the number of IAs for each line on the left and within \((u_{low}',u_{high}']\). As before, the algorithm switches to brute-force computation once the number of IAs in \((u_{low},u_{high}]\) is of the order \(\mathcal{O}(n)\). For this algorithm, there is an expected \(\mathcal{O}(\log(n))\) number of contraction steps required, resulting in a total complexity of \(\mathcal{O}(n\log^2(n))\).
The original algorithms were only described for the case where the \(x_i\) values are all distinct. In case of duplicate values however, some of the slopes are not defined and the natural way of handling this is by ignoring the undefined slopes and computing the median (or any other order statistic) on the remaining slopes. Fortunately, we can incorporate this into our algorithms by realizing that these duplicate \(x_i\) values correspond with parallel lines in dual space. Using the convention that parallel lines in dual space meet at \(+\infty\), the (repeated) median slope can be found by appropriately adjusting the value of the order statistic of the slope we want to find. For the Theil-Sen estimator, this means we need to find the order statistic corresponding with \(\frac{n(n-1)}{2}-\sum_{i=1}^n{\frac{d_i-1}{2}}\) values, where \(d_i\) denotes the number of times the value \(x_i\) occurs in the predictor variable. For the repeated median estimator, we only have to adjust the inner median, but the adjustment is dependent on the individual line \(i\). More specifically, the order statistic for the inner median needs to be computed on \(n-d_i\) values. Note that we can count the number of duplicates easily in \(\mathcal{O}(n)\) time and so the overall complexity of the algorithm is not affected by this change. There is one additional complication, namely that of duplicate pairs \((x_i, y_i)\). If such pairs are present, one should be careful with computing the permutation of intersections in dual space given an interval \((u_{low}, u_{high}]\). Obtaining such a permutation involves sorting (and ranking) a vector of IAs. However, in case of duplicate pairs, this sorting needs to be done using stable sort. If not, duplicate pairs may randomly produce inversions, yielding sampled IAs at \(\infty\) even if the current interval has \(u_{high}< \infty\). In the original algorithm, this scenario was not considered and thus not explicitly accounted for. These changes have been incorporated in the implementation provided in the robslopes package.
We briefly describe the implementation and usage of the functions in the
robslopes package. The package revolves around 2 main functions, the
TheilSen
function and the RepeatedMedian
function, both returning a
list with the self-explanatory elements intercept
and slope
. Both
functions start with input checking and duplicate counting, which are
done in R. The duplicate counts are then used to set the correct target
order statistics for the main part of the algorithm. This information is
then passed on to the randomized algorithms which are implemented in
C++, making use of the
rcpparmadillo
package (Eddelbuettel and Sanderson 2014; Eddelbuettel and François).
The TheilSen
function has the layout
TheilSen(x, y, alpha = NULL, verbose = TRUE)
, where the x
and y
arguments are the input vectors for the predictor and response variable
respectively. The verbose
option allows for switching off the printing
of the computation progress. Finally, the alpha
argument is a value
between 0 and 1 which determines the order statistic corresponding with
the target slope. When alpha = NULL
, the default, the upper median of
the \(m\) slopes is computed, which corresponds with the
\(\lfloor (m + 2) / 2\rfloor\)-th order statistic. For any other value of
\(0\leq\)alpha
\(\leq 1\), the function computes the \([\)alpha
\(\;m]\)-th
order statistic of the slopes, where \([\cdot]\) is the rounding operator.
The RepeatedMedian
function has the layout
RepeatedMedian(x, y, alpha = NULL, beta = NULL, verbose = TRUE)
, where
the x
, y
and verbose
arguments play the same role as for the
TheilSen
function. The arguments alpha
and beta
determine the
order statistics of the inner and outer “median”. When NULL
, the
default, they again correspond to the upper median. If they contain
values between zero and one, the order statistics corresponding with the
inner and outer “median” are given by \([\)alpha
\(\;m]\) and
\([\)beta
\(\;m]\) respectively.
For convenience, the TheilSen
and RepeatedMedian
functions have been
wrapped in the user-friendly functions robslope
and robslope.fit
.
These mimic the structure of common regression functions (e.g., lm
and
lm.fit
). In particular, robslope
takes the standard arguments
formula
, data
, subset
, weights
and na.action
, in addition to
the type
argument selecting which type of slope to compute, as well as
the optional alpha
, beta
and verbose
arguments of the TheilSen
and RepeatedMedian
functions.
As an example we analyze the flights data of the nycflights13 package (Wickham 2019). It contains on-time data for all flights that departed in New York City (i.e. JFK, LGA or EWR airports) in 2013. We consider the distance traveled in miles as the predictor variable, and take the air time in minutes as the response. After removing all NA values, we end up with a data set of 327,346 observations. In contrast to the previous example, this one is too large to compute the TS or RM slope with a brute-force algorithm (on most computers). We now calculate the TS and RM slopes three times each, where we change the inner order statistic to the first, second and third quartiles. With the exception of graphical parameters, the code executed is shown below.
library("robslopes")
library("nycflights13")
data("flights")
.25 <- robslope(formula = air_time~distance, data = data, alpha = 0.25)
ts.out.50 <- robslope(formula = air_time~distance, data = data, alpha = 0.50)
ts.out.75 <- robslope(formula = air_time~distance, data = data, alpha = 0.75)
ts.outplot(data$distance, data$air_time)
abline(coef(ts.out.25), col = "red", lwd = 3)
abline(coef(ts.out.50), col = "green", lwd = 3)
abline(coef(ts.out.75), col = "blue", lwd = 3)
.25 <- robslope(formula = air_time~distance, data = data, beta = 0.25,
rm.outtype = "RepeatedMedian")
.50 <- robslope(formula = air_time~distance, data = data, beta = 0.50,
rm.outtype = "RepeatedMedian")
.75 <- robslope(formula = air_time~distance, data = data, beta = 0.75,
rm.outtype = "RepeatedMedian")
plot(data$distance, data$air_time)
abline(coef(rm.out.25), col = "red", lwd = 3)
abline(coef(rm.out.50), col = "green", lwd = 3)
abline(coef(rm.out.75), col = "blue", lwd = 3)
The resulting figure is shown in Figure 2. In this example, the slopes calculated by the TS and RM estimators are virtually identical due to the very limited number of outliers and the large number of data points. Note that there are many duplicates in this data set, which are handled appropriately by the proposed implementation.
We now benchmark the implemented algorithm against the existing implementations in the deming, zyp, mblm, and RobustLinearReg packages. Note that the deming and zyp packages only contain the TS estimator, the others contain both the TS and RM estimators. All simulations were done using the R-package microbenchmark (Mersmann 2019) and were run on an Intel Core i7-10750H @2.60GHz processor. The setup is as follows. For each value of \(n = \{10, 10^2, 10^3, 10^4\}\), we have generated 100 samples \((x_i,y_i)\) from the bivariate standard normal distribution. Afterward, each of the available implementations was used to estimate the regression parameters and the execution time was measured (in nanoseconds).
The following code snippets show the code used for benchmarking the TS and RM estimator for a sample of size \(n = 10^3\), the same code was used for the other sample sizes, with exception of \(n = 10^4\), in which case the implementation of the mblm package was left out due to its computational burden (it takes several hours to compute it once).
<- 10^3
n <- microbenchmark("deming" = deming::theilsen(y~x, data = data),
mbm "zyp" = zyp::zyp.sen(y~x, dataframe = data),
"mblm" = mblm::mblm(y~x, dataframe = data, repeated = FALSE),
"RobustLinearReg" = RobustLinearReg::theil_sen_regression(y~x,
data = data),
"robslopes" = robslopes::TheilSen(x, y),
setup = {x = rnorm(n); y = rnorm(n);
= as.data.frame(cbind(x, y))}, times = 100)
data autoplot(mbm)
<- 10^3
n <- microbenchmark("mblm" = mblm::mblm(y~x, dataframe = data, repeated = TRUE),
mbm "RobustLinearReg" = RobustLinearReg::siegel_regression(y~x,
data = data),
"robslopes" = robslopes::RepeatedMedian(x, y),
setup = {x = rnorm(n); y = rnorm(n);
= as.data.frame(cbind(x, y))}, times = 100)
data autoplot(mbm)
Figure 3 shows the results for the Theil-Sen estimator. We see that for the smallest sample size, \(n = 10\), the absolute computation times are extremely low and while there are visible differences, they are probably not of practical relevance. For \(n=100\), the robslopes package is roughly five times faster than the fastest competitor, and the mblm implementation starts to run away from the other competitors. For \(n = 10^3\), the differences become much clearer. The robslopes implementation is now between one and two orders of magnitude faster than the best competitor. Finally, for \(n = 10^4\) the gap widens further to a difference of over two orders of magnitude. Note that for \(n = 10^4\), we have left out the mblm package in the comparison as a single run takes several hours and the resulting plot would hinder a clear comparison.