Distance Measures for Time Series in R: The TSdist Package

The definition of a distance measure between time series is crucial for many time series data mining tasks, such as clustering and classification. For this reason, a vast portfolio of time series distance measures has been published in the past few years. In this paper, the TSdist package is presented, a complete tool which provides a unified framework to calculate the largest variety of time series dissimilarity measures available in R at the moment, to the best of our knowledge. The package implements some popular distance measures which were not previously available in R, and moreover, it also provides wrappers for measures already included in other R packages. Additionally, the application of these distance measures to clustering and classification tasks is also supported in TSdist, directly enabling the evaluation and comparison of their performance within these two frameworks.


Introduction
In recent years, the increase in data collecting technologies has triggered the creation of time series databases, where each instance consists of an entire time series.The main features of this type of data are its high dimensionality, dynamism, auto-correlation and noisy nature, all which complicate the study and pattern extraction to a large extent.However, in the past few years, tasks such as regression, classification, clustering or segmentation have been extended and modified successfully for time series databases (Fu, 2011;Bagnall et al., 2016).In many cases, these tasks require the definition of a distance measure, which will indicate the level of similarity between time series.Because of this, understanding suitable measures for this specific type of data has become a crucial area of study.R is a popular programming language and a free software environment for statistical computing, data analysis and graphics (R Core Team, 2014), which can be extended by means of packages, contributed by the users themselves.A few of these R packages, such as dtw (Giorgino, 2009), pdc (Brandmaier, 2015), proxy (Meyer and Buchta, 2015), longitudinalData (Genolini, 2014) and TSclust (Montero and Vilar, 2014) provide implementations of some time series distance measures.However, many of the most popular distances reviewed by Esling and Agon (2012); Wang et al. (2012) and Bagnall et al. (2016) are not available in these R packages.
In this paper, the TSdist package (Mori et al., 2015) for the R statistical software is presented.In addition to providing wrapper functions to all the distance measures implemented in the previously mentioned packages, TSdist implements another 9 distance measures designed for univariate numerical time series.These distance measures have been selected based on their prevalence, and because they are mentioned in recent reviews on the topic (Liao, 2005;Esling and Agon, 2012;Wang et al., 2012).In this manner, and to the best of our knowledge, this package provides the most up-to-date coverage of the published time series distance measures in R.

Design and implementation of the package
As can be seen in Figure 1, the core of the TSdist package consists of three types of functions.To begin with, in the lowest level, the functions of the type MethodDistance conform the basis of the package, and can be used to calculate distances between pairs of numerical and univariate vectors.Of course, Method must be substituted by the name of a specific distance measure.Most of them are implemented exclusively in R language but, the internal routines of a few of them are implemented in C language, for reasons of computational efficiency.
In the next level, the wrapper function called TSDistances enables the calculation of distance measures between univariate time series objects of type ts, zoo and xts, the latter two defined in their respective packages: zoo (Zeileis and Grothendieck, 2005) and xts (Ryan and Ulrich, 2013).All these objects are specific for temporal data and the corresponding packages provide a complete set of methods to work with them.However, there are slight differences between them.Objects of type ts are the most basic and are exclusively addressed for regularly sampled time series.The zoo objects incorporate the possibility of dealing with irregularly sampled time series.Finally, the xts package further extends the zoo package to provide a uniform handling of all the time series data types in R. To calculate the distance measure between two objects of one of these types, the TSDistances function just takes care of the conversion of data types and then makes use of the desired MethodDistance The R Journal Vol.8/2, December 2016 ISSN 2073-4859 function.Note that, in addition to ts, xts and zoo objects, we can also introduce basic numeric vectors into the TSdistances function.In this sense, it generalizes and unifies the calculation of all the distance measures in one function.
Finally, on some occasions, it is necessary to calculate the distance between each pair of series in a given database of series (X = {X 1 , X 2 , ..., X N }).This will result in a distance matrix such as the following: The TSDatabaseDistances function is specifically designed to build distance matrices from time series databases saved in matrices, mts objects, zoo objects, xts objects or lists.Upon loading the TSdist package, the TSDistances function is automatically included in the pr_DB database, which is a list of similarity measures defined in the proxy package.This directly enables the use of the dist function, the baseline R function to calculate distance matrices, with the dissimilarity measures defined in the TSdist package.This is the general strategy followed by the TSDatabaseDistances function and, only for a few special measures, the distance matrix is calculated in other ad-hoc manners for efficiency purposes.
As an additional capability of the TSDatabaseDistances function, the distance matrices can not only be calculated for a single database, but also for two separate databases.In this second case, all the pairwise distances between the series in the first database and the second database are calculated: This last feature is especially useful for classification tasks where train/test validation frameworks are frequently used.

Summary of distance measures included in TSdist
In Table 1, a summary of the distance measures included in TSdist is presented.Since the package includes wrapper functions to distance measures hosted in other packages, the original package is also cited in the table.
Based on the literature, we have divided the distance measures into four groups.Shape-based distances compare the overall shape of the time series by measuring the closeness of the raw-values of the time series (Esling and Agon, 2012).Within this category, we separate the (i) lock-step measures, which compare the i-th point of one time series to the i-th point of another, and the (ii) elastic measures, which are more flexible and allow one-to-many points and one-to-none point matchings (Wang et al., 2012).Feature-based distances are based on comparing certain features extracted from the series, such as Fourier or wavelet coefficients, autocorrelation values, etc. Next, structure-based distances include (i) model-based approaches, where a model is fit to each series and the comparison is made between models, and (ii) complexity-based models, where the similarity between two series is measured based on the quantity of shared information.Finally, prediction-based distances analyze the similarity of the forecasts obtained for different time series.As can be seen in Table 1, the distance measures implemented specifically in TSdist complement the set of measures already included in other packages, contributing to a more thorough coverage of the existing time series distance measures.As the most notable example, edit based distances for numeric time series (EDR, ERP and LCSS) have been introduced, which were completely overlooked in previous R packages.
For more extensive explanations on each of the distance measures, the readers can access the documentation of the TSdist package, where more details or suitable references are provided.

User interface by example
The TSdist package is available from the CRAN repository, where the source files for Unix platforms and the binaries for Windows and some OS-X distributions can be downloaded.For more information on software pre-requisites and detailed instructions on the installation process of TSdist, please see the README file included in the inst/doc directory of the package.
Note that, in the following sections, we will use several time series and time series databases included in TSdist.These databases are all synthetic, and have been chosen and designed specifically because of their simplicity and because they allow us to provide straightforward examples which clearly illustrate the usage of the different functions included in the package, and can be easily analyzed, replicated and visualized by the reader.However, once the practitioner becomes familiar with the examples provided in the following sections, it is straightforward to download any real dataset, such as those included in the UCR archive (Keogh et al.), and work on it.

Examples of distance calculations between numeric vectors
The example.series1 and example.series2objects (see Figure 2) included in the TSdist package are two numeric vectors that represent two different synthetic series which were generated based on the shapes that define the Two Patterns synthetic database of series (Geurts, 2002).
Additionaly, example.series3and example.series4(see Figure 3   As mentioned previously, the basic calculation of the distance between two series, such as example.series1and example.series2, is done by using the MethodDistance functions and replacing Method with the reference name of the distance measure of choice (for a complete list of reference names, the user can access the help pages of TSdist): > CCorDistance(example.series1, example.series2) [1] 1.192903 > CorDistance(example.series1, example.series2) [1] 1.399347 Many of the distance measures require the definition of a parameter, which must be included in the call to the corresponding function: > EDRDistance(example.series1, example.series2,epsilon=0.1) [1] 80 > ERPDistance(example.series1, example.series2,g=0) [1] 98.29833 Additionally, each distance measure has some characteristics which can impose some constraints on the input time series.For example, some distance measures such as the Euclidean distance can not deal with time series of different lengths.As such, if the conditions are not fulfilled, the distance can not be computed and the function will return NA together with the corresponding error message: > EuclideanDistance(example.series3, example.series4)Error : Both series must have the same length.Finally, note that all these distance calculations can be carried out by using the TSdistances wrapper function as follows: > TSDistances(example.series1, example.series2,distance="ccor") [1] 1.192903 > TSDistances(example.series1, example.series2,distance="cor") [1] 1.399347 > TSDistances(example.series1, example.series2,distance="edr", epsilon=0.1) [1] 80 > TSDistances(example.series1, example.series2,distance="erp", g=0) [1] 98.29833 As can be seen, the distance of choice must be specified within the distance argument, followed by the necessary parameters.
We must emphasize that each distance measure is scaled differently and so, distance values obtained from different distance measures are not directly comparable, even when comparing the two same time series.As such, completely different values can be obtained from different distance measures, as can be seen in the previous example.

Examples of distance calculations between time series objects
The zoo.series1 and zoo.series2 time series included in the package are replicas of the example.series1 and example.series2objects introduced previously but saved as zoo objects with a specific time index.A basic distance calculation between two series like these is done using the TSDistances function exactly as shown in the previous section: > TSDistances(zoo.series1,zoo.series2, distance="cor") [1] 1.399347 > TSDistances(zoo.series1,zoo.series2, distance="dtw", sigma=10) [1] 123.8757 The distance calculation between ts or xts objects is done in the same manner.

Examples of distance matrix calculations
The example.database object included in the package is a matrix that represents a database with 6 ARMA(3,2) series of coefficients AR=(1, -0.24, 0.1) and MA=(1, 1.2), but generated with different random seeds.Each time series corresponds to a row of the matrix.Additionally, the zoo.database object included in the package is a multivariate zoo object that saves the series of example.databasewith a specific time index.
The dist function calculates the pairwise distance between all the rows in a matrix so, the calculation of the distance matrix can be done easily for the example.databaseobject in the following manner: > dist(example.database,method="TSDistances", distance="tquest", + tau=mean(example.database),diag=TRUE, upper=TRUE) series1 series2 series3 series4 series5 series6 series1 0.00000000 0.10310669 0.06460465 0.05345349 0.08355246 0.04768702 series2 0.10310669 0.00000000 0.05260503 0.07685220 0.12273356 0.03049604 series3 0.06460465 0.05260503 0.00000000 0.02003566 0.09874005 0.01984044 series4 0.05345349 0.07685220 0.02003566 0.00000000 0.04998743 0.02302477 series5 0.08355246 0.12273356 0.09874005 0.04998743 0.00000000 0.06191323 series6 0.04768702 0.03049604 0.01984044 0.02302477 0.06191323 0.00000000 The R Journal Vol.8/2, December 2016 ISSN 2073-4859 When using the dist function with the distances included in TSdist, the method argument will always be left as "TSDistances", and the selected distance measure must be introduced in the distance argument, followed by its parameters.The diag and upper options are used to specify if the diagonal and upper triangle of the matrix should be shown.In any case, this calculation can also be done more directly by using the TSDatabaseDistances function: > TSDatabaseDistances(example.database, distance="tquest", + tau=mean(example.database)) When the database is not saved as a matrix, such as with zoo.database, the distance matrix calculation can not be done by using the dist function directly.In this case, the calculation must necessarily be carried out by using TSDatabaseDistances: > TSDatabaseDistances(zoo.database, distance="tquest", + tau=mean(zoo.database))series1 series2 series3 series4 series5 series2 0.10310669 series3 0.06460465 0.05260503 series4 0.05345349 0.07685220 0.02003566 series5 0.08355246 0.12273356 0.09874005 0.04998743 series6 0.04768702 0.03049604 0.01984044 0.02302477 0.06191323 Note that, by default, the TSDatabaseDistances function does not show the diagonal and upper triangle of the computed distance matrix.If we want the whole matrix to appear, we must include the options diag=TRUE and upper=TRUE as with the dist function.

Time series classification and clustering with the TSdist package
The most common usage of time series distance measures is within clustering and classification tasks,1 and all the measures included in this package can be useful within these two frameworks.As a support for these two tasks, the TSdist package includes two well-known functions.
The first function (OneNN) implements the 1NN classifier.This classifier is commonly used to evaluate the performance of different distance measures, due to the influence the distance measure has on its performance together with its reduced number of parameters (Wang et al., 2012).Given a pair of train/test time series datasets and the class values of the series in the training set, the oneNN function outputs the predicted class values for the test series.Additionally, if the ground truth class values of the series in the testing set are provided by the user, the error obtained in the classification process is also calculated.
In this line, previous experiments show that there is no "best" distance measure which is suitable for all databases and all tasks, (Wang et al., 2012).In this context, a specific distance measure must be selected, in each case, in order to obtain satisfactory results (Mori et al., 2016).The large number of distance measures included in TSdist and the simple design of this package allows the user to try different distance measures directly, simplifying the distance measure selection process considerably.

Summary and conclusions
The TSdist package enables the calculation of distances between time series and time series databases, by using a large variety of measures available in the literature.By including wrapper functions for time series distances already available in R, and implementing other unavailable popular measures reviewed in the literature, this package provides the largest selection of time series distance measures available at R at the moment, to the best of our knowledge.Additionally, it also simplifies the evaluation of these measures and their application in classification and clustering contexts by providing several ad-hoc functions.
For more detailed information on the databases and functions included in the TSdist package, and a more complete set of examples, the reader can consult the help pages or the manual of the TSdist package and the vignette included within.

Figure 1 :
Figure 1: Structure and organization of the TSdist package.

Figure 2 :
Figure 2: The two example series of the same length included in the TSdist package.

Figure 3 :
Figure 3: The two example series of different length included in the TSdist package.

Table 1 :
Summary of distance measures for time series implemented in R.