A simple temporal point process (SPP) is an important class of time series, where the sample realization of the process is solely composed of the times at which events occur. Particular examples of point process data are neuronal spike patterns or spike trains, and a large number of distance and similarity metrics for those data have been proposed. A marked point process (MPP) is an extension of a simple temporal point process, in which a certain vector valued mark is associated with each of the temporal points in the SPP. Analyses of MPPs are of practical importance because instances of MPPs include recordings of natural disasters such as earthquakes and tornadoes. In this paper, we introduce the R package mmpp, which implements a number of distance and similarity metrics for SPPs, and also extends those metrics for dealing with MPPs.
A random point process is a mathematical model for describing a series
of discrete events (Snyder and M. I. Miller 1991). Let
Reflecting the importance of the analysis of point processes in a broad range of science and engineering problems, there are already some R packages for modeling and simulating point processes such as splancs (Rowlingson and P. Diggle 1993), spatstat (Baddeley and R. Turner 2005), PtProcess (Harte 2010), and stpp (Gabriel, B. S. Rowlingson, and P. J. Diggle 2013). These packages support various approaches for the analysis of both simple and marked spatial or spatio-temporal point processes, namely, estimating an intensity function for sample points, visualizing the observed sample process, and running simulations based on the specified models.
To complement the above mentioned packages, in
mmpp (Hino, K. Takano, Y. Yoshikawa, and N. Murata 2015), we focus
on the similarity or distance metrics between realizations of point
processes. Similarity and distance metrics are fundamental notions for
multivariate analysis, machine learning and pattern recognition. For
example, with an appropriate distance metric, a simple
As for the distance and similarity metric for point processes, vast amount of methods are developed in the field of neuroscience (Kandel, J. H. Schwartz, and T. M. Jessell 2000). In this field, neural activities are recorded as sequences of spikes (called spike trains), which is nothing but a realization of a simple point process (SPP). By its nature, the responses of neurons to the same stimulus can be different. To claim the repeatability and the reliability of experimental results, a number of different distance and similarity metrics between sequence of spikes are developed (Quian Quiroga, T. Kreuz, and P. Grassberger 1904; Rossum 2001; Schreiber, J. Fellous, D. Whitmer, P. Tiesinga, and T. Sejnowski 2003; Victor 2005; Kreuz, J. S. Haas, A. Morelli, H. D. Abarbanel, and A. Politi 2007).
The package mmpp categorizes commonly used metrics for spike trains and offers implementations for them. Since a spike train is a realization of a simple point process, the original metrics developed in the field of neuroscience do not consider marked point process (MPP) realizations. mmpp extends conventional metrics for SPPs to MPPs. We have two main aims in the development of mmpp:
to have a systematic and unified platform for calculating the similarities and distances between SPPs, and
to support MPPs to offer a platform for performing metric-based analysis of earthquakes, tornados, epidemics, or stock exchange data.
Since realizations of temporal SPPs are ordered sets of the events, the
commonly used Euclidean distance and inner product cannot be directly
defined between them. Most of the metrics for SPP realizations first
transform the realizations
We note that there are some attempts to directly define distances between SPP realizations. One of the most principled and widely used methods is based on the edit distance (Victor 2005), and this method is extended to deal with MPP realizations by Suzuki, Y. Hirata, and K. Aihara (2010). However, this approach is computationally expensive and prohibitive for computing the distance between spike trains with even a few dozen spikes. We exclude this class of metric from the current version of mmpp.
In the following, we often use kernel smoothers and step functions for
transforming SPP realizations. For notational convenience, we denote a
kernel smoother with parameter
Examples of smoothers include the Gaussian kernel smoother
The most commonly used and intensively studied metrics for spike trains are based on the mapping of an event sequence to a real valued continuous function as
Figure 1 illustrates the transformation of an event sequence to a continuous function by the smoothing method.
|
|
Then, the inner product of
and similarly the distance is defined by
When we use the Laplacian smoother
and
respectively. This distance eq. (4) is adopted by Rossum (2001) for measuring the distance between spike trains. On the other hand, Schreiber, J. Fellous, D. Whitmer, P. Tiesinga, and T. Sejnowski (2003) proposed to use the correlation defined by
to measure the similarity between spike trains. This class of measures is extended to take into account the effect of burst, i.e., the short interval in which events occur in high frequency, and refractory period, i.e., the short interval in which events tend to be suppressed immediately after the previous events (Houghton 2009; Lyttle and J.-M. Fellous 2011). These two effects, namely burst and refractory period, are commonly observed in neural activities. They are also observed in earthquake catalogues. After a large main shock, usually we observe high frequent aftershocks. On the other hand, suppression of events is sometimes happening, possibly because after a big event, the coda is so large that one cannot detect smaller events under the large ongoing signal from the big event (Kagan 2004; Iwata 2008).
The filtering-based metric is computed by using the function fmetric
in mmpp. The first two arguments S1
and S2
are the (marked) point
process realizations in the form of a ‘matrix
’ object. The first
column of S1
and S2
are the event timings and the rest are the
marks. The argument measure
can be either "sim"
or "dist"
,
indicating similarity or distance, respectively. By default, the
function assumes the Laplacian smoother. When the argument h
of
function fmetric
is set to a function with scaling parameter
> fmetric(S1, S2, measure = "sim", h = function(x, tau) exp(-x^2/tau), tau = 1)
the integrals in eq. (2) and eq. (3) are
numerically done using the R function integrate
. The function h
should be square integrable and non-negative.
For analysis of point processes, the intensity function plays a central
roll. Paiva, I. Park, and J. C. Prı́ncipe (2009) proposed to use the intensity
function for defining the inner product between SPP realizations. Let
We note that we can also consider the conditional intensity function
reflecting the history up to the current time
in non-parametric manner (Reiss 1993). Using the estimates of
intensity functions for processes behind
Particularly, when we use a Gaussian smoother
The distance metric is also defined as
and it is also simplified when we use the Gaussian smoother.
This class of measures is most in alignment with the statistical model of point processes. We estimated the intensity function in a versatile non-parametric approach, but it is reasonable to use other models such as the Hawkes model (Hawkes 1971) when we should include the self-exciting nature of the process.
The intensity inner product metric is computed by using the function
iipmetric
in mmpp. In the current version, the function assumes the
Gaussian smoother, and its scaling parameter is specified by the
argument tau
as
> iipmetric(S1, S2, measure = "sim", tau = 1)
For comparing two SPP realizations, it is natural to count the number of events which can be considered to be co-occurring. There are two metrics for SPP realizations based on the notion of co-occurrence.
The first one proposed by Quian Quiroga, T. Kreuz, and P. Grassberger (1904) directly counts near-by
events. The closeness of two events are defined by adaptively computed
thresholds, making the method free from tuning parameters. Suppose we
have two SPP realizations
We note that
Using this counting function, a similarity measure between
and a distance measure is obtained by the transformation (Lyttle and J.-M. Fellous 2011):
The second metric based on the counting co-occurrence is proposed
by Hunter and J. G. Milton (2003), which transforms
Then, a similarity metric between
and the distance is naturally defined by
The co-occurrence based metrics are computed by using the function
coocmetric
. By default, the function assumes the counting similarity
in eq. (9). The smoothed counting similarity is computed
by specifying the argument type = "smooth"
as
> coocmetric(S1, S2, measure = "sim", type = "smooth", tau = 1)
Assume an SPP realization
Then each event is mapped to a function
Two SPP realizations
This function takes value
and the distance is defined by
which is originally defined in Kreuz, J. S. Haas, A. Morelli, H. D. Abarbanel, and A. Politi (2007). A simple example of
transformation
The inter event interval metrics are computed by using the function
ieimetric
as
> ieimetric(S1, S2, measure = "sim")
Sometimes events considered in point processes entail certain vector
valued marks. For example, seismic events are characterized by the
time point the earthquake happens, and a set of attributes such as
magnitude, depth, longitude, and latitude of the hypo-center. To deal
with marked point processes, we extend the base space
There might be many possible ways of extension. The packages mmpp takes the simplest way to deal with marks in a unified and computationally efficient manner, namely, the density or weight of marks are included in the metrics for SPPs by Gaussian windowing as shown in eq. (16).
In the same manner as eq. (1), the marked point process
realization
where
where
and the distance by
By virtue of Gaussian windowing, the integral with respect to the mark is explicitly written as
Furthermore, when we use Laplacian smoother for transforming temporal SPPs, we obtain
The distance metric is also calculated in the same manner.
We note that the effect of marks depend on the units used for the
various marks. It is reasonable to estimate the variance of each mark,
and set the diagonal elements of
Extending kernel estimation eq. (5) to multivariate kernel estimation as
we obtain the estimate of the intensity function of the marked point
process. We note that kernel density estimation for multidimensional
variables is inaccurate in general, and we can instead estimate the
ground intensity function
When we use the Gaussian smoother
The distance metric is also defined by
For estimating the intensity function, a simple Poisson process is assumed. This assumption is relaxed with more flexible models such as the ETAS model (Ogata 1988, 1998), where the intensity function is estimated using the R packages SAPP (Statistical Mathematics 2014) and etasFLP (Chiodi and G. Adelfio 2015), for example. The extension of the intensity-based metric to support other forms of intensity estimation such as the Hawkes and ETAS models remains our important future work.
To extend the co-occurrence metric based on counting the synchronous events, eq. (8) is replaced with a weighted counter
To extend the co-occurrence metric based on the smoothed count of the synchronous events, eq. (10) is replaced with a weighted smoothed counter
To weight the inter event interval by using marks associated with two
MPP realizations
Then we define operators
The intermediate function eq.(13) is modified to take into account the dissimilarity of marks:
The distance and similarity are then calculated using eq. (15) and eq. (14).
The usage of the functions fmetric
, iipmetric
, coocmetric
, and
ieimetric
does not change for marked point process data, except for
one additional argument M
, which is the precision matrix S1
and S2
. We can also specify
the matrix
> fmetric(S1, S2, measure = "sim", M = diag(3))
where the number of marks is assumed to be three.
This section illustrates the use of the package with a simple
experiment. We use the Miyagi20030626
dataset contained within the
package.
> library(mmpp)
> data(Miyagi20030626)
The dataset is composed of 2305 aftershocks of the 26th July 2003
earthquake of M6.2 at the northern Miyagi-Ken Japan, which is a
reparameterization of the main2006JUL26
dataset from the SAPP
package. Each record has 5 dimensions, time
, longitude
, latitude
,
depth
, and magnitude
of its hypo-center. The time
is recorded in
seconds from the main shock.
To illustrate the use of the package, we consider a simple prediction
task. We first split the original dataset by a time-window of length
> sMiyagi <- splitMPP(Miyagi20030626, h = 60*60*3, scaleMarks = TRUE)$S
Then, the maximum magnitude in each partial point process realization is computed.
> ## target of prediction is the maximum magnitude in the window
> m <- NULL
> for (i in 1:length(sMiyagi)) {
+ m <- c(m, max(sMiyagi[[i]]$magnitude))
+ }
The task we consider is the prediction of the maximum magnitude in the
next three hours using the past one week of data. We formulate this
problem as a regression problem. Let the partial point process
realization in the
> m <- m[-1]
> sMiyagi[[length(sMiyagi)]] <- NULL
> ## number of whole partial MPPs split by a 3-hour time window
> N <- length(sMiyagi)
> ## training samples are past one week data
> Ntr <- 24*7/3
> ## number of different prediction methods
> Nd <- 10
For the purpose of illustrating the use of the package and to show the
effect of different similarity metrics, we adopt the nearest neighbor
regression. That is, given the current realization
> err <- matrix(0, N - Ntr, Nd)
> colnames(err) <- c("f SPP","iip SPP","cooc (s) SPP","cooc (c) SPP","iei SPP",
+ "f MPP","iip MPP","cooc (s) MPP","cooc (c) MPP","iei MPP")
The following code performs the above explained experiment.
> for (t in 1:(N - Ntr)) {
+ qid <- Ntr + t
+ q <- sMiyagi[[qid]]
+ ## simple PP
+ ## fmetric with tau = 1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, fmetric(q$time, sMiyagi[[qid - i]]$time))
+ }
+ err[t, 1] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## iipmetric with tau = 1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, iipmetric(q$time, sMiyagi[[qid - i]]$time))
+ }
+ err[t, 2] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## coocmetric (smooth) with tau = 1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, coocmetric(q$time, sMiyagi[[qid - i]]$time,
+ type = "smooth"))
+ }
+ err[t, 3] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## coocmetric (count)
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, coocmetric(q$time, sMiyagi[[qid - i]]$time,
+ type = "count"))
+ }
+ err[t, 4] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## iei metric
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, ieimetric(q$time, sMiyagi[[qid - i]]$time))
+ }
+ err[t, 5] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## marked PP with latitude, longitude, depth, and magnitude
+ ## fmetric with tau = 1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, fmetric(q, sMiyagi[[qid - i]]))
+ }
+ err[t, 6] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## iipmetric with tau = 1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, iipmetric(q, sMiyagi[[qid - i]]))
+ }
+ err[t, 7] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## coocmetric (smooth) with tau=1
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, coocmetric(q, sMiyagi[[qid - i]], type = "smooth"))
+ }
+ err[t, 8] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## coocmetric (count)
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query, coocmetric(q, sMiyagi[[qid - i]], type = "count"))
+ }
+ err[t, 9] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ ## iei metric
+ sim2query <- NULL
+ for (i in 1:Ntr) {
+ sim2query <- c(sim2query,ieimetric(q, sMiyagi[[qid - i]]))
+ }
+ err[t, 10] <- abs(m[qid] - m[t:(Ntr + t - 1)][which.max(sim2query)])
+ }
> colMeans(err)
f SPP iip SPP cooc (s) SPP cooc (c) SPP iei SPP
0.7002634 0.6839529 0.7263602 0.6632930 0.7905148
f MPP iip MPP cooc (s) MPP cooc (c) MPP iei MPP
0.6839529 0.6317594 0.6643804 0.6622056 0.7698548
From this simple example, we can see that the prediction accuracy is improved by taking the marks into account.
mmpp is the first R package dedicated to the calculation of the similarity and distance metrics for marked point process realizations. It provides the implementation of several similarity metrics for simple point processes, originally proposed in the literature of neuroscience, and also provides extensions of these metrics to those for marked point processes.
A simple example of a real dataset presented in this paper illustrates the importance of taking the marks into account in addition to the event timings, and it also illustrates the possibilities of the package mmpp with a user guide for practitioners.
The development of the mmpp package has only just begun. Currently, we
are considering supporting burst sensitive and refractory period
sensitive versions of fmetric
, since these properties are commonly
observed in both neural activities and seismic event recordings. In the
current version of mmpp, for treating MPPs, event timings and marks
are assumed to be separable, and all the marks are simultaneously
estimated by a kernel density estimator. This is a strong assumption and
other possibilities for modeling MPPs should be considered. For example,
it is popular to group spatio-temporal events, i.e., event timings and
locations, and treat marks such as magnitude in seismic events as purely
marks. Then, the separability between marks and spatio-temporal events
can be tested by using test statistics proposed in Schoenberg (2004) and
Chang and F. Schoenberg (2011). The separability assumption offers computational
advantages, though, it would miss the intrinsic structure and
relationship between event timings and marks. In principle, the
separability hypothesis should be tested before calculating the metrics.
Frameworks for flexible modeling of marked sample sequences with
statistical validation such as nonparametric tests would be implemented
in future version of mmpp. We are also considering to extend the
intensity inner product metric to support other form of intensity
estimation such as Hawkes and ETAS models.
Different similarity metrics capture different aspects of the point process realizations. Our final goal of the development of the package mmpp is to provide a systematic way to select or combine appropriate metrics for analysing given point process realizations and certain tasks such as prediction of magnitude or clustering similar seismic events.
The authors are grateful to T. Iwata for helpful discussions and suggestions. The authors would like to express their special thanks to the editor, the associate editor and three anonymous reviewers whose comments led to valuable improvements of the manuscript. Part of this work is supported by JSPS KAKENHI Grant Number 25870811, 26120504, and 25120009.
splancs, spatstat, PtProcess, stpp, mmpp, SAPP, etasFLP
Spatial, SpatioTemporal, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hino, et al., "mmpp: A Package for Calculating Similarity and Distance Metrics for Simple and Marked Temporal Point Processes", The R Journal, 2015
BibTeX citation
@article{RJ-2015-033, author = {Hino, Hideitsu and Takano, Ken and Murata, Noboru}, title = {mmpp: A Package for Calculating Similarity and Distance Metrics for Simple and Marked Temporal Point Processes}, journal = {The R Journal}, year = {2015}, note = {https://rjournal.github.io/}, volume = {7}, issue = {2}, issn = {2073-4859}, pages = {237-248} }