spNetwork: A Package for Network Kernel Density Estimation

This paper introduces the new package spNetwork that provides functions to perform Network Kernel Density Estimate analysis (NKDE). This method is an extension of the classical Kernel Density Estimate (KDE), a non parametric approach to estimate the intensity of a spatial process. More specifically, it adapts the KDE for cases when the study area is a network, constraining the location of events (such as accidents on roads, leaks in pipes, fish in rivers, etc.). We present and discuss in this paper the three main versions of NKDE: simple, discontinuous, and continuous that are implemented in spNetwork. We illustrate how to apply the three methods and map their results using a sample from a real dataset representing bike accidents in a central neighborhood of Montreal. We also describe the optimization techniques used to reduce calculation time and investigate their impacts when applying the three NKDE to a city-wide dataset.


Introduction
Data representing events in space are collected in many research fields. The analysis of these points clouds is crucial in identifying hot spots, e.g., complaints about environmental noise, cases of diseases, observation of animals or plant species, crimes, etc. The Kernel Density Estimate (KDE) is a nonparametric method often used to estimate the intensity function of a spatial process from a sample of events. More specifically, the KDE is part of the point pattern analysis (PPA) family of methods. It allows for the analysis of first-order properties (variation of density) of a spatial process (Baddeley et al., 2015). Recent works (Xie and Yan, 2008;Okabe et al., 2009;McSwiggan et al., 2017) have shown that the KDE method is unadapted when events are constrained on a network (road crashes, leaks in a network of pipes, crimes reported in streets, etc.). The classical KDE is based on the hypothesis of an infinite, homogenous, two-dimensional space, which is a very rough approximation if the study area is a network. Indeed, a network is a particular space between a simple line (1D) and a plan (2D). In a network, the movement is constrained on multiple one-dimensional lines, event if it is possible to change direction at intersections. Steenberghen et al. (2010) call it a 1.5D space. Moreover, the reticular distance between objects on a network is always superior or equal to the Euclidean distance. Thus, using the latter leads to underestimations of the real distance between the objects on a network. The need to extend the KDE method is part of a more general trend started in the 1990's aiming to generalize the spatial analysis to network spaces (Okabe and Satoh, 2006;Xie and Yan, 2008). At first, several adaptations were proposed but with a limited relevance because of strong limitations (applied to very simple networks or a simple spatial aggregation of a classical KDE) (Flahaut et al., 2003;Borruso, 2003;Porta et al., 2009). More recent works have developed methods overcoming these first limitations, forming the set of the NKDE methods.
Currently, these methods are not easily accessible. Xie and Yan (2008) developed an ArcMap (ESRI, 2017) plugin proposing only a biased density estimator; Hwang and Winslow (2012) developed GeodaNet (GeodaCenter, university of Chicago), but it is no longer accessible on their website;  proposed SANET, a toolbox for network analysis as an ArcMap plugin or a standalone application. Even though SANET is a free software, it is not open-source, and the actual license limits the use to research only. Moreover, a new user must send a request to the maintainer to obtain an activation key. Finally, the R package spatstat (Baddeley and Turner, 2005), dedicated to PPA (Baddeley et al., 2004), includes a function to perform NKDE analysis (density.lpp), but it returns only rasters and requires an extremely long calculation time for large and medium-sized datasets. Thus, accessible tools to perform NKDE analysis are missing. This encourages researchers and professionals to use classical KDE in cases where NKDE is much more adapted and limits the reproducibility of research.
We propose the spNetwork package to fill this gap. It implements three estimators of the intensity of a spatial process on a network, the possibility to use adaptive bandwidths, and several optimization methods to ensure reasonable calculation time. R provides a perfect environment to implement the NKDE method, considering its growing community, its capacity to read, write and manipulate geographical data (rgdal, sp, rgeos, maptools, Pebesma and Bivand, 2020;Bivand and Rundel, 2020;Bivand and Lewin-Koh, 2020)), the existence of libraries to manipulate networks (igraph (Csardi and Nepusz, 2020)), and the facilitated interface to C++ with Rcpp (Eddelbuettel et al., 2020a).
In the first section of this paper, we present the classical KDE and introduce the main concepts of the method. In the second section, we introduce the three NKDE estimators and briefly compare their respective advantages and limits. The three NKDE are then applied to a small dataset about road accidents involving a cyclist in a central neighborhood of Montreal. Finally, we investigate the calculation time of the three NKDE with different settings considering that calculation time is an important issue for the NKDE method.

General description
For a spatial process p, represented by a set of events e, its intensity function λ at location u (λ(u)) can be estimated in a non parametric way by kernel estimation (Silverman, 1986). Typically, in a two-dimensional space, a regular grid is defined on the study area, and the intensity is estimated at the centers of each quadra (pixels). At each location, u, the events within a specified bandwidth, bw, contributes to the local estimated intensity. The strength of this contribution depends on the distance between the events and u, the event's weight, and the kernel function. This function distributes the mass of the events within a circular area around each event. The radius of this area is called the bandwidth (or the standard deviation of the kernel). A larger bandwidth produces smoother results and higher bias but reduces variance. The Kernel Density Estimate can thus be obtained as follows: with n, the number of events that satisfy dist(u, e i ) < bw, w i the weight of the event e i and K the kernel function. K must be a probability density function and verifies the two following conditions: (2) Many kernel functions exist; Figure 1 shows the most commonly used (and implemented in spNetwork). library(ggplot2) library(tidyr) library(spNetwork) library(RColorBrewer) library(kableExtra) library(dplyr) x <-seq (-15.01, 15.01, by = 0.01) kernels_func <-list (gaussian_kernel, epanechnikov_kernel, quartic_kernel, triangle_kernel, tricube_kernel, triweight_kernel, cosine_kernel, uniform_kernel) cols <-lapply(kernels_func, function(f){f(x, 15)}) df <-data.frame(do.call(cbind, cols)) names(df) <-c("Gaussian", "Epanechnikov", "Quartic", "Triangle", "Tricube", "Triweight", "Cosine", "Uniform") df$x <-x pivot_cols <-names(df)[names(df)!="x"] df2 <-pivot_longer(df,cols = pivot_cols) names(df2) <-c("x","kernel","y") ggplot(df2) + geom_line(aes(x = x, y = y, color = kernel), size = 1)+ xlim(c(-15.01, 15.01))+ scale_color_brewer(palette = "Accent")+ scale_y_continuous(name = "density") The Many authors have stressed that the choice of the Kernel function has a smaller impact on final results in comparison with the choice of the bandwidth (O'Sullivan and Wong, 2007;Xie and Yan, 2008;Turlach, 1993). The Gaussian kernel is a special case (non-compact kernel) because it does not integrate to one on the domain [−bw; +bw] but on [−∞; +∞], leading to a loss of mass for each event.

Diggle's correction
Formula 1 describes the basic KDE value, which is unbiased only if the study area is infinite in every direction and if the spatial process is sampled everywhere in this space. In practice, such situations are rare, and the basic KDE is biased at the frontier of the study area. For example, let us imagine a case where abandoned syringes locations are systematically reported in a specific district of a city. The ones lying outside the district limits are not considered, leading to a situation where the border areas systematically have a lower estimated intensity. To reduce this effect, Diggle's correction (equation 3) is generally used (Diggle, 1985).
This correction increases the weight of the events located close to the study area border because a part of their mass is lost above the border. For an event e i , with the location u, e(u) is the fraction of its mass in the study area (W). The correction factor is thus inversely proportional to the fraction of the event's mass in the study area. For example, an event having only half its mass in the study area will end up with a doubled weight.

Adaptive bandwidth
In its basic form, the bandwidth of the KDE is fixed, i.e. the value of the bandwidth is the same everywhere in the study area. This is an important limitation if the intensity of the spatial process has a pronounced spatial variation. In that case, the risk is to obtain oversmoothed results in areas with high intensity and undersmoothed results in areas with low intensity. To overcome this limitation, it is possible to replace the fixed bandwidth with a vector of spatially varying bandwidths. The bandwidths should be smaller in subregions with higher intensity and wider elsewhere. The Abramson method (Abramson, 1982) is often used to define this vector of bandwidths. Its calculation requires two steps. First, a pilot estimate (λ) is calculated at the location of each event (u ei ) by using a global bandwidth h 0 . Second, the vector of bandwidths (h(u ei )) is calculated with equation 4: A trimming value can be defined to avoid obtaining large bandwidths in subregions with low intensity. Let us recall here that the adaptive bandwidth is still based on the choice of an arbitrary global bandwidth and is not a bandwidth selection method.

The three Network Kernel Density Estimate
We present the three NKDE methods implemented in the spNetwork package in this section .

The simple NKDE
A first NKDE method was proposed by Xie and Yan (2008) and constituted a geographical attempt to extend the planar KDE to the network case. It received some attention in geography (Xie and Yan, 2013) but has been criticized for its statistical incorrectness (detailed later). We refer to this method as the "simple NKDE", and it can be summarized by three points: • The intensity of the spatial process is only estimated on the network. Its edges are split into lixels (one-dimensional pixels), and the centers of the lixels are used as locations for estimating intensity. • The distances between events and sampling points are calculated as the shortest path distances on the network instead of Euclidean distances. • The intensity function is slightly modified (equation 5).
The method is appealing for three reasons. First, from a purely geographical point of view, the extension of the planar KDE to network KDE is intuitive. The modification applied is in line with other methods in PPA extended from planar to network cases like the K-function or the nearest neighbor analysis (Okabe and Sugihara, 2012). Second, the method does not rely on an expensive algorithm and thus achieves a short calculation time. Third, the adapted formula makes the interpretation straightforward: it "estimates the density over a linear unit" rather than an area unit (Xie and Yan, 2008, pp. 398). Note that the modification of the equation can also be applied to the next methods.
To get a visual representation of the method, we consider the situation depicted in Figure 2. The lines constitute the network, and a red dot is a single event. We can evaluate the density of the spatial process with the simple NKDE and visualize the result in 3D (Figure 3), using density as the height.
At intersections on the network, the mass of the event is multiplied in each direction. Consequently, the simple NKDE is not a real kernel density function because it does not integrate to 1 on its domain. The practical consequence is the systematic overestimation of the density, which could be problematic in subregions with many events. To overcome this limitation, Okabe et al. (2009) proposed two unbiased estimators: the discontinuous NKDE and the continuous NKDE.

The discontinuous NKDE
Considering the problem inherent to the simple NKDE, the discontinuous NKDE proposes a simple solution. The mass is divided at intersections according to the number of directions minus one. This is easily represented by   This estimator is unbiased. However, its discontinuous nature can be counter-intuitive in real application. For example, considering the analysis of crimes on a network, it would be counterintuitive that the "influence" of a crime suddenly drops from one street to another. The discontinuous NKDE can be summarized with Equation 6.
with k the kernel function, n ij the number of edges connected at the intersection j on the path originating at the event, e i , and ending at the location, u.

The continuous NKDE
Finally, the continuous NKDE attempts to combine the best of the two worlds: it adjusts the values of the NKDE at intersections to ensure that it integrates to 1 on its domain and applies a backward correction to force the density values to be continuous. A simple case is represented in Figure 6.  Because of its backward correction, the continuous NKDE is recursive in nature and can hardly be presented as an equation. Okabe and Sugihara (2012) describe this method with a recursive function, implemented in spNetwork. Indeed, for each node encountered, a correction factor must be applied to all the previous edges traveled within the remaining bandwidth, and this correction must also be split between all the nodes encountered in that direction. In comparison, for the two previous methods, the algorithm only has to flow in one direction along the edges from the event. As a consequence, much more iterations are required for the continuous NKDE. More specifically, Okabe et al. (2009) state that the complexity of the discontinuous (and by extension, of the simple) NKDE algorithm is O(nn L ) with n L the number of edges and n the number of events. For the continuous NKDE, the complexity of the algorithm is a function of the ratio of the bandwidth and the length of the edges. For this algorithm, Okabe et al. (2009) observed that the computation time has the following lower limit (equation 7): with d i the distance from an event to the ith-nearest node, and i max the last node verifying d i < bw.
In other words, having many short edges in a network or using a larger bandwidth will increase the calculation time exponentially for the continuous NKDE.
The pros and cons of the three methods are summarized in Table 1 The two KDE extensions presented in section 2 (adaptive bandwidth and edge correction) can be directly transposed to these three NKDE methods.

Implementation in spNetwork
The spNetwork package contains two "high-level functions" and three "helper functions". The "high level functions" nkde and nkde.mc can be used to calculate the three NKDE presented in the previous section. Both functions use exactly the same parameters, nkde.mc is a version of nkde allowing for multiprocessing via the package future (Bengtsson, 2020a). The three main inputs are: • lines, a "SpatialLinesDataFrame" (object defined in the sp package), representing the lines of the network. • events, a "SpatialPointsDataframe" (object defined in the sp package), representing the realizations of the spatial process. • samples, a "SpatialPointsDataframe" providing the locations where the density must be estimated.
These three inputs must only have simple and valid geometries and the same coordinates reference system. Otherwise, an error is raised by the function before starting any calculation. The parameters method, kernel_name and bw, indicate respectively which NKDE (simple, discontinuous, or continuous), and which kernel function (gaussian, quartic, triweight, etc.) to use, and the bandwidth of the kernel. adaptive and diggle_correction allow the user to specify if an adaptive bandwidth and Diggle's correction must be calculated. Finally, the remaining parameters control the geometric precision of the network and the optimization aspects.
The three "helper functions" are provided to facilitate the creation of the sampling points on the network: • The function lixelize_lines splits the lines of a "SpatialLinesDataFrame" according to a selected distance to generate lixels. • The function lines_center returns for each line of a "SpatialLinesDataFrame" its center points (the point located at mid-distance between the start and the end of the linestring). • The function lines_points_along returns points located along the lines of a "SpatialLines-DataFrame", evenly spaced according to a selected distance.
Considering the previous functions, the workflow of NKDE analysis with spNetwork follows five steps: 1. Loading the network dataset and the event dataset as sp objects, typically with the function readOGR from the package rgdal. Many standard formats can thus be used such as ESRI Shapefile, Geopackage, GeoJson, SpatialLite, etc. 2. Generating the sampling points on the network by using the helper functions. 3. Calculating the density estimations with the nkde or nkde.mc functions. 4. Adding the densities as a new column to the sampling "SpatialPointsDataFrame" or lixels "SpatialLinesDataFrame". 5. Exporting the results, typically with writeOGR from rgdal to map the results with a dedicated mapping software (like Qgis (QGIS Development Team, 2020)).

Optimization
The main problem with NKDE methods is the calculation time. Building the network and calculating the paths between its nodes can be a costly process (depending on the size of the network studied and the number of events). The implementation in SpNetwork uses many techniques to reduce the calculation time.
First, only the events are added as nodes in the network. The sampling points are snapped to their nearest edge. Each edge in the network is a straight line. Thus, the final distance between a sampling point and a node of the network is calculated with the Euclidean distance. Consequently, the complexity of the network is not affected by the density of the sampling points.
Second, the calculus of the density is event-oriented. In other words, the kernel density is sequentially calculated around each event. The density values of the samples are updated at each iteration. The advantage of this approach is that the calculation time is less affected by the density of the sampling points.
Third, the events can be aggregated, and their weights added within a threshold distance. In many applications, a small distance between events is not significant considering the accuracy of the location method (geocoding, smartphone GPS, location at intersections, etc.). Aggregating events can simplify networks and limit the number of iterations when calculating the NKDE. Fourth, following the idiom "divide and conquer", the user can specify the shape of a grid on the study area to split the calculation. In that case, for each quadra of the grid, the sampling points in the quadra are selected. The events and the lines of the network are selected if they intersect a buffer around the quadra (the size of the buffer being the bandwidth of the kernel). Thus, the separate calculations do not produce edge effects. All the spatial queries are optimized using the quadtree spatial index proposed by the package SearchTrees (Becker, 2012). This approach limits the risk of memory issues and increases calculation speed for big datasets.
Fifth, when the dataset is split, the calculation can be separated between several processes. Considering the fact that a lot of computers are now equipped with processors having more than four cores, dividing the work between them is an easy way to reduce the overall calculation time. The packages future and future.apply (Bengtsson, 2020b) are used in spNetwork to support multiprocessing. It ensures good compatibility between OS and even has features to dispatch calculations on multiple computers.
Finally, the main algorithms are implemented with Rcpp and RcppArmadillo (Eddelbuettel et al., 2020b) to overcome the R (scripted language) limitations when loops and recursions are involved (Ligges and Fox, 2008) and benefit from faster C++ compiled code.

Example with bike accidents in Montreal
In this section, we illustrate the use of the package with a built-in dataset. This dataset is an extract from the Montreal Open Data website representing bike accidents that occurred on the Montreal road network in 2017 (Figure 8). To identify hot spots of bike accidents, we calculated the three NKDE (simple, discontinuous, and continuous) with a quartic kernel (Figure 9). We selected a 200 meters bandwidth considering that the mean length of a road segment in this network is 108 meters. This bandwidth ensures that, most often, only events located closer than two street segments from a sampling point contribute to its density.

Adaptive bandwidth
An adaptive bandwidth based on the Abramson's inverse-square-root rule (Abramson, 1982) can be used to produce a new version of the three NKDE. We will only calculate here the results for the continuous NKDE. Figure 11 shows the new results and the size of some calculated bandwidths with circles (not all the bandwidths are shown; otherwise the map would not be readable). As we can see, the adaptive kernel produces more smoothed results in low-intensity subregions and more detailed results in high-density regions. We can see now that the primary hot spot (1) seems concentrated on a smaller area.

Calculation time
In this final section, we investigate the calculation time of the three NKDE. For the first test, we used a complete version of the dataset presented above (but not included in the package). The network used covers the full Island of Montreal for a total of 47488 features (5877 km). The 1722 events are road accidents involving a cyclist between 2017 and 2018. We ran the calculation of the three NKDE by using 1 to 8 cores and by splitting the study area with a grid ranging from 10 by 10 to 18 by 18. The calculations were executed on a computer equipped with an Intel (R) Xeon (R) Bronze 3104 CPU (12 cores 1.70 GHz) and 32 GO of RAM, with the Windows 10 operating system and the R version 4.0.2.  As one can see, the durations are very similar for the simple and the discontinuous methods. Because of the backward adjustment of the density, the continuous method is much more timeconsuming. Using multiprocessing is an efficient way to reduce calculation time. In our setting, the biggest gain is achieved with three cores. Beyond three, the gains are more negligible. Dividing the study area with a finer grid contributes to reducing calculation time to a smaller extent.
For the second test, we selected a single event in the middle of the complete study area. We then constructed several study areas by selecting all events and all lines within subsequent radiuses from 500 m to 7000 m. For each iteration, only one core was used and a grid with a 5 x 5 shape. The results are presented in Table 3. We can observe that the calculation time increases slowly for the simple and the discontinuous NKDE, but the increase for the continuous NKDE is sharper. These results show that the proposed package can calculate the three NKDE with reasonable calculation time, even for datasets covering a full city.

Concluding remarks
The Network Kernel Density Estimate (NKDE) is an extension of the Kernel Density Estimate (KDE). Numerous spatial phenomena are constrained on a network like road accidents, crimes reported on streets, leaks in pipes, breaks in a wiring network, bird nests along a river, etc. Analyzing these datasets with a classical KDE is unsatisfactory because densities are evaluated at places where the events cannot occur (outside the network). The Euclidean distance underestimates the distance between objects on the network, and a network is not a homogeneous space. Tang et al. (2013) argue that in practical applications, the differences between the classical KDE and NKDE are small, the latter providing only more accurate and detailed results. However, our brief comparison suggests that the simple NKDE (the closest to the classical KDE) systematically overestimates densities compared to the continuous and discontinuous NKDE. This is in line with results obtained when comparing the planar and network K-functions (Yamada and Thill, 2004). The NKDE method remains quite inaccessible because of the lack of open-source implementation. The spNetwork package introduced here proposes to fill this gap by implementing it with R, taking advantage of its rich geospatial ecosystem (Spatial). Three different NKDE are currently available in the package: the simple, the discontinuous, and the continuous NKDE. The first one is an early attempt to generalize the classical KDE to network spaces. It produces biased results and is not a true kernel, but it is intuitive from a purely geographical point of view. The discontinuous and continuous NKDE are true kernels and provide unbiased density estimates. We discussed in this paper their respective limitations and advantages. Two complementary features are proposed by spNetwork: Diggle's edge correction and adaptive bandwidths based on Abramson's rule. The spNetwork package uses several techniques to reduce computation time and memory use for the NKDE: dividing the study area into smaller chunks, multiprocessing, compiled C++ functions, and spatial indexes. However, for large datasets with large bandwidths the overall process remains long. Consequently, no methods are currently proposed for automatic bandwidth selection. In the future, we plan to add in the package other methods for PPA on networks like the K-function, nearest neighbor analysis and reticular distance weight matrices.