Will the Real Hopkins Statistic Please Stand Up?

Hopkins statistic (Hopkins and Skellam 1954) can be used to test for spatial randomness of data and for detecting clusters in data. Although the method is nearly 70 years old, there is persistent confusion regarding the definition and calculation of the statistic. We investigate the confusion and its possible origin. Using the most general definition of Hopkins statistic, we perform a small simulation to verify its distributional properties, provide a visualization of how the statistic is calculated, and provide a fast R function to correctly calculate the statistic. Finally, we propose a protocol of five questions to guide the use of Hopkins statistic.

Kevin Wright (Corteva Agriscience)
2022-12-20

1 Introduction

Hopkins and Skellam (1954) introduced a statistic to test for spatial randomness of data. If the null hypothesis of spatial randomness is rejected, then one possible interpretation is that the data may be clustered into distinct groups. Since one of the problems with clustering methods is that they will always identify clusters, (even if there are no meaningful clusters in the data), Hopkins statistic can be used to determine if there are clusters in the data before applying clustering methods. In the description below on how to calculate Hopkins statistic, we follow the terminology of earlier authors and refer to an “event” as one of the existing data values in a matrix \(X\), and a “point” as a new, randomly chosen location. For clarity in the discussions below we make a distinction between \(D\), the dimension of the data, and \(d\), the exponent in the formula for Hopkins statistic.

Let \(X\) be a matrix of \(n\) events (in rows) and \(D\) variables (in columns). Let \(U\) be the space defined by \(X\).

Hopkins statistic is calculated with the following algorithm:

  1. Sample at random one of the existing events from the data \(X\). Let \(w_i\) be the Euclidean distance from this event to the nearest-neighbor event in \(X\).
  2. Generate one new point uniformly distributed in \(U\). Let \(u_i\) be the Euclidean distance from this point to the nearest-neighbor event in \(X\).
  3. Repeat steps (1) and (2) \(m\) times, where \(m\) is a small fraction of \(n\), such as 10%.
  4. Calculate \(H = \sum_{i=1}^m u_i^d \big/ \sum_{i=1}^m (u_i^d + w_i^d)\), where \(d=D\).

Because of sampling variability, it is common to calculate \(H\) multiple times and take the average. Under the null hypothesis of spatial randomness, this statistic has a Beta(\(m\),\(m\)) distribution and will always lie between 0 and 1. The interpretation of \(H\) follows these guidelines:

2 A short history of Hopkins statistic

There exists considerable confusion about the definition of Hopkins statistic in scientific publications. In particular, when calculating Hopkins statistic, there are 3 different values of the exponent \(d\) (in step 4 above) that have been used in statistical literature: \(d=1\), \(d=2\), and the generalized \(d=D\). Here is a brief timeline of how this exponent has been presented.

3 Simulation study for the distribution of Hopkins statistic

Having identified the confusion in the statistical literature, we now ask the question, “Does it matter what value of \(d\) is used in the exponent?” In a word, “yes”.

According to Cross and Jain (1982), under the null hypotheses of no structure in the data, the distribution of the Hopkins statistic is Beta(\(m\),\(m\)) where \(m\) is the number of rows sampled in \(X\). This distribution can be verified in a simple simulation study:

  1. Generate a matrix \(X\) with 100 rows (events) and \(D=3\) columns, filled with random uniform numbers. (This is the assumption of no spatial structure in a 3D hypercube.)
  2. Sample \(m=10\) events and also generate 10 new uniform points.
  3. Calculate Hopkins statistic with exponents \(d=1\) (incorrect value).
  4. Calculate Hopkins statistic with exponents \(d=3\) (correct value).
  5. Repeat 1000 times.
  6. Compare the empirical density curves of the two different methods to the Beta(\(m\),\(m\)) distribution.
Results of a simulation study of the distribution of Hopkins statistic. The red and blue lines are the empirical density curves of 1000 Hopkins statistics calculated with exponents $d=1$ (red) and $d=3$ (blue). The black line is the theoretical distribution of the Hopkins statistic. The red line is very far away from the black line and shows that calculating Hopkins statistic with exponent $d=1$ is incorrect.

Figure 1: Results of a simulation study of the distribution of Hopkins statistic. The red and blue lines are the empirical density curves of 1000 Hopkins statistics calculated with exponents \(d=1\) (red) and \(d=3\) (blue). The black line is the theoretical distribution of the Hopkins statistic. The red line is very far away from the black line and shows that calculating Hopkins statistic with exponent \(d=1\) is incorrect.

In Figure 1:

The empirical density of the blue curve is similar to the theoretical distribution shown by the black line. The empirical density of the red curve is clearly dissimilar. The distribution of Hopkins statistic with \(d=1\) is clearly incorrect (except in trivial cases where \(X\) has only 1 column). One more thing to note about the graph is that the blue curve is slightly flatter than the theoretical distribution shown in black. This is not accidental, but is caused by edge effects of the sampling region and will be discussed in a later section.

4 Examples

The first three examples in this section are adapted from Gastner (2005). The datasets are available in the spatstat.data package (Baddeley et al. 2021). A modified version of the hopkins() function was written for this paper to show how the Hopkins statistic is calculated (inspired by Figure 1 of Lawson and Jurs (1990)). In order to minimize the amount of over-plotting, only \(m=3\) sampling points are used for these examples. In each figure, 3 of the existing events in \(X\) are chosen at random and a light-blue arrow is drawn to the nearest neighbor in \(X\). In addition, 3 points are drawn uniformly in the plotting region and a light-red arrow is drawn to the nearest neighbor in \(X\). The colored numbers are the lengths of the arrows.

Example 1: Systematically-spaced data

An example of how Hopkins statistic is calculated with systematically-spaced data. The black circles are the events of the `cells` data. Each blue `W` represents a randomly-chosen event. Each blue arrow points from a `W` to the nearest-neighboring event. Each red `U` is a new, randomly-generated point. Each red arrow points from a `U` to the nearest-neighboring event. The numbers are the length of the arrows. In systematically-spaced data, red arrows tend to be shorter than blue arrows.

Figure 2: An example of how Hopkins statistic is calculated with systematically-spaced data. The black circles are the events of the cells data. Each blue W represents a randomly-chosen event. Each blue arrow points from a W to the nearest-neighboring event. Each red U is a new, randomly-generated point. Each red arrow points from a U to the nearest-neighboring event. The numbers are the length of the arrows. In systematically-spaced data, red arrows tend to be shorter than blue arrows.

The cells data represent the centers of mass of 42 cells from insect tissue. The scatterplot of the data in Figure 2 shows that events are systematically spaced as nearly far apart as possible. Because the data is two-dimensional, Hopkins statistics is calculated as the sum of the squared distances \(u_i^2\) divided by the sum of the squared distances \(u_i^2 + w_i^2\):

(.046^2 + .081^2 + .021^2) / 
  ( (.046^2 + .081^2 + .021^2) + (.152^2 + .14^2 + .139^2) )
[1] 0.1281644

The hopkins() function returns the same value:

set.seed(17)
hopkins(cells, m=3)
[1] 0.1285197

The value of the Hopkins statistic in this calculation is based on only \(m=3\) events and will have sizable sampling error. To reduce the sampling error, a larger sample size can be used up to approximately 10% of the number of events. To reduce sampling error further while maintaining the independence assumption of the sampling in calculating Hopkins statistic, repeated samples can be drawn. Here we use the idea of Gastner (2005) to calculate Hopkins statistic 100 times and then calculate the mean and standard deviation for the 100 values of Hopkins statistic, which in this case are 0.21 and 0.06. This value of the statistic is quite a bit lower than 0.5 and indicates the events are spaced more evenly than purely-random events (p-value 0.05).

Example 2: Randomly-spaced data

The japanesepines data contains the locations of 65 Japanese black pine saplings in a square 5.7 meters on a side. The plot of the data in Figure 3 is an example of data in which the events are randomly spaced.

An example of how Hopkins statistic is calculated with randomly-spaced data. The black circles are the events of the `japanesepines` data. Each blue `W` represents a randomly-chosen event. Each blue arrow points from a `W` to the nearest-neighboring event. Each red `U` is a new, randomly-generated point. Each red arrow points from a `U` to the nearest-neighboring event. The numbers are the length of the arrows. In randomly-spaced data, red arrows tend to be similar in length to blue arrows.

Figure 3: An example of how Hopkins statistic is calculated with randomly-spaced data. The black circles are the events of the japanesepines data. Each blue W represents a randomly-chosen event. Each blue arrow points from a W to the nearest-neighboring event. Each red U is a new, randomly-generated point. Each red arrow points from a U to the nearest-neighboring event. The numbers are the length of the arrows. In randomly-spaced data, red arrows tend to be similar in length to blue arrows.

The value of Hopkins statistic using 3 events and points is:

(.023^2+.076^2+.07^2) /
  ((.023^2+.076^2+.07^2) + (.104^2+.1^2+.058^2))
[1] 0.3166596

The mean and standard deviation of the 100 Hopkins statistics are 0.48 and 0.12. The value of the statistic is close to 0.5 and indicates no evidence against a random distribution of data (p-value 0.9).

Example 3: Clustered data

An example of how Hopkins statistic is calculated with clustered data. The black circles are the events of the `redwood` data. Each blue `W` represents a randomly-chosen event. Each blue arrow points from a `W` to the nearest-neighboring event. Each red `U` is a new, randomly-generated point. Each red arrow points from a `U` to the nearest-neighboring event. The numbers are the length of the arrows. In clustered data, red arrows tend to be longer in length than blue arrows.

Figure 4: An example of how Hopkins statistic is calculated with clustered data. The black circles are the events of the redwood data. Each blue W represents a randomly-chosen event. Each blue arrow points from a W to the nearest-neighboring event. Each red U is a new, randomly-generated point. Each red arrow points from a U to the nearest-neighboring event. The numbers are the length of the arrows. In clustered data, red arrows tend to be longer in length than blue arrows.

The redwood data are the coordinates of 62 redwood seedlings in a square 23 meters on a side. The plot in Figure 4 shows events that exhibit clustering. The value of Hopkins statistic for the plot is:

(.085^2+.078^2+.158^2) /
  ((.085^2+.078^2+.158^2) + (.028^2+.028^2+.12^2))
[1] 0.7056101

The mean and standard deviation of the 100 Hopkins statistics are 0.79 and 0.13. The value of the statistic is much higher than 0.5, which indicates that the data are somewhat clustered (p-value 0.03).

Example 4

Adolfsson et al. (2017) provide a review of various methods of detecting clusterability. One of the methods they considered was Hopkins statistic, which they calculated using 10% sampling. They evaluated the clusterability of nine R datasets by calculating Hopkins statistic 100 times and then reporting the proportion of time that Hopkins statistic exceeded the appropriate beta quantile. We can repeat their analysis and calculate Hopkins statistic for both \(d=1\) dimension and \(d=D\) dimensions, where \(D\) is the number of columns for each dataset.

Table 1: In this table, dataset is the R dataset name, n is the number of rows in the data, D is the number of columns in the data, Adolfsson is the the proportion of 100 times that Hopkins statistic was significant as appearing in the paper by Adolfsson, Ackerman, and Brownsteain (2017), Hopkins1 is the proportion of 100 times that Hopkins statistic was significant when calculated with the exponent \(d=1\) (similar to the clustertend package), and HopkinsD is the proportion of 100 times that Hopkins statistic was significant when calculated with the exponent \(d=D\). Since the Adolfsson and Hopkins1 columns are similar (within samling variation), it appears that Adolfsson, Ackerman, and Brownstein (2017) used the clustertend package to calculate Hopkins statistic.
dataset n D Adolfsson Hopkins1 HopkinsD
faithful 272 2 1.00 1.00 1.00
iris 150 5 1.00 1.00 1.00
rivers 141 1 0.92 0.89 0.90
swiss 47 6 0.41 0.25 0.94
attitude 30 7 0.00 0.00 0.59
cars 50 2 0.19 0.23 0.68
trees 31 3 0.18 0.22 0.71
USJudgeRatings 43 12 0.69 0.53 1.00
USArrests 50 4 0.01 0.00 0.56

In Table 1:

Since the Adolfsson and Hopkins1 columns are similar (within sampling variability), it appears that Adolfsson et al. (2017) used Hopkins statistic with \(d=1\) as the exponent. This would be expected if they had used the clustertend package (YiLan and RuTong 2015 version 1.4) to calculate Hopkins statistic.

For a few of the datasets, there is substantial disagreement between the last two columns. For example, the swiss data is significantly clusterable 41% of the time according to Adolfsson et al. (2017), but 94% of the time when using Hopkins statistic with exponent \(d=D\). A scatterplot of the swiss data in Figure 5 shows that the data are strongly non-random, which agrees with the 94%.

Pairwise scatterplots of the R dataset `swiss`. The meaning of the variables is not important here. Because some panels show a lack of spatial randomness of the data, we would expect Hopkins statistic to be significant.

Figure 5: Pairwise scatterplots of the R dataset swiss. The meaning of the variables is not important here. Because some panels show a lack of spatial randomness of the data, we would expect Hopkins statistic to be significant.

Similarly, the trees data is significantly clusterable 18% of the time according to the Adolfsson column, but 71% of the time according to HopkinsD. The scatterplot in Figure 6 shows strong non-random patterns, which agrees with the 71%

Pairwise scatterplots of the R dataset `trees`. The data are `Girth`, `Height`, and `Volume` of 31 black cherry trees. Because all panels show a lack of spatial randomness of the data, we would expect Hopkins statistic to be significant.

Figure 6: Pairwise scatterplots of the R dataset trees. The data are Girth, Height, and Volume of 31 black cherry trees. Because all panels show a lack of spatial randomness of the data, we would expect Hopkins statistic to be significant.

Scatterplot matrices of the swiss, attitude, cars, trees, and USArrests datasets can be found in Brownstein et al. (2019). Each scatterplot matrix shows at least one pair of the variables with notable correlation and therefore the data are not randomly-distributed, but rather are clustered. For each of these datasets, the proportion of times Hopkins1 is significant is less than 0.5, but the proportion of times HopkinsD is significant is greater than 0.5. The HopkinsD statistic is accurately detecting the presence of clusters in these datasets.

5 Correcting for edge effects

In the cells, japanesepines and redwood examples above, it is possible or even probable that there are additional events outside of the sampling frame that contains the data. The sampling frame thus has the effect of cutting off potential nearest neighbors from consideration. If the distribution of the data can be assumed to extend beyond the sampling frame and if the shape of the sampling frame can be viewed as a hypercube, then edge effects due to the sampling frame can be corrected by using a torus geometry that wraps edges of the sampling frame around to the opposite side (Li and Zhang 2007). To see an illustration of this, look again at the plot of the japanesepines data in Figure 3. The randomly-generated event \(U\) in the upper left corner is a distance of \(0.076\) away from the nearest event. However, if the left edge of the plot is wrapped around an imaginary cylinder and connected to the right edge of the plot, then the nearest neighbor is the event in the upper-right corner at coordinates (0.97, 0.86).

To see what effect the torus geometry has on the distribution of the Hopkins statistic, consider the following simulation. We generate \(n=100\) events uniformly in a \(D=5\) dimension unit cube and sample \(m=10\) events to calculate the value of Hopkins statistic using both a simple geometry and a torus geometry. Repeat these steps \(B=1000\) times. The calculation of the nearest neighbor using a torus geometry is computationally more demanding than using a simple geometry, especially as the number of dimensions \(D\) increases, so the use of parallel computing can reduce the computing time linearly according to the number of processors used. As a point of reference, this small simulation study was performed in less than 1 minute on a reasonably-powerful laptop with 8 cores using the doParallel package (Microsoft Corporation and Weston 2020). We found that \(B=1000\) provided results that were stable, regardless of the seed value for the random number generation in the simulations.

Results of a simulation study considering how the spatial geometry affects Hopkins statistic. The thin black line is the theoretical distribution of Hopkins statistic. The blue and green lines are the empirical density curves of 1000 Hopkins statistics calculated with simple geometry (blue) and torus geometry (green). Calculating Hopkins statistic with a torus geometry aligns closely to the theoretical distribution.

Figure 7: Results of a simulation study considering how the spatial geometry affects Hopkins statistic. The thin black line is the theoretical distribution of Hopkins statistic. The blue and green lines are the empirical density curves of 1000 Hopkins statistics calculated with simple geometry (blue) and torus geometry (green). Calculating Hopkins statistic with a torus geometry aligns closely to the theoretical distribution.

In Figure 7:

When using a torus geometry to correct for edge effects in this example, the empirical distribution of Hopkins statistic is remarkably close to its theoretical distribution. In contrast, when a simple geometry is used, the empirical distribution of Hopkins statistic is somewhat flattened with heavier tails. The practical result is that when no edge correction is used, the Hopkins statistic is more likely to deviate from 0.5 and therefore more likely to suggest the data is not uniformly distributed. This erroneous interpretation is a greater risk as the number of dimensions \(D\) increases and edge effects become more pronounced

6 Sampling frame effects

The left figure shows 250 points simulated randomly in a unit square. As expected, the value of Hopkins statistic is close to 0.5. The right figure shows the same points, but only those inside a unit-diameter circle. The value of Hopkins statistic H is much larger than 0.5. Although both figures depict spatially-uniform points, the square shape of the sampling frame affects the value of Hopkins statistic.

Figure 8: The left figure shows 250 points simulated randomly in a unit square. As expected, the value of Hopkins statistic is close to 0.5. The right figure shows the same points, but only those inside a unit-diameter circle. The value of Hopkins statistic H is much larger than 0.5. Although both figures depict spatially-uniform points, the square shape of the sampling frame affects the value of Hopkins statistic.

Another practical problem affecting the correct use and interpretation of Hopkins statistic has to do with the shape of the sampling frame. Consider the example data in Figure 8. On the left side, there were 250 random events simulated in a 2-dimensional unit square. On the right side, the same data are used, but have been subset to keep only the events inside a unit-diameter circle. For both figures, Hopkins statistic was calculated 100 times with 10 events sampled each time.

On the left side, both the bounding box and the actual sampling frame are the unit square. The median of 100 Hopkins statistics is 0.51, providing no evidence against random distribution. On the right side, the actual sampling frame of the data is a unit-circle, but the Hopkins statistic still uses the unit square (for generating new points in \(U\)) and the median Hopkins statistic is 0.75, indicating clustering of the data within the sampling frame, even though the distribution of the data was generated uniformly. A few more examples of problems related to the sampling frame can be found in Smith and Jain (1984).

To consider the problem with the sampling frame on real data, refer again to the trees data in Figure 6. Because trees usually grow both in height and girth at the same time, it would be unexpected to find tall trees with narrow girth or short trees with large girth. Also, since the volume is a function of the girth and height, it is correlated with those two variables. In the scatterplot of girth versus volume, it would be nearly impossible to find points in the upper left or lower right corner of the square. From a biological point of view, the sampling frame cannot be shaped like a square and the null hypothesis of uniform distribution of data is violated a priori, which means the distribution of Hopkins statistic does not follow a Beta(\(m\),\(m\)) distribution.

7 A protocol for using Hopkins statistic

Because Hopkins statistic is not hard to calculate and is easy to interpret, yet can be misused (as shown in the previous sections), we propose a protocol for using Hopkins statistic. The protocol simply asks the practitioner to consider the following five questions before calculating Hopkins statistic.

  1. Is the number of events \(\mathbf{n > 100}\) and the number of randomly-sampled events at most 10% of \(\mathbf{n}\)? This is recommended by Cross and Jain (1982).
  2. Is spatial randomness of the events even possible? If the events are known or suspected to be correlated, this violates the null hypothesis of spatial uniformity, and may also mean that the sampling frame is not shaped like a hypercube.
  3. Could nearest-neighbor events have occurred outside the boundary of the sampling frame? If yes, it may be appropriate to calculate nearest-neighbor distances using a torus geometry.
  4. Is the sampling frame non-rectangular? If yes, then be extremely careful with the use of Hopkins statistic in how points are samples from \(U\).
  5. Is the dimension of the data much greater than 2? Edge effects are more common in higher dimensions.

The important point of this protocol is to raise awareness of potential problems. We leave it to the practitioner to decide what do with the answers to these questions.

8 Conclusion

The statistical literature regarding Hopkins statistic is filled with confusion about how to calculate the statistic. Some publications have erroneously used the exponent \(d=1\) in the formula for Hopkins statistic and this error has propagated into much statistical software and led to incorrect conclusions. To remedy this situation, the R package hopkins (Wright 2022) provides a function hopkins() that calculates Hopkins statistic using the general exponent \(d=D\) for D-dimensional data. The function can use simple geometry for fast calculations or torus geometry to correct for edge effects. Using this function, we show that the distribution of Hopkins statistic calculated with the general exponent \(d=D\) aligns closely with the theoretical distribution of the statistic. Because inference with Hopkins statistic can be trickier than expected, we introduce a protocol of five questions to consider when using Hopkins statistic.

Alternative versions of Hopkins statistic have been examined by Zeng and Dubes (1985b), Rotondi (1993), Li and Zhang (2007). Other methods of examining multivariate uniformity of data have been considered by Smith and Jain (1984), Yang and Modarres (2017), and Petrie and Willemain (2013).

9 Acknowledgements

Thanks to Deanne Wright for bringing the confusion about Hopkins statistic to our attention. Thanks to Vanessa Windhausen and Deanne Wright for reading early drafts of this paper and to Dianne Cook for reviewing the final version. Thanks to Wong (2013) for the pdist package for fast computation of nearest neighbors and thanks to Northrop (2021) for the donut package for nearest neighbor search on a torus.

Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2022-055.zip

CRAN packages used

clustertend, hopkins, doParallel, pdist, donut

CRAN Task Views implied by cited packages

A. Adolfsson, M. Ackerman and N. C. Brownstein. To cluster, or not to cluster: How to answer the question. Proceedings of Knowledge Discovery from Data, Halifax, Nova Scotia, Canada, August 13-17 (TKDD ’17), 2017. URL https://maya-ackerman.com/wp-content/uploads/2018/09/clusterability2017.pdf.
A. Baddeley, R. Turner and E. Rubak. spatstat.data: Datasets for ’spatstat’ family. 2021. URL https://CRAN.R-project.org/package=spatstat.data. R package version 2.1-0.
A. Banerjee and R. N. Dave. Validating clusters using the Hopkins statistic. In 2004 IEEE international conference on fuzzy systems (IEEE cat. No. 04CH37542), pages. 149–153 2004. IEEE. URL https://doi.org/10.1109/FUZZY.2004.1375706.
N. C. Brownstein, A. Adolfsson and M. Ackerman. Descriptive statistics and visualization of data from the R datasets package with implications for clusterability. Data in Brief, 25: 104004, 2019. URL https://doi.org/10.1016/j.dib.2019.104004.
G. R. Cross and A. K. Jain. Measurement of clustering tendency. In Theory and application of digital control, pages. 315–320 1982. URL https://doi.org/10.1016/B978-0-08-027618-2.50054-1.
P. J. Diggle, J. Besag and J. T. Gleaves. Statistical analysis of spatial point patterns by means of distance methods. Biometrics, 659–667, 1976. URL https://doi.org/10.2307/2529754.
R. C. Dubes and G. Zeng. A test for spatial homogeneity in cluster analysis. Journal of Classification, 4(1): 33–56, 1987. URL https://doi.org/10.1007/BF01890074.
M. T. Gastner. Spatial distributions: Density-equalizing map projections, facility location, and two-dimensional networks. 2005. URL https://hdl.handle.net/2027.42/125368.
B. Hopkins and J. G. Skellam. A new method for determining the type of distribution of plant individuals. Annals of Botany, 18(2): 213–227, 1954. URL https://doi.org/10.1093/oxfordjournals.aob.a083391.
P. C. Jurs and R. G. Lawson. Clustering tendency applied to chemical feature selection. Drug Information Journal, 24(4): 691–704, 1990. URL https://doi.org/10.1177/216847909002400405.
R. G. Lawson and P. C. Jurs. New index for clustering tendency and its application to chemical problems. Journal of Chemical Information and Computer Sciences, 30(1): 36–41, 1990. URL https://doi.org/10.1021/ci00065a010.
F. Li and L. Zhang. Comparison of point pattern analysis methods for classifying the spatial distributions of spruce-fir stands in the north-east USA. Forestry, 80(3): 337–349, 2007. URL https://doi.org/10.1093/forestry/cpm010.
Microsoft Corporation and S. Weston. doParallel: Foreach parallel adaptor for the ’parallel’ package. 2020. URL https://CRAN.R-project.org/package=doParallel. R package version 1.0.16.
P. J. Northrop. donut: Nearest neighbour search with variables on a torus. 2021. URL https://CRAN.R-project.org/package=donut. R package version 1.0.2.
A. Petrie and T. R. Willemain. An empirical study of tests for uniformity in multidimensional data. Computational Statistics & Data Analysis, 64: 253–268, 2013. URL https://doi.org/10.1016/j.csda.2013.02.013.
R. Rotondi. Tests of randomness based on the k-NN distances for data from a bounded region. Probability in the Engineering and Informational Sciences, 7: 557–569, 1993. URL https://doi.org/10.1017/S0269964800003132.
S. P. Smith and A. K. Jain. Testing for uniformity in multidimensional data. IEEE transactions on pattern analysis and machine intelligence, (1): 73–81, 1984. URL https://doi.org/10.1109/TPAMI.1984.4767477.
J. Wong. pdist: Partitioned distance function. 2013. URL https://CRAN.R-project.org/package=pdist. R package version 1.2.
K. Wright. hopkins: Calculate hopkins statistic for clustering. 2022. URL https://CRAN.R-project.org/package=hopkins. R package version 1.0.
M. Yang and R. Modarres. Multivariate tests of uniformity. Statistical Papers, 58: 627–639, 2017. URL https://doi.org/10.1007/s00362-015-0715-x.
L. YiLan and Z. RuTong. clustertend: Check the clustering tendency. 2015. URL https://CRAN.R-project.org/package=clustertend. R package version 1.4.
G. Zeng and R. C. Dubes. A comparison of tests for randomness. Pattern Recognition, 18(2): 191–198, 1985a. URL https://doi.org/10.1016/0031-3203(85)90043-3.
G. Zeng and R. C. Dubes. A test for spatial randomness based on k-NN distances. Pattern Recognition Letters, 3(2): 85–91, 1985b. URL https://doi.org/10.1016/0167-8655(85)90013-3.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wright, "Will the Real Hopkins Statistic Please Stand Up?", The R Journal, 2022

BibTeX citation

@article{RJ-2022-055,
  author = {Wright, Kevin},
  title = {Will the Real Hopkins Statistic Please Stand Up?},
  journal = {The R Journal},
  year = {2022},
  note = {https://doi.org/10.32614/RJ-2022-055},
  doi = {10.32614/RJ-2022-055},
  volume = {14},
  issue = {3},
  issn = {2073-4859},
  pages = {282-292}
}