# Will the Real Hopkins Statistic Please Stand Up?

Hopkins statistic can be used to test for spatial randomness of data and for detecting clusters in data. Although the method is nearly 70 years old, there is persistent confusion regarding the definition and calculation of the statistic. We investigate the confusion and its possible origin. Using the most general definition of Hopkins statistic, we perform a small simulation to verify its distributional properties, provide a visualization of how the statistic is calculated, and provide a fast R function to correctly calculate the statistic. Finally, we propose a protocol of five questions to guide the use of Hopkins statistic.

Kevin Wright (Corteva Agriscience)
2022-12-20

# 1 Introduction

Hopkins and Skellam (1954) introduced a statistic to test for spatial randomness of data. If the null hypothesis of spatial randomness is rejected, then one possible interpretation is that the data may be clustered into distinct groups. Since one of the problems with clustering methods is that they will always identify clusters, (even if there are no meaningful clusters in the data), Hopkins statistic can be used to determine if there are clusters in the data before applying clustering methods. In the description below on how to calculate Hopkins statistic, we follow the terminology of earlier authors and refer to an “event” as one of the existing data values in a matrix $$X$$, and a “point” as a new, randomly chosen location. For clarity in the discussions below we make a distinction between $$D$$, the dimension of the data, and $$d$$, the exponent in the formula for Hopkins statistic.

Let $$X$$ be a matrix of $$n$$ events (in rows) and $$D$$ variables (in columns). Let $$U$$ be the space defined by $$X$$.

Hopkins statistic is calculated with the following algorithm:

1. Sample at random one of the existing events from the data $$X$$. Let $$w_i$$ be the Euclidean distance from this event to the nearest-neighbor event in $$X$$.
2. Generate one new point uniformly distributed in $$U$$. Let $$u_i$$ be the Euclidean distance from this point to the nearest-neighbor event in $$X$$.
3. Repeat steps (1) and (2) $$m$$ times, where $$m$$ is a small fraction of $$n$$, such as 10%.
4. Calculate $$H = \sum_{i=1}^m u_i^d \big/ \sum_{i=1}^m (u_i^d + w_i^d)$$, where $$d=D$$.

Because of sampling variability, it is common to calculate $$H$$ multiple times and take the average. Under the null hypothesis of spatial randomness, this statistic has a Beta($$m$$,$$m$$) distribution and will always lie between 0 and 1. The interpretation of $$H$$ follows these guidelines:

• Low values of $$H$$ indicate repulsion of the events in $$X$$ away from each other.
• Values of $$H$$ near 0.5 indicate spatial randomness of the events in $$X$$.
• High values of $$H$$ indicate possible clustering of the events in $$X$$. Values of $$H > 0.75$$ indicate a clustering tendency at the 90% confidence level .

# 2 A short history of Hopkins statistic

There exists considerable confusion about the definition of Hopkins statistic in scientific publications. In particular, when calculating Hopkins statistic, there are 3 different values of the exponent $$d$$ (in step 4 above) that have been used in statistical literature: $$d=1$$, $$d=2$$, and the generalized $$d=D$$. Here is a brief timeline of how this exponent has been presented.

• 1954: Hopkins and Skellam (1954) introduced Hopkins statistic in a two-dimensional setting. The formula they present is in a slightly different form, but is equivalent to $$\sum u_i^2 \big/ \sum (u_i^2 + w_i^2 )$$. The exponent here is $$d=2$$.

• 1976: Diggle et al. (1976) presented a formula for Hopkins statistic in a two-dimensional setting as $$\sum u_i \big/ \sum (u_i + w_i )$$. This formula has no exponents and therefore at first glance appears to use the exponent $$d=1$$ in the equation for Hopkins statistic. However, a careful reading of their text shows that their $$u_i$$ and $$w_i$$ values were actually squared Euclidean distances. If their $$u_i$$ and $$w_i$$ had represented ordinary (non-squared) Euclidean distances, then their formula would have been $$\sum u_i^2 \big/ \sum (u_i^2 + w_i^2 )$$. We suspect this paper is the likely source of confusion by later authors.

• 1982: Cross and Jain (1982) generalized Hopkins statistic for $$X$$ of any dimension $$d=D$$ as $$\sum u_i^d \big/ \sum (u_i^d + w_i^d )$$. This formula was also used by Zeng and Dubes (1985a), Dubes and Zeng (1987), and Banerjee and Dave (2004).

• 1990: Lawson and Jurs (1990) and Jurs and Lawson (1990) give the formula for Hopkins statistic as $$\sum u_i \big/ \sum (u_i + w_i)$$, but used ordinary distances instead of squared distances. Perhaps this was a result of misunderstanding the formula in Diggle et al. (1976).

• 2015: The R function hopkins() in the clustertend package (YiLan and RuTong 2015 version 1.4) cited Lawson and Jurs (1990) and used also used the exponent $$d=1$$.

• 2022: The new function hopkins() in the hopkins package (Wright 2022 version 1.0) uses the general exponent $$d=D$$ as found in Cross and Jain (1982).

# 3 Simulation study for the distribution of Hopkins statistic

Having identified the confusion in the statistical literature, we now ask the question, “Does it matter what value of $$d$$ is used in the exponent?” In a word, “yes”.

According to Cross and Jain (1982), under the null hypotheses of no structure in the data, the distribution of the Hopkins statistic is Beta($$m$$,$$m$$) where $$m$$ is the number of rows sampled in $$X$$. This distribution can be verified in a simple simulation study:

1. Generate a matrix $$X$$ with 100 rows (events) and $$D=3$$ columns, filled with random uniform numbers. (This is the assumption of no spatial structure in a 3D hypercube.)
2. Sample $$m=10$$ events and also generate 10 new uniform points.
3. Calculate Hopkins statistic with exponents $$d=1$$ (incorrect value).
4. Calculate Hopkins statistic with exponents $$d=3$$ (correct value).
5. Repeat 1000 times.
6. Compare the empirical density curves of the two different methods to the Beta($$m$$,$$m$$) distribution.

In Figure 1:

• The black curve is the density of Beta(10,10).
• The red curve is the density of Hopkins statistic when $$d=1$$ is used in the calculation (incorrect).
• The blue curve is the density of Hopkins statistic when $$d=3$$ (the number of columns in $$X$$) is used (correct).

The empirical density of the blue curve is similar to the theoretical distribution shown by the black line. The empirical density of the red curve is clearly dissimilar. The distribution of Hopkins statistic with $$d=1$$ is clearly incorrect (except in trivial cases where $$X$$ has only 1 column). One more thing to note about the graph is that the blue curve is slightly flatter than the theoretical distribution shown in black. This is not accidental, but is caused by edge effects of the sampling region and will be discussed in a later section.

# 4 Examples

The first three examples in this section are adapted from Gastner (2005). The datasets are available in the spatstat.data package . A modified version of the hopkins() function was written for this paper to show how the Hopkins statistic is calculated (inspired by Figure 1 of Lawson and Jurs (1990)). In order to minimize the amount of over-plotting, only $$m=3$$ sampling points are used for these examples. In each figure, 3 of the existing events in $$X$$ are chosen at random and a light-blue arrow is drawn to the nearest neighbor in $$X$$. In addition, 3 points are drawn uniformly in the plotting region and a light-red arrow is drawn to the nearest neighbor in $$X$$. The colored numbers are the lengths of the arrows.