Interest in social network analysis has exploded in the past few years, partly thanks to the advancements in statistical methods and computing for network analysis. A wide range of the methods for network analysis is already covered by existent R packages. However, no comprehensive packages are available to calculate group centrality scores and to identify key players (i.e., those players who constitute the most central group) in a network. These functionalities are important because, for example, many social and health interventions rely on key players to facilitate the intervention. Identifying key players is challenging because players who are individually the most central are not necessarily the most central as a group due to redundancy in their connections. In this paper we develop methods and tools for computing group centrality scores and for identifying key players in social networks. We illustrate the methods using both simulated and empirical examples. The package keyplayer providing the presented methods is available from Comprehensive R Archive Network (CRAN).
Interest in social network analysis has grown rapidly in the past few years. This was due partly to the advancements in statistical methods and computing for network analysis and partly to the increasing availability of social network data (e.g., network data generated by social media). A wide range of the methods for network analysis is already covered by R packages such as network (Butts 2008a), sna (Butts 2008b), igraph (Csardi and T. Nepusz 2006), statnet (Handcock et al. 2008), RSiena (Ripley et al. 2013), etc. However, none of these packages provides a comprehensive toolbox to calculate group centrality measures and to identify key players, who constitute the most central group, in a network. Determining the key players in a network is very important because many social and health interventions rely on key players to facilitate the intervention. For example, (Kelly et al. 1991) and (Latkin 1998) trained peer leaders as educators to promote HIV prevention. (Campbell et al. 2008) and (An 2015) used peer leaders to facilitate smoking prevention. (Borgatti 2006) and (Ressler 2006) suggested removing key figures among terrorists to most widely disrupt terrorism. More examples of this sort can be found in (Valente and Pumpuang 2007), (Banerjee et al. 2013), etc. Identifying key players is challenging because players who are individually the most central are not necessarily the most central as a group due to redundancy in their connections. In a seminal paper, (Borgatti 2006) pointed out the problem and proposed methods for identifying key players in social networks.
To the best of our knowledge, the keyplayer
function in UCINET
(Borgatti et al. 2002) is the first implementation of the methods detailed in
(Borgatti 2006). It has evolved from a separate add-on to UCINET to a
built-in function UCINET. In this paper, we present the
keyplayer package
(An and Liu 2016) in R, which differs from the keyplayer
function in
UCINET in several aspects. (1) Unlike the keyplayer
function in
UCINET which is only applicable to binary networks, keyplayer in R
can be used for both binary and weighted networks. (2) The keyplayer
package includes more centrality measures for choosing key players than
what is currently available in the keyplayer
function in UCINET. (3)
keyplayer provides better integration with other open-source packages
in R. Overall, the keyplayer
function in UCINET is useful for
researchers who are more familiar with UCINET and would like to
utilize other functionalities provided by UCINET, whereas keyplayer
is designed for users who are more familiar with R and who plan to do
more computational work.
The influenceR package (Simon and Aditya 2015) aims to provide calculations of several node centrality measures that were previously unavailable in other packages, such as the constraint index (Burt 1992) and the bridging score (Valente and Fujimoto 2010). It can also be used to identify key players in a network. But in comparison to keyplayer, it utilizes only one centrality metric when selecting key players whereas keyplayer includes eight different metrics. Also, influenceR currently works only for undirected networks whereas keyplayer works for both undirected and directed networks. Both packages provide parallel computation. influenceR relies on OpenMP for parallel computation whereas keyplayer utilizes the base package parallel which is readily available in R. Last, influenceR focuses on computing centrality measures at the node level whereas keyplayer is more interested in providing centrality measures at the group level. Overall, keyplayer provides more comprehensive functionalities for calculating group centrality measures and for selecting key players.
The algorithm for identifying key players in package keyplayer
essentially consists of three steps. First, users choose a metric to
measure centrality in a network. Second, the algorithm (specifically the
kpcent
function) will randomly pick a group of players and measure
their group centrality. Third, the algorithm (specifically the kpset
function) will select the group of players with the highest group
centrality as the desired key players. In general, users only need to
employ the kpset
function by specifying a centrality metric and the
number of key players to be selected. The function will return a set of
players who are the most central as a group. We also make the auxiliary
function kpcent
available. If users specify a centrality metric and
the indices of a group of players, this function will return the
centrality score of the specified group. Thus the two functions can be
used for two purposes: selecting key players or measuring group
centrality.
The paper proceeds as follows. First, we review centrality measures at
the individual level. Then we present methods for measuring centrality
at the group level. After that, we present a greedy search algorithm for
selecting key players and outline the basic structure and the usage of
the main function kpset
in package keyplayer. To illustrate the
methods and the usage of the package, we use a simulated network as well
as an empirical example based on the friendship network among managers
in a company. Last, we summarize and point out directions for improving
the package in the future.
We first review the definitions of centrality measures at the individual level. For conciseness, we provide the definitions based on weighted networks, where the weight of a tie takes a continuous value and usually measures the strength of the connection between two nodes. The definitions naturally incorporate binary networks where the weight of a tie can only be one or zero, indicating the presence or absence of a connection (Freeman 1978; Wasserman and Faust 1994; Butts 2008b).
Figure 1 shows an example of a simulated network. On the left is the adjacency matrix of the network. On the right is the network graph. Thinking of it as a friendship network, we can see that the strength of friendship between node 1 and node 3 is conceived differently by node 1 and node 3. The former assigns it a weight of 3 while the latter assigns it a weight of 1. We will use this example to illustrate the centrality measures. Calculations of four centrality measures (i.e., degree, closeness, betweenness, and eigenvector centralities) at the individual level are done using the sna package (Butts 2008b). Calculations of four other individual level centralities and all group level centralities are done using our package keyplayer. We would like to clarify at this point that our package does not depend on sna. We use sna here just for the sake of the example.
Degree centrality is defined as follows (Freeman 1978; Butts 2008b):
> W <- matrix(c(0, 1, 3, 0, 0, 0, 0, 0, 4, 0, 1, 1, 0, 2, 0, 0, 0, 0, 0, 3,
+ 0, 2, 0, 0, 0), nrow = 5, ncol = 5, byrow = TRUE)
> library(sna)
> degree(W, ignore.eval = FALSE) # For binary networks, set ignore.eval = TRUE.
[1] 5 8 7 9 5
> degree(W, ignore.eval = FALSE, cmode = "indegree")
[1] 1 4 3 6 3
> degree(W, ignore.eval = FALSE, cmode = "outdegree")
[1] 4 4 4 3 2
One version of the closeness centrality is due to (Gil and Schmidt 1996):
> A <- W
> A[W != 0] <- 1 / W[W != 0] # Inverse the non-zero tie status
> closeness(A, ignore.eval = FALSE, cmode = "suminvdir")
[1] 1.5142857 1.4285714 1.3000000 1.0500000 0.8333333
## For undirected networks, set cmode = "suminvundir".
Betweenness centrality is defined as follows (Butts 2008b):
> betweenness(A, ignore.eval = FALSE, cmode = "directed")
[1] 0 1 2 3 1
Eigenvector centrality defines a node’s centrality as a weighted average
of the centrality of its neighbors (Bonacich 1972; Butts 2008b):
> evcent(A, gmode = "digraph", ignore.eval = FALSE, use.eigen = TRUE)
[1] 0.5000000+0i 0.0000000+0i 0.8660254+0i 0.0000000+0i 0.0000000+0i
where gmode = "digraph"
indicates that the input is a directed network
and use.eigen = TRUE
requests using the robust eigen
function to
calculate the eigenvectors. In this example, the eigenvector centrality
includes complex numbers, which are hard to interpret. Thus, to
facilitate interpretation of the results, it is often a good idea to
symmetrize the network first because symmetric matrices always have real
eigenvalues. In the following, the symmetrization process first converts
> B <- symmetrize (W)
> evcent(B)
[1] 0.3505418 0.5590326 0.4699593 0.4699593 0.3505418
M-reach degree centrality generalizes the degree centrality by
delimiting specific neighborhoods. Suppose the set of nodes that node
## Calculations of four other individual level centralities and all group level
## centralities are done by our package.
> library(keyplayer)
## M-reach centrality.
> mreach.degree(W, M = 1)
outdegree indegree total
[1,] 2 1 3
[2,] 1 3 4
[3,] 3 1 4
[4,] 1 2 3
[5,] 1 1 2
One way to refine the M-reach degree centrality is to use (the inverse
of) geodistance to measure the tie status between nodes, just like how
closeness centrality refines degree centrality. We define the M-reach
closeness centrality as below:
## As before, we first take the inverse of the tie status, making it correspond to
## distance.
> mreach.closeness(A)
outdegree indegree total
[1,] 0.3785714 0.0625000 0.4410714
[2,] 0.3571429 0.3250000 0.6821429
[3,] 0.3250000 0.1875000 0.5125000
[4,] 0.2625000 0.5333333 0.7958333
[5,] 0.2083333 0.4232143 0.6315476
Fragmentation centrality measures the extent to which a network is
fragmented after a node is removed from the network (Borgatti 2006):
> fragment(A)
fragment
[1,] 0.6365079
[2,] 0.7446429
[3,] 0.6733500
[4,] 0.8333333
[5,] 0.7250000
Banerjee et al. (2013) proposed the diffusion centrality defined by the row sum of
the following matrix:
## Create a new adjacency matrix.
> g <- W
> g[W != 0] <- 1
> g
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 1 0 0
[2,] 0 0 0 1 0
[3,] 1 1 0 1 0
[4,] 0 0 0 0 1
[5,] 0 1 0 0 0
## Create a matrix with the passing probabilities.
> q <- matrix(c(0, .2, .6, 0, 0, .1, 0, 0, .4, 0, .1, .1, 0, .4, 0, 0, .5, 0, 0, .3,
+ 0, .4, 0, 0, 0), nrow = 5, ncol = 5, byrow = TRUE)
> q
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0 0.2 0.6 0.0 0.0
[2,] 0.1 0.0 0.0 0.4 0.0
[3,] 0.1 0.1 0.0 0.4 0.0
[4,] 0.0 0.5 0.0 0.0 0.3
[5,] 0.0 0.4 0.0 0.0 0.0
## Get the probability matrix and calculate diffusion centrality.
> P <- q * g
> P
[,1] [,2] [,3] [,4] [,5]
[1,] 0.0 0.2 0.6 0.0 0.0
[2,] 0.0 0.0 0.0 0.4 0.0
[3,] 0.1 0.1 0.0 0.4 0.0
[4,] 0.0 0.0 0.0 0.0 0.3
[5,] 0.0 0.4 0.0 0.0 0.0
> diffusion(P, T = 5)
diffusion
[1,] 1.50832
[2,] 0.59296
[3,] 0.99968
[4,] 0.48816
[5,] 0.63488
(Everett and Borgatti 1999) provide one of the first studies that explored ways to measure group centralities (mainly degree, closeness, and betweenness centralities) in undirected networks. In this paper, we provide more group centrality measures (including the eight ones outlined above) and extend the methods to both undirected and directed networks. The basic idea is to treat a group of nodes as a large pseudo-node. The key problem, then, is how to measure the tie status between the group and other outside nodes. For that purpose, we provide several criteria.
Minimum. According to the minimum criterion, the tie status
between a group
Maximum. According to the maximum criterion, the tie status
between a group
Addition. According to the addition criterion, the tie status
between a group
Union. The union criterion is designed for probability matrices.
The tie status between a group
In the simulated network, suppose nodes 2 and 3 are grouped together.
The connection between this group and node 4 according to the maximum
criterion is contract
function
automates these calculations and returns a reduced network matrix in
which the node index will be re-ordered with the group as the last node.
## Group nodes 2 and 3 and measure the connections between the group and outside nodes
## using the maximum criterion.
> contract(W, c(2, 3), method = "max")
1 4 5 set
1 0 0 0 3
4 0 0 3 0
5 0 0 0 2
set 1 4 0 0
## Group nodes 2 and 3 in the probability matrix and measure the connections
## between the group and outside nodes using the union criterion.
> contract(P, c(2, 3), method = "union")
1 4 5 set
1 0.0 0.00 0.0 0.68
4 0.0 0.00 0.3 0.00
5 0.0 0.00 0.0 0.40
set 0.1 0.64 0.0 0.00
Once the tie status between the group and outside nodes is measured, we
can use the centrality measures outlined above to calculate group
centrality based on the reduced network. The kpcent
function
implements the calculations. Note that users do not need to explicitly
deploy the contract
function because kpcent
automatically uses it in
the background.
> kpcent(W, c(2, 3), type = "degree", cmode = "total", method = "max")
[1] 10
> kpcent(W, c(2, 3), type = "degree", cmode = "total", method = "min")
[1] 6
> kpcent(W, c(2, 3), type = "degree", cmode = "total", method = "min", binary = TRUE)
[1] 4
> kpcent(W, c(2, 3), type = "mreach.degree", cmode = "total", M = 1, binary = TRUE)
[1] 4
> kpcent(W, c(2, 3), type = "mreach.closeness", cmode = "total", M = 1, binary = TRUE)
[1] 1.333333
Recall that the ultimate goal is to select the most central group of
nodes from a network. This goal quickly becomes challenging as the
network size grows. For example, to choose five key players out of 100
nodes, there are
Select an initial candidate set
Update the candidate set
Start with the first node in
Repeat loop 1 for each node in
Stop if (1) the change in
Return the final set
The function kpset
implements the search algorithm. Its basic
structure is shown below.
kpset(adj.matrix, size, type = "degree", M = Inf, T = ncol(adj.matrix),
method = "min", binary = FALSE, cmode = "total", large = TRUE,
geodist.precomp = NULL, seed = "top", parallel = FALSE, cluster = 2,
round = 10, iteration = ncol(adj.matrix))
where the arguments are defined as follows.
adj.matrix
: Matrix indicating the adjacency matrix of the network
or in the case of diffusion centrality a probability matrix.
size
: Integer indicating the target size of players.
type
: String indicating the type of centrality measure to be used.
Should be one of "degree"
for degree centrality, "closeness"
for
closeness centrality, "betweenness"
for betweenness centrality,
"evcent"
for eigenvector centrality, "mreach.degree"
for M-reach
degree centrality, "mreach.closeness"
for M-reach closeness
centrality, "fragment"
for fragment centrality, and "diffusion"
for diffusion centrality.
M
: Positive number indicating the maximum geodistance between two
nodes, above which the two nodes are considered disconnected. The
default is Inf
. The option is applicable to M-reach degree,
M-reach closeness, and fragmentation centralities.
T
: Integer indicating the maximum number of iterations in the
communication process. For diffusion centrality only. By default,
T
is the network size.
method
: Indication of which grouping criterion should be used.
"min"
indicates the “minimum” criterion and is suggested for
betweenness, closeness, fragmentation, and M-reach centralities.
"max"
indicates the “maximum” criterion and is suggested for
degree and eigenvector centralities. "add"
indicates the
“addition” criterion and is suggested for degree and eigenvector
centralities as an alternative of "max"
. "union"
indicates the
“union” criterion and is suggested for diffusion centrality. The
default is "min"
.
binary
: If TRUE
, the input matrix is binarized. If FALSE
, the
edge values are considered. The default is FALSE
.
cmode
: String indicating the type of centrality being evaluated.
The option is applicable to degree and M-reach centralities.
"outdegree"
, "indegree"
, and "total"
refer to indegree,
outdegree, and total degree, respectively. "all"
reports all the
above measures. The default is to report the total degree.
large
: Logical scalar. If TRUE
(the default), the method
implemented in igraph is used for computing geodistance and
related centrality measures; otherwise the method in sna is used.
geodist.precomp
: Geodistance precomputed for the network to be
analyzed (optional).
seed
: String indicating the seeding method or a vector of the
seeds specified by user. If "top"
, players with the highest
individual centrality are used as the seeds. If "random"
, seeds
are randomly sampled. The default is "top"
for efficiency.
parallel
: Logical scalar. IF TRUE
, the parallel computation is
activated. The default is FALSE
.
cluster
: Integer indicating the number of CPU cores to be used for
parallel computation.
round
: Integer indicating the “length” of search, namely, the
number of loops over the nodes in the candidate set.
iteration
: Integer indicating the “width” of search in each round,
namely, the number of loops over the nodes in the residual set.
The greedy algorithm converges fast, but sometimes can be trapped in a
local optimum. To avoid this problem, it is recommended to run kpset
several times with different seeds. To facilitate the search in large
networks, users can employ parallel computation by specifying
parallel = TRUE
in kpset
. During parallel computation, for each
cluster and each iteration the algorithm randomly picks a node from the
candidate set and the residual set, respectively, and swaps the two if
it improves the centrality score of the candidate set. It repeats this
process until exhausting the specified iterations and rounds and then
combines the results from the clusters. The following code shows how to
find two players who are the most central as a group in the simulated
network.
## In terms of indegree.
> kpset(W, size = 2, type = "degree", cmode = "indegree", method = "max")
$keyplayers
[1] 3 4
$centrality
[1] 7
## In terms of indegree in the binarized network.
> kpset(W, size = 2, type = "degree", cmode = "indegree", binary = TRUE,
+ method = "max")
$keyplayers
[1] 2 4
$centrality
[1] 3
## In terms of mreach.degree.
> kpset(W, size = 2, type = "mreach.degree", cmode = "indegree", M = 1,
+ binary = TRUE)
$keyplayers
[1] 2 4
$centrality
[1] 3
## In terms of mreach.closeness.
> kpset(A, size = 2, type = "mreach.closeness", cmode = "indegree", M = 1)
$keyplayers
[1] 3 4
$centrality
[1] 0.6944444
## In terms of indegree via parallel computation using 2 CPU cores.
> kpset(W, size = 2, type = "degree", cmode = "indegree", parallel = TRUE,
+ cluster = 2)
$keyplayers
[1] 3 4
$centrality
[1] 7
Below we use the friendship network of 21 managers in a high-tech company (Krackhardt 1987) to illustrate the methods. The network graph is shown in Figure 2. Each node represents one manager. Each tie indicates a friendship nomination from one manager to the other. The nodes are colored according to the four departments the managers belong to. The size of each node is proportional to its degree. As it can be seen, friendships occur predominately within departments.
We first examine the individual centrality of the managers. To make a probability matrix for calculating the diffusion centrality, we multiply the original adjacency matrix by 0.1. The results are presented in Table 1. To facilitate reading the results, we marked the top centrality scores in red. Apparently, the most central manager identified varies with the centrality measure used. In terms of indegree, managers 5 and 19 each receive six friend nominations and are the most central. However, in terms of outdegree, managers 1, 9, 11, and 12 are the most central. In terms of closeness centrality, manager 11 is the most central. In terms of betweenness centrality and eigenvector centrality, manager 12 is the most central, etc. Which centrality measure is suitable for selecting the most central player depends on the objectives. If the objective is to find a manager whose opinion is respected by most peers, then indegree can be a suitable measure. But if the objective is to spread the information most widely, then outdegree or closeness may be a better option.
Now suppose we want to find the three managers in this company who are
the most central as a group. Table 2 lists the results
according to different centrality measures. If indegree is the preferred
centrality measure, then managers 2, 12, and 19 form the most central
group. Together these three managers can connect to 14 other managers.
Note that the three managers with the highest individual centrality do
not constitute the most central group. The group indegree of managers 5,
19, and 1 (or 2 or 4) is no more than 10. Table 2 also shows
that the most central group varies by centrality measure employed.
Researchers are required to thoroughly thinkabout which centrality
measure they should use in their specific context to select key players.
In addition, sometimes there may be multiple sets of players which are
equally central as a group. In such cases, which set is to be used may
not make big difference in practice. But if examining these different
sets is of interest, it is recommended to run kpset
multiple times.
In this paper, we developed a comprehensive set of methods and tools for locating key players in social networks. In the future, the algorithms used may be improved by choosing seeds and swaps more strategically and by utilizing alternative optimization schemes such as simulated annealing.
The two authors contributed equally. Weihua An designed the study. Both authors contributed to writing the manuscript and developing the package. The authors thank Professor Bettina Grün and two anonymous reviewers for their helpful comments.
network, sna, igraph, statnet, RSiena, keyplayer, influenceR
Bayesian, GraphicalModels, Optimization
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
An & Liu, "keyplayer: An R Package for Locating Key Players in Social Networks", The R Journal, 2016
BibTeX citation
@article{RJ-2016-018, author = {An, Weihua and Liu, Yu-Hsin}, title = {keyplayer: An R Package for Locating Key Players in Social Networks}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {1}, issn = {2073-4859}, pages = {257-268} }