In clinical practice, it is very useful to select an optimal cutpoint in the scale of a continuous biomarker or diagnostic test for classifying individuals as healthy or diseased. Several methods for choosing optimal cutpoints have been presented in the literature, depending on the ultimate goal. One of these methods, the generalized symmetry point, recently introduced, generalizes the symmetry point by incorporating the misclassification costs. Two statistical approaches have been proposed in the literature for estimating this optimal cutpoint and its associated sensitivity and specificity measures, a parametric method based on the generalized pivotal quantity and a nonparametric method based on empirical likelihood. In this paper, we introduce GsymPoint, an R package that implements these methods in a user-friendly environment, allowing the end-user to calculate the generalized symmetry point depending on the levels of certain categorical covariates. The practical use of this package is illustrated using three real biomedical datasets.
The classification of cases and controls is a common task in several
fields. For example, it is conducted in the atmospheric sciences
(rainy/non rainy day), finance (good/not good payer), sociology
(good/not good citizen), industry (product of good/poor quality),
computing science (spam/non-spam) or health sciences (healthy/diseased),
among others. However, in this paper we will focus on the latter
example, where we will be interested on the classification of
individuals as healthy or diseased using a continuous biomarker or
diagnostic test that will be based on a cutoff point or discrimination
value
Considering
A key question in clinical practice is to find a cutpoint that “best”
discriminates between patients with and without the disease. However,
one cannot talk in absolute terms of a “best choice”. This is the reason
why several criteria for selecting optimal cutpoints have been proposed
in the literature depending on the ultimate goal of such selection (see Youden 1950; Feinstein 1975; Metz 1978; Schäfer 1989; Vermont et al. 1991; Greiner 1995; Pepe 2003 for example). One of the best-known methods is based on selecting the
cutpoint that provides the same value for the sensitivity and
specificity. This point is known as the equivalence or symmetry point
(Greiner 1995; Defreitas et al. 2004; Adlhoch et al. 2011). Graphically, it
corresponds with the operating point on the ROC curve that intersects
the perpendicular to the positive diagonal line, that is,
Two statistical approaches have been recently introduced in the literature (López-Ratón et al. 2016) to obtain point estimates and confidence intervals for the generalized symmetry point and its associated sensitivity and specificity measures, a parametric method based on the Generalized Pivotal Quantity (GPQ) under the assumption of normality (Weerahandi 1993, 1995; Lai et al. 2012), and a nonparametric method based on the Empirical Likelihood (EL) methodology without any parametric assumptions (Thomas and Grunkemeier 1975; Molanes-López and Letón 2011).
The availability of software for estimating optimal cutpoints in a user-friendly environment is very important and necessary for facilitating, mainly to the biomedical staff, the selection of optimal cutpoints in clinical practice. There are several packages in R to carry out this task, such as PresenceAbsence (Freeman and Moisen 2008), DiagnosisMed (Brasil 2010), pROC (Robin et al. 2011) and OptimalCutpoints (López-Ratón et al. 2014; López-Ratón and Rodríguez-Álvarez 2014). However, these packages only consider the classical non-parametric empirical method for estimating optimal cutpoints and accuracy measures, that is, none of them take into account recent methodology introduced in ROC analysis such as the GPQ and EL approaches above-mentioned (Molanes-López and Letón 2011; Lai et al. 2012).
In this paper we present
GsymPoint, a package
written in R for estimating the generalized symmetry point
(López-Ratón et al. 2017), which is freely available from the Comprehensive
R Archive Network (CRAN) at
http://CRAN.R-project.org/package=GsymPoint. Specifically, this
package enables end-users to obtain point estimates and
The rest of paper is organized as follows. In Section 2, we briefly review the two methods included in our GsymPoint package for obtaining point estimates and confidence intervals for the generalized symmetry point and its sensitivity and specificity measures. In Section 3, we describe the general use of this package, describing the most important functions. In Section 4, we illustrate the practical application of the package using three real biomedical datasets on melanoma, prostate cancer, and coronary artery disease. Finally, in Section 5 we conclude with a discussion and some interesting future extensions.
In this section we briefly explain the two methods included in the
GsymPoint package for estimating and constructing confidence intervals
for the generalized symmetry point
Generalized confidence intervals refer to a parametric methodology based on the normality assumption, first introduced by Weerahandi (1993, 1995) and recently applied in the context of diagnostic studies to the Youden index by Lai et al. (2012) and Zhou and Qin (2013), and to the generalized symmetry point by López-Ratón et al. (2016).
Assuming that the diagnostic test in healthy and diseased populations
Therefore, under the normality assumption, using if necessary a monotone
transformation of Box-Cox type to achieve normality, it follows from
Equations (1)–(2) that the generalized
symmetry point
For computing the GPQ-based confidence intervals of
The empirical likelihood method was firstly introduced by Thomas and Grunkemeier (1975) that proposed the construction of EL-based confidence intervals for the Kaplan-Meier estimator. Nowadays, this methodology is an active area of research in several fields due to the good properties presented by EL-based confidence intervals and regions (see, for example, Molanes-López et al. 2009 among others). Moreover, this methodology has the advantages of easy implementation and not requiring any particular parametric assumption. In the recent literature, Molanes-López and Letón (2011) proposed a bootstrap-based EL approach to make inference on the Youden index and its associated optimal cutpoint, and López-Ratón et al. (2016) applied these same bootstrap-based EL ideas for estimating and constructing confidence intervals for the generalized symmetry point and its corresponding specificity and sensitivity measures. The key point in both works is that the optimal cutpoint of int erest can be seen as specific quantiles of the two populations involved.
As López-Ratón et al. (2016) mention,
In this section we present GsymPoint, a package written in R for
estimating the generalized symmetry point (López-Ratón et al. 2017), which
is freely available from the Comprehensive R Archive Network (CRAN) at
http://CRAN.R-project.org/package=GsymPoint. This package enables
end-users to obtain point estimates and
Function | Description |
---|---|
gsym.point() |
Main function that computes the generalized symmetry point and its sensitivity and specificity measures with their corresponding confidence intervals. |
control.gsym.point() |
Used to set several parameters that control the estimation process of the optimal cutpoint. |
print() |
Print method for "gsym.point" class objects. |
summary() |
Summary method for "gsym.point" class objects. |
plot() |
Plot method for "gsym.point" class objects that shows in the same graph the empirical ROC curve and the line |
gsym.point()
functionThe main function of the package is the gsym.point()
function, which
uses the selected method(s) (GPQ, EL or auto, where this last option
automatically selects the most appropriate method based on the normality
assumption) to obtain (parametric and/or nonparametric) confidence
intervals and point estimates for "gsym.point"
. The code to
use the gsym.point()
function is as follows:
gsym.point(methods, data, marker, status, tag.healthy,
categorical.cov = NULL, control = control.gsym.point(),
CFN = 1, CFP = 1, confidence.level = 0.95,
trace = FALSE, seed = FALSE, value.seed = 3, verbose = TRUE)
The methods
argument is a character vector to select the estimation
method(s) to be used. The possible options are: "GPQ"
,"EL"
,
c("GPQ","EL")
, c("EL","GPQ")
or "auto"
.
The data
argument is the data frame containing all the needed
variables: the diagnostic marker, the true disease status and, when
necessary, the categorical covariate; the marker
and status
arguments are character strings with the names of the diagnostic test
and the variable that distinguishes healthy from diseased individuals,
respectively. The value codifying healthy individuals in this last
variable status
is indicated in the tag.healthy
argument.
The categorical.cov
argument is a character string with the name of
the categorical covariate according to which NULL
, that is, no
categorical covariate is considered in the analysis.
The control
argument indicates the output of the
control.gsym.point()
function, which controls the whole calculation
process. This function will be explained in detail in the following
subsection.
The CFN
and CFP
arguments are the misclassification costs of false
negative and false positive classifications, respectively. The default
value is 1 for both, that is, no misclassification costs are taking into
account by default.
The confidence.level
argument is the numerical value of the confidence
level
The trace
argument is a logical value that shows information on the
calculation progress if TRUE
. By default it is FALSE
.
The seed
argument is a logical value, such that if TRUE
, a seed is
fixed for generating the trials when computing the confidence intervals,
allowing the reproducibility of the results at any other time. The
default value is FALSE
.
The value.seed
argument is the numerical value for the fixed seed if
seed
is TRUE
. By default it is equal to 3.
The verbose
argument is a logical value that allows to show extra
information on progress of running. By default it is TRUE
.
Some of these arguments, methods
, data
, marker
, status
and
tag.healthy
, are essential and must be specified in the call to the
gsym.point()
function because, otherwise, an error is produced. The
other arguments, categorical.cov
, control
, confidence.level
,
trace
, seed
and value.seed
, are optional and, if they are not
specified explicitly in the call, the values by default are taken.
control.gsym.point()
functionIt should be noted that there are some extra arguments, specific to each
estimation method. We considered to include all of them in the control
argument, which is a list of control values specified by calling to the
control.gsym.point()
function, designed to replace the default values
yielded by the control.gsym.point()
function. The arguments of the
control.gsym.point()
function are presented in Table
2.
Argument | Description |
---|---|
B |
The number of simulations in the empirical likelihood ("EL" ) method. The default value is 499 based on (Carpenter and Bithell 2000). |
c_sampling |
The constant needed for resampling in the empirical likelihood ("EL" ) method. The default value is 0.25 . |
c_F |
The constant needed for estimating the distribution in the empirical likelihood ("EL" ) method. The default value is 0.25 . |
c_``ELq |
The constant needed for estimating the empirical likelihood function in the empirical likelihood ("EL" ) method. The default value is 0.25 . |
c_R |
The constant needed for estimating the ROC curve in the empirical likelihood ("EL" ) method. The default value is 0.25 . |
I |
The number of replicates in the generalized pivotal quantity ("GPQ" ) method. The default value is 2500 . |
summary.gsym.point()
functionNumerical results are printed on the screen, and the output yielded by
the summary.gsym.point()
function or the summary()
method always
includes:
The matched call to the main function gsym.point()
.
The estimated value of the area under the ROC curve (AUC) based on
the Mann-Whitney
Information related to the generalized symmetry point:
The method(s) (EL
and/or GPQ
) used for estimating
The point estimates and confidence intervals for
Apart from the matched call, that it is presented only once at the
beginning, all the other information will be shown for each level of the
categorical covariate if this is specified in the call, that is, if the
categorical.cov
argument in the gsym.point()
function is not NULL
.
The call to this function is as follows:
summary(object, ...)
where the object
argument is a "gsym.point"
class object as produced
by the gsym.point()
function and the ellipsis ...
indicate further
arguments passed to or from other methods. None are used in this method.
plot.gsym.point()
functionThe graphical output of the GsymPoint package is yielded by the
plot.gsym.point()
function or by the plot()
method. This function
plots the empirical ROC curve and the line
plot(x, legend = TRUE, ...)
where the x
argument is a "gsym.point"
class object as produced by
the gsym.point()
function, the argument legend
is a logical value
for including the AUC value in the legend when it is TRUE
(value by
default) and the ellipsis ...
indicate further arguments passed to the
plot.default()
method.
This section illustrates the use of the R-based GsymPoint package by means of three real biomedical datasets on melanoma, prostate cancer, and coronary artery disease.
Dermatologists use a clinical scoring scheme without dermoscope (CSS) or a dermoscopic scoring scheme (DSS) to clinically evaluate lesions on the skin on the basis of several visible features such as asymmetry, border irregularity, colouration and size. We have considered a dataset on 72 patients with suspicious lesions of being a melanoma (Venkatraman and Begg 1996). Taking into account that 21 melanomas were detected through biopsies, our objective here is to evaluate the capacity of the CSS as a potential non-invasive diagnostic marker for discriminating between melanomas and non-melanomas using the generalized symmetry point as the binary classification threshold. In the following, we illustrate the use of the GsymPoint package for estimating that optimal threshold value. The first step consists on attaching the GsymPoint package and the melanoma dataset as follows:
> library("GsymPoint")
> data("melanoma")
With the following instruction, we can get some basic summary statistics of the variables included in the melanoma dataset:
> summary(melanoma)
X group
Min. :-5.88100 Min. :0.0000
1st Qu.:-3.22100 1st Qu.:0.0000
Median :-1.69550 Median :0.0000
Mean :-1.55642 Mean :0.2917
3rd Qu.: 0.00675 3rd Qu.:1.0000
Max. : 3.03200 Max. :1.0000
To estimate the generalized symmetry point of the CSS marker, we need to
call the gsym.point()
function. For instance, as specified below.
> melanoma.cutpoint1 <- gsym.point(methods = "GPQ", data = melanoma,
+ marker = "X", status = "group", tag.healthy = 0,
+ categorical.cov = NULL, CFN = 2, CFP = 1,
+ control = control.gsym.point(), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
In this call, we have considered that a false negative decision is 2
times more serious than a false positive decision, and so we have set
CFN = 2
and CFP = 1
for the misclassification costs. Besides, we
have considered the GPQ method of estimation (methods = "GPQ"
)
because, according to the Shapiro-Wilk normality test, the CSS values
can be assumed normally distributed in both melanoma and non-melanoma
groups, and under this assumption the GPQ method is more adequate than
the EL method in this case.
The melanoma.cutpoint1
object produced by this call is a list that
consists of the following components: "GPQ"
, "methods"
, "call"
,
and "data"
, as can be checked below with the names
command:
> names(melanoma.cutpoint1)
[1] "GPQ" "methods" "call" "data"
The last three components, "methods"
, "call"
and "data"
are,
respectively, a character vector with the value of the argument
methods
used in the call, the matched call, and the data frame used in
the analysis. The first component, "GPQ"
, contains the results
provided by the GPQ method regarding the estimation of the generalized
symmetry point. Below, we detail these results:
> names(melanoma.cutpoint1$GPQ)
[1] "Global"
> names(melanoma.cutpoint1$GPQ$Global)
[1] "optimal.result" "AUC" "rho"
[4] "pvalue.healthy" "pvalue.diseased"
> melanoma.cutpoint1$GPQ
$Global
$Global$optimal.result
$Global$optimal.result$cutoff
Value ll ul
1 -1.213237 -1.792908 -0.6283236
$Global$optimal.result$Specificity
Value ll ul
1 0.75465 0.6249716 0.8485824
$Global$optimal.result$Sensitivity
Value ll ul
1 0.877325 0.8124858 0.9242912
$Global$AUC
[1] 0.9056956
$Global$rho
[1] 0.5
$Global$pvalue.healthy
[1] 0.4719117
$Global$pvalue.diseased
[1] 0.9084176
The "optimal.result"
component is a list with the point estimates and
"AUC"
is the numerical value of the area under the ROC curve, "rho"
is the numerical value of the costs ratio
"pvalue.healthy"
is the "pvalue.diseased"
is the
The end-user can directly access each of these components. For example, the following instruction only yields the value of the AUC:
> melanoma.cutpoint1$GPQ$Global$AUC
[1] 0.9056956
A numerical summary of the results can be obtained by means of the
print.gsym.point()
or summary.gsym.point()
functions, which can be
abbreviated by the print()
and summary()
methods, respectively, as
can be seen below:
> summary(melanoma.cutpoint1)
*************************************************
OPTIMAL CUTOFF: GENERALIZED SYMMETRY POINT
*************************************************
Call:
gsym.point(methods = "GPQ", data = melanoma, marker = "X", status = "group",
tag.healthy = 0, categorical.cov = NULL, CFN = 2, CFP = 1,
control = control.gsym.point(), confidence.level = 0.95,
trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.4719 0.9084
Area under the ROC curve (AUC): 0.906
METHOD: GPQ
Estimate 95% CI lower limit 95% CI upper limit
cutoff -1.213237 -1.7929079 -0.6283236
Specificity 0.754650 0.6249716 0.8485824
Sensitivity 0.877325 0.8124858 0.9242912
As seed = TRUE
in the previous call, the user can replicate the output
by simply running again the same call.
In this case, as can be seen above, the output provided by the
summary.gsym.point()
function shows:
An informative message indicating that the marker can be considered normally distributed in both groups, according to the Shapiro-Wilk normality test.
The Shapiro-Wilk test
The AUC value and information related to the generalized symmetry
point, that is, the point estimates and the GPQ based
confidence.level
argument of the main gsym.point()
function.
So far, we have considered the GPQ method that, under normality assumptions, turns out to be more appropriate than the EL method. As you can see below, if we now replace the GPQ method by the EL method in the call, the program shows in the first place an informative message on the normality assumption and the better appropriateness of the GPQ method.
> melanoma.cutpoint2 <- gsym.point(methods = "EL", data = melanoma,
+ marker = "X", status = "group", tag.healthy = 0,
+ categorical.cov = NULL, CFN = 2, CFP = 1,
+ control = control.gsym.point(), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.4719 0.9084
By means of the summary()
function, we show below the results obtained
with the EL method. In this case, after reproducing the call used to
create the melanoma.cutpoint2
object, the same informative message is
shown regarding the better appropriateness of the GPQ method to this
dataset.
> summary(melanoma.cutpoint2)
*************************************************
OPTIMAL CUTOFF: GENERALIZED SYMMETRY POINT
*************************************************
Call:
gsym.point(methods = "EL", data = melanoma, marker = "X", status = "group",
tag.healthy = 0, categorical.cov = NULL, CFN = 2, CFP = 1,
control = control.gsym.point(), confidence.level = 0.95,
trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.4719 0.9084
Area under the ROC curve (AUC): 0.906
METHOD: EL
Estimate 95% CI lower limit 95% CI upper limit
cutoff -1.2382325 -1.8403497 -0.4565671
Specificity 0.7901833 0.6326184 0.8973174
Sensitivity 0.8950916 0.8163092 0.9486587
Since both estimating methods are adequate for this dataset, we could
specify them simultaneously in the call to the gsym.point()
function
as follows.
> melanoma.cutpoint3 <- gsym.point(methods = c("EL","GPQ"),
+ data = melanoma, marker = "X", status = "group",
+ tag.healthy = 0, categorical.cov = NULL, CFN = 2, CFP = 1,
+ control = control.gsym.point(), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Therefore, although the results of both methods will be shown,
the GPQ method would be more suitable for this dataset.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.4719 0.9084
With the option methods = "auto"
the program selects in this case the
GPQ method based on the normality assumption satisfied for this dataset.
The graphical output of any of the three objects previously created, can
be obtained by means of the plot.gsym.point()
function, which can be
abbreviated by the plot()
method.
For instance, the call below
> plot(melanoma.cutpoint1)
shows the plot of the empirical Receiver Operating Characteristic (ROC)
curve, the line
We have considered here the dataset on prostate cancer analyzed by
Le (2006). In order to design an appropriate treatment strategy for a
patient with prostate cancer, it is important to know first whether
cancer has spread or not to the neighboring lymph nodes. Although a
laparoscopic surgery could confirm the true status of the patient, it is
interesting to find a non-invasive diagnostic method to predict whether
nodal involvement is present or not. In this dataset, 20 patients (of
the total of 55 patients) had nodal involvement, and the level of acid
phosphatase in blood serum (APBS) (
As shown below, after loading the Gsympoint package and the prostate
cancer dataset, we use the gsym.point()
function to estimate the
generalized symmetry point of the APBS marker and its associated
specificity and sensitivity indexes.
> library("GsymPoint")
> data("prostate")
> prostate.cutpoint1 <- gsym.point(methods = "GPQ", data = prostate,
+ marker = "marker", status = "status", tag.healthy = 0,
+ categorical.cov = NULL, CFN = 10, CFP = 1,
+ control = control.gsym.point(I = 1500), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
Since cancer is a very serious disease which can cause death, a CFN = 10
, CFP = 1
). Since the Shapiro-Wilk normality
test indicated that both groups could be assumed normally distributed
(after a Box-Cox transformation of the data), the GPQ method is more
adequate than the EL method in this case. For the sake of illustration,
we have changed the default value of the number of replicates associated
to the GPQ method by setting control = control.gsym.point(I = 1500)
in
the call to the gsym.point()
function. The default value for this
parameter is I = 2500
.
Below we show the numerical results obtained by means of the
summary.gsym.point()
function, which can be abbreviated by the
summary()
method:
> summary(prostate.cutpoint1)
*************************************************
OPTIMAL CUTOFF: GENERALIZED SYMMETRY POINT
*************************************************
Call:
gsym.point(methods = "GPQ", data = prostate, marker = "marker",
status = "status", tag.healthy = 0, categorical.cov = NULL,
CFN = 10, CFP = 1, control = control.gsym.point(I = 1500),
confidence.level = 0.95, trace = FALSE, seed = TRUE, value.seed = 3,
verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can not
be considered normally distributed in both groups.
However, after transforming the marker using the Box-Cox
transformation estimate, the Shapiro-Wilk normality test
indicates that the transformed marker can be considered
normally distributed in both groups.
Box-Cox lambda estimate = -1.2494
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.0000 0.0232
Box-Cox transformed marker 0.3641 0.2118
Area under the ROC curve (AUC): 0.725
METHOD: GPQ
Estimate 95% CI lower limit 95% CI upper limit
cutoff 51.9522523 46.8013315 57.3009307
Specificity 0.3233012 0.1420636 0.5191686
Sensitivity 0.9323301 0.9142064 0.9519169
In this case, the numerical output obtained with the
summary.gsym.point()
function shows:
An informative message indicating that the original data can not be assumed normally distributed in both groups, but the Box-Cox transformed data can be considered normally distributed in both groups, according to the Shapiro-Wilk normality test.
The estimated value of the Box-Cox power lambda.
The estimated value of the area under the ROC curve (AUC).
Information corresponding to the generalized symmetry point: the
point estimates and the GPQ based confidence.level
argument of the main gsym.point()
function.
As can be checked with the command below, the prostate.cutpoint1
object yields a list with the following components: "GPQ"
,
"methods"
, "call"
, and "data"
.
> names(prostate.cutpoint1)
[1] "GPQ" "methods" "call" "data"
The first component, "GPQ"
, contains the results associated with the
GPQ method.
> names(prostate.cutpoint1$GPQ)
[1] "Global"
> names(prostate.cutpoint1$GPQ$Global)
[1] "optimal.result" "AUC"
[3] "rho" "lambda"
[5] "normality.transformed" "pvalue.healthy"
[7] "pvalue.diseased" "pvalue.healthy.transformed"
[9] "pvalue.diseased.transformed"
The "methods"
component is a character vector with the value of the
argument methods
used in the call, "call"
is the matched call, and
"data"
is the data frame used in the analysis:
> prostate.cutpoint1$methods
[1] "GPQ"
> prostate.cutpoint1$call
gsym.point(methods = "GPQ", data = prostate, marker = "marker",
status = "status", tag.healthy = 0, categorical.cov = NULL,
CFN = 10, CFP = 1, control = control.gsym.point(I = 1500),
confidence.level = 0.95, trace = FALSE,
seed = TRUE, value.seed = 3, verbose = TRUE)
> prostate.cutpoint1$data
marker status
1 40 0
2 40 0
3 46 0
[...]
51 99 1
52 126 1
53 136 1
We list below the elements of the first component, "GPQ"
, of the
prostate.cutpoint1
object and show the corresponding R-based commands
that allow end-users to directly access these elements:
"optimal.result"
is a list with the generalized symmetry point and
its sensitivity and specificity accuracy measures with the
corresponding
> prostate.cutpoint1$GPQ$Global$optimal.result
$cutoff
Value ll ul
1 51.95225 46.80133 57.30093
$Specificity
Value ll ul
1 0.3233012 0.1420636 0.5191686
$Sensitivity
Value ll ul
1 0.9323301 0.9142064 0.9519169
"AUC"
is the numerical value of the area under the ROC curve.
> prostate.cutpoint1$GPQ$Global$AUC
[1] 0.725
"rho"
is the numerical value of
> prostate.cutpoint1$GPQ$Global$rho
[1] 0.1
"lambda"
is the numerical value of the power used in the Box-cox
transformation of the GPQ method.
> prostate.cutpoint1$GPQ$Global$lambda
[1] -1.249428
"normality.transformed"
is a character string indicating if the
transformed marker values by the Box-Cox transformation are normally
distributed ("yes"
) or not ("no"
).
> prostate.cutpoint1$GPQ$Global$normality.transformed
[1] "yes"
"pvalue.healthy"
is the numerical value of the
> prostate.cutpoint1$GPQ$Global$pvalue.healthy
[1] 3.276498e-07
"pvalue.diseased"
is the numerical value of the
> prostate.cutpoint1$GPQ$Global$pvalue.diseased
[1] 0.02323895
"pvalue.healthy.transformed"
is the numerical value of the
> prostate.cutpoint1$GPQ$Global$pvalue.healthy.transformed
[1] 0.3640662
"pvalue.diseased.transformed"
is the numerical value of the
> prostate.cutpoint1$GPQ$Global$pvalue.diseased.transformed
[1] 0.2118137
Similarly to the previous example, if we now consider the EL method instead of the GPQ method for estimating the generalized symmetry point and its accuracy measures, an informative message is shown by the package, advising the user that the GPQ method is more suitable for this dataset due to the fact that the Box-Cox transformed marker can be assumed normally distributed in both groups.
> prostate.cutpoint2 <- gsym.point(methods = "EL", data = prostate,
+ marker = "marker", status = "status", tag.healthy = 0,
+ categorical.cov = NULL, CFN = 10, CFP = 1,
+ control = control.gsym.point(B = 999), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can not
be considered normally distributed in both groups.
However, after transforming the marker using the Box-Cox
transformation estimate, the Shapiro-Wilk normality test
indicates that the transformed marker can be considered
normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Box-Cox lambda estimate = -1.2494
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.0000 0.0232
Box-Cox transformed marker 0.3641 0.2118
For the sake of illustration, we have set
control = control.gsym.point(B = 999)
in the previous call to change
the default value of the number of bootstrap replicates B
required in
the empirical likelihood method. This parameter is B = 499
by default.
The results obtained with the EL method are the following.
> summary(prostate.cutpoint2)
*************************************************
OPTIMAL CUTOFF: GENERALIZED SYMMETRY POINT
*************************************************
Call:
gsym.point(methods = "EL", data = prostate, marker = "marker",
status = "status", tag.healthy = 0, categorical.cov = NULL,
CFN = 10, CFP = 1, control = control.gsym.point(B = 999),
confidence.level = 0.95, trace = FALSE, seed = TRUE, value.seed = 3,
verbose = TRUE)
According to the Shapiro-Wilk normality test, the marker can not
be considered normally distributed in both groups.
However, after transforming the marker using the Box-Cox
transformation estimate, the Shapiro-Wilk normality test
indicates that the transformed marker can be considered
normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Box-Cox lambda estimate = -1.2494
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.0000 0.0232
Box-Cox transformed marker 0.3641 0.2118
Area under the ROC curve (AUC): 0.725
METHOD: EL
Estimate 95% CI lower limit 95% CI upper limit
cutoff 49.2249839 45.39058266 58.7032623
Specificity 0.2451690 0.09891113 0.5269153
Sensitivity 0.9245169 0.90989111 0.9526915
Figure 2 shows the graphical output
corresponding to any of the two "gsym.point"
class objects previously
created, prostate.cutpoint1
and prostate.cutpoint2
, as generated by
means of the plot()
method. For instance, the code below produces
Figure 2.
> plot(prostate.cutpoint1)
We now consider a study conducted on 141 patients (96 with coronary
lesions and 45 with non-stenotic coronaries) admitted to the cardiology
department of a teaching hospital in Galicia (Northwest Spain) for
evaluating chest pain or cardiovascular disease, where the leukocyte
elastase determination was investigated as a potential clinical marker
for the diagnosis of coronary artery disease (Amaro et al. 1995). Since in
this biomedical example there is available information regarding the
gender of the patient (female or male), we will illustrate the practical
application of the GsymPoint package to these data taking into account
the covariate gender, that is, the generalized symmetry point will be
computed for each gender in the scale of the elastase concentration to
diagnose coronary artery disease (CAD). From here on, we will refer to
this dataset as elastase.
First of all, we need to load the GsymPoint package and the
corresponding elastase dataset:
> library("GsymPoint")
> data("elastase")
Now, for computing the generalized symmetry point in the elastase scale
taking into account the categorical covariate gender, we simply have to
include categorical.cov = "gender"
in the call below.
> elastase.gender.cutpoint1 <- gsym.point(methods = "GPQ",
+ data = elastase, marker = "elas", status = "status",
+ tag.healthy = 0, categorical.cov = "gender", CFN = 10, CFP = 1,
+ control = control.gsym.point(), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
In this case we are interested in having a high sensitivity, that is, a
low number of false negatives. Therefore, the same values as in the
previous prostate cancer dataset were considered for the
misclassification costs, CFN = 10
and CFP = 1
, that is, a false
negative result is regarded as 10 times more serious than a false
positive one. Since the elastase concentration in females and males
follow the Box-Cox family in both CAD and non-CAD groups according to
the Shapiro-Wilk normality test, the GPQ method of estimation is more
adequate than the EL method in this case.
The elastase.gender.cutpoint1
object produced by the previous call is
a list with the following components:
> names(elastase.gender.cutpoint1)
[1] "GPQ" "methods" "levels.cat" "call"
[5] "data"
Similarly to the previous datasets, the "methods"
component is a
character vector with the value of the argument methods
used in the
call, "call"
is the matched call and "data"
is the data frame used
in the analysis. However, now there is an extra component,
"levels.cat"
, a character vector indicating the levels of the
categorical covariate ("Female"
and "Male"
in this case). Besides,
the first component, "GPQ"
, that includes the results associated with
the GPQ method is itself a two-component list with "Female"
and
"Male"
as can be seen below.
> names(elastase.gender.cutpoint1$GPQ)
[1] "Female" "Male"
For each level of the categorical covariate, you get a list with nine
components. For instance, for the subgroup of males,
elastase.gender.cutpoint1$GPQ$Male
contains the following components:
> names(elastase.gender.cutpoint1$GPQ$Male)
[1] "optimal.result" "AUC"
[3] "rho" "lambda"
[5] "normality.transformed" "pvalue.healthy"
[7] "pvalue.diseased" "pvalue.healthy.transformed"
[9] "pvalue.diseased.transformed"
The "optimal.result"
component of elastase.gender.cutpoint1$GPQ$Male
is a list with the point estimates and
> elastase.gender.cutpoint1$GPQ$Male$optimal.result
$cutoff
Value ll ul
1 20.72776 18.08961 23.49228
$Specificity
Value ll ul
1 0.2739826 0.1345484 0.4326794
$Sensitivity
Value ll ul
1 0.9273983 0.9134548 0.9432679
The "AUC"
component of elastase.gender.cutpoint1$GPQ$Male
is the
numerical value of the area under the ROC curve for the subgroup of
individuals that are males.
> elastase.gender.cutpoint1$GPQ$Male$AUC
[1] 0.7216855
The "rho"
component of elastase.gender.cutpoint1$GPQ$Male
is the
numerical value of the ratio
> elastase.gender.cutpoint1$GPQ$Male$rho
[1] 0.1
The "lambda"
component of elastase.gender.cutpoint1$GPQ$Male
is the
estimated numerical value of the power in the Box-Cox transformation
obtained when considering only the subgroup of individuals that are
males.
> elastase.gender.cutpoint1$GPQ$Male$lambda
[1] -0.04277911
The "normality.transformed"
component of
elastase.gender.cutpoint1$GPQ$Male
is a character string indicating if
the Box-Cox transformed marker values in the subgroup of males are
normally distributed ("yes"
) or not ("no"
).
> elastase.gender.cutpoint1$GPQ$Male$normality.transformed
[1] "yes"
The "pvalue.healthy"
component of elastase.gender.cutpoint1$GPQ$Male
is the
> elastase.gender.cutpoint1$GPQ$Male$pvalue.healthy
[1] 0.5866506
The "pvalue.diseased"
component of
elastase.gender.cutpoint1$GPQ$Male
is the
> elastase.gender.cutpoint1$GPQ$Male$pvalue.diseased
[1] 5.44323e-09
Similarly, the "pvalue.healthy.transformed"
and
"pvalue.diseased.transformed"
components are the
> elastase.gender.cutpoint1$GPQ$Male$pvalue.healthy.transformed
[1] 0.06656483
> elastase.gender.cutpoint1$GPQ$Male$pvalue.diseased.transformed
[1] 0.2147409
A summary of the numerical results is shown bellow. In this case, the
results obtained are shown for each level of the categorical covariate
gender, that is, for females and males. However, similarly to previous
examples, the output shows first the AUC value and then information
related to the generalized symmetry point (point estimates and
gsym.point()
function). By default,
confidence intervals are computed for a confidence level 1-confidence.level
argument of the gsym.point()
function.
> summary(elastase.gender.cutpoint1)
*************************************************
OPTIMAL CUTOFF: GENERALIZED SYMMETRY POINT
*************************************************
Call:
gsym.point(methods = "GPQ", data = elastase, marker = "elas",
status = "status", tag.healthy = 0, categorical.cov = "gender",
CFN = 10, CFP = 1, control = control.gsym.point(), confidence.level = 0.95,
trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
*************************************************
Female
*************************************************
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.0837 0.9077
Area under the ROC curve (AUC): 0.818
METHOD: GPQ
Estimate 95% CI lower limit 95% CI upper limit
cutoff 25.0929510 12.3370641 34.0526540
Specificity 0.4246091 0.1460251 0.6618634
Sensitivity 0.9424609 0.9146025 0.9661863
*************************************************
Male
*************************************************
According to the Shapiro-Wilk normality test, the marker can not
be considered normally distributed in both groups.
However, after transforming the marker using the Box-Cox
transformation estimate, the Shapiro-Wilk normality test
indicates that the transformed marker can be considered
normally distributed in both groups.
Box-Cox lambda estimate = -0.0428
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.5867 0.0000
Box-Cox transformed marker 0.0666 0.2147
Area under the ROC curve (AUC): 0.722
METHOD: GPQ
Estimate 95% CI lower limit 95% CI upper limit
cutoff 20.7277556 18.0896126 23.4922813
Specificity 0.2739826 0.1345484 0.4326794
Sensitivity 0.9273983 0.9134548 0.9432679
If we consider now the EL method in the call to the main gsym.point()
function, the GsymPoint package will show an informative message
indicating that the GPQ method would be more adequate in this case
because for the two levels of the categorical covariate gender, either
the original marker or the Box-Cox transformed marker in both diseased
and healthy populations can be assumed normally distributed according to
the Shapiro-Wilk normality test:
> elastase.gender.cutpoint2 <- gsym.point (methods = "EL", data = elastase,
+ marker = "elas", status = "status", tag.healthy = 0,
+ categorical.cov = "gender", CFN = 10, CFP = 1,
+ control = control.gsym.point(), confidence.level = 0.95,
+ trace = FALSE, seed = TRUE, value.seed = 3, verbose = TRUE)
Female :
According to the Shapiro-Wilk normality test, the marker can be
considered normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.0837 0.9077
Male :
According to the Shapiro-Wilk normality test, the marker can not
be considered normally distributed in both groups.
However, after transforming the marker using the Box-Cox
transformation estimate, the Shapiro-Wilk normality test
indicates that the transformed marker can be considered
normally distributed in both groups.
Therefore the GPQ method would be more suitable for this dataset.
Box-Cox lambda estimate = -0.0428
Shapiro-Wilk test p-values
Group 0 Group 1
Original marker 0.5867 0.0000
Box-Cox transformed marker 0.0666 0.2147
The graphical results can be obtained by means of the
plot.gsym.point()
function or merely the plot()
method. For
instance, the following command
> plot(elastase.gender.cutpoint1)
yields the graphical output shown in Figure
3, where the empirical ROC curve of the
elastase concentration is represented separately for females and males,
together with the line
The selection of optimal cutpoints in the scale of a quantitative diagnostic test or biomarker that can help in the diagnosis of a disease is an important issue in the clinical sciences. Depending on the main objective of such selection, several criteria of optimality have been proposed in the literature to carry out this task. One of the most popular in clinical practice is based on selecting the cutoff that provides the same sensitivity and specificity, known in the literature as the equivalence or symmetry point (Greiner 1995; Defreitas et al. 2004; Adlhoch et al. 2011). However, this cutpoint is not valid in scenarios where the severity of misclassifying a diseased patient is not the same as the severity of misclassifying a healthy patient, which is normally the case in practice. For instance, in cancer disease, a false negative decision is in general more serious than a false positive decision. Hence, when selecting an optimal cutpoint it is very important to take into account the costs of the different incorrect diagnostic decisions. For this reason, the generalized symmetry point, a generalization of the symmetry point that incorporates the costs of the misclassifications is very adequate and useful in practice (López-Ratón et al. 2016). Although there are several R packages that implement different criteria for selecting the optimal cutoff point such as PresenceAbsence (Freeman and Moisen 2008), DiagnosisMed (Brasil 2010), pROC (Robin et al. 2011) and OptimalCutpoints (López-Ratón et al. 2014; López-Ratón and Rodríguez-Álvarez 2014), up to our knowledge, none of them includes the criterion based on the generalized symmetry point nor recent estimation approaches such as the GPQ and EL methods that provide more efficient estimates than the empirical ones (Molanes-López and Letón 2011; Lai et al. 2012). In order to avoid that the use of the generalized symmetry point is limited in the clinical practice by the lack of software that implements all necessary computations to estimate it, we have developed the Gsympoint package, a user-friendly R package that fills this gap. As it has been illustrated in this paper, the GsymPoint package allows the possibility that the generalized symmetry point is straightforwardly estimated for each level of a certain categorical covariate that represents an individual characteristic such as gender, age or disease severity, and that may influence the discrimination capacity of the diagnostic test (Pepe 2003). Possible interesting extensions of the GsymPoint package could be taken into account. For instance, the implementation of this same methodology to estimate other accuracy measures such as predictive values or diagnostic likelihood ratios, the incorporation of more efficient methods for estimating the generalized symmetry point and its accuracy measures, and the extension of this methodology to other more complex scenarios where, for instance, the diagnostic test is subject to the measurement of error, there is presence of partial disease verification (see Alonzo 2014 and references therein), the disease status evolves over time and the disease onset time is subject to censoring (Rota et al. 2015) or there are continuous covariates available that may affect the diagnostic capacity of the biomarker and that should be taken into consideration.
This research has been supported by several Grants from the Spanish Ministry of Science and Innovation. M. López-Ratón and C. Cadarso-Suárez acknowledge support to MTM2011-15849-E, MTM2011-28285-C02-00, MTM2014-52975-C2-1-R and MTM2015-69068-REDT. E.M. Molanes-López acknowledges support to MTM2011-28285-C02-02, ECO2011-25706, MTM2011-15849-E and MTM2015-69068-REDT. E. Letón acknowledges support to MTM2011-15849-E, MTM2011-28285-C02-02, PI13/02446 and MTM2015-69068-REDT.
GsymPoint, PresenceAbsence, DiagnosisMed, pROC, OptimalCutpoints
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
López-Ratón, et al., "GsymPoint: An R Package to Estimate the Generalized Symmetry Point, an Optimal Cut-off Point for Binary Classification in Continuous Diagnostic Tests", The R Journal, 2017
BibTeX citation
@article{RJ-2017-015, author = {López-Ratón, Mónica and Molanes-López, Elisa M. and Letón, Emilio and Cadarso-Suárez, Carmen}, title = {GsymPoint: An R Package to Estimate the Generalized Symmetry Point, an Optimal Cut-off Point for Binary Classification in Continuous Diagnostic Tests}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {262-283} }