Multivariate reference regions (MVR) represent the extension of the reference interval concept to the multivariate setting. A reference interval is defined by two threshold points between which a high percentage of healthy subjects’ results, usually 95%, are contained. Analogously, an MVR characterizes the values of several diagnostic tests most frequently found among non-diseased subjects by defining a convex hull containing 95% of the results. MVRs have great applicability when working with diseases that are diagnosed via more than one continuous test, e.g., diabetes or hypothyroidism. The present work introduces refreg, an R package for estimating conditional MVRs. The reference region is non-parametrically estimated using a multivariate kernel density estimator, and its shape allowed to change under the influence of covariates. The effects of covariates on the multivariate variable means, and on their variance-covariance matrix, are estimated by flexible additive predictors. Continuous covariate non-linear effects can be estimated by penalized spline smoothers. The package allows the user to propose, for instance, an age-specific diagnostic rule based on the joint distribution of two non-Gaussian, continuous test results. The usefulness of the refreg package in clinical practice is illustrated with a real case in diabetes research, with an age-specific reference region proposed for the joint interpretation of two glycemia markers (fasting plasma glucose and glycated hemoglobin). To show that the refreg package can also be used in other, and indeed very different fields, an example is provided for the joint prediction of two atmospheric pollutants (SO
In clinical practice, many medical decisions are based on continuous diagnostic tests (Hallworth 2011) – i.e., tests that provide results along a continuous, quantitative scale. The interpretation of such continuous tests requires the comparison of the obtained value with a pre-defined reference interval, so that a result could be classified as positive or negative (ie, disease present or absent) based on this comparator value. A reference interval is an interval containing most healthy subjects’ results. For a single test they are usually estimated from the 2.5 and 97.5 empirical percentiles of the distribution for the healthy population; thus, 95% of healthy patients are located within the interval limits (Wright and Royston 1999). Those patients falling outside the reference interval, are likely to have an undiagnosed disease. If the test results are influenced by some patient characteristics independent of the disease (e.g., age and gender), reference intervals for specific patient groups must be obtained. These covariate-dependent reference intervals, usually termed reference curves, are estimated using quantile regression (Koenker and Bassett Jr 1978) or location-scale models (Cole and Green 1992; Stasinopoulos et al. 2017). Several R packages for estimating reference intervals and reference curves already exist, including the R package referenceIntervals, which comprises a collection of tools, the R package gamlss (Stasinopoulos et al. 2007), which provides a general tool for deriving reference curves in clinical practice (WHO 2006), and software RefCurv (Winkler et al. 2019), recently proposed to facilitate clinicians’ use of gamlss. However, all these packages were produced to provide reference intervals for single tests; they cannot address diseases for which diagnosis and control are based on multiple tests.
When the results of several tests are available for the same patient, obtaining separate reference intervals for each one provides an incomplete picture of disease status, particularly when these results are strongly correlated (Boyd 2004). Although each reference interval would leave only 5% of healthy patients out, their combined use can result in a higher percentage of false positives. Moreover, a patient falling within each univariate reference interval might, in fact, show an abnormal multivariate result. Thus, a multivariate reference region (MVRs) would provide a better means of interpreting the results of multiple tests (Winkel and Lyngbye 1972). MVRs are a straightforward extension of the univariate reference interval to the multidimensional setting, i.e., a region that contains 95% of the healthy patients’ results. However, despite being proposed more than 40 years ago, MVRs are rarely used in clinical practice. This might be explained by the multivariate Gaussian assumption of MVRs, which is quite restrictive when interpreting diagnostic test results. Further, the multivariate distribution of test results is usually affected by patient characteristics, independent of their health status. For example, (Espasandı́n-Domı́nguez et al. 2019) showed that the correlation between two diagnostic tests for diabetes was influenced by patient age and red blood cell turnover, independent of glycemia status. A conditional MVR is therefore desirable, but the statistical literature is not rich in such proposals (see, e.g., (Wei 2008)).
Very few statistical software routines have been proposed for estimating
the region containing a specific percentage of multivariate data points.
The function mvtol.region
in the R package
tolerance
(Young 2010) obtains a region containing a high percentage of a
bivariate Gaussian distribution in the context of quality control
studies, and non-parametric probabilistic regions can be obtained using
the R packages r2d2
(Magnusson and Burgos 2014), hdrcde (Hyndman 2018)
and distfree.cr
(Hu and Yang 2018). However, these all suffer the major limitation of not being
able to estimate the effects of covariates on the region’s shape. The R
package modQR can estimate
conditional multivariate quantiles, but the quantile level
The present paper introduces refreg, an implementation in R of a new statistical methodology for estimating bivariate reference regions able to classify subjects as having normal or abnormal values based on the results of two continuous diagnostic tests. The main advantages of the presented method are; i) the absence of parametric restrictions for describing bivariate distributions for continuous tests, and ii) the possibility of estimating the effects of covariates on the shape of the reference region via flexible additive predictors. To illustrate this statistical methodology, and how to use the package, an age-specific reference region for two diabetes diagnostic tests was estimated. The estimated reference region offers new insights into the diagnosis and prognosis of diabetes, enabling physicians to identify different patients’ profiles. The proposed method can, however, be applied to any disease in which two continuous diagnostic tests are available, and can even be used in non-medical fields. Indeed, an application is discussed in which the conditional region is used in the joint forecasting of the concentrations of two air pollutants. Moreover, the current implementation can be easily extended to three dimensions.
The statistical model that enables conditional reference regions to be determined is presented in the next section. The main functions contained in the refreg package are then described, with a brief introduction to the main functions. The use of the package is shown in analyses of real medical and environmental data problems. The paper closes with some comments, and some notes on future research directions.
In this section the main features of our statistical method is presented. In a nutshell, our proposal is based on a bivariate location scale model, where the response means, and their variance-covariance matrix, are related to covariates using flexible additive predictors. The probabilistic region covering a specific percentage of the data is firstly estimated using the model standardized residuals. Then, it is generalized for each covariate value based on the aforementioned bivariate location-scale model fit. This statistical model was already presented and evaluated in (Roca-Pardiñas et al. 2020).
Given a bivariate continuous random variable of interest
where
We consider additive structures for the mean functions
Based on the model presented in equation ((1)), for a given
In this section, we present the estimation procedure of the conditioned
bivariate uncertainty region given in equation ((3)). Our
approach is based on the estimation of the covariate effects on the
response means using an additive model, and then on the
variance-covariance matrix using the squared residuals of the former
models. Finally, the bivariate region
The steps of the proposed estimation algorithm are the following:
Step 1: For
Finally, for a given
The continuous covariates smooth effects (
Step 1. From the sample data
Step 2. For
The limits for the
The estimation of the unconditionally probabilistic region for the
bivariate errors
The selection of
The refreg package contains a set of functions for estimating a conditional reference, or uncertainty, region. Its working framework was designed so that people without a strong statistical background can use it. Indeed, only two functions need to be taken into account by the user: 1) the effects of the predictor variables on responses need to be estimated using the bivRegr function, a step that requires the user choose which variables may influence the region; 2) bivRegion needs to be applied to a bivRegr object so that the reference region can be estimated.
The bivRegr()
function has the following structure:
bivRegr(f = formulas, data = data)
The f
argument contains a list of five R formulae corresponding to the
additive predictors for the means, variances and correlation models
shown in equation (1). Since bivRegr()
uses mgcv::gam()
internally,
the user can estimate covariate linear and non-linear effects using
s()
operator. For instance:
mu1 <- y1 ~ s(x1)
mu2 <- y2 ~ s(x1)
var1 <- ~ x2
var2 <- ~ x2
rho <- ~ s(x3)
formula = list(mu1,mu2,var1,var2,rho)
assumes a smooth effect of x1
on the response means, a parametric
effect of x2
on their variances, and a smooth effect of x3
on the
response correlation.
The bivRegion()
function is designed for non-parametrically estimating
a bivariate reference region:
bivRegion(object, tau = 0.95, bandwidth = "plug-in")
The object may be a set of bivariate data points, or a bivRegr
object,
while tau defines the desired coverage(s) for the reference region,
which might be a single value or a vector. Finally, “bandwidth”
specifies the kernel bandwidth selection method. The user can chose
between the plug-in, cross-validation, or the best coverage method (see
equation (13)).
Additionally, we defined S3 methods for these two main functions.
Specifically, associated with bivRegr
we have
predict.bivRegr
and plot.bivRegr
: to predict and depict additive
models results for responses’ means, variances, and their
correlation.summary_boot.bivRegr
: a function implementing the bootstrap
inference for flexible additive models (see (9)). This
function results can be depicted by applying plot.summary_boot
.On the other hand, we defined the following S3 methods associated to
bivRegion
:
summary.bivRegion
: this function evaluates the region performance
on the healthy patients’ sample.predict.bivRegion
and plot.bivRegion
: these functions offer a
prediction or a plot of conditional regions for a new dataset. If
the argument cond=FALSE
it evaluates the response values in the
standardized scale.In addition, we define the functions trivRegr
, trivRegion
and
plot.trivRegion
as an extension of the aforementioned method for a
trivariate response variable. Finally, our package also contains some
inner functions as ace
(for estimating variance, and correlation
models), Hcv
(it implements equation (11) method), and
refcurve
(it implements an univariate location-scale model).
This section outlines the implemented functions of the proposed package in detail, and illustrates their use with real datasets. The first illustration is related to diabetes research, in which a reference region for the joint interpretation of two glycemia tests is calculated. In the second illustration, refreg methodology is used to predict the concentrations of two air pollutants during a pollution episode. Finally, the extension of the method to higher dimensions is shown using real data.
Diabetes is a chronic disease, the diagnosis of which is based on two glycemia tests: the fasting plasma glucose (FPG) and glycated hemoglobin (HbA1c)(American Diabetes Association 2019) tests. The multivariate interpretation of FPG and HbA1c results is desirable for two reasons: i) the results of both tests are correlated in healthy patients (Aleyassine et al. 1980), ii) a miss-match between them may be indicative of a poorer prognosis (Kim et al. 2018). Finally, it is well known that both test results are influenced by patient age (Davidson 1979; Pani et al. 2008).
The age-dependent reference region for the FPG and HbA1c tests was estimated using a sample of healthy subjects derived from the A-Estrada Glycation and Inflammation Study (AEGIS) (see (Gude et al. 2017)). A subset of this dataset is available in the package under the name “AEGIS”.
This dataset comprised 1516 subjects and 7 variables:
Applying the summary()
routine to the aegis
dataset indicated 55% of
the subjects to be female, the mean age of all 1516 subjects to be 52
years (range 18-91), and that 187 subjects (12%) had been previously
diagnosed with diabetes.
R> summary(aegis)
id gender age dm fpg
Min. : 1.0 female:838 Min. :18.00 no :1329 Min. : 63.00
1st Qu.: 379.8 male :678 1st Qu.:39.00 yes: 187 1st Qu.: 82.00
Median : 758.5 Median :52.00 Median : 89.00
Mean : 758.5 Mean :52.58 Mean : 94.51
3rd Qu.:1137.2 3rd Qu.:67.00 3rd Qu.:100.00
Max. :1516.0 Max. :91.00 Max. :274.00
hba1c fru
Min. : 3.900 Min. :119.0
1st Qu.: 5.200 1st Qu.:225.0
Median : 5.400 Median :254.0
Mean : 5.608 Mean :262.2
3rd Qu.: 5.700 3rd Qu.:284.0
Max. :10.200 Max. :700.0
To estimate the reference region, a subset of the patients not
previously diagnosed with diabetes was define as dm_no
. This subset
sample was deemed the healthy patient sample.
R> dm_no = subset(aegis,aegis$dm == "no")
R> dm_yes = subset(aegis,aegis$dm == "yes")
To estimate the effect of age on the final region shape, the bivRegr()
function was used. This function implements the estimation process of
the bivariate location-scale:
R> mu1 = fpg ~ s(age)
R> mu2 = hba1c ~ s(age)
R> var1 = ~ s(age)
R> var2 = ~ s(age)
R> rho = ~ s(age)
R> formula = list(mu1,mu2,var1,var2,rho)
The first and second formulae define the additive models for the mean
values of both glycemia tests. The third and fourth define the additive
models for test result variability. The last formula represents the
additive model that comprises the effect of age on the correlation
between the results of both glycemia tests. In addition to the model
formulae list, a dataset including both the test results and subject’s
age must be supplied to bivRegr()
as:
R> fit = bivRegr(formula,data=dm_no)
summary_boot
, the shaded
area is the 95% pointwise confidence interval obtained by bootstrap
resampling. The parameters of our bivariate response change with
age.By applying the S3 method plot()
to a bivRegr
object, the estimated
effects of covariates can be shown for each submodel. The argument eq=
controls the model component to be represented (1 = FPG mean, 2 = HbA1c
mean, 3 = FPG variance, 4 = HbA1c variance, and 5 = [FPG – HbA1c]
correlation). Moreover, the function summary_boot()
may be applied to
a bivRegr
object to obtain the 95% pointwise confidence interval for
the estimated effects via bootstrapping:
R> fit_boot = summary_boot(fit, B=250, parallel = TRUE )
R> plot(fit_boot,eq=1)
R> plot(fit_boot,eq=2)
R> plot(fit_boot,eq=3)
R> plot(fit_boot,eq=4)
R> plot(fit_boot,eq=5)
Since bootstrap resampling (introduced in equation ((9))) is
time consuming, the user can fix parallel = TRUE
and run a
parallelized computation. The parallel backend is registered using
doParallel
(Microsoft and Weston 2020a), and the parallel computation is performed by
foreach (Microsoft and Weston 2020b).
Figure 2 shows that the mean values of both glycemia markers increase almost linearly with age. FPG variance increases from 20 to 40 years, while the HbA1c variance increases linearly with age. Finally, the correlation between the FPG and HbA1c concentration seems to be stronger for older patients.
Applying the function bivRegion()
to a bivRegr
object provides a
bivariate region containing 100H_choice
argument. Here, the 90%, 95% and 97.5% regions are obtained with the
best coverage bandwidth selector (see equation (10)):
R> region = bivRegion(fit,tau=c(0.90,0.95,0.975),H_choice = "Hcov")
This region facilitates a multivariate interpretation of the glycemia
test results. A patient whose results are “normal”, for his/her age
would see them fall inside this reference region, while a subject with
“abnormal” results for his/her age would not. This interpretation is
possible because the model residuals are centered around zero, show unit
variance, no correlation, and they are independent of age. The user can
check test results located outside the reference region using the
bivRegion
S3 method plot
:
R> par(mfrow = c(1, 2))
R> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, \%",cond=T, newdata =
data.frame (age = c(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")
R> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata =
data.frame(age = c(20,60)),tau=c(0.50,0.95), reg.lwd=2, pch="*", col="grey")
Figure 3 shows the unconditional reference region for
plot()
function
argument ‘newdata’ allows the glycemia test values to be observed in the
standardized residuals scale of the dataset for the patients with
diabetes. As is clearly seen, most of healthy patients’ results are
located inside the reference region, while those recorded for diabetic
patients are located outside.
A major advantage of this representation is that it allows clinicians
new insights into the subject’s glycemia status. Indeed, those patients
located outside the reference region may be classified into four groups:
(I) individuals with high values for both tests (first quadrant); (II)
those with discordant results, with high HbA1c concentrations and
low/medium FPG (second quadrant); (III) individuals with low values for
both tests (third quadrant); and (IV) individuals with low/medium HbA1c
concentrations and high FPG values (fourth quadrant). This distinction
might be useful for physicians. For instance, discordant results are
probably due to an altered bloodstream protein glycation rate, a
condition associated with a poorer prognosis. Patients located outside
the standardized region may be also checked applying summary()
to a
bivRegion
object:
R> summary(region,tau = 0.95)
This R output presents patients located outside the standardized
bivariate region for
plot.bivRegion
function setting the arguments cond=TRUE
and
add=FALSE
for several ages, and bottomrow ones with cond=TRUE
and
add=TRUE
in a pre-existing plot. The estimated regions change with age
and it describes the shape of the observed data
points.The use of this region in combination with the results of the bivariate
location-scale model allow the conditional reference regions to be
obtained. The user can visualize these regions using the bivRegion
S3
method plot()
, setting cond = TRUE
as follows:
R> plot(region, xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
c(20,30,40,50,60,70)),tau=0.95,reg.lwd=2, pch="*",col="grey")
R> plot(region,xlab = "FPG, mg/dL", ylab = "HbA1c, %",cond=T, newdata = data.frame(age =
c(20,60)),tau=c(0.50,0.95),reg.lwd=2, pch="*",col="grey")
In addition, the region may be represented in a pre-existing plot if the
plot
function argument add
is equal to TRUE
as in the following
code:
R> plot(dm_no[dm_no$age==40,"fpg"],dm_no[dm_no$age==40,"hba1c"],main="40 years",
xlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
R> plot(region,cond=T,newdata = data.frame(age = 40),add=T,legend=F, tau=c(0.50,0.95),
reg.lty=c(1,2))
R> plot(dm_no[dm_no$age==60,"fpg"],dm_no[dm_no$age==60,"hba1c"],main="60 years",
xlim=c(50,140), ylim=c(4.2,7.5),xlab = "FPG, mg/dL", ylab = "HbA1c, %", pch=20,cex=2)
R> plot(region,cond=T,newdata = data.frame(age = 60),add=T,legend=F, tau=c(0.50,0.95),
reg.lty=c(1,2))
In Figure 4, the bivariate reference region is shown for several ages. Note that the regions shift towards the upper right corner and expand as age increase. This agrees with the non-linear effect of age on the expected means and variability of both markers. The conditional region coverage and the performance of the methodology have already been assessed (Lado-Baleato et al. 2021).
This section provides an example of how the method proposed in equation
(3) might be extended to more than two dimensions. This section is
intended to provide a proof of concept rather than a formal statistical
proposal. For a trivariate variable (
Following equation ((12)), a trivariate reference region may
be estimated as:
Using this model, the application of the methodology for diabetes diagnosis can be extended by incorporating an additional glycemia test. This extension is justified since other glycated proteins are routinely monitored in diabetes control besides FPG and HbA1c. For instance, in conditions that determine alterations in hemoglobin metabolism (e.g., anemia or kidney disease), fructosamine (Fr) is frequently used as an additional marker. Nevertheless, the translation of Fr into average glucose levels is not as clear as for HbA1c, and discordances are often encountered between the Fr and HbA1c results. In addition, agreement among these glycemia markers can be affected by factors such as patient age.
Figure 5 show the FPG, HbA1c and Fr results for the AEGIS sample.
As can be seen, the values recorded for these tests correlate with one
another, showing a complex multivariate distribution. Moreover, it can
be appreciated how their multivariate distribution changes with age. To
estimate a trivariate reference region for these markers, taking into
account patient age, the trivRegr()
function was used. This function
is an extension of bivRegr()
to the trivariate setting. The usage of
both functions is similar, but additional additive predictors must be
defined for the means vector and variance-covariance matrix.
R> dm_no = subset(aegis,aegis$dm == "no")
R> mu1 = fpg ~ s(age)
R> mu2 = hba1c ~ s(age)
R> mu3 = fru ~ s(age)
R> var1 = ~ s(age)
R> var2 = ~ s(age)
R> var3 = ~ s(age)
R> rho12 = ~ s(age)
R> rho13 = ~ s(age)
R> rho23 = ~ s(age)
R> formula = list(mu1,mu2,mu3,var1,var2,var3,rho12,rho13,rho23)
R> fit = trivRegr(formula,data=dm\_no)
As in the bivariate estimation, the trivRegion
function is applied to
a trivRegr
object. Here, the method was implemented only for a single
plot()
method implemented for the trivRegion
object can be used
to interactively check the trivariate standardized and conditional
reference regions:
R> region = trivRegion(fit,tau=0.95)
R> plot(region,planes = F,size=5, col="red", incol = "grey",xlab="FPG, mg/dl",
ylab="HbA1c, \%", zlab="Fru, mg/dL")
R> plot(region,planes = T,size=5,col="red",incol = "grey",xlab="FPG, mg/dl",
ylab="HbA1c, \%", zlab="Fru, mg/dL")
R> plot(region,cond=T,newdata=data.frame(age=c(20,70)), xlab="FPG, mg/dl",
ylab="HbA1c, \%", zlab="Fru, mg/dL", legend=T)
In Figure 6 the trivariate standardized and conditional
reference regions are shown for different angles. The model residuals
are centered around zero, with variance zero, and zero linear
correlation. The region contains the 94.96% of the patients. In the
trivariate setting, a patient may be located outside the region for
different reasons. Indeed, if plot()
argument panels=T
, eight
reasons exist for why a patient is located outside the reference region
(explaining each situation goes beyond the scope of the present work).
The conditional reference region may be produced for different ages by
setting the cond = T
, and providing new age values in newdata
.
This section illustrates how
refreg methodology can be
used in fields other than laboratory medicine. It is here shown how an
uncertainty region useful for the joint forecasting of the concentration
of two air pollutants (bivRegr
and bivRegion
functions. The following example estimates
which joint
Given the historical concentrations of both pollutants (contained in the
pollution
dataset), we aimed to predict the pollution_episode
dataset) 30 minutes in advance. As can be seen in
the following R output, both datasets have a similar structure, SO2 and
Nox are the current concentrations of both pollutants, while the
remaining columns represent their concentrations in the previous 30, 45,
and 60 minutes.
R > head(pollution[,1:9])
Date So2 Nox So2_0 Nox_0 So2_1 Nox_1 So2_2 Nox_2
316 2003-02-07 16:15:00 38.38 2.38 73.50 3.21 76.79 3.38 81.92 3.54
1865 2003-05-07 03:10:00 3.00 3.67 3.00 3.33 3.00 3.33 3.00 3.33
1383 2003-04-04 01:35:00 256.50 11.17 293.71 8.96 294.54 8.33 285.29 7.71
3262 2003-07-09 14:35:00 225.29 17.04 104.67 7.67 84.33 6.29 64.67 4.67
1191 2003-03-30 13:55:00 42.33 4.12 80.21 4.96 84.38 5.00 88.33 5.00
3065 2003-07-04 11:50:00 145.83 8.58 99.12 5.71 83.83 5.25 70.92 4.83
R > head(pollution_episode[,1:9])
Date So2 Nox So2_0 Nox_0 So2_1 Nox_1 So2_2 Nox_2
1 2003-05-09 00:00:00 3.08 4.12 3.08 4.50 3.08 4.46 3.08 4.38
2 2003-05-09 00:05:00 3.08 4.12 3.08 4.46 3.08 4.50 3.08 4.46
3 2003-05-09 00:10:00 3.08 4.12 3.08 4.38 3.08 4.46 3.08 4.50
4 2003-05-09 00:15:00 3.08 4.12 3.08 4.29 3.08 4.38 3.08 4.46
5 2003-05-09 00:20:00 3.08 4.12 3.08 4.21 3.08 4.29 3.08 4.38
6 2003-05-09 00:25:00 3.08 4.12 3.08 4.12 3.08 4.21 3.08 4.29
Figure 7 shows the course of the pollution episode under
prediction. In the left plot the
R> mu1 = Nox~s(Nox_0)+s(Nox_2)
R> mu2 = So2~s(So2_0)+s(So2_2)
R> var1 = ~s(Nox_0)+s(Nox_2)
R> var2 = ~s(So2_0)+s(So2_2)
R> rho = ~s(Nox_0)+s(So2_0)
R> f = list(mu1,mu2,var1,var2,rho)
R> fit = bivRegr(f,data=pollution)
R> region = bivRegion(fit,tau=c(0.950,0.975),shape=10)
In the previous code example, the reference region was estimated for
pollution_episode
dataset. Note,
that one dataset is used to fit the model, and another when making
predictions. The observed
par(mfrow=c(3,3))
for(k in c(150, 160, 165, 175, 180, 185, 190, 200,210)){
plot(pollution_episode[,3:2],type="l",lty=2,ylim=c(0,600),xlim=c(0,45),
main=pollution_episode[k,1])
points(pollution_episode[k,3:2],col="black",pch=19)
plot(region, cond = T, newdata = pollution_episode[k,], add = T,
tau=c(0.95, 0.975),legend=F)
}
This paper discusses the R implementation of a newly developed method for estimating conditional reference regions. The method was originally designed for bivariate responses to provide a joint interpretation of two glycemia markers. However, as shown with real data, the proposed package can be used in other fields, and its extension to three dimensions is feasible. The proposed package is useful in the definition of conditional reference regions for continuous diagnostic tests. Few MVRs applications are used in practice, yet they have been shown clinically valuable in the treatment of patients with cancer (Mattsson et al. 2008), cardiovascular disease (Selmeryd et al. 2018) and endocrine problems (Hoermann et al. 2016). Given the simplicity with which refreg can be used, and its having no parametric restrictions, it is hoped it might enhance the use of MVRs.
The definition of a region characterizing the central part of a multivariate distribution may be of great interest in other fields. For instance, in quality control studies, multivariate control charts are commonly used when two or more attributes of a product, or process, require evaluation (Fuchs and Kenett 1998). Analogously to medical MVRs, this multivariate analysis is currently performed assuming a Gaussian distribution. Thus, the refreg R package may help provide for better quality surveillance of medical conditions, environmental problems and even industrial processes that are monitored by the measurement of more than one variable.
Future versions of our package will look forward to extending our model for continuous responses of dimension higher than three. Such reference region would require a kernel density estimator for high dimension applications (Nagler 2021). To the best of our knowledge, the estimated region is not easy to visualize for scales larger than three. Although, we might identify and visualize the multivariate values located inside or outside the region using parallel coordinate plots, or the interactive visualization methods as the implemented in the tourr package (Wickham et al. 2011).
Óscar Lado-Baleato is funded by a pre-doctoral grant (ED481A-2018) from the Galician Government (Plan I2C)-Xunta de Galicia. This study was also supported by grants from the Carlos III Institute of Health (Instituto de Salud Carlos III-ISCIII/PI20/01069/Co-funded by European Union), the Network for Research on Chronicity, Primary Care, and Health Promotion (Instituto de Salud Carlos III-ISCIII/RD21/0016/0022/Co-funded by European Union), and the Galician Innovation Agency-Competitive Benchmark Groups (GAIN-GRC/IN607A/2021/02/Xunta de Galicia). This article was developed under the project MTM2017-83513-R, cofinanced by the Ministry of Economy and Competitiveness (SPAIN) and by the European Regional Development Fund (FEDER). The work was also supported by the project ED431C 2020/20, approved within the framework of the Competitive Research Unit Consolidation Programme (2020), Xunta de Galicia. Work supported by the Grant PID2020-118101GB-I00, Ministerio de Ciencia e Innovación (MCIN/ AEI /10.13039/501100011033).
referenceIntervals, gamlss, tolerance, r2d2, hdrcde, distfree.cr, modQR, refreg, doParallel, foreach, mbend, tourr
Distributions, Econometrics, HighPerformanceComputing, MixedModels, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lado-Baleato, et al., "Refreg: An R Package for Estimating Conditional Reference Regions", The R Journal, 2022
BibTeX citation
@article{RJ-2022-029, author = {Lado-Baleato, Óscar and Roca-Pardiñas, Javier and Cadarso-Suárez, Carmen and Gude, Francisco}, title = {Refreg: An R Package for Estimating Conditional Reference Regions}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {2}, issn = {2073-4859}, pages = {135-151} }