Examining distributions of variables is the first step in the analysis of a clinical trial before more specific modelling can begin. Reporting these results to stakeholders of the trial is an essential part of a statistician’s work. The atable package facilitates these steps by offering easy-to-use but still flexible functions.
Reporting the results of clinical trials is such a frequent task that guidelines have been established that recommend certain properties of clinical trial reports; see Moher et al. (2010). In particular, Item 17a of CONSORT states that “Trial results are often more clearly displayed in a table rather than in the text”. Item 15 of CONSORT suggests “a table showing baseline demographic and clinical characteristics for each group”.
The atable package facilitates this recurring task of data analysis by providing a short approach from data to publishable tables. The atable package satisfies the requirements of CONSORT statements Item 15 and 17a by calculating and displaying the statistics proposed therein, i.e. mean, standard deviation, frequencies, p-values from hypothesis tests, test statistics, effect sizes and confidence intervals thereof. Only minimal post-processing of the table is needed, which supports reproducibility. The atable package is intended to be modifiable: it can apply arbitrary descriptive statistics and hypothesis tests to the data. For this purpose, atable builds on R’s S3-object system.
R already has many functions that perform single steps of the analysis process (and they perform these steps well). Some of these functions are wrapped by atable in a single function to narrow the possibilities for end users who are not highly skilled in statistics and programming. Additionally, users who are skilled in programming will appreciate atable because they can delegate this repetitive task to a single function and then concentrate their efforts on more specific analyses of the data at hand.
The atable package supports the analysis and reporting of randomised parallel group clinical trials. Data from clinical trials can be stored in data frames with rows representing ‘patients’ and columns representing ‘measurements’ for these patients or characteristics of the trial design, such as location or time point of measurement. These data frames will generally have hundreds of rows and dozens of columns. The columns have different purposes:
The task is to compare the target columns between the groups, separately for every split column. This is often the first step of a clinical trial analysis to obtain an impression of the distribution of data. The atable package completes this task by applying descriptive statistics and hypothesis tests and arranges the results in a table that is ready for printing.
Additionally atable can produce tables of blank data.frames with arbitrary fill-ins (e.g. X.xx) as placeholders for proposals or report templates.
To exemplify the usage of atable
, we use the dataset arthritis
of
multgee Touloumis (2015).
This dataset contains observations of the self-assessment score of
arthritis, an ordered variable with five categories, collected at
baseline and three follow-up times during a randomised comparative study
of alternative treatments of 302 patients with rheumatoid arthritis.
library(atable)
library(multgee)
data(arthritis)
# All columns of arthritis are numeric. Set more appropriate classes:
= within(arthritis, {
arthritis = ordered(y)
score = ordered(baseline)
baselinescore = paste0("Month ", time)
time = factor(sex, levels = c(1,2), labels = c("female", "male"))
sex = factor(trt, levels = c(1,2), labels = c("placebo", "drug"))}) trt
First, create a table that contains demographic and clinical
characteristics for each group. The target variables are sex
, age
and baselinescore
; the variable trt
acts as the grouping variable:
<- atable::atable(subset(arthritis, time == "Month 1"),
the_table target_cols = c("age", "sex", "baselinescore"),
group_col = "trt")
Now print the table. Several functions that create a
LaTeX-representation (Mittelbach et al. 2004) of the table exist: latex
of
Hmisc Harrell Jr et al. (2018), kable
of knitr Xie (2018) or
xtable
of xtable
Dahl et al. (2018). latex
is used for this document.
Table 1 reports the number of observations
per group. The distribution of numeric variable age
is described by
its mean and standard deviation, and the distributions of categorical
variable sex
and ordered variable baselinescore
are presented as
percentages and counts. Additionally, missing values are counted per
variable. Descriptive statistics, hypothesis tests and effect sizes are
automatically chosen according to the class of the target column; see
table 3 for details. Because the data is from a
randomised study, hypothesis tests comparing baseline variables between
the treatment groups are omitted.
Group | placebo | drug |
---|---|---|
Observations | ||
149 | 153 | |
age | ||
Mean (SD) | 51 (11) | 50 (11) |
valid (missing) | 149 (0) | 153 (0) |
sex | ||
female | 29% (43) | 26% (40) |
male | 71% (106) | 74% (113) |
missing | 0% (0) | 0% (0) |
baselinescore | ||
7.4% (11) | 7.8% (12) | |
23% (35) | 25% (38) | |
47% (70) | 45% (69) | |
19% (28) | 18% (28) | |
3.4% (5) | 3.9% (6) | |
missing | 0% (0) | 0% (0) |
Now, present the trial results with atable
. The target variable is
score
, variable trt
acts as the grouping variable, and variable
time
splits the dataset before analysis:
<- atable(score ~ trt | time, arthritis) the_table
Table 2 reports the number of
observations per group and time point. The distribution of ordered
variables score
is presented as counts and percentages. Missing values
are also counted per variable and group. The p-value and test statistic
of the comparison of the two treatment groups are shown. The statistical
tests are designed for two or more independent samples, which arise in
parallel group trials. The statistical tests are all non-parametric.
Parametric alternatives exist that have greater statistical power if
their requirements are met by the data, but non-parametric tests are
chosen for their broader range of application. The effect sizes with a
95% confidence interval are calculated; see table
3 for details.
LaTeX is not the only supported output format. All possible formats are:
latex
of Hmisc, kable
of knitr or xtable
of xtable.knitr::kable
of knitr.flextable
of
flextable
The output format is declared by the argument format_to
of atable
,
or globally via atable_options
. The
settings package
van der Loo (2015) allows global declaration of various options of atable
.
Group | placebo | drug | p | stat | Effect Size (CI) |
---|---|---|---|---|---|
Month 1 | |||||
Observations | |||||
149 | 153 | ||||
score | |||||
6% (9) | 1.3% (2) | 0.08 | 9.9e+03 | -0.12 (-0.24; 0.0017) | |
23% (35) | 10% (16) | ||||
34% (50) | 50% (77) | ||||
30% (45) | 33% (51) | ||||
6% (9) | 3.3% (5) | ||||
missing | 0.67% (1) | 1.3% (2) | |||
Month 3 | |||||
Observations | |||||
149 | 153 | ||||
score | |||||
6% (9) | 2% (3) | 0.0065 | 9e+03 | -0.2 (-0.32; -0.08) | |
21% (32) | 18% (27) | ||||
42% (63) | 34% (52) | ||||
24% (36) | 33% (50) | ||||
5.4% (8) | 10% (16) | ||||
missing | 0.67% (1) | 3.3% (5) | |||
Month 5 | |||||
Observations | |||||
149 | 153 | ||||
score | |||||
5.4% (8) | 1.3% (2) | 0.004 | 8.7e+03 | -0.22 (-0.34; -0.1) | |
19% (29) | 13% (20) | ||||
35% (52) | 33% (51) | ||||
32% (48) | 29% (45) | ||||
6.7% (10) | 18% (28) | ||||
missing | 1.3% (2) | 4.6% (7) |
R class | factor | ordered | numeric |
---|---|---|---|
scale of measurement | nominal | ordinal | interval |
statistic | counts occurrences of every level | as factor | Mean and standard deviation |
two-sample test | \(\chi\)\(^{2}\) test | Wilcoxon rank sum test | Kolmogorov-Smirnov test |
effect size | two levels: odds ratio, else Cramér’s \(\phi\) | Cliff’s \(\Delta\) | Cohen’s d |
multi-sample test | \(\chi\)\(^{2}\) test | Kruskal-Wallis test | Kruskal-Wallis test |
The current implementation of tests and statistics (see table 3) is not suitable for all possible datasets. For example, the parametric t-test or the robust estimator median may be more adequate for some datasets. Additionally, dates and times are currently not handled by atable.
It is intended that some parts of atable can be altered by the user. Such modifications are accomplished by replacing the underlying methods or adding new ones while preserving the structures of arguments and results of the old functions. The workflow of atable (and the corresponding function in parentheses) is as follows:
calculate statistics (statistics
)
apply hypothesis tests (two_sample_htest
or multi_sample_htest
)
format statistics results (format_statistics
)
format hypothesis test results (format_tests
).
These five functions may be altered by the user by replacing existing or adding new methods to already existing S3-generics. Two examples are as follows:
The atable package offers three possibilities to replace existing methods:
atable_options
. This affects all following
calls of atable
.atable
. This affects only a single call of
atable
and takes precedence over atable_options
.We now define three new functions to exemplify these three possibilities.
First, define a modification of two_sample_htest.numeric
, which
applies t.test
and ks.test
simultaneously. See the documentation of
two_sample_htest
: the function has two arguments called value
and
group
and returns a named list.
<- function(value, group, ...){
new_two_sample_htest_numeric
<- data.frame(value = value, group = group)
d
<- levels(group)
group_levels <- subset(d, group %in% group_levels[1], select = "value", drop = TRUE)
x <- subset(d, group %in% group_levels[2], select = "value", drop = TRUE)
y
<- stats::ks.test(x, y)
ks_test_out <- stats::t.test(x, y)
t_test_out
<- list(p_ks = ks_test_out$p.value,
out p_t = t_test_out$p.value )
return(out)
}
Secondly define a modification of statistics.numeric
, that calculates
the median, MAD, mean and SD. See the documentation of statistics
: the
function has one argument called x
and the ellipsis ...
. The
function must return a named list.
<- function(x, ...){
new_statistics_numeric
<- list(Median = median(x, na.rm = TRUE),
statistics_out MAD = mad(x, na.rm = TRUE),
Mean = mean(x, na.rm = TRUE),
SD = sd(x, na.rm = TRUE))
class(statistics_out) <- c("statistics_numeric", class(statistics_out))
# We will need this new class later to specify the format
Third, define a modification of format_statistics
: the median and MAD
should be next to each other, separated by a semicolon; the mean and SD
should go below them. See the documentation of format_statistics
: the
function has one argument called x
and the ellipsis ...
. The
function must return a data.frame with names tag
and value
with
class factor and character, respectively. Setting a new format is
optional because there exists a default method for format_statistics
that performs the rounding and arranges the statistics below each other.
<- function(x, ...){
new_format_statistics_numeric
<- paste(round(c(x$Median, x$MAD), digits = 1), collapse = "; ")
Median_MAD <- paste(round(c(x$Mean, x$SD), digits = 1), collapse = "; ")
Mean_SD
<- data.frame(
out tag = factor(c("Median; MAD", "Mean; SD"), levels = c("Median; MAD", "Mean; SD")),
# the factor needs levels for the non-alphabetical order
value = c(Median_MAD, Mean_SD),
stringsAsFactors = FALSE)
return(out)
}
Now apply the three kinds of modification to atable
: We start with
atable’s namespace:
::assignInNamespace(x = "two_sample_htest.numeric",
utilsvalue = new_two_sample_htest_numeric,
ns = "atable")
Here is why altering two_sample_htest.numeric
in atable’s namespace
works: R’s lexical scoping rules state that when atable
is called, R
first searches in the enclosing environment of atable
to find
two_sample_htest.numeric
. The enclosing environment of atable
is the
environment where it was defined, namely, atable’s namespace. For more
details about scoping rules and environments, see e.g. Wickham (2014),
section ‘Environments’.
Then modify via atable_options
:
atable_options('statistics.numeric' = new_statistics_numeric)
Then modify via passing new_format_statistics_numeric
as an argument
to atable
, together with actual analysis. See table
4 for the results.
<- atable(age ~ trt, arthritis,
the_table format_statistics.statistics_numeric = new_format_statistics_numeric)
The modifications in atable_options
are reverted by calling
atable_options_reset()
, changes in the namespace are reverted by
calling utils::assignInNamespace
with suitable arguments.
Group | placebo | drug | p_ks | p_t |
---|---|---|---|---|
Observations | ||||
447 | 459 | |||
age | ||||
Median; MAD | 55; 10.4 | 53; 10.4 | 0.043 | 0.38 |
Mean; SD | 50.7; 11.2 | 50.1; 11 |
Replacing methods allows us to create arbitrary tables, even tables independent of the supplied data. We will create a table of a blank data.frame with arbitrary fill-ins (here X.xx ) as placeholders. This is usefull for proposals or report templates:
# create empty data.frame with non-empty column names
<- atable::test_data[FALSE, ]
E
<- function(x, ...){
stats_placeholder
return(list(Mean = "X.xx",
SD = "X.xx"))
}
<- atable::atable(E, target_cols = c("Numeric", "Factor"),
the_table statistics.numeric = stats_placeholder)
See table 5 for the results. This table also shows that atable accepts empty data frames without errors.
Group | value |
Observations | |
0 | |
Numeric | |
Mean | X.xx |
SD | X.xx |
Factor | |
G3 | NaN% (0) |
G2 | NaN% (0) |
G1 | NaN% (0) |
G0 | NaN% (0) |
missing | NaN% (0) |
In the current implementation of atable, the generics have no method
for class Surv
of
survival
Therneau (2015). We define two new methods: the distribution of
survival times is described by its mean survival time and corresponding
standard error; the Mantel-Haenszel test compares two survival curves.
<- function(x, ...){
statistics.Surv
<- survival::survfit(x ~ 1)
survfit_object
# copy from survival:::print.survfit:
<- survival:::survmean(survfit_object, rmean = "common")
out
return(list(mean_survival_time = out$matrix["*rmean"],
SE = out$matrix["*se(rmean)"]))
}
<- function(value, group, ...){
two_sample_htest.Surv
<- survival::survdiff(value~group, rho=0)
survdiff_result
# copy from survival:::print.survdiff:
<- survdiff_result$exp
etmp <- (sum(1 * (etmp > 0))) - 1
df <- 1 - stats::pchisq(survdiff_result$chisq, df)
p
return(list(p = p,
stat = survdiff_result$chisq))
}
These two functions are defined in the user’s workspace, the global environment. It is sufficient to define them there, as R’s scoping rules will eventually find them after going through the search path, see Wickham (2014).
Now, we need data with class Surv
to apply the methods. The dataset
ovarian
of survival contains the survival times of a randomised
trial comparing two treatments for ovarian cancer. Variable futime
is
the survival time, fustat
is the censoring status, and variable rx
is the treatment group.
library(survival)
# set classes
<- within(survival::ovarian, {time_to_event = survival::Surv(futime, fustat)}) ovarian
Then, call atable
to apply the statistics and hypothesis tests. See
tables 6 for the results.
atable(ovarian, target_cols = c("time_to_event"), group_col = "rx")
Group | 1 | 2 | p | stat |
Observations | ||||
13 | 13 | |||
time_to_event | ||||
mean_survival_time | 650 | 889 | 0.3 | 1.1 |
SE | 120 | 115 |
A single function call does the job, and in conjunction with report-generating packages such as knitr, accelerates the analysis and reporting of clinical trials.
Other R packages exist to accomplish this task:
Desc
(only describes
data.frames, no hypothesis tests) and PercTable
(contingency
tables only).furniture and tableone have high overlap with atable
, and thus we
compare their advantages relative to atable
in greater detail:
Advantages of furniture::table1
are:
interacts well with
margrittr’s pipe
%>%
Bache and Wickham (2014), as mentioned in the examples of ?table1
. This
facilitates reading the code.
handles objects defined by
dplyr’s group_by
to
define grouping variables Wickham et al. (2019). atable
has no methods
defined for these objects.
uses non-standard evaluation, which allows the user to create and modify variables from within the function itself, e.g.:
table1(df, x2 = ifelse(x > 0, 1, 0)).
This is not possible with atable
.
Advantages of tableone::CreateTableOne
are:
atable
demands syntactically valid names
defined by make.names
.tableone::CreateTableOne
. In atable
a redefinition of a
function is needed.atable
allows only multivariate tests.Advantages of atable
are:
atable
and
globally via atable_options
,split_cols
and group_col
,Changing options is exemplified in section 4:
passing options to atable
allows the user to modify a single
atable
-call; changing atable_options
will affect all subsequent
calls and thus spares the user passing these options to every single
call.
Descriptive statistics, hypothesis tests and effect sizes are automatically chosen according to the class of the target column. R’s S3-object system allows a straightforward implementation and extension of this feature, see section 4.
atable
supports the following concise and self-explanatory formula
syntax:
atable(target_cols ~ group_col | split_cols, ...)
R users are used to working with formulas, such as via the lm
function
for linear models. When fitting a linear model to randomised clinical
trial data, one can use
lm(target_cols ~ group_col, ...)
to estimate the influence of the interventions group_col
on the
endpoint target_cols
. atable
mimics this syntax:
atable(target_cols ~ group_col, ...)
performs a hypothesis test, whether there is an influence of the
interventions group_col
on the endpoint target_cols
.
Also, statisticians know the notion of conditional probability:
P(target_cols | split_cols).
This denotes the distribution of target_cols
given split_cols
.
atable
borrows the pipe |
from conditional probability:
atable(target_cols ~ group_col | split_cols)
shows the distribution of the endpoint target_cols
within the
interventions group_col
given the strata defined by split_cols
.
atable
distinguishes between split_cols
and group_col
: group_col
denotes the randomised intervention of the trial. We want to test
whether it has an influence on the target_cols
; split_cols
are
variables that may have an influence on target_cols
, but we are not
interested in that influence in the first place. Such variables, for
example, sex, age group, and time point of measurement, arise often in
clinical trials. See table 2: the
variable time
is such a supplementary stratification variable: it has
an effect on the arthritis score, but that is not the effect of
interest; we are interested in the effect of the intervention on the
arthritis score.
The package can be used in other research contexts as a preliminary unspecific analysis. Displaying the distributions of variables is a task that arises in every research discipline that collects quantitative data.
I thank the anonymous reviewer for his/her helpful and constructive comments.
multgee, Hmisc, knitr, xtable, flextable, settings, survival, furniture, tableone, stargazer, DescTools, margrittr, dplyr
Bayesian, CausalInference, ClinicalTrials, Databases, Econometrics, MissingData, MixedModels, ModelDeployment, ReproducibleResearch, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ströbel, "atable: Create Tables for Clinical Trial Reports", The R Journal, 2019
BibTeX citation
@article{RJ-2019-001, author = {Ströbel, Armin}, title = {atable: Create Tables for Clinical Trial Reports}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {137-148} }