The bnclassify package provides state-of-the art algorithms for learning Bayesian network classifiers from data. For structure learning it provides variants of the greedy hill-climbing search, a well-known adaptation of the Chow-Liu algorithm and averaged one-dependence estimators. It provides Bayesian and maximum likelihood parameter estimation, as well as three naive-Bayes-specific methods based on discriminative score optimization and Bayesian model averaging. The implementation is efficient enough to allow for time-consuming discriminative scores on medium-sized data sets. The bnclassify package provides utilities for model evaluation, such as cross-validated accuracy and penalized log-likelihood scores, and analysis of the underlying networks, including network plotting via the Rgraphviz package. It is extensively tested, with over 200 automated tests that give a code coverage of 94%. Here we present the main functionalities, illustrate them with a number of data sets, and comment on related software.
Bayesian network classifiers (Friedman et al. 1997; Bielza and Larrañaga 2014) are competitive performance classifiers (e.g., Zaidi et al. 2013) with the added benefit of interpretability. Their simplest member, the naive Bayes (NB) (Minsky 1961), is well-known (Hand and Yu 2001). More elaborate models exist, taking advantage of the Bayesian network (Pearl 1988; Koller and Friedman 2009) formalism for representing complex probability distributions. The tree augmented naive Bayes (Friedman et al. 1997) and the averaged one-dependence estimators (AODE) (Webb et al. 2005) are among the most prominent.
A Bayesian network classifier is simply a Bayesian network applied to
classification, that is, to the prediction of the probability
The bnclassify package implements state-of-the-art algorithms for learning structure and parameters. The implementation is efficient enough to allow for time-consuming discriminative scores on relatively large data sets. It provides utility functions for prediction and inference, model evaluation with network scores and cross-validated estimation of predictive performance, and model analysis, such as querying structure type or graph plotting via the Rgraphviz package (Hansen et al. 2017). It integrates with the caret (Kuhn 2008; Kuhn et al. 2017) and mlr (Bischl et al. 2017) packages for straightforward use in machine learning pipelines. Currently it supports only discrete variables. The functionalities are illustrated in an introductory vignette, while an additional vignette provides details on the implemented methods. It includes over 200 unit and integration tests that give a code coverage of 94 percent (see https://codecov.io/github/bmihaljevic/bnclassify?branch=master).
The rest of this paper is structured as follows. We begin by providing background on Bayesian network classifiers (Section 2) and describing the implemented functionalities ([sec:functionalities]). We then illustrate usage with a synthetic data set ([sec:usage]) and compare the methods’ running time, predictive performance and complexity over several data sets ([sec:properties]). Finally, we discuss implementation ([sec:implementation]), briefly survey related software ([sec:relatedsw]), and conclude by outlining future work ([sec:conclusion]).
A Bayesian network
classifier is a Bayesian network used for predicting a discrete class
variable
The classifier factorizes
Local distributions
We learn
Common scores in structure learning are the penalized log-likelihood
scores, such as the Akaike information criterion (AIC) (Akaike 1974) and
Bayesian information criterion (BIC) (Schwarz 1978). They measure the
model’s fitting of the empirical distribution P(c, ) adding a penalty
term that is a function of structure complexity. They are decomposable
with respect to
For Bayesian network classifiers, a common (see Bielza and Larrañaga 2014) structure
space is that of augmented naive Bayes (Friedman et al. 1997) models (see
Figure 1), factorizing
with
The algorithms that produce the above structures are generally instances
of greedy hill-climbing (Sahami 1996; Keogh and Pazzani 2002), with arc inclusion
and removal as their search operators. Some (e.g., Pazzani 1996) add
node inclusion or removal, thus embedding feature selection (Guyon and Elisseeff 2003)
within structure learning. Alternatives include the adaptation
(Friedman et al. 1997) of the Chow-Liu (Chow and Liu 1968) algorithm to find the
optimal one-dependence estimator (ODE) with respect to decomposable
penalized log-likelihood scores in time quadratic in
Given
where
While the NB can separate any two linearly separable classes given the
appropriate , learning by approximating P(C, ) cannot recover the
optimal in some cases (Jaeger 2003). Several methods
(Hall 2007; Zaidi et al. 2013; Zaidi et al. 2017) learn a weight
A
Another parameter estimation method for the naive Bayes is by means of
Bayesian model averaging over the
Computing P(c) for a fully observed means multiplying the corresponding
The package has four groups of functionalities:
Learning network structure and parameters
Analyzing the model
Evaluating the model
Predicting with the model
Learning is split into two separate steps, the first step is structure learning and the second, optional, step is parameter learning. The obtained models can be evaluated, used for prediction, or analyzed. The following provides a brief overview of this workflow. For details on some of the underlying methods please see the “methods” vignette.
The learning algorithms produce the following network structures:
Figure 1 shows some of these structures and their factorizations of P(c, ). We use k-DB in the sense meant by (Pernkopf and Bilmes 2010) rather than that by (Sahami 1996), as we impose no minimum on the number of augmenting arcs. SNB is the only structure whose complexity is not a priori bounded: the feature subgraph might be complete in the extreme case.
|
|
p(c,x) = p(c)p(x1|c)p(x2|c)p(x3|c)p(x4|c) | |
p(x5|c)p(x6|c) | p(c,x) = p(c)p(x1|c,x2)p(x2|c,x3)p(x3|c,x4)p(x4|c) |
p(x5|c,x4)p(x6|c,x5) | |
|
|
p(c,x) = p(c)p(x1|c,x2)p(x2|c)p(x3|c)p(x4|c) | |
p(x5|c,x4)p(x6|c,x5) | p(c,x) = p(c)p(x1|c,x2)p(x2|c)p(x4|c) |
p(x5|c,x4)p(x6|c,x4,x5) |
Each structure learning algorithm is implemented by a single R function. Table 1 lists these algorithms along with the corresponding structures that they produce, the scores they can be combined with, and their R functions. Below we provide their abbreviations, references, brief comments, and illustrate function calls.
We implement two algorithms:
The NB and AODE structures are fixed given the number of variables, and thus no search is required to estimate them from data. For example, we can get a NB structure with
n <- nb('class', dataset = car)
where class
is the name of the class variable car
the
dataset containing observations of
We implement one algorithm:
Maximizing log-likelihood will always produce a TAN while maximizing
penalized log-likelihood may produce a FAN since including some arcs can
degrade such a score. With incomplete data our implementation does not
guarantee the optimal ODE as that would require computing maximum
likelihood parameters. The arguments of the tan_cl()
function are the
network score to use and, optionally, the root for features’ subgraph:
n <- tan_cl('class', car, score = 'AIC', root = 'buying')
The bnclassify package implements five algorithms:
These algorithms use the cross-validated estimate of predictive accuracy
as a score. Only the FSSJ and BSEJ perform feature selection. The
arguments of the corresponding functions include the number of
cross-validation folds, k
, and the minimal absolute score improvement,
epsilon
, required for continuing the search:
fssj <- fssj('class', car, k = 5, epsilon = 0)
Structure | Search algorithm | Score | Feature selection | Function |
---|---|---|---|---|
NB | - | - | - | nb |
TAN/FAN | CL-ODE | log-lik, AIC, BIC | - | tan_cl |
TAN | TAN-HC | accuracy | - | tan_hc |
TAN | TAN-HCSP | accuracy | - | tan_hcsp |
SNB | FSSJ | accuracy | forward | fssj |
SNB | BSEJ | accuracy | backward | bsej |
AODE | - | - | - | aode |
kDB | kDB | accuracy | - | kdb |
The bnclassify package only handles
discrete features. With fully observed data, it estimates the parameters
with maximum likelihood or Bayesian estimation, according to Equation
(2), with a single
We implement two methods for weighted naive Bayes parameter estimation:
We implement one method for estimation by means of Bayesian model
averaging over all NB structures with up to
It makes little sense to apply WANBIA, MANB, and AWNB to structures
other than NB. WANBIA, for example, learns the weights by optimizing the
conditional log-likelihood of the NB. Parameter learning is done with
the lp()
function. For example,
a <- lp(n, smooth = 1, manb_prior = 0.5)
computes Bayesian parameter estimates with smooth
argument) for all local distributions, and updates them with the MANB
estimation obtained with a 0.5 prior probability for each
class-to-feature arc.
Single-structure-learning functions, as opposed to those that learn an
ensemble of structures, return an S3 object of class "bnc_dag"
. The
following functions can be invoked on such objects:
plot()
is_tan()
, is_ode()
, is_nb()
, is_aode()
,
…narcs()
, families()
, features()
, …as_grain()
Ensembles are of type "bnc_aode"
and only print()
and model type
queries can be applied to such objects. Fitting the parameters (by
calling lp()
) of a "bnc_dag"
produces a "bnc_bn"
object. In
addition to all "bnc_dag"
functions, the following are meaningful:
predict()
compute_joint()
AIC(),BIC(),logLik(),clogLik()
cv()
nparams()
manb_arc_posterior()
, weights()
The above functions for "bnc_bn"
can also be applied to an ensemble
with fitted parameters.
This vignette provides an overview of the package and background on the
implemented methods. Calling ?bnclassify
gives a more concise overview
of the functionalities, with pointers to relevant functions and their
documentation. The “usage” vignette presents more detailed usage
examples and shows how to combine the functions. The “methods” vignette
provides details on the underlying methods and documents implementation
specifics, especially where they differ from or are undocumented in the
original paper.
The available functionalities can be split into four groups:
Learning network structure and parameters
Analyzing the model
Evaluating the model
Predicting with the model
We illustrate these functionalities with the synthetic car
data set
with six features. We begin with a simple example for each functionality
group and then elaborate on the options in the following sections. We
first load the package and the dataset:
library(bnclassify)
data(car)
Then we learn a naive Bayes structure and its parameters:
nb <- nb('class', car)
nb <- lp(nb, car, smooth = 0.01)
Then we get the number of arcs in the network:
narcs(nb)
[1] 6
Then we get the 10-fold cross-validation estimate of accuracy:
cv(nb, car, k = 10)
[1] 0.8628258
Finally, we classify the entire data set:
p <- predict(nb, car)
head(p)
[1] unacc unacc unacc unacc unacc unacc
Levels: unacc acc good vgood
The functions for structure learning, shown in Table 1, correspond to the different algorithms. They all receive the name of the class variable and the data set as their first two arguments, which are then followed by optional arguments. The following runs the CL-ODE algorithm with the AIC score, followed by the FSSJ algorithm to learn another model:
ode_cl_aic <- tan_cl('class', car, score = 'aic')
set.seed(3)
fssj <- fssj('class', car, k = 5, epsilon = 0)
The bnc()
function is a shorthand for learning structure and
parameters in a single step,
ode_cl_aic <- bnc('tan_cl', 'class', car, smooth = 1, dag_args = list(score = 'aic'))
where the first argument is the name of the structure learning function
while and optional arguments go in dag_args
.
Printing the model, such as the above ode_cl_aic
object, provides
basic information about it.
ode_cl_aic
Bayesian network classifier
class variable: class
num. features: 6
num. arcs: 9
free parameters: 131
learning algorithm: tan_cl
While plotting the network is especially useful for small networks, printing the structure in the deal (Bottcher and Dethlefsen 2013) and bnlearn format may be more useful for larger ones:
ms <- modelstring(ode_cl_aic)
strwrap(ms, width = 60)
[1] "[class] [buying|class] [doors|class] [persons|class]"
[2] "[maint|buying:class] [safety|persons:class]"
[3] "[lug_boot|safety:class]"
We can query the type of structure–params()
lets us access the
conditional probability tables (CPTs), while features()
lists the
features:
is_ode(ode_cl_aic)
[1] TRUE
params(nb)$buying
class
buying unacc acc good vgood
low 0.2132243562 0.2317727320 0.6664252607 0.5997847478
med 0.2214885458 0.2994740131 0.3332850521 0.3999077491
high 0.2677680077 0.2812467451 0.0001448436 0.0001537515
vhigh 0.2975190903 0.1875065097 0.0001448436 0.0001537515
length(features(fssj))
[1] 5
For example, fssj()
has selected five out of six features.
The manb_arc_posterior()
function provides the MANB posterior
probabilities for arcs from the class to each of the features:
manb <- lp(nb, car, smooth = 0.01, manb_prior = 0.5)
round(manb_arc_posterior(manb))
buying maint doors persons lug_boot safety
1 1 0 1 1 1
With the posterior probability of 0% for the arc from class
to
doors
, and 100% for all others, MANB renders doors
independent from
the class while leaving the other features’ parameters unaltered. We can
see this by printing out the CPTs:
params(manb)$doors
class
doors unacc acc good vgood
2 0.25 0.25 0.25 0.25
3 0.25 0.25 0.25 0.25
4 0.25 0.25 0.25 0.25
5more 0.25 0.25 0.25 0.25
all.equal(params(manb)$buying, params(nb)$buying)
[1] TRUE
For more functions for querying a structure with parameters ("bnc_bn"
)
see ?inspect_bnc_bn
. For a structure without parameters ("bnc_dag"
),
see ?inspect_bnc_dag
.
Several scores can be computed:
logLik(ode_cl_aic, car)
'log Lik.' -13307.59 (df=131)
AIC(ode_cl_aic, car)
[1] -13438.59
The cv()
function estimates the predictive accuracy of one or more
models with a single run of stratified cross-validation. In the
following we assess the above models produced by NB and CL-ODE
algorithms:
set.seed(0)
cv(list(nb = nb, ode_cl_aic = ode_cl_aic), car, k = 5, dag = TRUE)
nb ode_cl_aic
0.8582303 0.9345913
Above, k
is the desired number of folds, and dag = TRUE
evaluates
structure and parameter learning, while dag = FALSE
keeps the
structure fixed and evaluates just the parameter learning. The output
gives 86% and 93% accuracy estimates for NB and CL-ODE, respectively.
The mlr and caret packages provide additional options for evaluating
predictive performance, such as different metrics, and bnclassify is
integrated with both (see the “usage” vignette).
As shown above, we can predict class labels with predict()
. We can
also get the class posterior probabilities:
pp <- predict(nb, car, prob = TRUE)
# Show class posterior distributions for the first six instances of car
head(pp)
unacc acc good vgood
[1,] 1.0000000 2.171346e-10 8.267214e-16 3.536615e-19
[2,] 0.9999937 6.306269e-06 5.203338e-12 5.706038e-19
[3,] 0.9999908 9.211090e-06 5.158884e-12 4.780777e-15
[4,] 1.0000000 3.204714e-10 1.084552e-15 1.015375e-15
[5,] 0.9999907 9.307467e-06 6.826088e-12 1.638219e-15
[6,] 0.9999864 1.359469e-05 6.767760e-12 1.372573e-11
We illustrate the algorithms’ running times, resulting structure complexity and predictive performance on the datasets listed in Table 2. We only used complete data sets as time-consuming inference with incomplete data makes cross-validated scores costly for medium-sized or large data sets. The structure and parameter learning methods are listed in the legends of Figure 2, Figure 3, and Figure 4.
Dataset | |||
---|---|---|---|
1728 | 7 | 4 | car |
958 | 10 | 2 | tic-tac-toe |
435 | 17 | 2 | voting |
351 | 35 | 2 | ionosphere |
562 | 36 | 19 | soybean |
3196 | 37 | 2 | kr-vs-kr |
3190 | 61 | 3 | splice |
Figure 2 shows that the algorithms with cross-validated
scores, followed by WANBIA, are the most time-consuming. Running time is
still not prohibitive: TAN-HC ran for 139 seconds on kr-vs-kp and 282
seconds on splice, adding 27 augmenting arcs on the former and 7 on the
latter (k
; using k
k
k = 5
and
epsilon = 0
for the wrappers. CL-ODE-AIC is CL-ODE with the AIC rather
than the log-likelihood score. The lines have been horizontally and
vertically jittered to avoid overlap where
identical.CL-ODE tended to produce the most complex structures (see
Figure 3), with FSSJ learning complex models on car,
soybean and splice, yet simple ones, due to feature selection, on voting
and tic-tac-toe. The NB models with alternative parameters, WANBIA and
MANB, have as many parameters as the NB, because we are not counting the
length-
In terms of accuracy, NB and MANB performed comparatively poorly on car,
voting, tic-tac-toe, and kr-vs-kp, possibly because of many wrong
independence assumptions (see Figure 4). WANBIA may
have accounted for some of these violations on voting and kr-vs-kp, as
it outperformed NB and MANB on these datasets, showing that a simple
model can perform well on them when adequately parameterized. More
complex models, such as CL-ODE and AODE, performed better on car
.
With complete data, bnclassify implements prediction for augmented naive Bayes models as well as for ensembles of such models. It multiplies the corresponding in logarithmic space, applying the log-sum-exp trick before normalizing, to reduce the chance of underflow. On instances with missing entries, it uses the gRain package (Højsgaard 2012, 2016) to perform exact inference, which is noticeably slower. Network plotting is implemented by the Rgraphviz package. Some functions are implemented in C++ with Rcpp for efficiency. The package is extensively tested, with over 200 unit and integrated tests that give a 94% code coverage.
NB, TAN, and AODE are available in general-purpose tools such as bnlearn and Weka. WANBIA (https://sourceforge.net/projects/rawnaivebayes) and MANB (http://www.dbmi.pitt.edu/content/manb) are only available in stand-alone software, published along with the original publications. We are not aware of available implementations of the remaining methods implemented in bnclassify.
The bnlearn package implements algorithms for learning general purpose Bayesian networks. Among them, algorithms for Markov blanket learning by testing for independencies, such as IAMB (Tsamardinos and Aliferis 2003) and GS (Margaritis and Thrun 2000), can be very useful for classification as they can look for the Markov blanket of the class variable. The bnlearn package combines the search algorithms, such as greedy hill-climbing and tabu search (Glover and Laguna 2013), only with generative scores such as penalized log-likelihood. Among classification models, it implements the discrete NB and CL-ODE. It does not handle incomplete data and provides cross-validation and prediction only for the NB and TAN models, but not for the unrestricted Bayesian networks.
Version 3.8 of Weka (Bouckaert 2004; Hall et al. 2009) provides variants of the AODE (Webb et al. 2005) as well as the CL-ODE and NB. It implements five additional search algorithms, such as K2 (Cooper and Herskovits 1992), tabu search, and simulated annealing (Kirkpatrick et al. 1983), combining them only with generative scores. Except for the NB, Weka only handles discrete data and uses simple imputation (replacing with the mode or mean) to handle incomplete data. It provides two constraint-based algorithms, but performs conditional independence tests in an ad-hoc way (Bouckaert 2004). Weka provides Bayesian model averaging for parameter estimation (Bouckaert 1995).
The Java library jBNC (http://jbnc.sourceforge.net/, version 1.2.2)
learns ODE classifiers from (Sacha et al. 2002) by optimizing penalized
log-likelihood or the cross-validated estimate of accuracy. The CGBayes
(version 7.14.14) package (McGeachie et al. 2014) for MATLAB implements
conditional Gaussian networks (Lauritzen and Wermuth 1989). It provides four structure
learning algorithms, including a variant of K2 and a greedy
hill-climber, all optimizing the marginal likelihood of the data given
the network.
The bnclassify package implements several state-of-the art algorithms for learning Bayesian network classifiers. It also provides features such as model analysis and evaluation. It is reasonably efficient and can handle large data sets. We hope that bnclassify will be useful to practitioners as well as researchers wishing to compare their methods to existing ones.
Future work includes handling real-valued feature via conditional Gaussian models. Straightforward extensions include adding flexibility to the hill-climbing algorithm, such as restarts to avoid local minima.
This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 785907 (HBP SGA2), the Spanish Ministry of Economy and Competitiveness through the Cajal Blue Brain (C080020-09; the Spanish partner of the EPFL Blue Brain initiative) and TIN2016-79684-P projects, from the Regional Government of Madrid through the S2013/ICE-2845-CASI-CAM-CM project, and from Fundación BBVA grants to Scientific Research Teams in Big Data 2016.
bnlearn, bnclassify, caret, mlr, gRain, deal
Bayesian, GraphicalModels, HighPerformanceComputing, MachineLearning
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
gRain
package for R
. Journal of Statistical Software, 46(10): 1–26, 2012.
R
using the caret
package. Journal of Statistical Software, 28(5): 1–26, 2008.
bnlearn
R
package. Journal of Statistical Software, 35(3): 1–22, 2010.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Mihaljević, et al., "bnclassify: Learning Bayesian Network Classifiers", The R Journal, 2018
BibTeX citation
@article{RJ-2018-073, author = {Mihaljević, Bojan and Bielza, Concha and Larrañaga, Pedro}, title = {bnclassify: Learning Bayesian Network Classifiers}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {455-468} }