Associative Classification in R: arc, arulesCBA, and rCBA

Several methods for creating classifiers based on rules discovered via association rule mining have been proposed in the literature. These classifiers are called associative classifiers and the best-known algorithm is Classification Based on Associations (CBA). Interestingly, only very few implementations are available and, until recently, no implementation was available for R. Now, three packages provide CBA. This paper introduces associative classification, the CBA algorithm, and how it can be used in R. A comparison of the three packages is provided to give the potential user an idea about the advantages of each of the implementations. We also show how the packages are related to the existing infrastructure for association rule mining already available in R.


Introduction
Association rule learning (Agrawal et al., 1993) was initially designed for data exploration to discover interesting patterns in very large and sparse datasets. Several years after its inception, association rule learning was also adapted to create rule-based classification models. The first algorithm called CBA (Classification Based on Associations) was introduced by Liu et al. (1998). While there were multiple follow-up algorithms providing some improvements in classification performance (e.g., CPAR (Yin and Han, 2003) and FARC-HD-OVO (Elkano et al., 2015)), these performance gains are offset by a deterioration of comprehensibility of the produced set of rules. For some practical applications, CBA still provides a very good balance between accuracy, speed, and model comprehensibility. Unlike many more recent approaches, CBA classifiers are easy to interpret and apply: the resulting ruleset is relatively small, rules are crisp (i.e., not fuzzy rules), and rules are sorted according to predictive strength. CBA uses a simple first-match strategy for classification, where the first matching rule determines the predicted class.
With the exception of fuzzy approaches such as FARC-HD, associative classification approaches require a dataset in the form of transactions, i.e., all attributes need to be binary indicators and thus numeric attributes in the input data need to be discretized. This puts additional demands on the user and may deteriorate model fit on datasets with numerical attributes. Another disadvantage relating to CBA and most other associative classification approaches is that these algorithms require the user to specify a minimum support and a minimum confidence threshold for association rule mining. The performance (accuracy and speed) is typically very sensitive to a proper selection of these threshold values. Setting these thresholds too high can result in the classifier underfitting the dataset or even an empty rule list. Too low values can lead to a combinatorial explosion with an excessive number of rules generated, leading to speed and memory issues. Another limitation that applies specifically to CBA is that even when the user specifies reasonable thresholds, CBA typically produces more rules than other related approaches (Alcala-Fdez et al., 2011). These limitations may be the reason why CBA implementations have not been available in many computational environments for machine learning and statistics. However, in the last several years, three packages with CBA implementations appeared on CRAN (listed by date of the first release): rCBA (Kuchar, 2018), arc (Kliegr, 2018) and arulesCBA (Johnson and Hahsler, 2019). Each of these packages offers some enhancements over the original CBA algorithm to address some shortcomings of association rule-based classification.
The goal of this paper is to introduce prospective users to the concepts used in associative classification and the CBA algorithm in particular. We provide detailed information on the three available R packages and the enhancements they provide, followed by hands-on examples.
The paper is organized as follows. We first introduce association rule mining, followed by a discussion of the CBA algorithm. We present existing CBA implementations and focus on the features and the use of the three new R implementations. We conclude with a short comparison of the features and a run-time comparison on a typical dataset.

Background: Association rule mining
Associative classifiers like CBA are based on association rules. Mining association rules was first introduced by Agrawal et al. (1993) and, following the notation used by Agrawal et al. (1993), Hahsler et al. (2005 and Tan et al. (2006), can formally be defined as: Let D = {t 1 , t 2 , . . . , t m } be a set of transactions called the database, and let I = {i 1 , i 2 , . . . , i n } be The R Journal Vol. 11/2, December 2019 ISSN 2073-4859 transaction ID items  1  milk, bread  2  bread, butter  3  beer  4 milk, bread, butter 5 bread, butter the set of all items considered in the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. To illustrate the concepts, we use a small example from the supermarket domain introduced by Hahsler et al. (2005). The set of items is I = {milk, bread, butter, beer} and a small database containing five transactions with these items is shown in Figure 1. An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also may buy butter.
A rule is defined as an expression X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule. Often rules are restricted to only a single item in the consequent. Association rules are rules which meet user-specified minimum support and minimum confidence thresholds. The support, supp(X), of an itemset X is a measure of importance defined as the proportion of transactions in the dataset which contain the itemset. The confidence of a rule is defined as conf(X ⇒ Y) = supp(X ∪ Y)/supp(X), measuring how likely it is to see Y in a transaction containing X.
An association rule X ⇒ Y needs to satisfy where σ and δ are the minimum support and minimum confidence thresholds, respectively. For example, the rule {milk, bread} ⇒ {butter} has a support of 1/5 = 0.2 and a confidence of 0.2/0.4 = 0.5 in the database in Figure 1, which means that for 50% of the transactions containing milk and bread, the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS (see, e.g., Hipp et al., 2000).
Another popular measure for the importance of a rule is lift (Brin et al., 1997). The lift of a rule is defined as lift(X ⇒ Y) = supp(X ∪ Y)/(supp(X) supp(Y)), and can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Lift values greater than one indicate positive associations between the rule's LHS and RHS.
Because associative classification is based on association rules, transaction data is required as the input. Here each object (or instance) needs to be converted into a transaction containing only binary items. Discrete variables can be converted into items using a set of 0-1 dummy variables, one for each possible value. Continuous variables need to be first discretized and then converted. Typically, discretization for associative classifiers is performed using a class-based (also called supervised) discretization strategy, which identifies ranges for several intervals using information from the class variable (Yin and Han, 2003). The most popular method for class-based discretization is Minimum Description Length Principle (MDLP) discretization (Fayyad and Irani, 1993), which uses a greedy procedure to find cut points based on the entropy of the induced partition of the data with respect to the class variable. MDLP was also used in the initial paper on CBA (Liu et al., 1998). One of the advantages of MDLP is that there are no external parameters to be set; the optimal number of cut points is determined automatically using a stopping rule. Liu et al. (1998) proposed the first approach to associative classification called CBA. In CBA, a special type of association rules called Class Association Rules (CARs) are used for classification. A CAR is an association rule that conforms to the additional constraint that the consequent (RHS) of the rule is a single item that is associated with a class label for the classification problem. CBA proposed the following steps to perform associative classification (Vanhoof and Depaire, 2010):

The CBA algorithm
1. Mine a set of class association rules (CARs), 2. prune and sort the rules, 3. classify new objects using the RHS of the first matching rule.
Within the original paper, the first step is handled by a modification of the popular APRIORI algorithm (Agrawal and Srikant, 1994) for mining CARs. The modification includes an optional pruning step based on the rule's pessimistic error rate with the goal to reduce the size of the set of considered CARs. According to results reported in Liu et al. (1998), the absence of pessimistic pruning does not affect classifier accuracy and a regular implementation of APRIORI can be used. The output of association rule learning algorithms is determined by two parameters, the minimum confidence and the support thresholds. In light of classification, confidence gives the proportion of objects correctly classified by the rule in the training set. Therefore it can be seen as an optimistic estimate of the accuracy of the rule.
The main obstacles for straightforward use of the discovered CARs as a classifier are the excessive number of rules discovered even on small datasets, the fact that contradicting rules are generated, and the absence of a default rule. To address these issues, CBA employs rule sorting and a data coverage pruning procedure to reduce the number of rules. Two variants were proposed in the original paper (Liu et al., 1998): the direct M1 version, and the M2 version which reduces data access. Accessing data fewer times is especially useful if the data is too large to be stored in main memory. The amount of available main memory has increased substantially since the original paper was published making the improvements of M2 less relevant. For pruning, the rules are first ranked in the order of their strength: 1. Rule A is ranked higher if confidence of rule A is greater than that of rule B.
2. For rules tied for 1, rule A is ranked higher if support of rule A is greater than that of rule B.
3. For rules tied for 1 and 2, rule A is ranked higher if rule A is produced before rule B in the mining process. Since APRIORI applies breath-first search, rule A is ranked higher if rule A has fewer conditions (i.e., a smaller antecedent set) than rule B.
Rules are processed in ranking order. After each rule is processed, the matching (covered) transactions are removed. If a rule does not correctly cover at least one instance, it is deleted (pruned). In CBA, data coverage pruning is combined with default rule pruning. A default rule is a rule added to the end of the rule set with the majority class in the uncovered transactions in the RHS and an empty LHS. This rule ensures that a query instance is always classified even if it is not matched by any other rule in the classifier. The algorithm prunes all rules below the current rule if a default rule inserted at that place reduces the total number of errors on the training set.
Other algorithms. Since CBA was introduced, several competing associative classification approaches have been proposed to improve accuracy, training time, and ruleset size. Two popular extensions of CBA are CMAR (Li et al., 2001) and CPAR (Yin and Han, 2003). A multiclass-focused approach called Multiclass Associative Classification (MAC) (Abdelhamid et al., 2012) has been proposed for expanding CBA with the goal of more accurately addressing classification problems with many different class labels. An approach related to associative classification is used by rule-induction classifiers which generate a large rulesets and then use greedy pruning strategies to reduce the size while maintaining classification accuracy. Common examples of this technique are RIPPER (Cohen, 1995) and SLIPPER (Cohen and Singer, 1999).
Recently, instead of relying on heuristics, several optimization approaches have been proposed for selecting the rules used by the classifier. Scalable Bayesian Rule Lists Model (Yang et al., 2017) tries to identify a small subset of mined CARs by optimizing the posterior of a Bayesian hierarchical model over rule lists. The method is implemented in the R package sbrl (Yang et al., 2019). Azmi et al. (2019) propose to learn optimal rule weights for associative classifiers that use the sum of the class weight of all matching rules instead of the first rule for classification. The authors use logistic regression with L1 regularization to learn rule weights while enforcing a small rule set. This approach is available in arulesCBA as function RCAR().
While several alternative approaches have been introduced, CBA still acts as a strong contender in associative classification and is typically used as the benchmark against which new methods are assessed (Alcala-Fdez et al., 2011). A comparison between CBA and selected successors is performed in Kliegr (2017).

Implementations
There are only a few implementations of CBA available. Table 1 shows them ordered by the first release date and summarizes the used licenses and programming languages.
In the following, we discuss the three currently available implementations of CBA in R. We will first present each package individually and then compare the packages by providing code for the same classification problem implemented with each of the packages. We will use as the example dataset the well-known iris dataset (Fisher, 1936) and split it into 80% for training and 20% for testing.

virginica
The classification problem we use for the examples is to predict a flower's species using the four measurements.
All three packages integrate with the infrastructure for association rule mining in R implemented in package arules (Hahsler et al., 2005) and the ecosystem of related packages (Hahsler et al., 2011). While the presented packages can perform discretization, the conversion of a dataset with continuous variable to a set of transactions with binary items, and mining class association rules (CARs) internally and transparent to the user, we will give here a short example of how the packages arules and arulesCBA can be used to perform these tasks. First, we discretize the data using supervised discretization based on the minimum description length principle (MDLP) offered by packages like discretization (Kim, 2012). Here we use the discretizeDF.supervised function provided in arulesCBA.
library("arules") library("arulesCBA") iris_train_disc <-discretizeDF.supervised(Species~., data = iris_train, method = "mdlp") head(iris_train_disc) Note that the class variable is translated into several items, all starting with Species=. From these transactions, CARs can be mined by restricting the items which can appear in the right-hand-side of the rules. This can be done with the APRIORI implementation available in arules by specifying appearance restrictions.
rules <-apriori(trans_train, parameter = list(support = 0.01, confidence = 0.8), appearance = list(rhs = grep("Species=", itemLabels(trans_train), value = TRUE), default = "lhs")) arulesCBA contains a convenience function called mineCARs to make setting the appropriate appearance easier using the standard formula interface.  Test data can be discretized consistently with the training data using discretizeDF, which applies the discretization used in the second argument to the data in the first argument. Followed by a conversion to transactions.
iris_test_disc <-discretizeDF(iris_test, iris_train_disc) trans_test <-as(iris_test_disc, "transactions") While these steps are performed in most cases by the discussed packages internally, it is still helpful to understand the process. One of the advantages of associative classifiers is that the rule base can be inspected and, therefore, it is important to understand the transformations used to create items. Next, we will discuss the packages in alphabetical order.

Package arc
The R package arc (Kliegr, 2018) provides a pure R implementation of the rule pruning step of CBA. The association rule learning step is handled by the implementation of APRIORI in package arules. arc implements the M1 version of the CBA pruning step (Liu et al., 1998) and offers, in addition, automatic discretization and threshold tuning. A CBA model can be learned for the iris dataset as follows.
library("arc") classifier <-arc::cba(iris_train, "Species") The function cba() will create an instance of the S4 class CBARuleModel for the iris dataset using Species as the class variable. Note that discretization is performed and that the support and confidence thresholds are automatically found.
The resulting object holds a list of rules, a list of cut points (if discretization was automatically performed), the name of the class attribute, and a list of attribute types. The slot rules of the CBARuleModel object contains the rule base, which can be inspected by: Predictions for new data can be obtained using predict(). The new data is discretized automatically to match the rules.
Automatic discretization. Since association rule classification is a supervised task, the discretization can take advantage of using the class label. In the arc package, automatic discretization with MDLP is enabled by default. All numeric explanatory attributes with three or more distinct values are by default subject to discretization. The package relies on the discretization package (Kim, 2012). The arc package provides several convenience functions that allow to perform discretization of all attributes at once, addressing some of the shortcomings of the mdlp function from the discretization package, such as the inability to handle missing values, or skip non-numeric attributes. Only attributes containing at least a preset number of distinct values are discretized. The package is also capable of discretizing the target attribute if necessary. For this purpose, unsupervised discretization (clustering) is used.
Automatic threshold tuning. Association rule learning is notorious for how difficult it is to set the minimum support and minimum confidence thresholds. The necessity to set these thresholds also applies to CBA. The arc package contains an optional procedure for automatic setting of these thresholds detailed in (Kliegr and Kuchar, 2019) . The package contains a wrapper for the apriori function from the arules package that iterative changes mining parameters (maximum antecedent length, minimum support threshold and minimum confidence threshold) until a desired number of rules is obtained, all options are exhausted or a preset time limit is reached. The desired number of rules can be specified by the target_rule_count parameter.
The arc package also supports manual specification of thresholds: classifier <-arc::cba(iris_train, "Species", rulelearning_options = list(minsupp = 0.05, minconf = 0.9, The R Journal Vol. 11/2, December 2019 ISSN 2073-4859 minlen = 1, maxlen = 5, maxtime = 1000, target_rule_count = 50000, trim=TRUE, find_conf_supp_thresholds = FALSE)) classifier@rules set of 3 rules Unlike other implementations of CBA, which also implement the M2 version of CBA described by Liu et al. (1998), the arc package relies solely on the M1 version. However, the implementation does not follow the originally proposed way relying on iteratively processing of rules in the sort order. Instead, the pruning steps in M1 are implemented using a more efficient multiplication of sparse matrices exposed by the arules package, which relies on the optimized C code from the Matrix package (Bates and Maechler, 2017).

Package arulesCBA
The arulesCBA package (Johnson and Hahsler, 2019) is an extension of the arules package and strives to integrate seamlessly with its association rule mining infrastructure. The packages allows the user to set a time limit for rule mining, exposed by the arules package. The core operations of arulesCBA are implemented in a mixture of R and C to speed up processing. arulesCBA implements both versions of the pruning step, M1 and the optimized M2 version. The code for the pruning algorithm is heavily optimized by using rule-indexed sparse matrix representation, sparse matrix operations via package Matrix (Bates and Maechler, 2017) and prefix trees.
The arulesCBA interface. In arulesCBA, classifiers are created using the CBA() function. An advantage of this package for R users is that it consistently uses the well-known formula interface for building classifier models and for supervised discretization. Users can provide a number of options to the function to tune discretization, rule mining, and model building. The following is the list of available parameters to the CBA function.
• formula: A symbolic description of the model to be fitted using a standard formula object of the from: class ∼ explanatory variables The class is the variable name (part of the item label before =). Explanatory variables are separated using + and the special dot symbol . for all variables is also allowed.
• data: A data.frame containing the training data. If necessary, discretization is automatically applied. Alternatively, also a transaction set can be supplied.
• support,confidence: Minimum support and confidence thresholds for mining CARs with APRIORI.
• parameter,control: Parameter and control lists passed on to the apriori() function from the arules package.
A classifier for the iris dataset can be learned as follows.
library("arulesCBA") classifier <-arulesCBA::CBA(Species~., data = iris_train, supp = 0.05, confidence = 0.9) classifier CBA Classifier Object Class: Species (labels: setosa, versicolor, virginica ) Default Class: Species=setosa Number of rules: 2 Classification method: first Description: CBA algorithm by Liu, et al. 1998 with support=0.05 and confidence=0.9 CBA() returns an object of class CBA which contains all needed information for classification. A print method shows the settings used for the classifier. Prediction follows the usual approach in R. Note that only two rules are shown, while arc above produced three rules. The reason is that arulesCBA stores the default class Species=setosa separate from the rule base while arc includes it.
Advanced use of arulesCBA. arulesCBA is implemented with flexibility and future extensions in mind. For example, to have optimal control over the discretization process, the user can discretize the data manually before learning the classifier. The discretization functions in arules and arulesCBA retain enough information so that predict() can later automatically discretize the new data.
Another extension implemented in CBA_ruleset() allows the user to create an associative classifier by providing a custom rule base in the form of a rules object. For example, we can easily create a classifier from a set of CARs using, for example, majority voting instead of CBA's first-match strategy for classification.

Package rCBA
The rCBA package (Kuchar, 2018) was the first available implementation of the CBA algorithm on CRAN. The main algorithms are implemented in Java and it is the only R implementation that supports the use of multiple CPU cores during pruning. The package provides wrapper functions for pruning, prediction, and the FPGrowth association rule mining algorithm (Han et al., 2004). rCBA includes both, the M1 and the M2 version of the CBA algorithm. It also includes data coverage pruning and automatic threshold tuning.
Model building with automatic tuning of parameters and APRIORI is done as follows.

Selection of algorithms for rule learning.
The CBA algorithm can generally rely on any rule learning algorithm (Liu et al., 1998). By default, it uses the APRIORI implementation in arules, but it can also use rCBA's own implementation of the FP-Growth algorithm (Han et al., 2004) for the association learning step. rulebase <-rCBA::fpgrowth(iris_train, support = 0.05, confidence = 0.9, consequent = "Species") rulebase <-rCBA::pruning(iris_train, rulebase, method = "m2cba") rCBA::classification(head(iris_test), rulebase) [1] versicolor versicolor versicolor versicolor setosa versicolor Levels: setosa versicolor Automatic threshold tuning. Since pure random or grid search do not use any background knowledge of the algorithm, these approaches are unsuitable for optimizing the parameters of association rule learning. The implementation for the parameter optimization in rCBA is based on the simulated annealing (SA) algorithm, which addresses these problems. The objective criterion, which is optimized against, is the accuracy of the model. A detailed description of the approach can be found in Kliegr and Kuchar (2019).

Comparison of R implementations
In order to help the user to decide which package addresses best the particular use case, Table 2 presents a comparison of the features and limitations of the packages. Since all three packages implement the same algorithm, we did not compare classification accuracy between the implementations, but performed a small run-time comparison instead.
We compare the different implementations on some standard classification problems. The used datasets are available in the packages mlbench, datasets, arules, and the Lymphography dataset (Lymph) (Mickalski et al., 1986) was obtained from the UCI repository 1 . The most important dataset characteristics are summarized in Table 3. The number of transactions ranges from 101 to 48842 and the number of items (after discretization) from 15 to 147. We used for the comparison a minimum  confidence threshold of 0.5, a maximal rule length of 10 and set the minimum support so a reasonable number of classification association rules (CARs) was produced. CBA pruned the CARs to between 6 and 143 rules and achieves an accuracy (in sample testing) of typically around 90%. Only difficult datasets like Pima, Vehicle and Adult have worse results.
To compare run time, we conducted experiments on a standard laptop with an Intel Core i5-8250U CPU @ 1.60GHz with 4 cores and 8GB of RAM running R version 3.6.1 on Ubuntu 19.10. The package versions used for the comparison are: arc: 1.2, rCBA: 0.4.3, arulesCBA: 1.1.5. We disabled automatic threshold tuning. To remove the effect of random system load, we executed each algorithm ten times on each dataset and report the average execution time. The results are summarized in Table 4. arc produces the longest run times due to its pure R implementation. For Adult, the largest dataset arc ran out of memory. rCBA executes faster than arc. Both M2 and parallel execution using multi-core support in Java only improve the run time for the largest dataset. However, there the improvement is quite significant, reducing the run time to a third. arulesCBA's M1 implementation is on average the fastest while the M2 implementation's performance deteriorates on larger datasets.
Since many datasets of interest are typically larger then the standard datasets, we perform additional experiments to assess run time sensitivity for the number of input rules and the dataset size. For the experiments, we use the Lymph dataset. For assessing sensitivity to ruleset size, we oversample the dataset to 500 transactions and mine CARs with a minimum support of 0.05, a minimum confidence of 0.5 and a maximal rule length of 10. This results in more than 100000 rules. We then evaluate run time for building classifiers from the first 100, 1000, 10000, and 100000 mined rules. The results are shown in Figure 2(a). We see that M2 is generally slower than the corresponding M1 implementations. This might be due to the fact that the tested implementations hold all data in main memory, while M2 was designed for situations where the data does not reside in main memory. However, parallel execution helps rCBA's M2 implementation. arulesCBA's M1 implementation is the fastest.
To assess the sensitivity to dataset size, we fixed the ruleset size to 500 and increased the dataset size by oversampling every round by a factor of 2. In Figure 2(b), we see a similar result to the sensitivity to the number of rules. Parallel execution in rCBA helps both algorithms and arulesCBA's M1 implementation is the fastest. All packages are integrated with the arules infrastructure, where arulesCBA has the most consistent integration. arc and rCBA offer automatic threshold tuning, which will help users with applying associative classification for practical applications.

Conclusion
In this paper, we reviewed associative classifiers based on the CBA algorithm. While the algorithm is cited in many papers about classifiers based on association rule mining, there are only very few implementations available. This paper discussed three recent implementations in R packages. Due to the differences in implementation language (R, C, and Java) and additional implemented features, (b) Sensitivity to dataset size.