Subgroup Discovery with Evolutionary Fuzzy Systems in R: the SDEFSR Package

Subgroup discovery is a data mining task half-way between descriptive and predictive data mining. Nowadays it is very relevant for researches due to the fact that the knowledge extracted is simple and interesting. For this task, evolutionary fuzzy systems are well suited algorithms because they can find a good trade-off between multiple objectives in large search spaces. In fact, this paper presents the SDEFSR package, which contains all evolutionary fuzzy systems for subgroup discovery presented throughout the literature. It is a package without dependencies on other software, providing functions with recommended default parameters. In addition, it brings a graphical user interface to avoid the user having to know all the algorithm’s parameters.

SD is a rule learning process within a complex search space. Therefore, the search strategy used becomes a key factor in the efficiency of the method. Different strategies can be found in the literature such as beam search in the algorithm CN2-SD [Lavrač et al.(2004b)] and Apriori-SD [Kavšek and Lavrač(2006)], exhaustive algorithms such as SDMap [Atzmueller and Puppe(2006)] or Evolutionary Algorithms (EAs), for example.
EAs are stochastic algorithms for optimizing and searching tasks, based on the natural evolution process [John(1992)]. There are different paradigms within EAs: genetic algorithms [John(1992), Goldberg(1989)], evolution strategies [Schwefel(1995)], evolutionary programming [Fogel(2006)] and genetic programming [Koza(1992)]. With these methods the inclusion of rules as knowledge representation is known as evolutionary rule-based systems [Freitas(2003)] and has the advantage of allowing the inclusion of domain knowledge and returning better rules. The use of EAs is very well suited to SD because these algorithms perform a global search in the space in a convenient way, stimulating the obtaining of simple, interesting and precise subgroups.
Nowadays, there are several frameworks that allow the realization of data mining tasks, but only a few of them have implementations of SD algorithms. The most well-know frameworks with SD algorithms are KEEL [Alcalá-Fdez et al.(2011)] and VIKAMINE [Atzmueller and Lemmerich(2012)], but ORANGE [Demšar et al.(2013)], and CORTANA [Meeng and Knobbe(2011)] provide some implementations too. In fact, VIKAMINE also provides an R package called rsubgroup [Atzmueller(2014)] which is an interface for R to VIKAMINE Java algorithms.
In this contribution the SDEFSR package is introduced. It provides the user with the most important evolutionary fuzzy rule-based methods for SD documented in the literature, being a major contribution to the R community. In addition, it also brings the capability of reading datasets in the KEEL data format. This file format is not supported by R. Similarly, this package provides a Graphical User Interface (GUI) to make this task easier for the user, especially the novel one.
This contribution is organized according to the following structure: The first section presents SD concepts, main properties and features. In the second section, the structure of the SDEFSR package and its operations are described. In the third section, a complete example of use of the package is presented. Finally, the GUI of SDEFSR is shown.

Subgroup Discovery
SD was defined by [Wrobel(2001)] as: In subgroup discovery, we assume we are given a so-called population of individuals (objects, customer, . . .) and a property of those individuals we are interested in. The task of subgroup discovery is then to discover the subgroups of the population that are statistically "most interesting", i.e. are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest. SD tries to find relations between different properties of a set with respect to one interesting or target variable. Such relations must be statistically interesting, so it is not necessary to find complete relations, partial relations could be interesting too.
Usually these relations are represented by means of rules, one per relation. These rules are defined as [Lavrač et al.(2004a), Gamberger and Lavrac(2002)]: where T arget value is the value for the variable of interest (target variable) for the SD task and Cond is normally a conjunction of attribute-value pairs which describe the characteristics of the induced subgroup.\ SD is halfway between description and classification, because it has a target variable but its objective is not to predict it but rather to describe it. The use of a target variable is not possible in description, because description simply finds relationships between unlabeled object.\ A key point to fully understanding the goal of SD is how it differentiates from the classification objective. Classification is a predictive task that tries to split the entire search space, usually in a complex way, aiming to find the correct value for the variable in new incoming instances. On the other hand, SD aims to find interesting relations among these instances regarding the variable of interest.\ For instance, assuming there is a group of patients, the variable of interest is whether they have heart disease or not. The predictive data mining objective is to predict if new patients will have heart disease. However, SD tries to find which subgroup of patients are more likely to have heart disease according to certain characteristics. This is relevant for developing treatment against those characteristics, for instance.

Main elements of subgroup discovery algorithms
Below, the most relevant aspects of SD algorithms are presented [Atzmueller et al.(2004)]: • Type of target variable: This is the kind of information the interest variable can hold. The target variable could be binary (two possible values), categorical (n posible values) or numerical (a real value within a range). Nevertheless, the majority of SD algorithms can only deal with binary or categorical target variables.
• Description language: Knowledge representation is a key factor in SD due to its descriptive nature. In this way, rules must be as simple as possible. Rules are usually represented by conjunctions of attribute-value pairs or in disjunctive normal form. Fuzzy logic could also be included in the rules in order to improve the interpretability of the knowledge [Zadeh(1975), Hüllermeier(2005)].
• Quality measures: This is the most important aspect in the design of SD algorithms. The quality measures must guide the learning process and must show the quality of the extracted knowledge. They will be briefly described below.
• Search strategy: The search space grows exponentially with the number of variables.The use of a search strategy able to find a good solution, or the optimal one, by searching efficiently through the whole search space is very important.

Quality measures for subgroup discovery
A quality measure tries to measure the interestingness of a given rule or subgroup, but there is not a formal definition of what interestingness is. However, the interestingness could be defined as a concept that emphasizes conciseness, coverage, reliability, peculiarity, diversity, novelty, surprisingness, utility, and actionability [Geng and Hamilton(2006)]. For SD, the most used criteria to measure the interestingness of a rule are conciseness, generality or coverage, reliability, and novelty [Carmona et al.(2014)]. Quality measures that accomplish this criteria available in the SDEFSR package are: • Measures for conciseness (or complexity). It measures the complexity of the induced rules. Rules with a few number of attribute-value pairs are easy to remember and to add to the expert's knowledge. The quality measures associated to this criterion are: -Nr: The number of rules generated. A rule set with a large number of rules is much more difficult to remember than other that has less rules. Additionally, the lower the number of rules, the easier for the expert to filter those rules that are interesting. -Nv: The number of variables that have the rules generated. Rules with less number of variables are easier to understand and to remember, also, it tends to have more generality. Thus, rules with low number of variables are interesting.
• Measures for generality (or coverage). It measures the capacity of a rule to match with a great number of examples in the dataset. Also, the capacity to generalize the rule to other instances that are not in the training dataset is greater. The quality measures associated to this criterion are: -Support: It measures the frequency of correctly classified examples covered by the rule: where n ( • Measures for novelty. A rule is novel if the knowledge obtained from this one is unknown by the user or it is unable to infer such knowledge from other rules. For this kind of criterion, the quality measures availables in the package are: -Significance: It reflects the novelty in the distribution of the examples covered by the rule regarding the whole dataset: where p(Cond) = n(Cond) ns , nc is the number of possible values of the target variable and T arget value k is the value k-th value of the target variable.
• Hybrid measures. These measures try to maximize more than one criterion with a single expression that finds a good trade-off between the criteria used. The hybrid quality measures implemented are: -Unusualness: It is defined as the weigthed relative accuracy of a rule and tries to maximize generality and realiability: where T arget value means the negation of T arget value , i.e., the examples that not belong to the target class. A more extended classification and definition of quality measures for SD is available in  and [Atzmueller(2015)].

Evolutionary fuzzy systems
Evolutionary fuzzy systems (EFS) are the union of two powerful tools for approximate reasoning: evolutionary algorithms and fuzzy logic.
In one side, evolutionary algorithms [Eiben and Smith(2003)] are stochastic algorithms based on the natural evolution to solve complex optimization problems. It is based on a population of representations of possible solutions, called chromosomes. A basic evolutionary algorithm scheme is: 1. Generate the initial population 2. Evaluate the chromosomes of the population. This is the most important and expensive part of the GA. In the algorithms of this package, quality measures described above are used as evaluation function. 3. Select the chromosomes which the genetic operators will be applied. 4. Apply the genetic operators. The most used are: • Crossover operator. Which takes two chromosomes and generates two descendants as a combination of the elements of the parents.
• Mutation operator. Which takes a chromosome and changes randomly a gene (a value of the possible solution). 5. Replace the population with the new generated chromosomes. 6. Go to step 2 until a stopping criteria is reached. Normally this criteria is a number of evaluations or generations. These algorithms performs a global stochastic search through a huge search space efficiently. However, it is possible that these algorithms can not find an optimal solution (global optimum), but a good one (a local optimum) that can solve the problem too. Theys are well suited for SD because the problem of finding subgroups can be formulated from the optimization point of view coding rules as a parameters structure that optimize some measures. Additionally, different kinds of rules exists in SD (with inequality, with intervals, fuzzy rules. . . ). This can change dramatically the size of the search space and genetic algorithms can adapt these structures easily without performance degradation. Likewise, the selection of the genetic operators can make GA a great candidate to introduce expert knowledge [Carmona et al.(2014)].
On the other side, fuzzy logic [Zadeh(1975)] is an extension of the traditional set theory. Its main objective is to model and deal with imprecise knowledge, which is nearer to the human reasoning. The main difference with traditional set theory is that belonging degree is not zero or one, but a real value in [0,1]. This possibility allows to define fuzzy limits and the chance of overlapping between fuzzy sets.
A fuzzy variable is a set of linguistic labels, e.g. low, medium and high, which are defined by overlapped fuzzy sets. This information is nearer to human reasoning and it is possible to calculate with precision the value of each belonging degree. This expressivity allow to obtain simpler rules because continuous variables are more understandable for humans. A rule can be a set of fuzzy variables. To determine if the rule cover an example it is neccessary to calculate the belonging degree of each variable in the rule with respect to the example. If all variables has a belonging degree grater than zero, the example is covered.
Evolutionary fuzzy systems is the union of both techniques, and has three ways of work [Herrera(2008)]: • Evolutionary algorithms that evolve the fuzzy rules (changing the number of variables and their values) and uses fuzzy sets definitions defined by the user. This way of work is used by all the algorithms implemented in the SDEFSR package.
• Evolutionary algorithms that evolve the fuzzy sets, changing the number of fuzzy sets for each variable, its shapes, etc.
• Evolutionary algorithms that evolve both rules and fuzzy sets.

The SDEFSR package
SDEFSR is a package entirely written on R from scratch. This package includes the most important EFS for SD presented throughout the literature up to the moment. In addition, SDEFSR has the capacity to read data in different standard and well-known formats such as ARFF (Weka), KEEL, CSV and data.frame objects. Similarly, all functions included in the SDEFSR package have default parameters with values recommended by the authors. This allows the algorithms to be executed in an easy way, without the necessity of knowing the parameters for final users.

Algorithms of the package
Now, we describe the algorithms that are available in our package. This contains three algorithms: SDIGA [?], MESDIF [Berlanga et al.(2006)] and NMEEF-SD [Carmona et al.(2010a)]. In chapter 4 we describe how to use the algorithms.

SDIGA (Subgroup Discovery Iterative Genetic Algorithm)
The algorithm SDIGA is a subgroup discovery algorithm that extract rules with two possible representations: one with conjunctions of attribute-value pairs (called canonical representation) or one with a disjunctive normal form (DNF) in the antecedent. It follows an iterative schema with a genetic algorithm in his core to extract those rules, this genetic algorithm only extract one rule, the best of the population, and after the extraction a local optimization could be performed in order to improve the generality of the rule. As the algorithm follows an iterative schema, the genetic algorithm is executed one time after another until a stopping criteria is reached: the rule obtained by the genetic algorithm and after the local improvement phase must cover at least one example not covered by precedent rules and this rule must have a minimum confidence (see Equation 4). SDIGA can work with lost data (represented by the maximum value of the variable + 1), categorical variables and numerical ones using fuzzy logic with the latter to improve the interpretability of the results.

Components of genetic algorithm of SDIGA As we mentioned above, a genetic
algorithm is the core of SDIGA. Such genetic algorithm is the responsible of extract one rule per execution and this rule it is the best of the population at the final of the evolutive process.
3.1.1.1.1 Representation schema Each chromosome in the population represents a possible rule but it only represents the antecedent part of the rule because the consequent is prefixed. SDIGA can handle two types of representation as we mentioned above, canonical and DNF. Canonical representation is formed by a numerical vector of a fixed length equal to the number of variables with possibles values in a range in [0, max] where max is the number of possible values for categorical variables or the number of fuzzy sets defined for numerical variables. This max value represents the no participation of that variable in the rule. DNF representation is formed by a binary vector, also with fixed length equal to the sum of all number of values/fuzzy sets of the variables. Here, a variable does not participate in a rule when all of his values are equal to zero or one.
3.1.1.1.2 Crossover operator SDIGA follows a pure stationary schema which only crosses the two best chromosomes in an iteration of the evolutionary process. This crossover is performed by a two-point cross operator.

Mutation operator Mutation operator is applied over all the population of parents
and chromosomes generated by crossover. The mutation probability is applied at every gene. This operator can be applied in two ways (both with the same probability): • Eliminate the variable: it puts the no participation value in that variable.
• Put a random value: it puts a random value in that variable. The no participation value is also included.

Fitness function
The function to define the quality of a chromosome in SDIGA is a weighted average of three of the quality measures described in section 2.2. So the functions to maximize is: where: • Sop(x) is the local support. This support is a modification of support (see Equation 2) and can be crisp or fuzzy: -Crisp Support: -Fuzzy Support: where where E CC are the correctly covered examples and E C are the covered examples.
• Obj3(x) is other quality measure of section 2.2 • wi is the weight of objective i As this rules uses fuzzy logic, we need an expresion to determine when an example is covered or not by a rule and also determine the belonging degree of that example to the rule. This function is determined by the expression AP C (Antecedent Part Compatibility) and it is calculate by the expression: where: • ei is the value of the example for the variable i • µ i n is the belonging degree to the set n of the variable i • T C is the fuzzy t-conorm. In this case is the maximum t-conorm.
• T is the fuzzy t-norm. In this case is the minimum t-norm. µ i n will be one or zero if the variable is categorical and its value is the same of the rule or not. In case of numerical variable, µ i n will be computed following the triangular belonging degree function.
3.1.1.1.5 Replace operator To get next population for the next iteration a replace operator must be performed. This operator sort the population by fitness value and keep only the n best chromosomes, being n the size of the population.

Local optimization
After the genetic algorithm returns a rule, this rule could be improved by means of a local optimization based on a hill climbing first best local search. The algorithm eliminate one by one a variable and if the rule has a support and confidence greater than or equal the actual, the process starts again with that variable eliminated.

MESDIF (Multiobjetive Evolutionary Subgroup DIscovery Fuzzy rules)
MESDIF is a multi-objective genetic algorithm that extract fuzzy rules. The representation used could be canonical or DNF (see Section 3.1.1.1.1). This algorithm follows the SPEA2 [Zitzler et al. (2002)] approach where an elite population is used along the whole evolutive process. At the end of the process, the rules stored in the elite population where returned as result.
The multi-objective approach is based on a niches schema based on the dominance in the Pareto front. This schema allows to find non-dominated individuals to fill the elite population, that has a fixed size. In Figure 1 we see a basic schema of the algorithm.

Initial Population generation
The initial population operator performed by MESDIF produce random chromosomes in two ways: one percentage of chromosomes are produced randomly and the rest are produced also randomly, but with the difference that only a maximum number of variables could participate in the rule. This produce more generality in the generated rules.

Fitness function
To obtain the fitness value of each chromosome, MESDIF follows this steps: • First, we look for dominated and non-dominated individuals in both populations. We call strength of an individual the number of individuals that this domain.
• An initial fitness value Ai is calculated for each individual, such value is the sum of the strength of all dominators of individual i. So, we have to minimize this value and for non-dominated individuals is zero.
• Due to this system could fail if there are a lot of non-dominated individuals, additional information about density is included. This density is computed by the nearest neighbour approach and it is calculate by where σ k is the k-th nearest neighbour.
• The final adaptation value is the sum of initial fitness and density information

Truncation operator
As the elite population has a fixed size, we need a truncation operator to truncate the elite population if all non-dominated individuals can not fit in elite population.
To make this truncation, the operator take all non-dominated individuals and calculate the distance among every one. Then, the two closest individuals are taken and eliminate the individual with his k-th nearest neighbour with a minor distance. This process is repeated until non-dominated indivuals fit in elite population.

Fill operator
If the number of non-dominated individuals are less than the size of the elite population, we need to fill elite population with dominated individuals. The operator sort the individuals by its fitness value and copy the n first individuals to elite population, where n is the size of the elite population.

Genetic operators
The genetic operators of MESDIF are: • A two-point crossover operator (see Section 3.1.1.1.2) • A biased mutation operator, the functionality is the same operator of SDIGA (see Section 3.1.1.1.3) but it is applied over a population of selected individuals.
• A selection operator based in a binary tournament selection. This selection is only applied over the individuals of elite population.
• A replacement of the selected population based on the direct replace of the k worse individuals of the population. k is the number of individuals returned after crosses and mutations.

NMEEF-SD (Non-dominated Multi-objective Evolutionary algorithm for Extracting Fuzzy rules in Subgroup Discovery)
NMEEF-SD is another multi-objective genetic algorithm based in the NSGA-II [Deb et al.(2000)] approach. This algorithm has a fast sorting algorithm and a reinitialisation based on coverage if the population does not evolve for a period. This algorithm only works with a canonical representation (see Section 3.1.1.1.1). In [?] a study is presented where it reflects the low quality of rules obtained with a DNF representation.

Evolutionary process The evolutionary process follows this steps
• An initial biased population Pt is generated (see Section 3.1.2.1.1).
• The genetic operators are applied over Pt obtaining Qt with the same size as Pt.
• Pt and Qt are joined obtaining Rt and the fast sorting algorithm is applied over Rt. The individuals are sorted forming different fronts in the following way: "The first front (F1) is composed of non-dominated individuals, the second front (F2) is composed by individuals dominated by one individual; the third front (F3) is composed by individuals dominated by two, and so on." • After that, the algorithm generates the population of the next generation (Pt+1). First, the algorithm checks if the Pareto front covers new examples as it can be show in Figure 2. If this condition is not satisfied during a period of the evolutionary process, a reinitialisation based on coverage is performed. Otherwise, the algorithm gets the next population (Pt+1) introducing, in order, the first complete fronts of Rt. If the last front does not fit completely in Pt+1 then, the front is sorted by the crowding distance and first individuals are copied until Pt+1 is filled.
• At the final of the evolutionary process the individuals in the Pareto front are returned.

Reinitialisation operator This operator is applied if the Pareto front does not cover
any new example during a 5% of the total number of evaluations. Then, the algorithm copy the non duplicated individuals in the Pareto front to Pt+1 and the rest of individuals are generated by means of trying to cover one example of the target class with a maximum number of variables.

FuGePSD
FuGePSD [?] is another genetic algorithm that finds interesting subgroups of a given class. It uses a programming genetic approach, where individuals are represented as trees with variable length instead of vectors. Also, the consecuent of the rule is represented. This schema has the advantage of get rules for all possible values of a given target variable in one execution. Furthermore, FuGePSD has a variable population length, which can change over the evolutive process based on an competitive-cooperative approach, the Token Competition.

Evolutionary process
The evolutionary process of FuGePSD works as follow: 1. Create a random population of a fixed length. 2. Create a new population called Offspring Population, which is generated via genetic operators. 3. Join original population and offspring and execute the token competition procedure 4. Get global fitness and replace best population if necessary. 5. return to 2 until number of generations is reached. 6. Apply to the best population a Screening function and return the resulting rules.
The key function on this scheme is the token competition procedure, which promove diversity on the population, and some of the genetic operator will bring specificity to the rules.

Genetic operators The genetic operators used by FuGePSD are:
• Crossover: it takes to parents and to random subtrees of each one. The crossover will cross the subtrees if the grammar are correct.
• Mutation: Change randomly a variable of an individual with a random value.
• Insertion: Inserts a new node on an individual, This node is a variable with a random value.
• Dropping: Remove a subtree of an individual. This genetics operators will be applied with a probability given by the user.

Token Competition procedure
The token competition is the key procedure of FuGePSD, this brings diversity on the population keeping the best individuals. The algorithm works as follows: let a token be an example of the datasets, each individual can catch a token if the rule can cover it. If its occur, a flag is set at that token and this token cannot be catched by other rules. So, the token competition works as follows: 1. Sort the population in descending order by their individual fitness value. 2. In order, each individual takes as much token as it can. This action is storied in a value for each individual called penalized fitness: Where count(Ri) is the number of tokens the rule really seized and ideal(Ri) is the maximum number of tokens the rule can seize. 3. Remove individuals with P enalizedF itness = 0.

Screening Function
At the end of the evolutive process, the screening function is launched to get the users quality rules only. This rules must reach a minimum level of confidence and sensitivity. The function has an external parameter (ALL_CLASS) which if true, the algorithm will return, at least, one rule per value of the target variable, or at least the best rule of the population if false.

Package architecture
The main advantage of the SDEFSR package is that it provides all evolutionary fuzzy systems for SD that exists in the bibliography. Algorithms included are an important kind of SD methods that are not available in R at the moment. Therefore, this package provides to the R community a brand new possibility for data mining and data analysis.
The base of the package is defined by two S3 classes. This classes are: • SDEFSR_Dataset. This object defines a dataset and contains information about it. Such information are stored in the following fields: -relation. Defines the name of the relation that this dataset belongs to, e.g. "german credit".
-attributeNames. Store the names of the attributes. This class also exports the well-known S3 methods print() and summary() that show the data structure without codification and a summary with basic information about the dataset respectively.
• SDEFSR_Rules. This class is a list that contains the rules generated by an algorithm. To know the number of rules generated, it is possible to use length (SDEFSR_RulesObject). Each rule has the following fields: -rule. The string that represent the rule description.
-qualityMeasures. A list that contains the quality measures of the rule. Such measures are the same as described in the quality measures section. This object must be returned for all the SD algorithms of this package, and it is neccessary to make an analysis of the generated rules. This object is necessary to the exported functions plot() that plots an FPR vs TPR graph that allows the visualization of rules, and orderRules() that return other SDEFSR_Rules object with the rules sorted by a given quality measure in descendant order. Likewise, this object overloads the subset operator ('[') to allow filtering operations easily. Additionally, the package has a general function that reads datasets in ARFF, KEEL or CSV format called read.dataset() and SDEFSR_DatasetFromDataFrame() to transform a data.frame into a SDEFSR_Dataset.

Use of SDEFSR package
In this section we are going to explain how to use this package. This package tries to use in a really simple way subgroup discovery algorithms and also without any dependencies.

Installing and load the package.
The package SDEFSR is now available at CRAN servers, so it can be installed as any other package by simply typing: install.packages("SDEFSR") Also, the develop version is available into GitHub at http://github.com/aklxao2/SDEFSR, feel free to clone and contribute if you wish. If you wish to use the development version you need to install the package devtools using the commando install_github: devtools::install_github( aklxao2/SDEFSR ) SDEFSR depends only on the package shiny [?]. This package is neccessary to use the user interface in a local way and if the package is not installed, it asked to the user to be install when running the function SDEFSRR_GUI(). Once installed the package has to be loaded before use it. The package can be loaded through library() or require(). After loading the package there are six datasets available as SDEFSR_Dataset: carTra, carTst, germanTra, germanTst, habermanTra and habermanTst that corresponds to car, german and haberman training and test datasets. Also, rules generated by the SDIGA algorithm with defaults parameters over the haberman dataset are loaded as habermanRules as an SDEFSR_Rules object.

Loading a dataset
In order to use SD algorithms available in the SDEFSR package, a SDEFSR_Dataset object is neccessary. This object can be generated using the read.dataset() function. This function reads ".dat" files with the KEEL data mining tool format, ARFF files (".arff") from WEKA or even CSV files (".csv"). Assuming the files iris.dat, iris.arff and iris.csv corresponding to the classical iris dataset in KEEL, ARFF and CSV formats respectively in the working directory, the loading of these files will be as follows: irisFromKEEL <-read.dataset("iris.dat") irisFromARFF <-read.dataset("iris.arff") irisFromCSV <-read.dataset("iris.csv") Note that the function detects the type of data by the extension. To read csv files, the function has optional parameters that defines the separator used between fields, the decimal separator, the quote indicator and the NA identifier as parameters. By default, this options and values are sep = ",", quote = "\"", dec = "." and na.strings = "?" respectively. It is important to remark that sparse ARFF data is not supported.
If the dataset is not available in this formats, it is possible to obtain a SDEFSR_Dataset object from a data.frame. This data.frame could be loaded by read.table() or similar functions. Eventually, the resulting data.frame has to be given to the SDEFSR_DatasetFromDataFrame() function. As we can see, this function allows the creation of datasets on the fly, as in this example: library(SDEFSR) df <-data.frame(matrix(data = runif(1000), ncol = 10)) #Add class attribute (It must be the last attribute and it must be categorical) df[,11] <-c("Class 0", "Class 1", "Class 2", "Class 3") SDEFSR_DatasetObject <-SDEFSR_DatasetFromDataFrame(df, relation = "random") #Load from iris dataset irisFromDataFrame <-SDEFSR_DatasetFromDataFrame(iris, "iris") This will assign to SDEFSR_DatasetObject a new dataset created randomly with 100 examples and 11 attributes. Note that the target variable must be categorical, because the SD algorithms can only deal with categorical target variables.\ The SDEFSR\_DatasetFromDataFrame() function has three additional parameters: names, types, and classNames. These allow the manual assignment of attribute names, their types, and a vector with values of target variable, respectively. Leaving the default values (NA), the function automatically retrieves these values through the information found on the dataset. However, if the information in the dataset is not accurate, it could cause unexpected results for the SD algorithms.

Executing Subgroup Discovery algorithms
Once our datasets are ready to be used, it is time to execute one subgroup discovery algorithm. For example we want to execute the algorithm MESDIF. For the rest of the algorithm this steps are equal and only a few parameters are different, if you need help with the parameters, refer to the help of the function using help(function) or ?function. It is possible to execute an SD algorithm in two ways: Through a parameters file, specifying as argument the path to such file. Or by entering all the parameter names and values at the command line. You can find the structure of the parameters file, among other useful information, on the help pages of each function.
• After that, the rules obtained are shown one by one. These rules are numbered, starting at 1.
• Finally, the quality measures applied over the test (or training if 'test = NULL') dataset for each rule are shown. At the end of the results, the "Global" section shows quality measures as a summary, providing the mean of the quality measures of every rule.
As this output could be extremely large, the function also saves it to three files, one for each of the above sections. The name of these files by default are optionsFile.txt, rulesFile.txt and testQM.txt and being saved into the working directory, overwriting existing files. The format of these files is identical to the output generated by the algorithm, but divided into the sections described above.

Analyzing the rules obtained
After the execution of a SD algorithm, it returns a SDEFSR_Rules object that contains the rules obtained. Following the example, with the ruleSet object obtained we can plot a TPR vs FPR plot to view the reliability, and generality of the rule. Reliable rules has low values of FPR and great TPR values, and too much general variables has great values of both TPR and FPR. To plot the rules, we can use the function plot(): library(ggplot2) plot(ruleSet) Additionally, we can directly order the rule set by a quality measure with the function orderRules which returns another SDEFSR_Rules object with the sorted rules. By default, the ordering is done by confidence.
rulesOrderedBySignificance <-orderRules(ruleSet, by = "Significance") Filter rules by number of attribute-value pairs or keep those rules with a quality measure greater than a given threshold is an interesting functionality to keep only high-quality rules. Such filtering operations are quite simple to apply in SDEFSR. Using

The user interface
As we mentioned at the begin of this paper, the SDEFSR package provide to the user an interface to use the subgroup discovery task in a more easy way and also do some basic exploratory analysis tasks. We can use the user interface by two ways: one of them is using the interface via web at: http: //sdrinterface.shinyapps.io/shiny . This has the advantage of use the algorithm without having R installed in our system and also avoid expending process time in our machine. The other way is to use the interface is in a local way, having our local server. This could be possible simply using:

SDR_GUI()
This function launch the user interface in our predetermined web browser. As we can see in Figure 3. The page are organized into an options panel at the left and a tabbed panel at the right. Here is where our results are shown when we execute the algorithms. The first we have to do is to load a KEEL dataset. If we want to execute a subgroup discovery algorithm we must load a training and a test file having the same @relation field in the KEEL dataset file.
Once we select the dataset, automatically it shows a graph with information about the last variable defined in the dataset. The graph shows the distribution of examples having some values of the variable. At the rigth of this graphic we can see a table with some basic information, more extended if this variable is numerical. Figure 3: Initial screen of the SDEFSR user interface.
When loaded a dataset we can do a lot of things, in Figure 4 we can see all the possibilities we can do: 1. As we mentioned above, we can load a second dataset as a test (or training) file. 2. This lists have a double function. In one hand we could select the variable to visualize in the graph and in the other hand is to select the target variable for executing a subgroup discovery algorithm.
Below the variable selection we can choose a target value of the variable to find subgroups about this value or search for all possible values of the target variable. 3. Here we can choose how the information will be visualized, for categorical variables we can choose show the information as a pie chart or as a histogram. For numerical variables, a histogram and a boxplot are available. 4. Here we can choose the subgroup discovery algorithm and his parameters, that it is shown below.
Below the parameters, we have the button to execute the subgroup discovery algorithm. 5. This section allows the selection, for categorical variables, the values of that variable has that we can see in the graph. It is important to remark that it only "hide" the values in the graph, so it does not eliminate any example of the dataset. 6. It allows to visualize information about the training or test file. After the execution of a subgroup discovery algorithm, we go automatically to the tab Rules generated , this tab contains a table with all subgroups generated. If we want we could filter rules by variable, for example, typing into the Search field. The tab Execution Info shows an echo of the parameters used for launch the algorithm. This echo is equal than the one we can see in R console.
The tab Test Quality Measures shows a table with quality measures of every rule applied to test dataset. The last row its a summary of results and shows the number of rules we have and the average results of every quality measure.

Summary
In this paper the SDEFSR package has been introduced. This package can use four subgroup discovery algorithms without any other dependencies to others tools/packages. Also, the possibility of read and load datasets in the KEEL dataset format is provided, but it can read dataset from ARFF format or load a data.frame object. Finally, a web-based interface is developed for make the work easier, even if we do not have R installed in our system. The development of the package will continue in the future, including more functionality to work with KEEL datasets, adding new subgroup discovery algorithms and also improve the web interface. As we can see we have a great job ahead, so we encourage other developers to participate adding tools or algorithms into the package, as we will do in futures releases.