dGAselID: An R Package for Selecting a Variable Number of Features in High Dimensional Data

The dGAselID package proposes an original approach to feature selection in high dimensional data. The method is built upon a diploid genetic algorithm. The genotype to phenotype mapping is modeled after the Incomplete Dominance Inheritance, overpassing the necessity to define a dominance scheme. The fitness evaluation is done by user selectable supervised classifiers, from a broad range of options. Cross validation options are also accessible. A new approach to crossover, inspired from the random assortment of chromosomes during meiosis is included. Several mutation operators, inspired from genetics, are also proposed. The package is fully compatible with the data formats used in Bioconductor and MLInterfaces package, readily applicable to microarray studies, but is flexible to other feature selection applications from high dimensional data. Several options for the visualization of evolution and outcomes are implemented to facilitate the interpretation of results. The package’s functionality is illustrated by examples.


Introduction
Recent advances in information technology provide tools for gathering an immense amount of data on various scopes.Investigators have increasingly improved tools to collect data describing different areas of research.The need to efficiently manage, analyze and extract the important features from high dimensional data is also growing with the different exploration areas benefiting from these technologies.
The DNA microarray technology is widely used for exploring the differential gene expression.The method is very established and offers a very good opportunity to develop new methods for selecting features in real high dimensional data.The vast amount of microarray data that is freely available along with the results obtained by employing other exploratory techniques, belonging to statistics or artificial intelligence, provide a unique opportunity to evaluate the performance of newly developed techniques.
The Genetic Algorithms (GAs) were extensively used to select features in various high dimensional data, for different research goals.The GA designs evolved and were adapted with particular exploration interests since they were introduced (Holland, 1975).Different GA designs were specifically adapted to address optimizations or diverse feature selection (Xue et al., 2016) assignments.
The literature on GAs is comprehensive and covers various aspects of interest.The fundamentals of GAs, including the schema theorem, are covered in very instructive introductory books (Mitchell, 1998) and (Goldberg, 1989).Other authors propose exhaustive investigations in the GAs' behavior (Berard and Bienvenue, 2003) and properties (Rudolph, 1994).The genetic operators and their impact on evolution, with emphasis on mathematical details (Doerr and Doerr, 2015) were extensively examined.The genetic algorithms model the naturally occurring evolution and are designed to solve particular problems.In consequence, the theoretical foundations are yet to catch up with the practically applied algorithms.Nevertheless, the theory of genetic algorithms is emerging (Droste et al., 2002).
The project R offers a great environment for developing methods for high dimensional data analysis.The variety of techniques already implemented by numerous contributors and the availability of the methods, source code, countless data, and results as well as the very forthcoming community make it the environment of choice for implementing our method.Moreover, the Bioconductor project (Huber et al., 2015) available in R, offers a wide range of methods and tools for analyzing microarray data, as well as real data sets to experiment with and compare the results.Our package dGAselID was developed to be cohesive with Bioconductor.The "ExpressionSet class" used in Bioconductor was adopted as standard for our package; any data formatted accordingly can be analyzed with our method.
Different GA implementations are available as contributors' packages in R. Implementations of GAs for both floating-point and binary chromosomes are included in the genalg (Willighagen and Ballings, 2015) package.The GA (Scrucca, 2016), nsga2R (Tsou, 2015), and gaoptim (Tenorio, 2013) packages are dedicated to optimizations using GAs.A GA designed for determining training populations (Akdemir et al., 2015) is offered in the STPGA package.The package kofnGA (Wolters, 2015) aims to select a fixed-size set of integers.Variable selection applications of GAs are proposed in the The R Journal Vol.9/2, December 2017 ISSN 2073-4859 mogavs (Pajala, 2016) and gaselect (Kepplinger, 2015) packages for regression and high-dimensional data respectively.

Algorithm
Haploid GAs were previously employed to address feature selection in microarray studies (Melita et al., 2008).In this type of data, the number of samples is significantly lower than the number of features and the utilization of cross validation techniques is necessary for reliable results.The diploid GAs offer better performance than the haploid implementations for selecting features in a cross validation scenario in general, and for microarray data (Melita and Holban, 2016b) in particular.
The GA implementation in the dGAselID package uses a diploid representation.All the features in the data form a genome, and each feature retains a specific locus in the genome, the position in the original data.Every individual in the population will consist of two such genomes.
The fitness evaluation function is a supervised classifier.Any implementation of supervised classifier available in MLInterfaces package (Carey et al., 2016) is a possible choice for fitness evaluation function in dGAselID.The fitness value is the accuracy of the given classifier in discerning between samples belonging to different classes.
Every feature in the data, inputted according to the format in the "ExpressionSet class", is represented by a gene in the genome.Every gene has two alleles, represented as 0 and 1.The allele 1 codes for the corresponding feature to be present in the classifier.The allele 0 cyphers for discarding the gene from the classifier.A genome with a limited number of alleles = 1, codes for the supervised classifier working on a subset of features from the data.The number of desired features is user selectable at the initialization of the algorithm.
Our implementation offers the possibility to divide the genomes into a variable number of chromosomes.The number of chromosomes to split the genomes in is user selectable.The value 1 for the number of chromosomes will result in the genome being treated as a single chromosome, like in the classical GA implementation.The default value in the dGAselID is 22.In this case, the genome is parted into 22 chromosomes, the total of human autosomes.The chromosomes will have variable length, with different number of genes, following the dispersal found in the human autosomes, as illustrated in the Table 1.The number of genes found on each chromosome will follow the spread found in the human autosomes with different values for the number of chromosomes.We chose the default value 22 to emphasize the foundation of our evolutionary approach.Different values will serve diverse practical applications.This parameter is particularly important when variables belong to several previously known categories, as with the custom microarray chips.When no such information is known, an appropriate value can be empirically determined.
The initial population is randomly generated from a discrete uniform distribution.The user can specify the number of genomes in the population, the number of activated genes in each genome and the number of chromosomes to split the genomes in.The population will encompass individuals, with each individual consisting of two sets of haploid chromosomes, randomly assigned.
In a diploid GA it is mandatory to determine how different alleles on heterozygous chromosomes influence the phenotype.The dominance schemes typically used in genetic algorithms are built upon the Complete Dominance model, described in biology by Gregor Mendel in 1865.In this model, one of the alleles, called dominant, produces effects into phenotype and masks the existence of the other allele in genotype.The alternative that does not affect the phenotype is called recessive allele.The Complete Dominance model describes only a few of the interactions between alleles in nature.Various models were later developed to describe different interactions between alleles.In Incomplete Dominance model, the phenotype of an individual is considered to be in between the phenotypes resulting from each of the inherited alleles.Both alleles influence the phenotype and the existence of none is masked in genotype.The main difference between the two models is illustrated in Figure 1.We can suppose that a gene that codes for the color of an organism has two alleles; one of them produces a red individual and the alternative shapes a blue entity.With the Complete Dominance model, one of the alleles is dominant and masks the presence of the other.In our example, the allele that codes for the red color is dominant and is noted with capital letter (R).Every diploid organism that inherits at least one allele R will be a red entity.Only if a diploid organism inherits two copies of the recessive allele (b), the phenotype will be influenced by it, resulting in a blue individual.The interaction described by the Incomplete Dominance model results in three different phenotypes.In this case, an individual can be red, blue or purple.Both alleles affect the phenotype and are noted with capital letters (R and B).
The Incomplete Dominance inheritance is an alternative to genotype to phenotype dominance schemes (Melita and Holban, 2016b) in genetic algorithms and is the approach adopted in our algorithm.Moreover, this method does not require an explicit scheme for genotype-phenotype mapping.The fitness of each individual is consequently evaluated as the mean accuracy of the two classifiers, The R Journal Vol.9/2, December  The R Journal Vol.9/2, December 2017 ISSN 2073-4859 each set of haploid chromosomes in the individual.With this approach it is possible to maintain a better variability in the late generations, as features from the poorly performing genotypes will be present in later generations when compared to the classical technique.Exploitation of the Incomplete Dominance model is optional.The value "ID2" for the parameter ID enables the Incomplete Dominance model, while the value "ID1" turns it off.In evaluating the fitness of an individual, any supervised classifier available in MLInterfaces package can be selected.The package offers a unified way to call supervised classifiers on data formatted according to "ExpressionSet class" specifications with several options (svmI, ldaI, rdaI, knnI, knn.cvI, randomForrestI, dldaI, nnetI, qdaI, naiveBayesI, etc).The cross-validation techniques implemented in MLInterfaces are also available for fitness evaluation in our package, and are accessible through the trainTest parameter.The default value "LOG" for the trainTest parameter enables the leave-out-group cross-validation, while specifying "LOO", the user can opt for the leave-one-out cross-validation.Moreover, is if possible to shape the data into training and testing sets.A value of the form x:y for the trainTest parameter identifies the samples with indexes from x to y as training examples while all the others are used as testing set.
In the next step, the individuals are ranked according to their fitness.Our implementation offers the option of applying an elitist selection over the ranked individuals.The default value for elitism is NA, but a user can decide on the desired value for elitism, keeping the chosen number of best performing genotypes in the population.
Crossovers are applied next, between the two haploid sets of chromosomes in each individual.Two-point crossovers were preferred to the single-point alternative which preferentially affects the string ends.The two-point crossover follows the classical implementation.One two-point crossover is applied between homologous chromosomes, in each individual.A parameter to explicitly specify the chance for a crossover to occur is not implemented in our algorithm.The number of chromosomes implicitly affects the number of crossovers.
Another approach to crossover, modeled after the Random Assortment of Chromosomes in meiosis is also available.The Random Assortment of Chromosomes Crossover (Melita and Holban, 2016a) takes advantage of splitting the genomes in a number of chromosomes with variable size.When selected, two-point crossovers are performed between homologous chromosomes.After the crossovers are applied, the chromosomes are randomly assorted and distributed to one of the chromosomes set in each individual.This process, models the events that occur during meiosis I in eukaryotes.The user can choose at the initialization of the algorithm if the chromosomes will be randomly assorted through the randomAssortment parameter.The default value is TRUE.This operator is especially important when selecting a small number of features from a very large poll.In this case, the two-point crossover frequently recombines strings containing only zeros, with no effect on the very long genotypes.The Random Assortment operator offers a significant advantage in these situations.The distinctions between these approaches to recombination are highlighted in Figure 2, a part of the R output using the code: > #Random Assortment on the recombined genotypes with 3 chromosomes > RandomAssortment(cr3, chrConf03) Subsequently, the haploid sets of chromosomes with a higher fitness are kept and the others are discarded from each individual and mutations are applied with a chance that is specified by the user when initializing the algorithm.These haploid sets and the genotypes obtained thru crossovers are The R Journal Vol.9/2, December 2017 ISSN 2073-4859 The Nonsense Mutation operator annuls all the genes on a chromosome following a randomly selected locus (treated as a stop codon).The other chromosomes in the genotype are not influenced by the nonsense mutation.The Frameshift Mutation operator randomly selects a locus on an arbitrarily selected chromosome.The gene at the selected locus is deleted and all the following chain is shifted with one position to the left.The last locus on the implicated chromosome is subsequently annulled to conserve its length.Other chromosomes in the genotype are not altered by the mutation.The Large Segment Deletion operator annuls all the genes in a randomly generated interval on an arbitrarily selected chromosome.The Whole Chromosome Deletion operator acts in a similar fashion, but on the whole chromosome rather than an interval.The Transposons operator randomly selects a chromosome, a gene on that chromosome and a distance for relocation.The elected gene is then transferred at a locus indicated by the generated distance.All the mutation operators occur with chances specified by designated parameters.The mutated genotypes are verified to have at least 4 active genes and the invalid mutations are not inherited, to prevent errors during the subsequent fitness evaluations.The following code demonstrates the mutation operators.The R output is partially presented in Figure 3.

Working with the dGAselID package
The dGAselID package is structured around the dGAselID() function.This function manages the initial parameters for the algorithm and, depending on the user selected options, sets the stage for the experiment.The dGAselID() function calls other functions for the different steps and options in the algorithm.The arguments accepted by dGAselID() along with short descriptions are presented in Table 2.The other functions, for different operators used during the genetic algorithm search, are summarized in Table 3.
Graphical representations of the evolution are available with the built-in functions in real-time.The maximum and average accuracy, accompanied by the most frequently selected genes can be displayed after each generation, offering a very intuitive image of the evolution.Evidence about the number of individuals in the current population, crossovers or the number of mutations are displayed for each generation.
The algorithm retains various data about the evolution for further analysis.For each gene, the frequency of selection across generations is recorded, along with other characteristics of the evolution.The output data format with a hypothetical result is shown below, together with a description of the recorded variables in the Table 4.

Example
We illustrate the functionality of the dGAselID package with the ALL dataset (Li, 2009), a very known set of real DNA microarray data, available in Bioconductor.We are searching for the differentially expressed genes that could characterize the patients suffering from acute lymphoblastic leukemia but have different BCR/ABL classification, negative or positive.For this example, we use a subset of the original ALL data, non-specifically and specifically filtered to 628 features and 79 samples, from the The R Journal Vol.9/2, December 2017 ISSN 2073-4859   The R Journal Vol.9/2, December 2017 ISSN 2073-4859 12625 features and 128 patients in the complete data set.The data was filtered using the capabilities offered in the genefilter package (Gentleman et al., 2016).The algorithm works as well on the original data set, but the filtered data is searched faster.However, our experiments with real data show that more reliable results are obtained with full featured data.
The code for constructing the dataset used in the following examples is presented below: > library(genefilter) > library(ALL) > data(ALL) > bALL = ALL[, substr(ALL$BT,1,1) == "B"] > smallALL = bALL[, bALL$mol.biol%in% c("BCR/ABL", "NEG")] > smallALL$mol.biol= factor(smallALL$mol.biol)> smallALL$BT = factor(smallALL$BT) > f1 <-pOverA(0.25,log2(100 An example of function call is: > set.seed(149)> resNoID<-dGAselID(smallALL, "mol.biol",trainTest = 1:79, startGenes = 12, + populationSize = 200, iterations = 300, noChr = 5, pMutationChance = 0.0075, + elitism = 4) The choice for evaluation function was knn.cvI from the MLInterfaces package, with the parameters k=3 and l=2.This is the default method in the package.The evaluation function used in this example, knn.cvI, is a kNN classifier with the leave-one-out cross validation embedded.For this reason, the requirement for the trainTest parameter is special, addresses all the instances in the data, 1:79.For any other supervised classifier, the trainTest parameter shapes the training and testing subsets in the same fashion as the trainInd parameter in MLInterfaces.When cross-validation is required, the trainTest parameter specifies the desired method as "LOO" or "LOG", and is equivalent to xvalSpec("LOO") or xvalSpec("LOG") respectively, in MLInterfaces package.An illustration of the evolution is presented in Figure 4.The figure pictures a juncture during the search, including the information provided by the verbose mode and graphical representations of the evolution, as they appear in real-time on the computer screen.The evolutions of the Maximum Accuracy, Average Accuracy and the most frequently selected genes are presented after each generation.After the desired number of generations, the evolution of the Maximum Accuracy and Average Accuracy in every generation and the most frequently selected genes can be displayed with the included functions.The Figure 5 represents the most frequently selected genes after the specified number of generations, 300 in our example.Their characteristics can be acquired with methods already implemented in Bioconductor.The 10 most selected genes are presented and could be obtained using hgu95av2.db(Carlson, 2016) with the code below.The selected genes can be studied and evaluated with all the methods available in Bioconductor.For reliable and interpretable results, an adequate number of replications are mandatory in such an experiment, due to the stochastic makeup of the genetic algorithms.The evolution of the maximum accuracy and the average accuracy can be plotted as illustrated in the Figure 6 and Figure 7, respectively.

Conclusions
The dGAselID package provides a creative approach to feature selection in high dimensional data.
The package utilizes the data format used in Bioconductor and is readily operational for microarray data analysis.The algorithm is flexible for high dimensional data other than microarray, when data is provided according to the "ExpressionSet class" specifications.In our experience, the diploid implementation offers advantages over the haploid GA, especially when cross-validation techniques are engaged.The Incomplete Dominance approach allows for a diploid framework, bypassing the requirement to specify a dominance scheme.Also, the Incomplete Dominance inheritance models an evolution process present in nature, tested by billions of years of evolution.Moreover, the diploid structure is a foundation for crossover and mutation operators that model the natural processes and provide improvements in performance over the classical versions.In our tests, the Random Assortment of Chromosomes provides an important performance advantage, in many situations.

Future development
The main disadvantages of the algorithm presented in dGAselID package are the necessity to replicate an experiment several times for reliable results, given the fortuity in generating the initial population, and the tendency of the GA to converge in local optima.Following the conduit in evolutionary computation, we will reconsider the principles of evolution and accurately model operators for crossover and mutation to offer a better tension between exploration and exploitation.

Figure 2 :
Figure 2: Recombination operators a) classical two-point crossover with 1 chromosome, b) two-point crossover with 3 chromosomes, c) two-point crossover with 3 chromosomes and Random Assortment

Figure 3 :
Figure 3: Partial R output illustrating the mutation operators

Figure 4 :
Figure 4: Screenshot of the algorithm execution after 18 generations.

Figure 5 :
Figure 5: The most frequently selected genes after 300 generations.

Figure 6 :
Figure 6: Evolution of the maximum accuracy after 300 generations.

Figure 7 :
Figure 7: Evolution of the average accuracy after 300 generations.

Figure 8 :
Figure 8: Evolution of the best individual after 300 generations with Point Mutation.

Figure 9 :
Figure 9: Evolution of the best individual after 300 generations with Transposon Mutation.

Figure 10 :
Figure 10: Comparative evolution of the maximum fitness over 300 generations.

Figure 11 :
Figure 11: Comparative evolution of the average fitness over 300 generations.

Figure 12 :
Figure 12: Comparatison of the most selected genes after 300 generations.

Table 1 :
Default distribution of features on chromosomes.
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 Operator for the Whole Chromosome Deletion mutation

Table 3 :
Functions in the dGAselID package.