Allele Imputation and Haplotype Determination from Databases Composed of Nuclear Families

The alleHap package is designed for imputing genetic missing data and reconstruct nonrecombinant haplotypes from pedigree databases in a deterministic way. When genotypes of related individuals are available in a number of linked genetic markers, the program starts by identifying haplotypes compatible with the observed genotypes in those markers without missing values. If haplotypes are identified in parents or offspring, missing alleles can be imputed in subjects containing missing values. Several scenarios are analyzed: family completely genotyped, children partially genotyped and parents completely genotyped, children fully genotyped and parents containing entirely or partially missing genotypes, and founders and their offspring both only partially genotyped. The alleHap package also has a function to simulate pedigrees including all these scenarios. This article describes in detail how our package works for the desired applications, including illustrated explanations and easily reproducible examples.


Introduction
The knowledge about human genetic variation has been growing exponentially over the last decade.Collaborative efforts of international projects such as HapMap (Consortium and others, 2005) and 1000 Genomes (Consortium and others, 2012) have contributed to improving the discovery about human genetic diversity.
Genotype imputation and haplotype reconstruction have achieved an important role in Genome-Wide Association Studies (GWAS) during recent years.Estimation methods are frequently used to infer missing genotypes as well as haplotypes from databases containing related and/or unrelated subjects.The majority of these analyses have been developed using several statistical methods (Browning and Browning, 2011a) which are able to impute genotypes as well as perform haplotype phasing (also known as haplotype estimation) of the corresponding genomic regions.
Most of the currently available computer programs such as IMPUTE2 (Howie and Marchini, 2010), MINIMAC3 (Das, 2015), BEAGLE (Browning and Browning, 2011b), and others, or R packages such as: haplo.ccs(French and Lumley, 2012), haplo.stats(JP and DJ, 2016), hsphase (Ferdosi et al., 2014), linkim (Xu and Wu, 2014), rrBLUP (Endelman, 2011), and synbreed (Wimmer et al., 2012) carry out genotype imputation or haplotype reconstruction using probabilistic methods to achieve their objectives when deterministic methods are insufficient to get them without errors.These methods are usually focused on population data and in the case of pedigree data, families normally are comprised by duos (parent-child) or trios (parents-child) (Browning and Browning, 2009).Studies focused on more than two offspring for each line of descent are uncommon.In these cases, the above programs do not take full advantage of the information contained in the global family structure to improve the process of imputation and construction of haplotypes.The program HAPLORE (Zhang et al., 2005), developed in C++, takes a similar approach as alleHap for haplotype reconstruction in pedigrees, but can not be easily integrated into an environment where R packages are extensively used.
On the other hand, certain genomic regions are very stable against recombination but at the same time they may have a considerable amount of mutations.For this reason, in some well-studied regions, such as the Human Leukocyte Antigen (HLA) loci (Mack et al., 2013), located in the extended Major Histocompatibility Complex (MHC) (de Bakker et al., 2006), an alphanumeric nomenclature is needed to facilitate later analysis.At this juncture, the available typing techniques usually are not able to determine the allele phase and therefore the constitution of the appropriate haplotypes is not possible.This paper presents new improvements and a detailed description for the R package alleHap (Medina-Rodríguez et al., 2014).Our program is capable of imputing missing alleles and identifying haplotypes from non-recombinant regions considering the mechanism of heredity and the genetic information present in parents and offspring.The algorithm is deterministic in the sense that haplotypes are identified from the existing genotypes guaranteeing compatibility between parents and children.When a haplotype can not be identified (due to genotyping errors, or recombination events in the genetic region), the procedure does not infer more haplotypes in the corresponding family members.The following sections will describe the implemented methods as well as some functional examples.

Basics
The algorithms in alleHap are based on a preliminary analysis of all possible combinations that may exist in the genotype of a marker, considering that each member of the family should unequivocally have inherited two alleles, one from each parent.The analysis is based on the differentiation of seven cases, as described in (Berger-Wolf et al., 2007).Each case is characterized by the number of different alleles present in the family and the way these alleles are distributed among parents that determine the set of possible genotypes in children.
Table 1 shows these cases when there are no missing genotypes.For example, in case 1 both parents are a|a and so the only possible child is a|a; in case 2 if a parent is a|a and the other is b|b the only possible child is a|b.Note that in both cases it is possible for a child to determine from which parent comes each allele.The rest of the cases can be easily understood in the same way.Note that in case 5 if a child is a|b it is not possible to know from whom comes each allele.The notation '|' indicates that the source of each allele can be assigned without error, and the notation '/' implies that origin is unknown.
To determine the haplotypes, alleHap creates an IDentified/Sorted (IDS) matrix from the genotypes of each family.For example, in a child, the genotype a/b of a marker is phased if it can be unequivocally determined that the first allele comes from the father and the second from the mother.In this way, the sequence of first (second) alleles of phased markers is the haplotype inherited from the father (or mother).So, when a marker in a child can be phased this way its IDS value is 1; in other case its value is 0. In parents, genotypes can be phased if there exists at least one child with all its genotypes phased reference child1 .Then, for every marker, the alleles of a parent genotype are sorted in such a way that first allele coincides with the corresponding allele inherited from that parent in the reference child.When this sorting is achieved, the IDS value in the parent is 1; in other case its value is 0.
An example of the IDS matrix values (right) and the corresponding phased genotypes (left) is shown in Table 1.Note that when the genotype a/b is phased we denote it by a|b; the first child in each group is considered to be the reference child for phasing the parents.In this case the IDS values of parents have been deduced considering as reference child 1 the first one of the family.

Phased Data IDS Matrix
Marker 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Parents a|a a|a a|a a|a a|b a|b a|b 1 1 1 1 1 1 1 a|a b|b a|b b|c a|b a|c c|d 1 1 1 1 1 1 1 Offspring a|a a|b a|a a|b a|a a|a a|c 1 1 1 1 1 1 1 a|b a|c b|b a|b a|d 1 1 1 1 1 a/b a|c b|c 0 1 1 b|c b|d 1 1 Table 1: Phased genotypes and IDS matrix.
Sometimes, missing values may occur.These can be located either in parents or children.An example of this is depicted in Table 2, where missing values have been denoted as NA (Not Available).
An identification of the homozygous genotypes for each family is also necessary for the proper operation of alleHap.This identification is done in the Homozygosity matrix (HMZ).This matrix has as many rows as members in the family and as many columns as markers.The term HMZ i,j is 0 if the subject i is heterozygous in the marker j, and 1 if he/she is heterozygous.An example of some unphased genotypes (left) and their corresponding HMZ values is shown in Table 3.

Formats
The alleHap package only works with PED files, although it can be easily adapted to similar formats (with similar structure) to later be loaded into the program.

PED files
A PED file (Purcell et al., 2007) is a white-space (space or tab) delimited file where the first six columns are mandatory and the rest of columns are the genotype: 'Family ID' (identifier of each family), 'Individual ID' (identifier of each member of the family), 'Paternal ID' (identifier of the paternal ancestor), 'Maternal ID' (identifier of the maternal ancestor), 'Sex' (genre of each individual: 1=male, 2=female, other=unknown), 'Phenotype' (quantitative trait or affection status of each individual: −9=missing, 1=unaffected, 2=affected), and the 'genotype' of each individual (in biallelic or coded format).
The identifiers are alphanumeric: the combination of family and individual ID should uniquely identify a person.PED files must have one and only one phenotype in the sixth column.The phenotype can be either a quantitative trait or an affection status column.Genotypes (seventh column onwards) should also be white-space delimited; they can be any character (e.g. 1, 2, 3, 4 or A, C, G, T or anything else) except 0, −9, −99.All markers should be biallelic and must have two alleles specified (Purcell et al., 2007).Note that alleHap does not use the phenotypic information that is located in these columns.

NA values
The missing or NA values may be placed either in the first six columns or also in genotype columns.In the last case, when some values are missing, both alleles should be 0, −9, −99 or NA.For example, a family composed of five individuals typed along three markers can be represented in the following way:

Workflow
The workflow of alleHap comprises three stages: data loading, data imputation, and data haplotyping.
Optionally if simulated data are to be used, a "pre-stage" data simulation must be done.The next subsections will describe each of them.

Data simulation
This "pre-stage" is implemented by an R function called alleSimulator that simulates genotypic data for parent-offspring pedigrees taking into account many different factors such as: number of families to generate, number of markers (allele pairs), maximum number of different alleles per marker in the population, type of alleles (numeric or character), number of unique haplotypes in the population, The R Journal Vol.9/2, December 2017 ISSN 2073-4859 To perform the data simulation, this function goes through the following steps: I. Internal functions: In this step all the necessary functions to simulate the data are loaded.These functions are: labelMrk (which creates the 'A', 'C', 'G', or 'T' character labels), simHapSelection (which selects h different haplotypes between the total number of possible haplotypes), simOffspring (which generates n offspring by selecting randomly one haplotype from each parent), simOneFamily (which simulates one family from a population containing the haplotypes popHaplos ) and simRecombHap (which simulates the recombination of haplotypes).
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 VIII.Function output: The last step is the creation of a list containing two different data frames, for genotypes and haplotypes respectively.This may be useful to compare simulated haplotypes with later reconstructed haplotypes.
The following examples show how alleSimulator works: alleSimulator Example 1: Simulation of a family containing parental missing data.
alleSimulator Example 2: Simulation of a family containing offspring missing data.

Data loading
Before the data loading process, since alleHap can handle large amounts of missing data, users should check what kind of missing values will be loaded.If those values are different from "−9" or "−99", the parameter '"missingValues"' of alleLoader has to be updated with the corresponding value.Per example, if the file to be loaded has been codified with zeros as missing values, 'missingValues = 0' must be specified.
Data loading may be used with either simulated or actual genetic data.This stage has been implemented in the alleLoader function for '.ped' files, the default input format.This function tries to read family data from an R data frame or from an external file, to later pass it into the alleImputer and/or alleHaplotyper functions.For this purpose this function goes through these five steps: I. Loading of the internal function recodeNA: This auxiliary function recodes pre-specified missing data as NA values.
II. Extention check and data read: In this step, the extension file is checked and if it has a '.ped' extension the dataset is loaded into R as a data frame.Should this not occur, the message "The file must have a .pedextension" is returned and the data will not be loaded.Then, if the file extension is appropriate, data is loaded and missing values (by default '-9' or '-99') are recoded as NAs (users may supply other codings values).

III. Data check:
The third step counts the number of families, individuals, parents, children, males, females and markers of the dataset, as well as, it checks the ranges of 'Paternal IDs', 'Maternal IDs', 'genotypes' and 'phenotype' values.
IV. Missing data count: This step counts the missing/unknown data which may exist in either genetic data or subjects' identifiers.
V. Function output: In the final step, the dataset is returned as an R data frame, with the same structure as a PED file, with the variables renamed and the missing values correctly identified and coded.If 'dataSummary = TRUE' a summary of previous data counting, ranges, and missing values is printed to the screen.
The intended datasets must conform to the specifications of a PED file: in each row the first six variables correspond to 'family ID', 'subject ID', 'paternal ID', 'maternal ID', 'sex', and affection status ('phenotype').The rest of the variables are the observed genotypes in each marker, where each marker comprises two other variables.
The following examples depict how alleLoader should be used:

Allele imputation marker by marker
At this stage, the imputation of missing alleles in previously loaded/simulated datasets is performed.For this purpose, first a simple quality control of data is conducted and second a "marker by marker" allele imputation is carried out in those cases where possible.Both procedures are implemented in the alleImputer function where the corresponding operation can be reduced to the following steps: 1. Internal functions: In this step all the necessary functions to impute the data are loaded.The most important ones are: • mkrImputer, which performs the imputation in one marker.This function first receives as input data the alleles of that marker in one family, and then applies the quality control and makes imputation when possible, attending to the family structure as shown in Table 1.In the most simple cases, missing alleles in children are imputed only if a parent is homozygous.Missing alleles in a parent are imputed when a child is homozygous, or when the other parent has no missing values and alleles not present in that parent are found in some children.
• famImputer, which applies mkrImputer sequentially to impute all the markers in a family.
• famsImputer, which applies famImputer to all the families of the given data frame, returning a dataset with the same format and dimensions as the input data (with imputed values in those alleles where imputation has been possible).
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 2. Data loading: The second step tries to read genotypic data and the families information into a fully compatible format employing the alleLoader function.If this process is successful, data are stored in an R data frame with the same structure as a PED file.
3. Imputation: This is the most important step of the alleImputer function.First, marker by marker and then family by family, the imputation of the corresponding missing alleles is performed by the mkrImputer function in two stages: children imputation first and then parent imputation, as has been described.

Data summary:
Once the imputation is done, a summary of the imputed data is collected.This summary contains information about the imputation process, i.e., number of imputed alleles, detected incidences (number of canceled markers due to problems detected in the quality control process), imputation rate (quotient of the imputed alleles to the number of originally missing alleles) and time consumed in the process.

Data storing:
In this step, the imputed data are stored in the same path where the PED file was located.The generated new file will have the same name and extension as the original, ending as 'imputed.ped'.

Function output:
In this final step, if 'dataSummary = TRUE' the imputation summary is printed out.Imputed data is directly returned as an R data frame (with the same structure and dimensions as the input dataset).Incidence messages are shown if they are detected at the 'quality control' phase.
The following examples show how alleImputer works: alleImputer Example 1: Deterministic imputation for familial data containing parental missing values.
## Simulation of a family containing parental missing data alleImputer Example 2: Deterministic imputation for familial data containing offspring missing values.
## Simulation of two families containing offspring missing data The R Journal Vol.9/2, December 2017 ISSN 2073-4859 It must be taken into account that the alleImputer function makes the imputation for each marker without "looking" at the rest of the markers in the subject/family.Imputation results obtained with alleImputer improve when the rest of the markers come into consideration assuming that there is no recombination.This task is addressed by the function alleHaplotyper.

Data haplotyping
At this stage, the corresponding haplotypes of the pedigree database are generated.To accomplish this, based on the user's knowledge of the genomic region to be analysed, it is necessary to slice the data into non-recombinant chunks in order to perform the haplotype reconstruction in each one of them.
Depending on the existence of missing alleles in parents or/and children, we have considered four haplotyping scenarios.In the first one, there are no completely missing markers in parents, and children may be complete without missing alleles or may have full or partially missing data.In the second one, all of the parental markers may be entirely missing, and there are at least three children in the family without missing alleles.The third scenario is a mixture of the previous two: some markers have completely missing alleles in parents but are complete (without missing alleles) in at least three children; some markers have non-missing alleles in parents, with some missing values in children; and some markers may have no missing values in parents nor children.In the fourth scenario, parents have completely missing markers, and non-missing markers are available in only two children; in this scenario, deterministic reconstruction of the haplotypes is possible only in a small number of cases under some specific conditions.
Several algorithms have been developed in alleHap for the reconstruction of haplotypes and the imputation of missing alleles in each one of these scenarios.
The function alleHaplotyper identifies the adequate scenario in each case and applies the corresponding algorithm for imputing and haplotyping.The user does not have to worry about deciding what scenario corresponds to each family in the database, since the function takes care of it.
Users may choose among several icons in order to specify the non-identified and missing values in the haplotypes.It is also possible to define the character that will be used as a separator between the alleles for the corresponding haplotypes.By default, the non-identified/missing allele symbol is '?', and haplotypes will be joined without any separator symbol between their correspondig alleles.
The alleHaplotyper function constructs the haplotypes "family by family" taking into account the initially known genotypes as well as the genotypes already imputed by the function alleImputer, along with the matrix IDS.In order to generate the haplotypes, this function performs six successive steps: I. Loading of internal functions: In this step, several functions are loaded, the most important ones being: • famHaplotyper, which carries out the haplotype reconstruction for each family data as follows: 1) Receives as input data the matrix of imputed data returned by alleImputer for one family.2) Applies the adequate algorithm depending on the specific scenario of each family (according to the amount of genotypic information available).3) Returns: a) a matrix equal to the input matrix, but with the new imputed alleles, b) a matrix with the same dimensions as the previous one filled with 0's an 1's.The value 0 indicates a non-phased allele, and the value 1 represents a phased allele, and c) another matrix with two columns corresponding to the haplotypes found in each member of the family.
• famsHaplotyper, which applies famHaplotyper sequentially to all the families in the dataset.
• summarizeData, which generates a summary of the haplotyping process.
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 II. Allele imputation marker by marker in each family: This step calls the alleImputer function which performs the imputation marker by marker and then, family by family.

III.
Haplotyping: This part is the most important of the alleHaplotyper function since it tries to solve the haplotypes when possible.The process is the following: once each family genotype has been imputed marker by marker, those markers containing two unique heterozygous alleles (both in parents and offspring) are excluded from the process.Then, a set of IDentified/Sorted (IDS) matrices is generated per family (one per subject), organized in an R array.Later, the internal function famHaplotyper tries to solve the haplotypes of each family, comparing the information between parents and children in an iterative and reciprocal way.When there are not genetic data in both parents, and there are two or more "unique" offspring (not twins or triplets), the internal functions makeHapsFromThreeChildren and makeHapsFromTwoChildren try to solve the remaining data.Finally, the HoMoZygosity (HMZ) matrix is updated, and the excluded markers are again included.Even if both parental alleles are missing in each marker, it is possible in some cases to reconstruct all the family haplotypes, identifying the corresponding children's haplotypes, although in certain cases their parental provenance will be unknown.
IV. Data summary: Once the data haplotyping is done, a data summary is collected, containing a re-imputation2 rate (after the haplotyping process), the proportions of phased and non-phased alleles, the proportion of full, partial and empty reconstructed haplotypes, and the time employed in the process.
V. Data storing: In this step, the re-imputed data are stored in the same path where the PED file was located, when data have been read from an external file.Two new files will be generated with the same name and extension as the original, but ending as 're-imputed.ped'and 'haplotypes.txt',for the re-imputed genotypes and the reconstructed haplotypes, respectively.

VI. Function output:
In this final step, a summary of the generated data may be printed out, if 'dataSummary=TRUE'.All the results can be directly returned, whether 'invisibleOutput=FALSE'.The list of results contains: imputedMkrs (with the preliminary imputed marker alleles), IDS (including the resulting IDentified/Sorted matrix), reImputedAlls (including the re-imputed alleles) and haplotypes (storing the reconstructed haplotypes), and haplotypingSummary (showing a summary of the haplotyping process).Incidence messages can also be shown if they are detected.These may be caused by haplotype recombination (detected on children), genotyping errors, or inheritance from non-declared parents.
The following example depicts how alleHaplotyper works: alleHaplotyper Example 1: Haplotype reconstruction of a simulated family with parental missing data.

Accuracy Initial considerations
The alleHap package was originally thought to determine haplotypes from nuclear families, i.e. both parents and their offspring (several children).The program can reconstruct haplotypes even when the genotypes of both parents (or only one) are completely missing.However, it does not fit well with the typical situation of cattle breeding in which a single progenitor (male) has had offspring with many females; or vice versa when there is a female that has had offspring of several males.
Although in future versions of alleHap this functionality will be included, the following simulations have been developed using the standard context of nuclear families containing the same parents and a variable number of descendants.
We established the foregoing to compare the accuracy and performance of alleHap versus other software we have selected AlphaImpute (Hickey et al., 2012) and FImpute (Sargolzaei et al., 2014) programs, since both programs have similar characteristics to alleHap, namely both consider pedigree information for inferring haplotypes and imputing missing data.As such, the following comparisons have been performed regarding the proportion of genotyping errors, missing values, and phasing (haplotyping) errors for the three programs.

Simulation parameters
Data for nuclear families ranging from one to 15 children and 50 alleles per haplotype have been simulated.In each case, haplotypes have been generated for missing values in the children's alleles, 0.10 (10% of the children's alleles are missing) and 0.25 (25% of genotypes in children are missing ).Likewise, families with missing rates for parental alleles from 0% (fully genotyped) to 100% (completely ungenotyped) were simulated.In order not to slow down the simulations, only 100 families were simulated for each case.

Incorrectly identified alleles per haplotype
Figure 2 shows the proportion of incorrectly identified alleles in each haplotype.As it can be appreciated, when using alleHap the ratio is always zero.In those cases when an allele can not be unequivocally identified it is left as a missing value.However, both AlphaImpute and FImpute have algorithms that use HMM and the information of the rest of the families to impute probabilistically.
When using AlphaImpute, it can be seen that when there is a low proportion of children and parents with missing genotypes, the rate of imputation errors is low (practically null when the parents are completely genotyped).However, as the rate of alleles of lost parents increases, the proportion of alleles wrongly accused also increases.This effect is greater the higher the number of children per family, which indicates that with more data AlphaImpute tends to impute more, but returning higher number of imputation errors.
With FImpute, it is observed that even when there are not missing alleles in parents or children, the program tends to erroneously impute a proportion of alleles (it does so in cases of heterozygous SNPs in which it is not possible to deterministically decide which haplotype belongs to each allele).The imputation error rate decreases as there are available larger families.In any case, the imputation error rates reach between 20% and 25% when parents have all alleles missing.

Missing alleles per haplotype
In Figure 3 it can be seen that the proportion of non-imputed alleles that remain after the application of the algorithm tends to be lower in alleHap than in AlphaImpute for all conditions.In addition, the greater the number of children available and the lower the rate of missing alleles in both parents and children, the lower the proportion of alleles remaining unallocated.
For FImpute, if the rate of missing alleles in parents is less than 75%, in the cases we have explored (missing allele rate in children up to 25%), there are practically no alleles left unallocated, that is almost all the alleles are imputed.However, as mentioned previously, FImpute has a high imputation error rate, which increases precisely with the rate of missing alleles in parents/children, and decreases with the number of families.
So, regarding the rate of missing alleles per haplotype, it could be said that alleHap is advantageous with respect to AlphaImpute, since it produces a smaller proportion of alleles that are finally left not imputed, and does not generate imputation errors.On the other hand, FImpute has an allele imputation rate higher than alleHap, but at the cost of making many more imputation errors.

Completely reconstructed haplotypes after imputation and haplotyping
In Figure 4 it is shown the proportion of haplotypes that were completely and correctly created with each method.It can be seen that for a large number of families with a low rate of missing alleles (under 10%), alleHap is able to completely and correctly reconstruct more than 90% of haplotypes.
Obviously, as the rate of missing alleles in parents and offspring increases, alleHap's haplotyping rate decreases (if 50% of the parental alleles are missing, alleHap is able to completely and correctly reconstruct 25% of haplotypes, provided that large families are available).
The R Journal Vol.9/2, December 2017 ISSN 2073-4859 When the allele missing rate in parents or children exceeds 10%, complete haplotype reconstruction rates are (in comparison to alleHap) generally lower with AlphaImpute and somewhat higher with FImpute.However, it should be noted that the completely reconstructed haplotypes by alleHap do not have any incorrectly allocated alleles.
In the case of FImpute, for an entirely reconstructed haplotype, we can not know whether reconstruction has been good (no errors), or if there have been misidentified alleles (could have a high proportion of misidentified alleles).

Performance Initial considerations
Since our package was mainly created for primase accuracy, i.e., the univocal treatment of family-based allelic databases, and not for processing large genomic data, it does not make much sense to perform a comparison on equal terms with programs that were implemented and compiled in other, faster languages.
In any case, we have carried out a benchmarking where the alleHap computing times were measured, evaluating the performance of the simulation, imputation, and haplotyping processes depending on two main factors: number of individuals and number of markers.

Computing times regarding the number of individuals
The simulations depending on the number of individuals had the following parameters: from 1 to 8000 individuals (two children per family), 3 markers (three allele pairs), two different alleles per marker, non-numerical alleles (A, C, G, or T), 25% of parental missing genotypes, 25% of children's missing genotypes, and 1200 different haplotypes in the population.
In Figure 5, it can be seen how haplotyping and imputation times grow linearly as the number of individuals increases while simulation times hardly grow.Therefore, it can be said that alleHap consumes very little time when using a small number of markers, even for a considerabe number of individuals.

Computing times regarding the number of markers
The simulations depending on the number of markers were developed using the following factors: from 1 to 5000 markers (from one allele pair to five thousand), one family (four individuals), two different alleles per marker, non-numerical alleles (A,C,G or T), 25% of parental missing genotypes, 25% of children's missing genotypes, and 1200 different haplotypes in the population.The R Journal Vol.9/2, December 2017 ISSN 2073-4859 Figure 5 shows how simulation times grow in a non-linear way as the number of markers increase, while imputation and reconstruction times remain linear (even considering a large number of markers).Note that for this analysis only, it has been taken into account a family (containing four individuals), although it is presumable that for a larger number of indivuals this growth will also continue in linear manner.

Summary
We have developed an improved version of the R package alleHap for the imputation of alleles and the reconstruction of haplotypes in non-recombinant genomic regions using pedigree databases.The procedure is entirely deterministic and uses the information contained in the family structure to guide the process of imputation and reconstruction of haplotypes, without resorting to a reference panel.The package has two main functions alleImputer and alleHaplotyper.
The first one, alleImputer, makes allele imputation marker by marker, taking into account the alleles present in parents and siblings and considering that each individual (due to meiosis) should unequivocally have two alleles per marker, one inherited from each parent.
The function alleHaplotyper first calls alleImputer for an initial imputation of missing alleles and then, considering that there is no recombination (by comparing parental genotypes with children) determines the compatible haplotypes with the family structure.When an inconsistency is detected, alleHaplotyper reports an error message specifying if such inconsistency can be due to a possible recombination or to a genotyping error.Besides, the procedure of construction of haplotypes allows the imputation of those alleles that were not previously imputed by alleImputer.The function alleHaplotyper has been entirely rewritten with respect to previous versions of the package.Also the function alleImputer has been modified to include a new quality control procedure and to improve the presentation of a summary of results.
Genotypic information can be read from an R data frame or from a PED file by the function alleLoader designed specifically with this aim.
The package also includes the function alleSimulator for the simulation of pedigrees.This function handles many arguments, such as number of families, markers, alleles per marker, probability and proportion of missing genotypes, recombination rate, etc., and it generates an R data frame with two lists, one with the structure of a .pedfile, and other with the haplotypes generated for each member of the simulated families.We have used alleSimulator for testing the performance of the other functions, with satisfactory results regarding imputation rate, haplotyping rate, and time of execution, even when handling large amounts of genetic data.
Future improvements of the package include the possibility of considering extended pedigrees (including grandparents, grandchildren, and other relatives) and to make inferences on haplotypes and missing alleles when these can not be deterministically derived.The R Journal Vol.9/2, December 2017 ISSN 2073-4859

Bibliography
1. Imputation of children's genotypes: It is identified which allele has been inherited from the father and which from the mother.If a parent has a homozygous genotype, the corresponding allele is imputed in all children with missing alleles (which do not already have this allele).Moreover, if both parents are homozygous, all children with missing alleles are readily imputed.
2. Imputation of parental genotypes: Given a reference child, it is determined which allele has been transmitted to such child: (a) If a child has a homozygous genotype, the allele is imputed in that parent that do not already have this allele.
(b) If a parent has missing alleles and the other not, and there are heterozygous children, the alleles present in those children (which are not located in the non-missing parent) are imputed to the parent with missing alleles.
Assuming that genotypic markers are part of the same haplotype, i.e., there is no recombination between markers, we have considered that when missing data occurs in a subject's marker, missingness affects both alleles (i.e. each marker is fully missing for the given subject); but if the subject is a child and a parent is homozygous at the same marker (say G/G), only one allele will be imputed in such a child by alleImputer function.Thus, the child's genotype would be G/NA (where NA stands for missing value).The same would occur if a fully missing marker is located in a parent and there is a homozygous child in that marker.

Data haplotying
The haplotyping procedure begins by considering only the offspring, trying to identify/sort the alleles in each marker in such a way that the allele in the first row of the matrix, A i , be the one inherited from the father (see Used notation section), and the allele in the second row be the inherited from the mother.So, if all the markers are sorted this way, the first row of the matrix A i would inherit the first haplotype from the father and the second one from the mother.Once these haplotypes have been found in children, they can be readily identified in parents.What complicates this idea and makes difficult its direct application is the fact that in some cases both parents and a child can share the same genotype (say G/T for the three subjects), and therefore it is not possible to know which allele has been inherited from which parent.Thus, we considered these four scenarios for the haplotyping procedure: • Scenario 1: There are no fully missing markers in parents.
• Scenario 2: There are missing markers in parents.
• Scenario 3: Mixture of the previous scenarios.
• Scenario 4: There are missing markers in parents and only two fully genotyped children.
In the first one there are no fully missing markers in parents, and children may be completely without missing alleles or may have full or partially missing data.In the second one we have taken into account that all of the parental markers are fully missing and there are at least three children in the family without missing alleles.The third scenario is a mixture of the previous two: some markers have parents with fully missing alleles and at least three complete children (without missing alleles).Some markers have non-missing alleles in parents, with some missing values in children, and some markers may have no missing values in parents nor in children.Finally, in the fourth scenario we show the conditions in which alleles can be deterministically phased with only two children, when parents have completely missing markers.Furthermore there may be missing alleles in parents or children that prevent the determination the provenance of some alleles in some markers.In particular, if both parents have all the alleles missing in a marker it is impossible to determine the provenance of the alleles of that marker in the children, at least if there are less than three children in the family.When Scenario 2 occurs the family has three or more children, if there are no missing alleles in at least three children, deterministic phasing can be carried out even when parents are fully missing.Also, in Scenario 4, for the particular case of having only two children, if parental alleles are available in some markers and in the other markers are fully missing, it is possible (under certain conditions) to determine the alleles' phase in those missing markers.

Figure 1 :
Figure 1: Graphical description of the package's workflow.

Figure 4 :
Figure 4: Proportion of fully reconstructed haplotypes after imputation and haplotyping.

Figure 5 :
Figure 5: Computing times for simulation, imputation, and haplotyping, depending on the number of individuals.

Figure 6 :
Figure 6: Computing times for simulation, imputation, and haplotyping, depending on the number of markers.

II. Alleles per marker: The
second step is the simulation of the number of different alleles per marker for the entire population (if the user does not supply them).The user can specify whether the alleles are letters (coded as A, C, G, or T) or if they are coded numerically.When the alleles are letters, only two possible different values are assigned to each marker; otherwise, between two and nine different values are randomly allotted.III.

Haplotypes in population: If
there are many markers or alleles per marker, the number of possible haplotypes can be very large.By default, the number of possible different haplotypes generated by the function is limited to 1200, although the user can modify this value with the argument nHaplos.