netgwas: An R Package for Network-Based Genome Wide Association Studies

Graphical models are a powerful tool in modelling and analysing complex biological associations in high-dimensional data. The R-package netgwas implements the recent methodological development on copula graphical models to (i) construct linkage maps, (ii) infer linkage disequilibrium networks from genotype data, and (iii) detect high-dimensional genotype-phenotype networks. The netgwas learns the structure of networks from ordinal data and mixed ordinal-and-continuous data. Here, we apply the functionality in netgwas to various multivariate example datasets taken from the literature to demonstrate the kind of insight that can be obtained from the package. We show that our package offers a more realistic association analysis than the classical approaches, as it discriminates between direct and induced correlations by adjusting for the effect of all other variables while performing pairwise associations. This feature controls for spurious interactions between variables that can arise from conventional approaches in a biological sense. The netgwas package uses a parallelization strategy on multi-core processors to speed-up computations.

Pariya Behrouzi (Wageningen University and Research) , Danny Arends (Albrecht Daniel Thaer-Institut für Agrar und Gartenbauwissenschaften) , Ernst C. Wit (Universita della Svizzera Italiana (USI))
2023-02-10

1 Introduction

Graphical models are commonly used in statistics and machine learning to model complex dependency structures in multivariate data (Lauritzen 1996; Hartemink et al. 2000; Lauritzen and Sheehan 2003; Dobra et al. 2004; Friedman 2004; Jordan 2004; Edwards et al. 2010; Behrouzi et al. 2018; Vinciotti et al. 2022) where each node in the graph represents a random variable and edges represent conditional dependence relationships between pairs of variables. Therefore, the absence of an edge between two nodes indicates that the two variables are conditionally independent. The netgwas package contains an implementation of undirected graphical models to address the three key and interrelated goals in genetics association studies: (i) building linkage maps, (ii) reconstructing linkage disequilibrium networks, and (iii) detecting genotype-phenotype networks (see Fig 1). Below we provide a brief introduction for each section of netgwas.

A linkage map describes the linear order of genetic markers within linkage groups (chromosomes). It is the first requirement for estimating the genetic background of phenotypic traits in quantitative trait loci (QTL) studies and are commonly used in QTL studies to link phenotypic traits to the underlying genetics of the population. In practice, many software packages for performing QTL analysis require linkage maps (Lander et al. 1987; Broman et al. 2003, 2019; Yang et al. 2008; Taylor et al. 2011; Huang et al. 2012). Most organisms are categorized as diploid or polyploid by comparing the copy number of each chromosome. Diploids have two copies of each chromosome (like humans). Polyploid organisms have more than two copies of each chromosome (like most crops). Polyploidy is common in plants and in different crops such as apple, potato, and wheat, which contain three (triploid), four (tetraploid), and six (hexaploid) copies from each of their chromosomes, respectively. Despite the importance of polyploids, statistical tools for their map construction are underdeveloped (Grandke et al. 2017; Bourke et al. 2018). Most software packages such as R/qtl (Broman et al. 2003), OneMap (Margarido et al. 2007), Pheno2Geno (Zych et al. 2015), and MST MAP (Wu et al. 2008; Taylor and Butler 2017) contain functionality to only construct linkage maps for diploid species. Packages such as MAPMAKER (Lander et al. 1987), TetraploidSNPMap (Hackett et al. 2017), and polymapR (Bourke et al. 2018) have the functionality to construct polyploid linkage maps but focus mainly on a specific type of polyploid species (e.g. tetraploids). All the aforementioned packages contain methods that use pairwise estimation of recombination frequencies and LOD (logarithm of the odds ratio) scores (Wang et al. 2016) that often require manual interaction. This often leads to an underpowered approach and confounding of correlated genotypes by failing to correct for intermediate markers (Behrouzi and Wit 2019a). In contrast, the netgwas R package uses a multivariate approach to construct linkage maps for diploid and polyploid species in a unified way. This is achieved by utilising the pairwise Markov property between any two genetic markers and constructing the linkage map by simultaneously assessing the complete set of pairwise comparisons. This often leads to an improved marker order over more conventional methods.

The linkage map, which can be constructed using the first key function of netgwas in Fig 1, provides the genetic basis for the second key function which detects the patterns of linkage disequilibrium and segregation distortion in a population. Segregation distortion (SD) refers to any deviation from expected segregation ratios based on Mendelian rules of inheritance. And linkage disequilibrium (LD) refers to non-random relations between loci (locations) on the same or different chromosomes.

Revealing the structures of LD is important for association mapping study as well as for studying the genomic architecture of a genome. Various methods have been published in the literature for measuring statistical associations between alleles at different loci, for instance see Hedrick (1987; Clarke et al. 2011; Bush and Moore 2012; Mangin et al. 2012; Kaler et al. 2020). Most of these measures are based on an exhaustive genome scan for pairs of loci and the \(r^2\) measure, the square of the loci correlation. The drawback of such approaches is that association testing in the genome–scale is underpowered, so that weak long–range LD will go undetected. Furthermore, they do not simultaneously take the information of more than two loci into account to make full and efficient use of modern multi–loci data. The netgwas package contains functionality to estimate pairwise interactions between different loci in a genome while adjusting for the effect of other loci to efficiently detect short- and long–range LD patterns in diploid and polyploid species (Behrouzi and Wit 2019b). Technically, this requires estimating a sparse adjacency matrix from multi-loci genotype data, which usually contains a large number of markers (loci), where the number of markers can far exceed the number of individuals. The non-zero patterns of the adjacency matrix created by the functions in netgwas shows the structures of short– and long–range LD of the genome. The strength of associations between distant loci can be calculated using partial correlations. Furthermore, the methods implemented in netgwas already account for the correlation between markers, while associating them to each other and thereby avoids the problem of population structure (that is physically unlinked markers are correlated).

graphic without alt text
Figure 1: Configuration of the netgwas package. Depending on the data type different functions can be used. Each color represents one main function of the package. The three main functions are: i) netmap() (in green) constructs linkage maps for bi-parental species with any ploidy level, ii) netsnp() (in blue) detects the conditional dependent short- and long-range linkage disequilibrium structure of genomes and iii) netphenogeno() (in black) reconstructs association networks between phenotypes and genetic markers.

A major problem in genetics is the association between genetic markers and the status of a disease (trait or phenotype). Genome-wide association studies are among the most important approaches for further understanding of genetics underlying complex traits (Welter et al. 2013; Kruijer et al. 2020). However, in genome-wide association methods genetic markers are often tested individually for association with the phenotype. Since genome-wide scans analyze thousands or even millions of markers, the issue of multiple testing is usually addressed by using a stringent significance threshold (Panagiotou et al. 2011). Such methods work only if the associations are strong enough to pass the stringent threshold. However, even if that is the case, this type of analysis has several limitations, which have been discussed extensively in the literature (Hoggart et al. 2008; He and Lin 2010; Rakitsch et al. 2012; Buzdugan et al. 2016). Particularly, the main issue of this type of analysis is when we test the association of the phenotype to each genetic marker individually, and ignore the effects of all other genetic markers. This leads to failures in the identification of causal loci. If we consider two correlated loci, of which only one is causal for the phenotype, both may show a marginal association, but only the causal locus will be detected by our method. The methods implemented in the netgwas package tackle this issue by using Gaussian copula graphical model (Klaassen and Wellner 1997; Hoff 2007), which accounts for the correlation between markers, while associating them to the phenotypes. As it is shown in Klasen et al. (2016), this key feature avoids the need to correct for population structure or any genetic background, as the method implemented in the netgwas package simultaneously associates all markers to the phenotype. In contrast, most GWAS methods rely on population structure correction to avoid false genotype-phenotype associations due to their single-loci approach (Yu et al. 2006; Kang et al. 2008, 2010; Lippert et al. 2011).


Algorithm 1: Monte Carlo Gibbs sampling for estimating \(\bar{R}\) in ((1))


Input: A data set containing the variables \(Y^{(i)}\) for \(i = 1 \ldots, n\).
Output: Mean of the conditional expectation, \(\bar{R} = \frac{1}{n} \sum\limits_{i=1}^{n} E( Z^{(i)} Z^{(i)^T} | y^{(i)}; \mathcal{D}, \Theta^{(m-1)})\) in ((1)).

  1. For each \(j = \{1, \ldots,p\}\) generate the latent data from \(Y_j= F^{-1}_j(\Phi(Z_j))\), where \(F_j\) and \(\Phi\) define the empirical marginals and the CDF of standard normal distribution, respectively for \(i = 1 \ldots, n\) and \(j = 1, \ldots, p\);

  2. Estimate \(\mathcal{D}\), vectors of lower and upper truncation points, whose boundaries come from \(Y^{(i)}\) for \(i = 1 \ldots, n\);

  3. for \(i = 1, \ldots, n\) do

  4. Sample from a p-variate truncated normal distribution with the boundaries in Line 2 above;

  5. for \(N\) iterations do

  6. Estimate \(R^{(i), N} = E(Z_\star^{(i)} Z_\star^{(i)T} | y^{(i)}, \widehat{\Theta}^{(m)})\), where \(Z_\star^{(i)}= \left( {\begin{array}{c} Z_\star^{(i)1}\\ \vdots \\ Z_\star^{(i)N} \ \end{array} } \right) \in \mathbb{R}^{N \times p}\);

  7. end for

  8. Update \(\widehat{R}^{(i)} = \frac{1}{N} Z_\star^{(i)} Z_\star^{(i)T}\);

  9. end for

  10. Calculate \(\widehat{\bar{R}} = \frac{1}{n} \sum\limits_{i=1}^{n} \widehat{R}^{(i)}\).


2 Technical details

Graphical models combine graph theory and probability theory to create networks that model complex probabilistic relationships. Undirected graphical models represent the joint probability distribution of a set of variables via a graph \(G=(V,E)\), where \(V = \{1, 2, \ldots, p\}\) specifies the set of random variables and \(E \subset V\times V\) represents undirected edges \((i,j) \in E \Leftrightarrow (j,i) \in E\). The pattern of edges in the graph represents the conditional dependencies between the variables; the absence of an edge between two nodes indicates that any statistical dependency between these two variables is mediated via some other variable or set of variables. The conditional dependencies between variables, which are represented by edges between nodes, are specified via parameterized conditional distributions. We refer to the pattern of edges as the structure of the graph. In this paper, the goal is to learn the graph structure from ordinal data and mixed ordinal-and-continuous data.

Sparse latent graphical model.

A p-dimensional copula \(\mathcal{C}\) is a multivariate distribution with uniform margins on \([0, 1]\). Any joint distribution function can be written in terms of its marginals and a copula which encodes the dependence structure. Here we consider a subclass of joint distributions encoded by the Gaussian copula

\(F(y_1, \ldots, y_p) = \Phi_p \Big( \Phi^{-1}(F_1(y_1)), \ldots, \Phi^{-1}(F_p(y_p)) \ | \ \Omega \Big)\)

where \(\Phi_p(. \ | \ \Omega)\) is the cumulative distribution function (CDF) of p-variate Gaussian distribution with correlation matrix \(\Omega\); \(\Phi\) is the univariate standard normal CDF; and \(F_j\) is the CDF of \(j\)-th variable, \(Y_j\) for \(j = 1, \ldots, p\). Note that \(y_j\) and \(y_{j'}\) are independent if and only if \(\Omega_{jj'} = 0\).

A Gaussian copula can be written in terms of latent variables \(Z\): Let \(F_j^{-1}(y) = inf\{ y: F_j(x) \ge y, x \in \mathcal{R} \}\) be the pseudo-inverse of the marginals. Then a Gaussian copula is defined as \(Y_{ij} = F_j^{-1} (\Phi(Z_{ij}))\) where \(Z \sim \mathcal{N}_p (0, \Omega)\) and \(Y= (Y_1, \ldots, Y_p)\) and \(Z= (Z_1, \ldots, Z_p)\) represent the non-Gaussian observed variables and Gaussian latent variables, respectively. Without lose of generality, we assume that \(Z_j\)’s have unit variances of \(\sigma_{jj}= 1\) for \(j = 1,\ldots,p\). Thus, \(Z_j\)’s marginally follow standard Gaussian distribution. Each observed variable \(Y_j\) is discretized from its latent counterpart \(Z_j\). For the \(j\)-th latent variable (\(j = 1, \ldots,p\)), we assume that the range \((-\infty, \infty)\) splits into \(K_j\) disjointed intervals by a set of thresholds \(-\infty = t^{(0)}_j < t^{(1)}_j < \ldots < t^{(K_j-1)}_j < t^{(K_j)}_j = \infty\) such that \(Y_j = k\) if and only if \(Z_j\) falls in the interval \((t^{(K-1)}_j , t^{(K)}_j)\). Let the parameter \(\mathcal{D} = \{ t^{(k)}_j: j =1,\ldots,p; k = 1, \ldots, K_j\}\) holds the boundaries for the truncation points and \(z^{(1:i)} = [z^{(1)}, \ldots, z^{(n)}]\) where \(z^{(i)} = (z_1^{(i)}, \ldots, z_p^{(i)})\). In order to learn the graph structures, our objective is to estimate the precision matrix \(\Theta = \Omega^{-1}\) from \(n\) independent observations \(y^{(1:i)} = [y^{(1)}, \ldots, y^{(n)}]\), where \(y^{(i)} = (y_1^{(i)}, \ldots, y_p^{(i)})\). The conditional independence between two variables given other variables is equivalent to the corresponding element in the precision matrix being zero, i.e. \(\theta_{ij} = 0\); or put another way, a missing edge between two variables in a graph G represents conditional independence between the two variables given all other variables.

In the classical low-dimensional setting of \(p\) smaller than \(n\), it is natural to implement a maximum likelihood approach to obtain the inverse of the sample covariance matrix. However, in modern applications the dimension \(p\) is routinely far larger than \(n\), meaning that the inverse sample covariance matrix does not exist. Motivated by the sparseness assumption of the graph we tackle the high-dimensional inference problem for discrete \(Y\)’s by a penalized expectation maximization (EM) algorithm as

\[\begin{aligned} \label{Estep} Q(\Theta \ | \ \widehat{\Theta}^{(m-1)} ) = \frac{n}{2} \Big[ \log \det(\Theta) - tr( \bar{R} \Theta ) \Big] \end{aligned} \tag{1}\] and \[\begin{aligned} \label{Mstep} \widehat{\Theta}^{(m)} = \arg \max_{\Theta} \tilde{Q}(\Theta | \hat{\Theta}^{(m-1)}) - \sum\limits_{j \ne j'}^ pP_\rho(| \theta_{jj'}| ), \end{aligned} \tag{2}\] where \(m\) is the iteration number within the EM algorithm. The last term in equation ((2)) represents different penalty functions. Here we impose the sparsity by means of \(L_1\) penalty, on the \(jj'\)-th element of the precision matrix.

We compute the conditional expectation \(\bar{R} = \frac{1}{n} \sum\limits_{i=1}^{n} E\Big( Z^{(i)} Z^{(i)^T} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{\Theta}^{(m-1)}\Big)\) in equation ((1)) using two different approaches: numerically through a Monte Carlo (MC) sampling method as explained in algorithm 1, and through a first order approximation based on algorithm 2. The most flexible and generally applicable approach for obtaining a sample in each iteration of an MCEM algorithm is through a Markov chain Monte Carlo (MCMC) routine like Gibbs and Metropolis – Hastings samplers (Metropolis et al. 1953; Hastings 1970; Geman and Geman 1984). Alternatively, the conditional expectation in equation ((1)) can be computed by using an efficient approximation approach which calculates elements of the empirical covariance matrix using the first and second moments of a truncated normal distribution with mean \(\mu_{ij} = \widehat{\Omega}_{j, -j} \widehat{\Omega}^{(-1)}_{-j,-j} z^{(i)^T}_{-j}\) and variance \(\sigma_{i,j}^2 = 1 - \widehat{\Sigma}_{j,-j} \widehat{\Sigma}^{-1}_{-j,-j} \widehat{\Sigma}_{-j,-j}\) (see Behrouzi and Wit (2019b) for details). The two proposed approaches are practical when some observations are missing. For example, if genotype information for genotype \(j\) is missing, it is still possible to draw Gibbs samples for \(Z_j\) or approximate the empirical covariance matrix, as the corresponding conditional distribution is Gaussian.

The optimization problem in ((2)) can be solved efficiently in various ways by using glasso or CLIME approaches (Friedman et al. 2008; Hsieh et al. 2011). Convergence of the EM algorithm for penalized likelihood problems has been proved in Green (1990). Our experimental study shows that the algorithm usually converges after several iterations \((< 10)\). Note that the sparsity of the estimated precision matrix in Equation ((2)) is controlled by a vector of penalty parameter \(\rho\). We follow (Foygel and Drton 2010) in using the extended Bayesian information criterion (eBIC) to select a suitable regularization parameter \(\rho^*\) to produce a sparse graph with a sparsity pattern corresponding to \(\widehat{\Theta}_{\rho^*}\). Alternatively, instead of using the EM algorithm, a nonparanormal skeptic approach can be used to estimate graph structure through Spearman’s rho and Kendall’s tau statistics; details can be found in Liu et al. (2012).

Algorithm 2: Approximation of the conditional expectation in ((1))


Input: A \((n \times p)\) data matrix \(Y\), where Y\(^{(i)}_j = F^{-1}_j(\Phi(Z^{(i)}_{j}))\) the \(F_j\) and \(\Phi\) define the empirical marginals and the CDF of standard normal, respectively for \(i = 1 \ldots, n\) and \(j = 1, \ldots, p\);
Output: The conditional expectation \(R = \frac{1}{n} \sum\limits_{i=1}^{n} E( Z^{(i)} Z^{(i)^T} | y^{(i)}; \mathcal{\widehat{D}}, \mathbf{\Theta})\);

  1. Initialize \(E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E(z^{(i)}_{j} \ | \ y^{(i)}_{j}; \mathcal{\widehat{D}} )\), \(E( (z_j{^{(i)}})^2 \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E( (z_j{^{(i)}})^2 \ | \ y^{(i)}_j; \mathcal{\widehat{D}} )\), and \(E(z^{(i)}_{j} z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E(z^{(i)}_{j} \ | \ y^{(i)}_{j}; \mathcal{\widehat{D}}) E(z^{(i)}_{j'} \ | \ y^{(i)}_{j'}; \mathcal{\widehat{D}} )\) for \(i = 1,\ldots, n\) and \(j,j' = 1, \ldots, p\);

  2. Initialize \(r_{j,j'}\) for \(1 \leq j, j' \leq p\) using the Line 1 above, then estimate \(\widehat{ \mathbf{\Theta}}\) by maximizing ((2));

  3. for \(i = 1, \ldots, n\) do

  4. if \(j = j'\) then

  5. Calculate \(E((z_j{^{(i)}})^2 \ | \ y^{(i)}_j; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}} )\) for \(j = 1, \ldots, p\);

  6. else

  7. Calculate \(E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}})\) and then

    \(E(z^{(i)}_{j} z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) = E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) E(z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}} )\) for \(i = 1,\ldots, n\) and \(j = 1, \ldots, p\);

  8. end if

  9. end for

  10. Calculate \(r_{j,j'}= \frac{1}{n} \sum_{i=1}^{n} E(z_j^{(i)} z_j^{(i)t} \ | \ y^{(i)}; \mathbf{\widehat{\mathcal{D}}}, \widehat{ \mathbf{\Theta}} )\) for \(1 \le j = j' \le p\).


Extension to linkage map construction

Here we convert the estimated network to a one-dimensional map using two different approaches. Depending on the type of (experimental) population (i.e. inbred or outbred), we order markers based on dimensionality reduction or based on bandwidth reduction, which both result in an one-dimensional map.

In inbred populations, loci in the genome of the progenies can be assigned to their parental homologues, resulting in a simpler conditional independence relationship between neighboring markers. Here, we use multidimensional scaling (MDS) to project markers in a p-dimensional space onto a one-dimensional map while attempting to maintain pairwise distances. Let \(G(V^{(d)},E^{(d)})\) be a sub–graph on the set of unordered \(d\) markers, where \(V^{(d)} = \{1, \ldots, d\}\), \(d \leq p\) and the edge set \(E^{(d)}\) represents all the links among \(d\) markers. We calculate the distance matrix \(D\) as follows \[\begin{aligned} D_{ij} = \left\{ \begin{array}{ll} - \log (r_{ij}) & if i\ne j\\ 0 & if i = j, \end{array} \right. \end{aligned}\]

\[\begin{aligned} r_{ij} = - \frac{\theta_{ij}}{\sqrt{\theta_{ii}\theta_{jj}}}, \end{aligned}\] where \(\theta_{ij}\) is the \(ij\)-th element of the precision matrix \(\widehat{\Theta}_{\rho^*}\). We aim to construct a configuration of \(d\) data points in a one–dimensional Euclidean space by using information about the distances between the \(d\) nodes. In this regards, we define a sequential ordering \(L\) of \(d\) elements such that the distance \(\widehat{D}\) between them is similar to \(\mathcal{D}\). We consider a metric multi-dimensional scaling \[\begin{aligned} \widehat{E}= arg\min_L \sum\limits_{i=1}^{d}\sum\limits_{j=1}^{d}(D_{ij} - \widehat{D}_{ij})^2, \end{aligned}\] that minimizes the so called mapping error \(\widehat{E}\) across all sequential orderings (Sammon 1969).

We propose a different ordering algorithm for outbred populations. In these populations, mating of two non-homozygous parents result in markers in the genome of progenies that cannot easily be mapped into their parental homologues. To order markers in outbred populations, we use the Cuthill-McKee (RCM) algorithm (Cuthill and McKee 1969) to permute the sparse matrix \(\widehat{\Theta}_\rho^{(d)}\) that has a symmetric sparsity pattern into a band matrix form with a small bandwidth. The bandwidth of the associated adjacency matrix \(A\) is defined as \(\beta = \max_ {A_{ij} \ne 0} | i - j|\). The algorithm produces a permutation matrix \(P\) such that \(P A P^T\) has a smaller bandwidth than matrix \(A\) does. The bandwidth decreases by moving the non-zero elements of the matrix \(A\) closer to the main diagonal. The way to move the non-zero elements is determined by relabeling the nodes in graph \(G(V_d, E_d)\) in consecutive order. Moreover, all of the nonzero elements are clustered near the main diagonal.

3 Package design and functionality

The netgwas R package implements the Gaussian copula graphical models (Behrouzi and Wit 2019b) for (i) constructing linkage maps in diploid and polyploid species and learning (ii) linkage disequilibrium networks and (iii) genotype-phenotype networks. Below, we illustrate the three main functions using a diploid A.thaliana population, a tetraploid potato, and maize NAM populations. Given that the computational cost for the usual size of GWAS data \(( > 10^5)\) is expensive, we use small data sets to explain the functionality of the package. All the results can be replicated using the functions in the netgwas package (see Supplementary Materials).

Linkage map construction

This module reconstructs linkage maps for diploid and polyploid organisms. Diploid organisms contain two copies of each chromosome, one from each parent, whereas polyploids contain more than two copies of each chromosome. In polyploids the number of chromosome sets reflects their level of ploidy: triploids have three copies, tetraploids have four, pentaploids have five, and so forth.

Typically, mating is between two parental lines that have recent common biological ancestors; this is called inbreeding. If they have no common ancestors up to roughly \(4\)-\(6\) generations, then this is called outcrossing. In both cases the genomes of the derived progenies are random mosaics of the genome of the parents. However, in the case of inbreeding parental alleles are distinguishable in the genome of the progeny, in outcrossing this does not hold.

Some inbreeding designs, such as Doubled haploid (DH), lead to a homozygous population where the derived genotype data include only homozygous genotypes of the parents namely AA and aa (conveniently coded as \(0\) and \(1\)) for diploid species. Other inbreeding designs, such as F2, lead to a heterozygous population where the derived genotype data contain heterozygous genotypes as well as homozygous ones, namely aa, Aa, and AA for diploid species, conveniently coded as \(0\), \(1\) and \(2\) which correspond to the dosage of the reference allele A. We remark that the Gaussian copula graphical models help us to keep heterozygous markers in the linkage map construction, rather than turn them to missing values as most other methods do in map construction for recombinant inbred line (RIL) populations.

Outcrossing or outbred experimental designs, such as full-sib families, derive from two non-homozygous parents. Thus, the genome of the progenies includes a mixed set of many different marker types containing fully informative markers and partially informative markers . Markers are called fully informative when all of the resulting gamete types can be phenotypically distinguished on the basis of their genotypes; markers are called partially informative when they have identical phenotypes (Wu et al. 2002).

netmap()

The netmap() function handles various inbred and outbred mapping populations, including recombinant inbred lines (RILs), F2, backcross, doubled haploid, and full-sib families, among others. Not all existing methods for linkage mapping support all inbreeding and outbreeding experimental designs. However, our proposed algorithm constructs a linkage map for any type of experimental design of biparental inbreeding and outbreeding. Also, it covers a wide range of possible population types. Argument cross in the map function must be specified to choose an ordering method. In inbred populations, markers in the genome of the progenies can be assigned to their parental homologous, resulting in a simpler conditional independence pattern between neighboring markers. In the case of inbreeding, we use multidimensional scaling (MDS). A metric MDS is a classical approach that maps the original high-dimensional space to a lower dimensional space, while attempting to maintain pairwise distances. An outbred population derived from mating two non-homozygous parents results in markers in the genome of progenies that cannot be easily assigned to their parental homologues. Neighboring markers that vary only on different haploids will appear as independent, therefore requiring a different ordering algorithm. In that case, we use the reverse Cuthill-McKee (RCM) algorithm (Cuthill and McKee 1969) to order markers.

The function can be called with the following arguments

    netmap(data, method = "npn", cross = NULL, rho = NULL, n.rho = NULL, rho.ratio = NULL, 
           min.m = NULL, use.comu = FALSE, ncores = "all", verbose = TRUE)

The main task of this function is to construct a linkage map based on conditional (in)dependence relationships between markers, which can be estimated using the methods, “gibbs", “approx", and “npn". The estimation procedure relies on maximum penalized log-likelihood, where the argument rho, a decreasing sequence of non-negative numbers, controls the sparsity levels, which corresponds to the last term in Equation ((2)). Leaving the input as rho = NULL, the program automatically computes a sequence of rho based on n.rho and rho.ratio. The argument n.rho specifies the number of regularization parameters (the default is 6) and rho.ratio determines the ratio between the consecutive elements of rho. Depending on the population type, inbred or outbred, different algorithms are applied to order markers in the genome. If it is known, the user can specify an expected minimum number of markers in a linkage group (LG) via the argument min.m. Furthermore, linkage groups can be identified either using the fast greedy community detection algorithm (Newman 2004) or simply each disconnect sub-networks can form a linkage group. The ncores = "all" automatically detects number of available cores and runs the computations in parallel on (available cores -1).

The netmap() function returns an object of the S3 class type netgwasmap and plot.netgwasmap and print.netgwasmap are summary method functions for this object class. The netgwasmap mainly holds the estimated linkage map (in object map) and a list containing all output results (in object res) of the regularization path rho.

buildMap()

The function buildMap() allows users to interact with the map construction procedure and to build the linkage map on the manually selected penalty term. Whereas the function netmap() selects the optimal penalty term \(\rho^\star\) using the eBIC method.

The function can be called via

    buildMap(res, opt.index, min.m = NULL, use.comu = FALSE).

The argument opt.index can be chosen manually which is a number between one and the number of penalty parameter n.rho in netmap(). In the default setting, the n.rho is \(6\). So, the opt.index can get a value between \(1\) and \(6\). Like function netmap(), the argument min.m is an optional argument in buildMap() function, where it keeps the clusters of markers that at least have a size of min.m member of markers. The default value for this argument is 2. The use.comu argument is an alternative approach to find linkage groups. The use.comu argument is an alternative approach to find linkage groups.

graphic without alt textgraphic without alt text
Figure 2: Linkage map construction in tetraploid potato. A total of 156 F1 plants were genotyped across 1972 genetic markers. (a) The estimated partial correlation matrix for the genotyping data. (b) The estimated partial correlation matrix after ordering the genetic markers, where all the 12 potato chromosomes are detected correctly.

Detecting linkage disequilibrium networks

The function netsnp() reconstructs a high-dimensional linkage disequilibrium interactions network for diploid and polyploid (GWAS) genotype data. Genetic viability can be considered as a phenotype. This function detects the conditional dependent short- and long-range linkage disequilibrium structure of genomes and thus reveals aberrant marker-marker associations that are due to epistatic selection. In other words, this function detects intra– and inter–chromosomal conditional interactions networks and can be called via

    netsnp(data, method = "gibbs", rho = NULL, n.rho = NULL, rho.ratio = NULL,
           ncores = "all", verbose = TRUE)

for any bi-parental genotype data containing at least two genotype states and possibly missing values. The input data can be either an (n \(\times\) p) genotype data matrix, an object of class netgwasmap, which is an output of functions netmap() and buildMap(), or a simulated data from the function simgeno(). Depending on the dimension of the input data a suitable ‘method’ and its related arguments can be specified. The argument ncores determines the number of cores to use for the calculations. Using ncores = "all" automatically detects number of available cores and runs the computations in parallel.

This function returns an \(S3\) object of class "netgwas", which holds mainly the following objects: (i) Theta a list of estimated \((p \times p)\) precision (inverse of variance-covariance) matrices that infer the conditional independence relationships patterns among genetic loci, (ii) path which is a list of estimated \((p \times p)\) adjacency matrices. This is the graph path corresponding to Theta, (iii) rho which is a vector with n.rho dimension containing the penalties, and (iv) loglik contains the maximum log-likelihood values along the graph path. To select an optimal graph the function selectnet() can be used.

Reconstructing genotype-phenotype networks

Complex genetic traits are influenced by multiple interacting loci, each with a possibly small effect. Our approach reduces the number of candidate genes from hundreds to much fewer genes. It is of great interest to geneticists and biologist to discover a set of most effective genes that directly affect a complex trait in GWAS. To overcome the limitations of traditional analysis, such as single-locus association analysis (looking for main effects of single marker loci), multiple testing and QTL analysis, we use the proposed mixed graphical model to study the simultaneous associations between phenotypes and SNPs. Our method allows for a more accurate interpretation of findings, because it adjusts for the effects of remaining variables –SNPs and phenotypes– while measuring the pairwise associations, whereas the traditional methods use marginal associations to often analyze SNPs and phenotypes one at a time .

Graphical modeling is a powerful tool for describing complex interaction patterns among variables in high-dimensional data used frequently in microarray analysis (Butte et al. 2000). In our modelling framework, a genotype–phenotype network is a complex network made up of interactions among: (i) genetic markers, (ii) phenotypes (e.g. disease), and (iii) between genetic markers and phenotypes. The first problem in analyzing genotype-phenotype data is the mixed variable-types, where markers are ordinal (counting the number of a major allele), and phenotypes (disease) can be measured in continuous or discrete scales. We deal with mixed discrete-and-continuous variables by means of copula. A second issue relates to the high-dimensional setting of the data, where thousands of genetic markers are measured across a few samples; we deal with inferring potentially large networks with only few biological samples. Fortunately, genotype-phenotype networks are intrinsically sparse, in the sense that only few elements interact with each other. This sparsity assumption is incorporated into our algorithm based on penalized graphical models. The proposed method is implemented in the netphenogeno() function, where the input data can be an (\(n \times p\)) matrix or a data.frame where \(n\) is the sample size and \(p\) is the dimension that includes marker data and phenotype measurements. One may consider including more columns, like environmental variables.

The function is defined by

    netphenogeno(data, method = "npn", rho = NULL, n.rho = NULL, rho.ratio = NULL, 
                   ncores = "all", em.iter = 5, em.tol = .001, verbose = TRUE)

and reconstructs genotype-phenotype interactions network for an input data of a genotype-phenotype data matrix or a data.frame. Detecting interactions network among genotypes, phenotypes and environmental factors is also possible using this function. Depending on the size of the input data, the user may choose "gibbs", "approx", or "npn" method for learning the networks. For a medium ( \(\sim\)500) and a large number of variables we recommend to choose "gibbs" and "approx", respectively. Choosing "npn" for a very large number of variables ( >2000) is computationally efficient. The default method is set to "npn". Like the function netsnp(), the netphenogeno() function returns an object of class netgwas.

For objects of type ‘netgwas’ there are plot, print and summary methods available. The plotting function plot.netgwas() provides a visualization plot to monitor the path of estimated networks for a range of penalty terms. The functions plot.netgwasmap(), ``plot.select()`` and ``plot.simgeno()`` visualize the corresponding network, the optimal graph and the results of model-based simulated data, respectively.

To speed up computations in all the three key functions of the netgwas package, we use the parallel package on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org to support parallel computing on a multi-core machine to deal with large inference problems. For the optimizing the memory usage, we use the Matrix package (Bates et al. 2022) for sparse matrix output when estimating and storing full regularization paths for large datasets. The use of these libraries significantly improves the computational speed of the functions within the package.