Graphical models are a powerful tool in modelling and analysing complex biological associations in high-dimensional data. The R-package netgwas implements the recent methodological development on copula graphical models to (i) construct linkage maps, (ii) infer linkage disequilibrium networks from genotype data, and (iii) detect high-dimensional genotype-phenotype networks. The netgwas learns the structure of networks from ordinal data and mixed ordinal-and-continuous data. Here, we apply the functionality in netgwas to various multivariate example datasets taken from the literature to demonstrate the kind of insight that can be obtained from the package. We show that our package offers a more realistic association analysis than the classical approaches, as it discriminates between direct and induced correlations by adjusting for the effect of all other variables while performing pairwise associations. This feature controls for spurious interactions between variables that can arise from conventional approaches in a biological sense. The netgwas package uses a parallelization strategy on multi-core processors to speed-up computations.
Graphical models are commonly used in statistics and machine learning to model complex dependency structures in multivariate data (Lauritzen 1996; Hartemink et al. 2000; Lauritzen and Sheehan 2003; Dobra et al. 2004; Friedman 2004; Jordan 2004; Edwards et al. 2010; Behrouzi et al. 2018; Vinciotti et al. 2022) where each node in the graph represents a random variable and edges represent conditional dependence relationships between pairs of variables. Therefore, the absence of an edge between two nodes indicates that the two variables are conditionally independent. The netgwas package contains an implementation of undirected graphical models to address the three key and interrelated goals in genetics association studies: (i) building linkage maps, (ii) reconstructing linkage disequilibrium networks, and (iii) detecting genotype-phenotype networks (see Fig 1). Below we provide a brief introduction for each section of netgwas.
A linkage map describes the linear order of genetic markers within
linkage groups (chromosomes). It is the first requirement for estimating
the genetic background of phenotypic traits in quantitative trait loci
(QTL) studies and are commonly used in QTL studies to link phenotypic
traits to the underlying genetics of the population. In practice, many
software packages for performing QTL analysis require linkage maps
(Lander et al. 1987; Broman et al. 2003, 2019; Yang et al. 2008; Taylor et al. 2011; Huang et al. 2012).
Most organisms are categorized as diploid or polyploid by comparing the
copy number of each chromosome. Diploids have two copies of each
chromosome (like humans). Polyploid organisms have more than two copies
of each chromosome (like most crops). Polyploidy is common in plants and
in different crops such as apple, potato, and wheat, which contain three
(triploid), four (tetraploid), and six (hexaploid) copies from each of
their chromosomes, respectively. Despite the importance of polyploids,
statistical tools for their map construction are underdeveloped
(Grandke et al. 2017; Bourke et al. 2018). Most software packages such
as R/qtl
(Broman et al. 2003), OneMap
(Margarido et al. 2007), Pheno2Geno
(Zych et al. 2015), and MST MAP
(Wu et al. 2008; Taylor and Butler 2017)
contain functionality to only construct linkage maps for diploid
species. Packages such as MAPMAKER
(Lander et al. 1987),
TetraploidSNPMap
(Hackett et al. 2017), and polymapR
(Bourke et al. 2018) have the functionality to construct polyploid
linkage maps but focus mainly on a specific type of polyploid species
(e.g. tetraploids). All the aforementioned packages contain methods that
use pairwise estimation of recombination frequencies and LOD (logarithm
of the odds ratio) scores (Wang et al. 2016) that often require manual
interaction. This often leads to an underpowered approach and
confounding of correlated genotypes by failing to correct for
intermediate markers (Behrouzi and Wit 2019a). In contrast, the netgwas R
package uses a multivariate approach to construct linkage maps for
diploid and polyploid species in a unified way. This is achieved by
utilising the pairwise Markov property between any two genetic markers
and constructing the linkage map by simultaneously assessing the
complete set of pairwise comparisons. This often leads to an improved
marker order over more conventional methods.
The linkage map, which can be constructed using the first key function of netgwas in Fig 1, provides the genetic basis for the second key function which detects the patterns of linkage disequilibrium and segregation distortion in a population. Segregation distortion (SD) refers to any deviation from expected segregation ratios based on Mendelian rules of inheritance. And linkage disequilibrium (LD) refers to non-random relations between loci (locations) on the same or different chromosomes.
Revealing the structures of LD is important for association mapping study as well as for studying the genomic architecture of a genome. Various methods have been published in the literature for measuring statistical associations between alleles at different loci, for instance see Hedrick (1987; Clarke et al. 2011; Bush and Moore 2012; Mangin et al. 2012; Kaler et al. 2020). Most of these measures are based on an exhaustive genome scan for pairs of loci and the \(r^2\) measure, the square of the loci correlation. The drawback of such approaches is that association testing in the genome–scale is underpowered, so that weak long–range LD will go undetected. Furthermore, they do not simultaneously take the information of more than two loci into account to make full and efficient use of modern multi–loci data. The netgwas package contains functionality to estimate pairwise interactions between different loci in a genome while adjusting for the effect of other loci to efficiently detect short- and long–range LD patterns in diploid and polyploid species (Behrouzi and Wit 2019b). Technically, this requires estimating a sparse adjacency matrix from multi-loci genotype data, which usually contains a large number of markers (loci), where the number of markers can far exceed the number of individuals. The non-zero patterns of the adjacency matrix created by the functions in netgwas shows the structures of short– and long–range LD of the genome. The strength of associations between distant loci can be calculated using partial correlations. Furthermore, the methods implemented in netgwas already account for the correlation between markers, while associating them to each other and thereby avoids the problem of population structure (that is physically unlinked markers are correlated).
A major problem in genetics is the association between genetic markers and the status of a disease (trait or phenotype). Genome-wide association studies are among the most important approaches for further understanding of genetics underlying complex traits (Welter et al. 2013; Kruijer et al. 2020). However, in genome-wide association methods genetic markers are often tested individually for association with the phenotype. Since genome-wide scans analyze thousands or even millions of markers, the issue of multiple testing is usually addressed by using a stringent significance threshold (Panagiotou et al. 2011). Such methods work only if the associations are strong enough to pass the stringent threshold. However, even if that is the case, this type of analysis has several limitations, which have been discussed extensively in the literature (Hoggart et al. 2008; He and Lin 2010; Rakitsch et al. 2012; Buzdugan et al. 2016). Particularly, the main issue of this type of analysis is when we test the association of the phenotype to each genetic marker individually, and ignore the effects of all other genetic markers. This leads to failures in the identification of causal loci. If we consider two correlated loci, of which only one is causal for the phenotype, both may show a marginal association, but only the causal locus will be detected by our method. The methods implemented in the netgwas package tackle this issue by using Gaussian copula graphical model (Klaassen and Wellner 1997; Hoff 2007), which accounts for the correlation between markers, while associating them to the phenotypes. As it is shown in Klasen et al. (2016), this key feature avoids the need to correct for population structure or any genetic background, as the method implemented in the netgwas package simultaneously associates all markers to the phenotype. In contrast, most GWAS methods rely on population structure correction to avoid false genotype-phenotype associations due to their single-loci approach (Yu et al. 2006; Kang et al. 2008, 2010; Lippert et al. 2011).
Input: A data set containing the variables \(Y^{(i)}\) for
\(i = 1 \ldots, n\).
Output: Mean of the conditional expectation,
\(\bar{R} = \frac{1}{n} \sum\limits_{i=1}^{n} E( Z^{(i)} Z^{(i)^T} | y^{(i)}; \mathcal{D}, \Theta^{(m-1)})\)
in ((1)).
For each \(j = \{1, \ldots,p\}\) generate the latent data from \(Y_j= F^{-1}_j(\Phi(Z_j))\), where \(F_j\) and \(\Phi\) define the empirical marginals and the CDF of standard normal distribution, respectively for \(i = 1 \ldots, n\) and \(j = 1, \ldots, p\);
Estimate \(\mathcal{D}\), vectors of lower and upper truncation points, whose boundaries come from \(Y^{(i)}\) for \(i = 1 \ldots, n\);
for \(i = 1, \ldots, n\) do
Sample from a p-variate truncated normal distribution with the boundaries in Line 2 above;
for \(N\) iterations do
Estimate \(R^{(i), N} = E(Z_\star^{(i)} Z_\star^{(i)T} | y^{(i)}, \widehat{\Theta}^{(m)})\), where \(Z_\star^{(i)}= \left( {\begin{array}{c} Z_\star^{(i)1}\\ \vdots \\ Z_\star^{(i)N} \ \end{array} } \right) \in \mathbb{R}^{N \times p}\);
end for
Update \(\widehat{R}^{(i)} = \frac{1}{N} Z_\star^{(i)} Z_\star^{(i)T}\);
end for
Calculate \(\widehat{\bar{R}} = \frac{1}{n} \sum\limits_{i=1}^{n} \widehat{R}^{(i)}\).
Graphical models combine graph theory and probability theory to create networks that model complex probabilistic relationships. Undirected graphical models represent the joint probability distribution of a set of variables via a graph \(G=(V,E)\), where \(V = \{1, 2, \ldots, p\}\) specifies the set of random variables and \(E \subset V\times V\) represents undirected edges \((i,j) \in E \Leftrightarrow (j,i) \in E\). The pattern of edges in the graph represents the conditional dependencies between the variables; the absence of an edge between two nodes indicates that any statistical dependency between these two variables is mediated via some other variable or set of variables. The conditional dependencies between variables, which are represented by edges between nodes, are specified via parameterized conditional distributions. We refer to the pattern of edges as the structure of the graph. In this paper, the goal is to learn the graph structure from ordinal data and mixed ordinal-and-continuous data.
A p-dimensional copula \(\mathcal{C}\) is a multivariate distribution with uniform margins on \([0, 1]\). Any joint distribution function can be written in terms of its marginals and a copula which encodes the dependence structure. Here we consider a subclass of joint distributions encoded by the Gaussian copula
\(F(y_1, \ldots, y_p) = \Phi_p \Big( \Phi^{-1}(F_1(y_1)), \ldots, \Phi^{-1}(F_p(y_p)) \ | \ \Omega \Big)\)
where \(\Phi_p(. \ | \ \Omega)\) is the cumulative distribution function (CDF) of p-variate Gaussian distribution with correlation matrix \(\Omega\); \(\Phi\) is the univariate standard normal CDF; and \(F_j\) is the CDF of \(j\)-th variable, \(Y_j\) for \(j = 1, \ldots, p\). Note that \(y_j\) and \(y_{j'}\) are independent if and only if \(\Omega_{jj'} = 0\).
A Gaussian copula can be written in terms of latent variables \(Z\): Let \(F_j^{-1}(y) = inf\{ y: F_j(x) \ge y, x \in \mathcal{R} \}\) be the pseudo-inverse of the marginals. Then a Gaussian copula is defined as \(Y_{ij} = F_j^{-1} (\Phi(Z_{ij}))\) where \(Z \sim \mathcal{N}_p (0, \Omega)\) and \(Y= (Y_1, \ldots, Y_p)\) and \(Z= (Z_1, \ldots, Z_p)\) represent the non-Gaussian observed variables and Gaussian latent variables, respectively. Without lose of generality, we assume that \(Z_j\)’s have unit variances of \(\sigma_{jj}= 1\) for \(j = 1,\ldots,p\). Thus, \(Z_j\)’s marginally follow standard Gaussian distribution. Each observed variable \(Y_j\) is discretized from its latent counterpart \(Z_j\). For the \(j\)-th latent variable (\(j = 1, \ldots,p\)), we assume that the range \((-\infty, \infty)\) splits into \(K_j\) disjointed intervals by a set of thresholds \(-\infty = t^{(0)}_j < t^{(1)}_j < \ldots < t^{(K_j-1)}_j < t^{(K_j)}_j = \infty\) such that \(Y_j = k\) if and only if \(Z_j\) falls in the interval \((t^{(K-1)}_j , t^{(K)}_j)\). Let the parameter \(\mathcal{D} = \{ t^{(k)}_j: j =1,\ldots,p; k = 1, \ldots, K_j\}\) holds the boundaries for the truncation points and \(z^{(1:i)} = [z^{(1)}, \ldots, z^{(n)}]\) where \(z^{(i)} = (z_1^{(i)}, \ldots, z_p^{(i)})\). In order to learn the graph structures, our objective is to estimate the precision matrix \(\Theta = \Omega^{-1}\) from \(n\) independent observations \(y^{(1:i)} = [y^{(1)}, \ldots, y^{(n)}]\), where \(y^{(i)} = (y_1^{(i)}, \ldots, y_p^{(i)})\). The conditional independence between two variables given other variables is equivalent to the corresponding element in the precision matrix being zero, i.e. \(\theta_{ij} = 0\); or put another way, a missing edge between two variables in a graph G represents conditional independence between the two variables given all other variables.
In the classical low-dimensional setting of \(p\) smaller than \(n\), it is natural to implement a maximum likelihood approach to obtain the inverse of the sample covariance matrix. However, in modern applications the dimension \(p\) is routinely far larger than \(n\), meaning that the inverse sample covariance matrix does not exist. Motivated by the sparseness assumption of the graph we tackle the high-dimensional inference problem for discrete \(Y\)’s by a penalized expectation maximization (EM) algorithm as
\[\begin{aligned} \label{Estep} Q(\Theta \ | \ \widehat{\Theta}^{(m-1)} ) = \frac{n}{2} \Big[ \log \det(\Theta) - tr( \bar{R} \Theta ) \Big] \end{aligned} \tag{1}\] and \[\begin{aligned} \label{Mstep} \widehat{\Theta}^{(m)} = \arg \max_{\Theta} \tilde{Q}(\Theta | \hat{\Theta}^{(m-1)}) - \sum\limits_{j \ne j'}^ pP_\rho(| \theta_{jj'}| ), \end{aligned} \tag{2}\] where \(m\) is the iteration number within the EM algorithm. The last term in equation ((2)) represents different penalty functions. Here we impose the sparsity by means of \(L_1\) penalty, on the \(jj'\)-th element of the precision matrix.
We compute the conditional expectation \(\bar{R} = \frac{1}{n} \sum\limits_{i=1}^{n} E\Big( Z^{(i)} Z^{(i)^T} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{\Theta}^{(m-1)}\Big)\) in equation ((1)) using two different approaches: numerically through a Monte Carlo (MC) sampling method as explained in algorithm 1, and through a first order approximation based on algorithm 2. The most flexible and generally applicable approach for obtaining a sample in each iteration of an MCEM algorithm is through a Markov chain Monte Carlo (MCMC) routine like Gibbs and Metropolis – Hastings samplers (Metropolis et al. 1953; Hastings 1970; Geman and Geman 1984). Alternatively, the conditional expectation in equation ((1)) can be computed by using an efficient approximation approach which calculates elements of the empirical covariance matrix using the first and second moments of a truncated normal distribution with mean \(\mu_{ij} = \widehat{\Omega}_{j, -j} \widehat{\Omega}^{(-1)}_{-j,-j} z^{(i)^T}_{-j}\) and variance \(\sigma_{i,j}^2 = 1 - \widehat{\Sigma}_{j,-j} \widehat{\Sigma}^{-1}_{-j,-j} \widehat{\Sigma}_{-j,-j}\) (see Behrouzi and Wit (2019b) for details). The two proposed approaches are practical when some observations are missing. For example, if genotype information for genotype \(j\) is missing, it is still possible to draw Gibbs samples for \(Z_j\) or approximate the empirical covariance matrix, as the corresponding conditional distribution is Gaussian.
The optimization problem in ((2)) can be solved efficiently in various ways by using glasso or CLIME approaches (Friedman et al. 2008; Hsieh et al. 2011). Convergence of the EM algorithm for penalized likelihood problems has been proved in Green (1990). Our experimental study shows that the algorithm usually converges after several iterations \((< 10)\). Note that the sparsity of the estimated precision matrix in Equation ((2)) is controlled by a vector of penalty parameter \(\rho\). We follow (Foygel and Drton 2010) in using the extended Bayesian information criterion (eBIC) to select a suitable regularization parameter \(\rho^*\) to produce a sparse graph with a sparsity pattern corresponding to \(\widehat{\Theta}_{\rho^*}\). Alternatively, instead of using the EM algorithm, a nonparanormal skeptic approach can be used to estimate graph structure through Spearman’s rho and Kendall’s tau statistics; details can be found in Liu et al. (2012).Input: A \((n \times p)\) data matrix \(Y\), where
Y\(^{(i)}_j = F^{-1}_j(\Phi(Z^{(i)}_{j}))\) the \(F_j\) and \(\Phi\) define
the empirical marginals and the CDF of standard normal, respectively for
\(i = 1 \ldots, n\) and \(j = 1, \ldots, p\);
Output: The conditional expectation
\(R = \frac{1}{n} \sum\limits_{i=1}^{n} E( Z^{(i)} Z^{(i)^T} | y^{(i)}; \mathcal{\widehat{D}}, \mathbf{\Theta})\);
Initialize \(E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E(z^{(i)}_{j} \ | \ y^{(i)}_{j}; \mathcal{\widehat{D}} )\), \(E( (z_j{^{(i)}})^2 \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E( (z_j{^{(i)}})^2 \ | \ y^{(i)}_j; \mathcal{\widehat{D}} )\), and \(E(z^{(i)}_{j} z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) \approx E(z^{(i)}_{j} \ | \ y^{(i)}_{j}; \mathcal{\widehat{D}}) E(z^{(i)}_{j'} \ | \ y^{(i)}_{j'}; \mathcal{\widehat{D}} )\) for \(i = 1,\ldots, n\) and \(j,j' = 1, \ldots, p\);
Initialize \(r_{j,j'}\) for \(1 \leq j, j' \leq p\) using the Line 1 above, then estimate \(\widehat{ \mathbf{\Theta}}\) by maximizing ((2));
for \(i = 1, \ldots, n\) do
if \(j = j'\) then
Calculate \(E((z_j{^{(i)}})^2 \ | \ y^{(i)}_j; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}} )\) for \(j = 1, \ldots, p\);
else
Calculate
\(E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}})\)
and then
\(E(z^{(i)}_{j} z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) = E(z^{(i)}_{j} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}}) E(z^{(i)}_{j'} \ | \ y^{(i)}; \mathcal{\widehat{D}}, \widehat{ \mathbf{\Theta}} )\)
for \(i = 1,\ldots, n\) and \(j = 1, \ldots, p\);
end if
end for
Calculate \(r_{j,j'}= \frac{1}{n} \sum_{i=1}^{n} E(z_j^{(i)} z_j^{(i)t} \ | \ y^{(i)}; \mathbf{\widehat{\mathcal{D}}}, \widehat{ \mathbf{\Theta}} )\) for \(1 \le j = j' \le p\).
Here we convert the estimated network to a one-dimensional map using two different approaches. Depending on the type of (experimental) population (i.e. inbred or outbred), we order markers based on dimensionality reduction or based on bandwidth reduction, which both result in an one-dimensional map.
In inbred populations, loci in the genome of the progenies can be assigned to their parental homologues, resulting in a simpler conditional independence relationship between neighboring markers. Here, we use multidimensional scaling (MDS) to project markers in a p-dimensional space onto a one-dimensional map while attempting to maintain pairwise distances. Let \(G(V^{(d)},E^{(d)})\) be a sub–graph on the set of unordered \(d\) markers, where \(V^{(d)} = \{1, \ldots, d\}\), \(d \leq p\) and the edge set \(E^{(d)}\) represents all the links among \(d\) markers. We calculate the distance matrix \(D\) as follows \[\begin{aligned} D_{ij} = \left\{ \begin{array}{ll} - \log (r_{ij}) & if i\ne j\\ 0 & if i = j, \end{array} \right. \end{aligned}\]
\[\begin{aligned} r_{ij} = - \frac{\theta_{ij}}{\sqrt{\theta_{ii}\theta_{jj}}}, \end{aligned}\] where \(\theta_{ij}\) is the \(ij\)-th element of the precision matrix \(\widehat{\Theta}_{\rho^*}\). We aim to construct a configuration of \(d\) data points in a one–dimensional Euclidean space by using information about the distances between the \(d\) nodes. In this regards, we define a sequential ordering \(L\) of \(d\) elements such that the distance \(\widehat{D}\) between them is similar to \(\mathcal{D}\). We consider a metric multi-dimensional scaling \[\begin{aligned} \widehat{E}= arg\min_L \sum\limits_{i=1}^{d}\sum\limits_{j=1}^{d}(D_{ij} - \widehat{D}_{ij})^2, \end{aligned}\] that minimizes the so called mapping error \(\widehat{E}\) across all sequential orderings (Sammon 1969).
We propose a different ordering algorithm for outbred populations. In these populations, mating of two non-homozygous parents result in markers in the genome of progenies that cannot easily be mapped into their parental homologues. To order markers in outbred populations, we use the Cuthill-McKee (RCM) algorithm (Cuthill and McKee 1969) to permute the sparse matrix \(\widehat{\Theta}_\rho^{(d)}\) that has a symmetric sparsity pattern into a band matrix form with a small bandwidth. The bandwidth of the associated adjacency matrix \(A\) is defined as \(\beta = \max_ {A_{ij} \ne 0} | i - j|\). The algorithm produces a permutation matrix \(P\) such that \(P A P^T\) has a smaller bandwidth than matrix \(A\) does. The bandwidth decreases by moving the non-zero elements of the matrix \(A\) closer to the main diagonal. The way to move the non-zero elements is determined by relabeling the nodes in graph \(G(V_d, E_d)\) in consecutive order. Moreover, all of the nonzero elements are clustered near the main diagonal.
The netgwas R package implements the Gaussian copula graphical models (Behrouzi and Wit 2019b) for (i) constructing linkage maps in diploid and polyploid species and learning (ii) linkage disequilibrium networks and (iii) genotype-phenotype networks. Below, we illustrate the three main functions using a diploid A.thaliana population, a tetraploid potato, and maize NAM populations. Given that the computational cost for the usual size of GWAS data \(( > 10^5)\) is expensive, we use small data sets to explain the functionality of the package. All the results can be replicated using the functions in the netgwas package (see Supplementary Materials).
This module reconstructs linkage maps for diploid and polyploid organisms. Diploid organisms contain two copies of each chromosome, one from each parent, whereas polyploids contain more than two copies of each chromosome. In polyploids the number of chromosome sets reflects their level of ploidy: triploids have three copies, tetraploids have four, pentaploids have five, and so forth.
Typically, mating is between two parental lines that have recent common biological ancestors; this is called inbreeding. If they have no common ancestors up to roughly \(4\)-\(6\) generations, then this is called outcrossing. In both cases the genomes of the derived progenies are random mosaics of the genome of the parents. However, in the case of inbreeding parental alleles are distinguishable in the genome of the progeny, in outcrossing this does not hold.
Some inbreeding designs, such as Doubled haploid (DH), lead to a homozygous population where the derived genotype data include only homozygous genotypes of the parents namely AA and aa (conveniently coded as \(0\) and \(1\)) for diploid species. Other inbreeding designs, such as F2, lead to a heterozygous population where the derived genotype data contain heterozygous genotypes as well as homozygous ones, namely aa, Aa, and AA for diploid species, conveniently coded as \(0\), \(1\) and \(2\) which correspond to the dosage of the reference allele A. We remark that the Gaussian copula graphical models help us to keep heterozygous markers in the linkage map construction, rather than turn them to missing values as most other methods do in map construction for recombinant inbred line (RIL) populations.
Outcrossing or outbred experimental designs, such as full-sib families, derive from two non-homozygous parents. Thus, the genome of the progenies includes a mixed set of many different marker types containing fully informative markers and partially informative markers . Markers are called fully informative when all of the resulting gamete types can be phenotypically distinguished on the basis of their genotypes; markers are called partially informative when they have identical phenotypes (Wu et al. 2002).
netmap()
The netmap()
function handles various inbred and outbred mapping
populations, including recombinant inbred lines (RILs), F2, backcross,
doubled haploid, and full-sib families, among others. Not all existing
methods for linkage mapping support all inbreeding and outbreeding
experimental designs. However, our proposed algorithm constructs a
linkage map for any type of experimental design of biparental inbreeding
and outbreeding. Also, it covers a wide range of possible population
types. Argument cross
in the map function must be specified to choose
an ordering method. In inbred populations, markers in the genome of the
progenies can be assigned to their parental homologous, resulting in a
simpler conditional independence pattern between neighboring markers. In
the case of inbreeding, we use multidimensional scaling (MDS). A metric
MDS is a classical approach that maps the original high-dimensional
space to a lower dimensional space, while attempting to maintain
pairwise distances. An outbred population derived from mating two
non-homozygous parents results in markers in the genome of progenies
that cannot be easily assigned to their parental homologues. Neighboring
markers that vary only on different haploids will appear as independent,
therefore requiring a different ordering algorithm. In that case, we use
the reverse Cuthill-McKee (RCM) algorithm (Cuthill and McKee 1969) to
order markers.
The function can be called with the following arguments
netmap(data, method = "npn", cross = NULL, rho = NULL, n.rho = NULL, rho.ratio = NULL,
min.m = NULL, use.comu = FALSE, ncores = "all", verbose = TRUE)
The main task of this function is to construct a linkage map based on
conditional (in)dependence relationships between markers, which can be
estimated using the methods, “gibbs"
, “approx"
, and “npn"
. The
estimation procedure relies on maximum penalized log-likelihood, where
the argument rho
, a decreasing sequence of non-negative numbers,
controls the sparsity levels, which corresponds to the last term in
Equation ((2)). Leaving the input as rho = NULL
, the
program automatically computes a sequence of rho
based on n.rho
and
rho.ratio
. The argument n.rho
specifies the number of regularization
parameters (the default is 6) and rho.ratio
determines the ratio
between the consecutive elements of rho
. Depending on the population
type, inbred
or outbred
, different algorithms are applied to order
markers in the genome. If it is known, the user can specify an expected
minimum number of markers in a linkage group (LG) via the argument
min.m
. Furthermore, linkage groups can be identified either using the
fast greedy community detection algorithm (Newman 2004) or simply
each disconnect sub-networks can form a linkage group. The
ncores = "all"
automatically detects number of available cores and
runs the computations in parallel on (available cores -1).
The netmap()
function returns an object of the S3
class type
netgwasmap
and plot.netgwasmap
and print.netgwasmap
are summary
method functions for this object class. The netgwasmap
mainly holds
the estimated linkage map (in object map
) and a list containing all
output results (in object res
) of the regularization path rho
.
buildMap()
The function buildMap()
allows users to interact with the map
construction procedure and to build the linkage map on the manually
selected penalty term. Whereas the function netmap()
selects the
optimal penalty term \(\rho^\star\) using the eBIC method.
The function can be called via
buildMap(res, opt.index, min.m = NULL, use.comu = FALSE).
The argument opt.index
can be chosen manually which is a number
between one and the number of penalty parameter n.rho
in netmap()
.
In the default setting, the n.rho
is \(6\). So, the opt.index
can get
a value between \(1\) and \(6\). Like function netmap()
, the argument
min.m
is an optional argument in buildMap()
function, where it keeps
the clusters of markers that at least have a size of min.m
member of
markers. The default value for this argument is 2. The use.comu
argument is an alternative approach to find linkage groups. The
use.comu
argument is an alternative approach to find linkage groups.
The function n
etsnp() reconstructs a high-dimensional linkage
disequilibrium interactions network for diploid and polyploid (GWAS)
genotype data. Genetic viability can be considered as a phenotype. This
function detects the conditional dependent short- and long-range linkage
disequilibrium structure of genomes and thus reveals aberrant
marker-marker associations that are due to epistatic selection. In other
words, this function detects intra– and inter–chromosomal conditional
interactions networks and can be called via
netsnp(data, method = "gibbs", rho = NULL, n.rho = NULL, rho.ratio = NULL,
ncores = "all", verbose = TRUE)
for any bi-parental genotype data containing at least two genotype
states and possibly missing values. The input data can be either an (n
\(\times\) p) genotype data matrix, an object of class netgwasmap
, which
is an output of functions netmap()
and buildMap()
, or a simulated
data from the function simgeno()
. Depending on the dimension of the
input data a suitable ‘method
’ and its related arguments can be
specified. The argument ncores
determines the number of cores to use
for the calculations. Using ncores = "all"
automatically detects
number of available cores and runs the computations in parallel.
This function returns an \(S3\) object of class "netgwas", which holds
mainly the following objects: (i) Theta
a list of estimated
\((p \times p)\) precision (inverse of variance-covariance) matrices that
infer the conditional independence relationships patterns among genetic
loci, (ii) path
which is a list of estimated \((p \times p)\) adjacency
matrices. This is the graph path corresponding to Theta
, (iii) rho
which is a vector with n.rho
dimension containing the penalties, and
(iv) loglik
contains the maximum log-likelihood values along the graph
path. To select an optimal graph the function selectnet()
can be used.
Complex genetic traits are influenced by multiple interacting loci, each with a possibly small effect. Our approach reduces the number of candidate genes from hundreds to much fewer genes. It is of great interest to geneticists and biologist to discover a set of most effective genes that directly affect a complex trait in GWAS. To overcome the limitations of traditional analysis, such as single-locus association analysis (looking for main effects of single marker loci), multiple testing and QTL analysis, we use the proposed mixed graphical model to study the simultaneous associations between phenotypes and SNPs. Our method allows for a more accurate interpretation of findings, because it adjusts for the effects of remaining variables –SNPs and phenotypes– while measuring the pairwise associations, whereas the traditional methods use marginal associations to often analyze SNPs and phenotypes one at a time .
Graphical modeling is a powerful tool for describing complex interaction
patterns among variables in high-dimensional data used frequently in
microarray analysis (Butte et al. 2000). In our modelling framework,
a genotype–phenotype network is a complex network made up of
interactions among: (i) genetic markers, (ii) phenotypes (e.g. disease),
and (iii) between genetic markers and phenotypes. The first problem in
analyzing genotype-phenotype data is the mixed variable-types, where
markers are ordinal (counting the number of a major allele), and
phenotypes (disease) can be measured in continuous or discrete scales.
We deal with mixed discrete-and-continuous variables by means of copula.
A second issue relates to the high-dimensional setting of the data,
where thousands of genetic markers are measured across a few samples; we
deal with inferring potentially large networks with only few biological
samples. Fortunately, genotype-phenotype networks are intrinsically
sparse, in the sense that only few elements interact with each other.
This sparsity assumption is incorporated into our algorithm based on
penalized graphical models. The proposed method is implemented in the
n
etphenogeno() function, where the input data can be an (\(n \times p\))
matrix
or a data.frame
where \(n\) is the sample size and \(p\) is the
dimension that includes marker data and phenotype measurements. One may
consider including more columns, like environmental variables.
The function is defined by
netphenogeno(data, method = "npn", rho = NULL, n.rho = NULL, rho.ratio = NULL,
ncores = "all", em.iter = 5, em.tol = .001, verbose = TRUE)
and reconstructs genotype-phenotype interactions network for an input
data of a genotype-phenotype data matrix or a data.frame. Detecting
interactions network among genotypes, phenotypes and environmental
factors is also possible using this function. Depending on the size of
the input data, the user may choose "gibbs", "approx", or "npn"
method for learning the networks. For a medium (
\(\sim\)500) and a large number of variables we recommend
to choose "gibbs" and "approx", respectively. Choosing "npn" for a
very large number of variables ( >2000) is computationally efficient.
The default method is set to "npn". Like the function netsnp()
, the
netphenogeno()
function returns an object of class netgwas.
For objects of type ‘netgwas’ there are plot, print and summary methods
available. The plotting function plot.netgwas()
provides a
visualization plot to monitor the path of estimated networks for a range
of penalty terms. The functions
plot.netgwasmap(), ``plot.select()`` and ``plot.simgeno()``
visualize
the corresponding network, the optimal graph and the results of
model-based simulated data, respectively.
To speed up computations in all the three key functions of the netgwas
package, we use the parallel package on the Comprehensive R
Archive
Network (CRAN) at http://CRAN.R-project.org to support parallel
computing on a multi-core machine to deal with large inference problems.
For the optimizing the memory usage, we use the Matrix package
(Bates et al. 2022) for sparse matrix output when estimating and storing full
regularization paths for large datasets. The use of these libraries
significantly improves the computational speed of the functions within
the package.