High-throughput genomic technologies bring to light a comprehensive
hallmark of molecular changes of a disease. It is increasingly evident
that genes are not isolated from each other and the identification of
a gene signature can only partially elucidate the de-regulated
biological functions in a disease. The comprehension of how groups of
genes (pathways) are related to each other (pathway-cross talk) could
explain biological mechanisms causing diseases. Biological pathways
are important tools to identify gene interactions and decrease the
large number of genes to be studied by partitioning them into smaller
groups. Furthermore, recent scientific studies have demonstrated that
an integration of pathways and networks, instead of a single component
of the pathway or a single network, could lead to a deeper
understanding of the pathology.
StarBioTrek is an R package for the integration of biological
pathways and networks which provides a series of functions to support
the user in their analyses. In particular, it implements algorithms to
identify pathways cross-talk networks and gene network drivers in
pathways. It is available as open source and open development software
in the Bioconductor platform.
In recent years new genomic technologies have made possible to define new marker gene signatures (Desmedt et al. 2009; Parker et al. 2009; Cava et al. 2014b). However, gene expression-based signatures present some constraints because they do not consider metabolic role of the genes and are affected by genetic heterogeneity across patient cohorts (Donato et al. 2013; Cava et al. 2015).
Pathway analysis can help the researchers in the identification of the biological roles of candidate genes exceeding these limitions (Folger et al. 2011). Indeed, considering the activity of entire biological pathways rather than the expression levels of individual genes can charactherize the whole tissue. In particular, there are several methods in computations and data used to perform the pathway analyses. They can be characterized in two different levels: gene-sets and pathway topology (García-Campos et al. 2015). Indeed, the existing tools integrating pathway data can be grouped into these groups based on the pathway definition.
In the first group we can include the tools that are based on gene sets definition as simple lists of biological molecules, in which the listed components share a common biological role. In this group, for example, we can include CoRegNet and Gene Set Enrichment Analysis (GSEA). CoRegNet reconstructs a co-regulatory network from gene expression profiles integrating, also, regulatory interactions, such as transcription factor binding site and ChIP data, presenting some analyses to identify master regulators of a given set of genes (Nicolle et al. 2015). One of the first and most popular methods is GSEA (Subramanian et al. 2005) that uses a list of ranked genes based on their differential gene expression between two labels. It then evaluates their distribution on a priori defined set of genes, thus generating an enrichment score (ES) for each set of genes.
In contrast, tools based on pathway topology do not only contain the components of a pathway but also describe the interactions between them. However, these methods still analyze the pathways as independent from each other and not considering the influence that a pathway can exert over another. In this second group we can include analysis methods that take into account the topological structure of a pathway, such as NetPathMiner, ToPASeq, and XGR. NetPathMiner (Mohamed et al. 2014) implements methods for the construction of various types of genome scale networks for network path mining. It generates a network representation from a pathway file supporting metabolic networks. Since the network is generated, the network edges can be weighted using gene expression data (e.g., Pearson correlation). Using machine learning methods and Markov mixture models, the pathways can be classified or clustered based on their association with a response label. The ToPASeq package implements seven different methods covering broad aspects for topology-based pathway analysis of RNA-seq data (Ihnatova and Budinska 2015). With respect to other tools, XGR (Fang et al. 2016) is designed for enhanced interpretation of genomic data generating also SNP-modulated gene networks and pathways. However, compared to our tool, the others are not focused on the pathway cross-talk analyses.
In line with this scenario given the few methods focused on the pathway cross-talk network, the development of new methodologies to measure pathway activity and cross-talk among pathways integrating also the information of networks and gene expression data (e.g., TCGA data) could lead to a deeper knowledge of the pathology.
Furthermore, functional pathway representation attributes the same functional significance to each gene without considering the impact of gene interactions in performing that function. What kinds of interactions are there among genes in functional pathways? Specifically, biological system interactions are composed of multiple layers of dynamic interaction networks (Cantini et al. 2015). These dynamic networks can be decomposed, for example, into: co-expression, physical, co-localization, genetic, pathway, and shared protein domains.
We developed a series of algorithms (see (Colaprico et al. 2015; Cava et al. 2016, 2018)), implemented in StarBioTrek package able to work on all levels of the pathway analysis.
Starting from the gene expression data of two groups of samples (e.g., normal vs. disease), such algorithms aim at building a pathway cross-talk model by attributing a score for each pairwise pathway. Several scores are implemented in the tool using the gene expression levels inside the pathways. The interacting pathways are filtered considering pathways that are able to classify better the two groups of samples. In addition, the genes inside the pathways can be weighted defining key network drivers in the pathways as those gene drivers that are highly connected in biological networks.
In summary, StarBioTrek package proposes an approach that integrates knowledge on the functional pathways and multiple gene-gene (protein-protein) interactions into gene selection algorithms. The challenge is to identify more stable biomarker signatures, which are also more easily interpretable from a biological perspective. The integration of biological networks and pathways can also give further hypotheses of the mechanisms of driver genes.
StarBioTrek makes accessible data of biological pathways and networks in order to perform analyses without having to navigate and access different web-based databases, without the need to download data, and by integrating and locally processing the full data sets in a short time. Specifically, it allows the users to: i) query and download biological pathways and networks from several repositories such as KEGG, Reactome and GeneMANIA(Zuberi et al. 2013; Cava et al. 2017; Franz et al. 2018) importing several functions from graphite (Sales et al. 2012), and harmonize annotations for genes and proteins (query/ download/ annotation harmonization); (ii) integrate pathways and biological networks with a series of implemented algorithms.
The functions of StarBioTrek import a large amount of data (e.g., biological pathways and networks).
Specifically, the function pathwayDatabases
can easily query some
features of interest of the user such as species or specific pathway
database from graphite (Sales et al. 2012). Then, the function GetData
imports the selected data.
> library(graphite)
> pathwayDatabases()
species database1 athaliana kegg
2 athaliana pathbank
3 athaliana reactome
4 btaurus kegg
5 btaurus reactome
6 celegans kegg
7 celegans reactome
8 cfamiliaris kegg
9 cfamiliaris reactome
10 dmelanogaster kegg
11 dmelanogaster reactome
12 drerio kegg
13 drerio reactome
14 ecoli kegg
15 ecoli pathbank
16 ggallus kegg
17 ggallus reactome
18 hsapiens biocarta
19 hsapiens humancyc
20 hsapiens kegg
21 hsapiens nci
22 hsapiens panther
23 hsapiens pathbank
24 hsapiens pharmgkb
25 hsapiens reactome
26 hsapiens smpdb
27 mmusculus kegg
28 mmusculus pathbank
29 mmusculus reactome
30 rnorvegicus kegg
31 rnorvegicus pathbank
32 rnorvegicus reactome
33 scerevisiae kegg
34 scerevisiae pathbank
35 scerevisiae reactome
36 sscrofa kegg
37 sscrofa reactome
38 xlaevis kegg
> path <- GetData(species="hsapiens", pathwaydb="kegg")
Since the user selected the data of interest, the function GetPathData
allows us to obtain a list of genes grouped by functional role:
> pathwayALLGENE <- GetPathData(path_ALL=path[1:3])
1] "Downloading............. Glycolysis / Gluconeogenesis 1 of 3 pathways"
[1] "Downloading............. Citrate cycle (TCA cycle) 2 of 3 pathways"
[1] "Downloading............. Pentose phosphate pathway 3 of 3 pathways" [
The function ConvertedIDgenes
will converter the gene nomenclature
(e.g., ENTREZ ID) to Gene Symbol.
> pathway <- ConvertedIDgenes(path_ALL=path[1:10])
The function getNETdata
of StarBioTrek imports biological networks
from GeneMANIA. The biological networks can be selected among physical
interactions, co-localization, genetic interactions, pathways, and
shared protein domain networks. Furthermore, it supports 9 species (
Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Drosophila
melanogaster, Escherichia coli, Homo sapiens, Mus musculus, Rattus
norvegicus, and Saccharomyces cerevisiae); for default it considers Homo
sapiens. Specifically, the function call
> netw <- getNETdata(network="SHpd")
1]"genemania.org/data/current/Homo_sapiens/Shared_protein_domains.INTERPRO.txt n.1 of 2"
[1]"genemania.org/data/current/Homo_sapiens/Shared_protein_domains.PFAM.txt n.2 of 2"
[1]"Preprocessing of the network n. 1 of 2"
[1]"Preprocessing of the network n. 2 of 2" [
imports biological networks (i.e., shared protein domains interactions
from INTERPRO and PFAM databases) for Homo sapiens. Otherwise, the
user can select one of the 9 species or using the following parameters
the user can select different network types: PHint
for Physical
interactions, COloc
for Co-localization, GENint
for Genetic
interactions, PATH
for Pathway, and SHpd
for Shared protein domains.
Finally, StarBioTrek provides the functions for the harmonization of
gene nomenclature in the pathways and biological networks. Biological
data are processed for downstream analyses mapping Ensembl Gene ID to
gene symbols. Figure 1 shows an overview of
network types supported by StarBioTrek with the function getNETdata
.
Starting from a gene expression matrix (DataMatrix), StarBioTrek groups the gene expression levels according to their biological roles in pathways for each sample.
> listpathgene <- GE_matrix(DataMatrix=tumo[,1:2], genes.by.pathway=pathway[1:10])
> str(listpathgene)
2
List of $ Cell_cycle :'data.frame': 114 obs. of 2 variables:
$ TCGA-E9-A1RC-01A: num [1:114] 4218 695 4231 7029 1211 ...
..$ TCGA-BH-A0B1-01A: num [1:114] 3273 692 6733 6468 1290 ...
..$ p53_signaling pathway:'data.frame': 64 obs. of 2 variables:
$ TCGA-E9-A1RC-01A: num [1:64] 989 1614 1592 3456 900 ...
..$ TCGA-BH-A0B1-01A: num [1:64] 816 1274 1770 3190 405 ... ..
This function allows the user to have in a short time the gene expression levels grouped by pathways.
As described in (Cava et al. 2014a) there are different measures to summarize the expression levels of each pathway, such as the mean:
> score_mean <- average(pathwayexpsubset=listpathgene)
or standard deviation:
> score_stdev <- stdv(gslist=listpathgene)
Dissimilarity distances have been proved useful in many application fields. Recent studies (Cava et al. 2013, 2014c) used with success dissimilarity representation among patients, considering the expression levels of individual genes. To our knowledge, dissimilarity representation is not used in pathway-based expression profiles. Our goal is to give a dissimilarity representation, which can express, through a function D(x,y), the dissimilarity between two pathways x and y, such as Euclidean distance between pairs of pathways:
> scoreeucdistat <- eucdistcrtlk(dataFilt=Data_CANCER_normUQ_fil,
pathway_exp=pathway[1:10])
or discriminating score (Colaprico et al. 2015):
> crosstalkstdv <- dsscorecrtlk(dataFilt=Data_CANCER_normUQ_fil,
pathway_exp=pathway[1:10])
Biological pathways can involve a large number of genes that are not equivocally relevant for a functional role in the cell. Therefore, the integration of network and pathway-based methods can boost the power to identify the key genes in the pathways.
The function takes as arguments: a list of pathways as obtained by the
function ConvertedIDgenes
and the networks as obtained by the function
getNETdata
.
> listanetwork <- pathnet(genes.by.pathway=pathway[1:10], data=netw)
It creates a network of interacting genes for each pathway. The output of the function is a selection of interacting genes according to the network \(N\) in a pathway \(P\), namely a list with two columns where on the same row there are the two interacting genes.
The function listpathnet
takes as inputs the output obtained by the
function pathnet
and the pathways as obtained by the function
ConvertedIDgenes
:
> listpath <- listpathnet(lista_net=listanetwork, pathway=pathway[1:10])
creating a list of vectors for each pathway containing only genes that have at least one interaction with other genes belonging to the pathway.
The function GetPathNet
allows us to obtain a list of interacting
genes (protein-protein interactions from
graphite
package) for each pathway:
> pathwaynet <- GetPathNet(path_ALL=path[1:3])
using as its argument the output obtained by GetData
.
The first algorithm implemented in StarBioTrek explores a pathway cross-talk network from gene expression data to better understand the underlying pathological mechanism. The algorithm generates a network of pathways that shows a different behavior between two groups of samples (i.e., normal vs. disease).
Specifically,
# get pathways from KEGG database
<- GetData(species="hsapiens", pathwaydb="kegg")
path <- ConvertedIDgenes(path_ALL=path)
pathway
# create a measure of pathway cross-talk (i.e., Euclidean distance) between pairwise
# of pathways starting from gene expression data (i.e.TCGA) with in the columns the
# samples and in the rows the genes
<- eucdistcrtlk(dataFilt=Data_CANCER_normUQ_fil, pathway=pathway)
scoreeucdistat
# split samples' TCGA ID into normal and tumor groups
<- SelectedSample(Dataset=Data_CANCER_normUQ_fil, typesample="tumour")
tumo <- SelectedSample(Dataset=Data_CANCER_normUQ_fil, typesample="normal")
norm
# divide the dataset in 60/100 for training and 40/100 for testing
<- 60
nf
# a support vector machine is applied
<- svm_classification(TCGA_matrix=scoreeucdistat[1:10,], nfs=nf,
res_class normal=colnames(norm), tumour=colnames(tumo))
Since the AUC values are obtained for each pair of pathways, they can be ranked in order to obtain the pathway cross-talk interactions able to classify the two classes (i.e. normal vs. tumor samples) with the best performance. Such selection can be done considering AUC values:
=0.80
cutoff<- res_class[res_class[,1]>cutoff, ] er
The outputs are the only pathway interactions that are obtained with AUC values \(>0.80\). The implemented algorithm was used in (Colaprico et al. 2015) and (Cava et al. 2016) to screen pathway cross-talk associated to breast cancer.
The pseudocode of the implemented algorithm is summarized below.
Here, we propose an algorithm for the integrative analysis of networks and pathways. Our method is inspired on a well-validated method (the GANPA/LEGO) (Fang et al. 2012; Dong et al. 2016), based on the hypothesis that if one gene is functionally connected in the pathway with more genes than those expected (according to the functional networks), has a key role in that pathway. The algorithm, an extension of the GANPA/LEGO method, defines driver genes in a pathway if they are highly connected in a biological network.
The function
IPPI(patha=pathway_matrix[,1:10], netwa=netw_IPPI)
is used to identify driver genes for each pathway. The inputs of the
function are pathways and network data. It calculates the degree
centrality values of genes inside the network and the degree centrality
of genes inside pathways.
The pseudocode of the implemented algorithm is summarized below.
In the first step, given the gene \(i\) within the network \(N\) with \(m\) genes, the function computes the degree centrality \(d\)\(iN\) as the number of neighbor genes belonging to \(N\) to which the gene \(i\) is directly connected.
In the second step, given gene \(i\) within the pathway \(P\) with \(k\) genes, the function then computes the degree centrality \(d\)\(iP\) considering only the relations among gene \(i\) and the other genes in the networks belonging to pathway \(P\). Overall, by combining the information of the network \(N\) within the pathway \(P\), is obtained a selection of interacting genes according to the network \(N\).
Then, the function computes the degree centrality expected \(d\)\(iE\) by supposing equal probability for the existence of edges between nodes (\(d\)\(iN\)\(/m = d\)\(iE\)/\(k\)). Thus, \(d\)\(iE\) = \(d\)\(iN\) \(x k/m\).
The function characterizes a gene as a ‘network driver’ in the pathway \(P\), when \(d\)\(iP\) of involving gene, normalized to the size of the pathway (\(k\)), is higher than \(d\)\(iE\), \(d\)\(iP\)/\(k > d\)\(iE\).
The speculation is that if one gene is functionally linked (according to the functional network) with more genes in the pathway than expected, its function is central in that pathway.
The function IPPI
was used in (Cava et al. 2018) to find driver genes in the
pathways that are also de-regulated in a pan-cancer analysis.
StarBioTrek presents several functions for the preparation to the
visualization of gene-gene interactions and pathway cross-talk using the
qgraph package
(Epskamp et al. 2012). The function plotcrosstalk
prepares the data:
> formatplot <- plotcrosstalk(pathway_plot=pathway[1:6],gs_expre=tumo)
It computes a Pearson correlation between the genes (according to a gene expression matrix, such as tumor) in which each gene is grouped in a gene set given by the user (e.g., pathway). Each gene is presented in a gene set if it is involved univocally in that gene set.
The functions of qgraph
> library(qgraph)
> qgraph(formatplot[[1]], minimum = 0.25, cut = 0.6, vsize = 5, groups = formatplot[[2]],
legend = TRUE, borders = FALSE, layoutScale=c(0.8,0.8))
and
> qgraph(formatplot[[1]], groups=formatplot[[2]], layout="spring", diag = FALSE,
cut = 0.6, legend.cex = 0.5, vsize = 6, layoutScale=c(0.8,0.8))
show the network with different layouts. The graphical output of the functions are presented in the Figure 2 and Figure 3. The color of interactions indicates the type of correlation: green edges are positive correlations and red edges are negative correlations. The thickness of the edge is proportional to the strength of correlation.
The outputs of the functions that compute the pairwise distance metrics can be easily used with heatmap plotting libraries such as heatmap or pheatmap as reported in the Figure 4.
Furthermore, the function circleplot
of StarBioTrek implemented
using the functions of
GOplot
(Walter et al. 2015) provides a visualization of driver genes (with a
score indicating the role of genes in that pathway), as reported in
Figure 5.
> formatplot <- plotcrosstalk(pathway_plot=pathway[1:6], gs_expre=tumo)
> score <- runif(length(formatplot[[2]]), min=-10, max=+10)
> circleplot(preplot=formatplot, scoregene=score)
In this section we will present two case studies for the usage of the StarBioTrek package. In particular, the first case study uses the first implemented algorithm reported above to identify pathway cross-talk network. The second case study uses the second implemented algorithm to identify gene drivers for each pathway.
Starting from gene expression data of breast cancer samples and normal samples we grouped 15243 genes in pathways according to their functional role in the cell. Pathway data were derived from the function call:
<- GetData(species="hsapiens", pathwaydb="kegg")
path <- ConvertedIDgenes(path_ALL=path) pathway
For each pair of pathways we calculated a discriminating score as a measure of cross-talk. This measure can be used considering e.g. the pathways enriched with differentially expressed genes.
<- dsscorecrtlk(dataFilt=Data_CANCER_normUQ_fil,
crosstalkscore pathway_exp=pathway[1:10])
Discriminating score is given by |M1-M2|/S1+S2 where M1 and M2 are means and S1 and S2 standard deviations of expression levels of genes in a pathway 1 and in a pathway 2. In order to identify the best pathways for breast cancer classification (breast cancer vs. normal) we implemented a Support Vector Machine. We divided the original dataset in training data set (60/100) and the rest of original data in the testing set (40/100). In order to validate the classifier, we used a \(k\)-fold cross-validation (\(k = 10\)) obtaining Area Under the Curve (AUC).
<- SelectedSample(Dataset=Data_CANCER_normUQ_fil, typesample="tumour")
tumo <- SelectedSample(Dataset=Data_CANCER_normUQ_fil, typesample="normal")
norm <- 60
nf <- svm_classification(TCGA_matrix=crosstalkscore, nfs=nf,
res_class normal=colnames(norm), tumour=colnames(tumo))
Ranking AUC values obtained we selected the pathway cross-talk network with the best AUC. The approach of the algorithm is shown in Figure 6.
In the second case study, we downloaded KEGG pathways
<- GetData(species="hsapiens",pathwaydb="kegg")
path <- ConvertedIDgenes(path_ALL=path) pathway
and network data for different network types from GeneMANIA
# for Physical interactions
<- getNETdata(network="PHint")
netw
# for Co-localization
<- getNETdata(network="COloc")
netw
# for Genetic interactions
<- getNETdata(network="GENint")
netw
# for Pathway interactions
<- getNETdata(network="PATH")
netw
# for Shared_protein_domains
<- <getNETdata(network="SHpd") netw
We processed the data obtained by the function getNETdata
in order to
obtain a data format supported by the function IPPI
. The function
IPPI
was applied for each of the 5 network types.
We obtained that genes with genetic interaction found the lowest number of potential gene network drivers. On the other hand, the network that includes proteins with shared protein domains found the highest number of potential driver genes. Finally, we defined a gene as a “network driver” in the pathway, when in at least two networks one gene is functionally connected in the pathway with more genes than those expected (according to the two networks).
The approach is shown in Figure 7.
We have described StarBioTrek, an R package for the integrative analysis of biological networks and pathways. The package supports the user during the import and data analysis of data. StarBioTrek implements two algorithms: i) the identification of gene network drivers in the pathways; ii) the building of pathway cross talk network.
GraphicalModels, Psychometrics
NetPathMiner, ToPASeq, StarBioTrek, graphite
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Cava & Castiglioni, "Integration of networks and pathways with StarBioTrek package", The R Journal, 2019
BibTeX citation
@article{RJ-2019-025, author = {Cava, Claudia and Castiglioni, Isabella}, title = {Integration of networks and pathways with StarBioTrek package}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {310-322} }