Biological pathways or modules represent sets of interactions or functional relationships occurring at the molecular level in living cells. A large body of knowledge on pathways is organized in public databases such as the KEGG, Reactome, or in more specialized repositories, the Atlas of Cancer Signaling Network (ACSN) being an example. All these open biological databases facilitate analyses, improving our understanding of cellular systems. We hereby describe ACSNMineR for calculation of enrichment or depletion of lists of genes of interest in biological pathways. ACSNMineR integrates ACSN molecular pathways gene sets, but can use any gene set encoded as a GMT file, for instance sets of genes available in the Molecular Signatures Database (MSigDB). We also present RNaviCell, that can be used in conjunction with ACSNMineR to visualize different data types on web-based, interactive ACSN maps. We illustrate the functionalities of the two packages with biological data taken from large-scale cancer datasets.
Biological pathways and networks comprise sets of interactions or functional relationships, occurring at the molecular level in living cells (Adriaens et al. 2008; Barillot et al. 2012). A large body of knowledge on cellular biochemistry is organized in publicly available repositories such as the KEGG database (Kanehisa et al. 2011), Reactome (Croft et al. 2014) and MINT (Zanzoni et al. 2002). All these biological databases facilitate a large spectrum of analyses, improving our understanding of cellular systems. For instance, it is a very common practice to cross the output of high-throughput experiments, such as mRNA or protein expression levels, with curated biological pathways in order to visualize the changes, analyze their impact on a network and formulate new hypotheses about biological processes. Many biologists and computational biologists establish list of genes of interest (e.g. a list of genes that are differentially expressed between two conditions, such as normal vs disease) and then evaluate if known biological pathways have significant overlap with this list of genes.
We have recently released the Atlas of Cancer Signaling Network (ACSN), a web-based database which describes signaling and regulatory molecular processes that occur in a healthy mammalian cell but that are frequently deregulated during cancerogenesis (Kuperstein et al. 2015). The ACSN atlas aims to be a comprehensive description of cancer-related mechanisms retrieved from the most recent literature. The web interface for ACSN is using the NaviCell technology, a software framework dedicated to web-based visualization and navigation for biological pathway maps (Kuperstein et al. 2013). This environment is providing an easy navigation of maps through the use of the Google Maps JavaScript library, a community interface with a web blog system, and a comprehensive module for visualization and analysis of high-throughput data (Bonnet et al. 2015).
In this article, we describe two packages related to ACSN analysis and data visualization. The package ACSNMineR is designed for the calculation of gene enrichment and depletion in ACSN maps (or any user-defined gene set via the import function), while RNaviCell is dedicated to data visualization on ACSN maps. Both packages are available on the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/ACSNMineR/ and https://cran.r-project.org/web/packages/RNaviCell/), and on the GitHub repository (https://github.com/sysbio-curie/ACSNMineR and https://github.com/sysbio-curie/RNaviCell). For the remainder of this article, we describe the organization of each package and illustrate their capacities with several concrete examples demonstrating their capabilities.
Currently, ACSN maps cover signaling pathways involved in DNA repair, cell cycle, cell survival, cell death, epithelial-to-mesenchymal transition (EMT) and cell motility. Each of these large-scale molecular maps is decomposed in a number of functional modules. The maps themselves are merged into a global ACSN map. Thus the information included in ACSN is organized in three hierarchical levels: a global map, five individual maps, and several functional modules. Each ACSN map covers hundreds of molecular players, biochemical reactions and causal relationships between the molecular players and cellular phenotypes. ACSN represents a large-scale biochemical reaction network of 4,826 reactions involving 2,371 proteins (as of today), and is continuously updated and expanded. We have included the three hierarchical levels in the ACSNMineR package, in order to be able to calculate enrichments at all three levels. The calculations are made by counting the number of occurences of gene symbols (HUGO gene names) from a given list of genes of interest in all ACSN maps and modules. Table 1 is detailling the number of gene symbols contained in all the ACSN maps.
Map | Total | Nb mod. | Min | Max | Mean |
---|---|---|---|---|---|
ACSN global | 2239 | 67 | 2 | 629 | 79 |
Survival | 1053 | 5 | 208 | 431 | 328 |
Apoptosis | 667 | 7 | 19 | 382 | 136 |
EMT & Cell motility | 634 | 9 | 18 | 629 | 137 |
DNA repair | 345 | 21 | 3 | 171 | 45 |
Cell cycle | 250 | 25 | 2 | 130 | 20 |
The statistical significance of the counts in the modules is assessed by using either the Fisher exact test (Fisher 1922, 1934) or the hypergeometric test, which are equivalent for this purpose (Rivals et al. 2007).
The current ACSN maps are included in the ACSNMineR package, as a list of character matrices.
> length(ACSN_maps)
1] 6
[> names(ACSN_maps)
1] "Apoptosis" "CellCycle" "DNA_repair" "EMT_motility" "Master"
[6] "Survival" [
For each matrix, rows represent a module, with the name of the module in the first column, followed by a description of the module (optional), and then followed by all the gene symbols of the module. The maps will be updated according to every ACSN major release.
The main function of the ACSNMineR package is the enrichment
function, which is calculating over-representation or depletion of genes
in the ACSN maps and modules. We have included a small list of 12 Cell
Cycle related genes in the package, named genes_test
that can be used
to test the main enrichment function and to get familiar with its
different options.
> genes_test
1] "ATM" "ATR" "CHEK2" "CREBBP" "TFDP1" "E2F1" "EP300"
[8] "HDAC1" "KAT2B" "GTF2H1" "GTF2H2" "GTF2H2B" [
The example shown below is the simplest command that can be done to test a gene list for over-representation on the six included ACSN maps. With the list of 12 genes mentionned above and a default p-value cutoff of \(0.05\), we have a set of 8 maps or modules that are significantly enriched. The results are structured as a data frame with nine columns displaying the module name, the module size, the number of genes from the list in the module, the names of the genes that are present in the module, the size of the reference universe, the number of genes from the list that are present in the universe, the raw p-value, the p-value corrected for multiple testing and the type of test performed. The module field in the results data frame indicate the map name and the module name separated by a column character. If a complete map is significantly enriched or depleted, then only the map name is shown, without any module or column character. For instance, the third line of the results object below concern the E2F1 module of the CellCycle map.
> library(ACSNMineR)
> results <- enrichment(genes_test)
> dim(results)
1] 8 9
[> results[3,]
module module_size nb_genes_in_module:E2F1 19 12
V161 CellCycle
genes_in_module
V161 ATM ATR CHEK2 CREBBP TFDP1 E2F1 EP300 HDAC1 KAT2B GTF2H1 GTF2H2 GTF2H2B
universe_size nb_genes_in_universe p.value p.value.corrected test2237 12 3.735018e-21 2.353061e-19 greater V161
The enrichment
function can take up to nine arguments: the gene list
(as a character vector), the list of maps that will be used to calculate
enrichment or depletion, the type of statistical test (either the Fisher
exact test or the hypergeometric test), the module minimal size for
which the calculations will be done, the universe, the p-value
threshold, the alternative hypothesis ("greater" for calculating
over-representation, "less" for depletion and "both" for both tests)
and a list of genes that should be removed from the universe (option
"Remove_from_universe"). This option may be useful for instance if we
know beforehand that a number of genes are not expressed in the samples
considered.
Only the gene list is mandatory to call the enrichment
function, all
the other arguments have default values. The maps
argument can either
be a dataframe imported from a GMT file with the format_from_gmt
function or a list of dataframes generated by the same procedure. The
GMT format corresponds to the Broad Institute’s Gene Matrix Transposed
file format, a convenient and easy way to encode named sets of genes of
interest in tab-delimited text files (it is not a graph or network
format). By default, the function enrichment
uses the ACSN maps
previously described. The correction for multiple testing is set by
default to use the method of Benjamini & Hochberg, but can be changed to
any of the usual correction methods (Bonferroni, Holm, Hochberg, Holm,
or Benjamini & Yekutieli (Reiner et al. 2003)), or even disabled . The
minimal module size represents the smallest size value of a module that
will be used to compute enrichment or depletion. This is meant to remove
results of low significance for module of small size. The universe in
which the computation is made by default is defined by all the gene
symbols contained in the maps. All the genes that were given as input
and that are not present on the maps will be discarded. To keep all
genes, the user can change the universe to HUGO
, and in that case, the
complete list of HUGO gene symbols will be used as the reference (\(>\)
39,000 genes). The threshold corresponds to the maximal value of the
corrected p-value (unless the user chose not to correct for multiple
testing) that will be displayed in the result table.
It may be of interest to compare enrichment of pathways in different
cohorts or experiments. For example, enrichment of highly expressed
pathways can reveal differences between two cancer types or two cell
lines. To facilitate such comparisons, ACSNMineR provides a
multisample_enrichment
function. It relies on the enrichment
function but takes a list of character vector genes. The name of each
element of the list will be assumed to be the name of the sample for
further analysis. Most of the arguments given to
multisample_enrichment
are the same as the ones passed to
enrichment
. However, the cohort_threshold
is designed to filter out
modules which would not pass the significance threshold in all samples.
Finally, to facilitate visualization of results, ACSNMineR integrates
a representation function based on
ggplot2 syntax
(Wickham 2009). It allows representation of results from enrichment
or multisample_enrichment
with a limited number of parameters. Two
types of display are available: heat-map tiles or bars. For multiple
samples using a barplot representation, the number of rows used can be
provided, otherwise all plots will be on the same row. For the heatmap,
the color of the non-significant modules, and boundaries of the gradient
for significant values can also be tuned.
We previously computed the p-value of the genes_test
list with default
parameters. The number of modules which have a p-value below 0.05 was 8,
that can be compared to the 16 obtained without correction with the
simple command shown below (some of the results are displayed in table
2).
enrichment(genes_test,correction_multitest = FALSE)
Module | Mod. size | Genes in module | p-value | Test |
---|---|---|---|---|
CellCycle | 242 | ATM ATR CHEK2 | \(5.4 \times 10^{-7}\) | greater |
CREBBP TFDP1 E2F1 | ||||
EP300 HDAC1 KAT2B | ||||
GTF2H1 GTF2H2 GTF2H2B | ||||
CellCycle:APOPTOSIS_ENTRY | 10 | ATM ATR CHEK2 E2F1 | \(3.5 \times 10^{-7}\) | greater |
CellCycle:CYCLINB | 7 | ATM | 0.04 | greater |
We can now plot the first six rows of the results obtained for corrected and uncorrected fisher test with heatmap format (Figure 1) or barplot (Figure 2) with the following commands:
# heatmap
represent_enrichment(enrichment = list(Corrected = results[1:6,],
Uncorrected = results_uncorrected[1:6,]),
plot = "heatmap", scale = "reverselog",
low = "steelblue" , high ="white", na.value = "grey")
# barplot
represent_enrichment(enrichment = list(Corrected = results[1:6,],
Uncorrected = results_uncorrected[1:6,]),
plot = "bar", scale = "reverselog",
nrow = 1)
The NaviCell Web Service provides a server mode, which allows automating visualization tasks and retrieving data from molecular maps via RESTful (standard http/https) calls. Bindings to different programming languages are provided in order to facilitate the development of data visualization workflows and third-party applications (Bonnet et al. 2015). RNaviCell is the R binding to the NaviCell Web Service. It is implemented as a standard R package, using the R object-oriented framework known as Reference Classes (Wickham 2015). Most of the work done by the user using graphical point-and-click operations on the NaviCell web interface are encoded as functions in the library encapsulating http calls to the server with appropriate parameters and data. Calls to the NaviCell server are performed using the library RCurl (Lang and CRAN team 2015), while data encoding/decoding in JSON format is performed with the RJSONIO library (Lang 2014).
Once the RNaviCell library is installed and loaded, the first step is to create a NaviCell object and launch the browser session. This will automatically create a unique session ID with the NaviCell server. Once the session is established, various functions can be called to send data to the web session, set graphical options, visualize data on a map or get data from the map. There are 125 functions available in the current version of RNaviCell. All of them are described with their different options in the RNaviCell documentation, and we provide a tutorial on the GitHub repository wiki (https://github.com/sysbio-curie/RNaviCell/wiki/Tutorial).
In the simple example detailed below, we create a NaviCell session, then load an expression data set from a local (tab-delimited) file. The data represent gene expression measured in a prostate cancer cell line resistant to hormonal treatment (agressive), and is taken from the Cell Line Encyclopedia project (Barretina et al. 2012). We visualize the data values on the Cell Cycle map (the default map), using heat maps. With this visualization mode, gene expression values are represented as a color gradient (green to red) in squares positioned next to the entities where the gene has been mapped (Figure 3). Note that the map is displayed in a browser and is interactive, i.e. users can zoom in to display more information and for example look in what reactions are involved the genes selected to be displayed, and lots of other informations (see (Bonnet et al. 2015) and (Kuperstein et al. 2015) for more details).
# a short RNaviCell script example
# load RNaviCell library
library(RNaviCell)
# create a NaviCell object and launch a server session
# this will automatically open a browser on the client
<- NaviCell()
navicell $launchBrowser()
navicell
# import a gene expression matrix and
# send the data to the NaviCell server
# NB: the data_matrix object is a regular R matrix
<- navicell$readDatatable('DU145_data.txt')
data_matrix $importDatatable("mRNA expression data", "DU145", data_matrix)
navicell
# set data set and sample for heat map representation
$heatmapEditorSelectSample('0','data')
navicell$heatmapEditorSelectDatatable('0','DU145')
navicell$heatmapEditorApply() navicell
In a study published in 2008, Schmidt and colleagues analyzed gene expression patterns of 200 breast cancer patients not treated by systemic therapy after surgery using discovery approach to reveal additional prognostic motifs (Schmidt et al. 2008). Estrogen receptor (ER) expression and proliferative activity of breast carcinomas are well-known and described prognostic markers. Patients with ER-positive carcinomas have a better prognosis than those with ER-negative carcinomas, and rapidly proliferating carcinomas have an adverse prognosis. Knowledge about the molecular mechanisms involved in the processes of estrogen-dependent tumor growth and proliferative activity has led to the successful development of therapeutic concepts, such as antiendocrine and cytotoxic chemotherapy.
The dataset corresponding to this study is available as a Bioconductor package. The code shown below is creating a list of differentially expressed genes between ER positive and ER negative samples, and calculates the enrichment in ACSN maps from this list of genes. As seen in Table 3, there is one map (DNA repair) and seven modules (belonging to the Cell Cycle, DNA repair and Apoptosis maps) enriched.
# load all necessary packages
library(breastCancerMAINZ)
library(Biobase)
library(limma)
library(ACSNMineR)
library(hgu133a.db)
library(RNaviCell)
# load data and extract expression and phenotype data
data(mainz)
<- exprs(mainz)
eset <- pData(mainz)
pdat
# Create list of genes differentially expressed between ER positive and
# ER negative samples using moderated t-test statistics
<- model.matrix(~factor(pdat$er == '1'))
design lmFit(eset, design) -> fit
eBayes(fit) -> ebayes
toptable(ebayes, coef=2,n=25000) -> tt
which(tt$adj < 0.05) -> selection
rownames(tt[selection,]) -> probe_list
mget(probe_list, env = hgu133aSYMBOL) -> symbol_list
<- as.character(symbol_list)
symbol_list
# calculate enrichment in ACSN maps
enrichment(symbol_list) -> results
dim(results)
1] 8 9 [
Module | Mod. size | Nb genes | pval | Cor. pval |
---|---|---|---|---|
Apoptosis:AKT_MTOR | 79 | 47 | 0.00043 | 0.0068 |
CellCycle:E2F2_TARGETS | 35 | 22 | 0.0055 | 0.043 |
CellCycle:E2F3_TARGETS | 51 | 31 | 0.0023 | 0.025 |
CellCycle:E2F4_TARGETS | 100 | 60 | \(5.8 \times 10^{-5}\) | 0.0037 |
DNA_repair | 346 | 172 | 0.00038 | 0.0068 |
DNA_repair:CELL_CYCLE | 82 | 49 | 0.00029 | 0.0068 |
DNA_repair:G1_CC_PHASE | 25 | 18 | 0.0013 | 0.016 |
DNA_repair:S_CC_PHASE | 46 | 28 | 0.0036 | 0.033 |
The Molecular Signatures Database (MSigDB) is one of the most widely
used repository of well-annotated gene sets representing the universe of
biological processes (Liberzon et al. 2011). We downloaded the
canonical pathways set, counting more than 1,300 gene sets representing
canonical pathways compiled by domain experts. The dataset is encoded
with the GMT format, and can be imported within ACSNMineR with the
format_from_gmt
function. We calculate the enrichment for the breast
cancer differentially expressed gene list, simply specifying the MSigDB
data we just imported as the maps
option. Table 4
is displaying the pathways having a corrected p-value \(< 0.05\). The
prefix is indicating the database source, so we see that we have
pathways from the KEGG, Reactome and PID databases. Consistent with our
previous results, most of the enriched pathways are related to the cell
cycle regulation.
# Import MSigDB canonical pathways and calculate enrichment on this database
<- format_from_gmt('c2.cp.v5.0.symbols.gmt')
mtsig enrichment(symbol_list, maps = mtsig)
Pathway | Mod. size | Nb genes | Cor. pval |
---|---|---|---|
KEGG_CELL_CYCLE | 128 | 76 | \(3.9 \times 10^{-8}\) |
REACTOME_CELL_CYCLE_MITOTIC | 325 | 159 | \(3.9 \times 10^{-8}\) |
REACTOME_DNA_REPLICATION | 192 | 98 | \(4.9 \times 10^{-6}\) |
PID_FOXM1PATHWAY | 40 | 29 | \(3.1 \times 10^{-5}\) |
REACTOME_MITOTIC_M_M_G1_PHASES | 172 | 87 | \(3.1 \times 10^{-5}\) |
REACTOME_CELL_CYCLE | 421 | 182 | \(5 \times 10^{-5}\) |
REACTOME_MITOTIC_G1_G1_S_PHASES | 137 | 71 | \(9 \times 10^{-5}\) |
PID_AURORA_B_PATHWAY | 39 | 27 | 0.0002 |
REACTOME_S_PHASE | 109 | 58 | 0.00024 |
PID_SYNDECAN_1_PATHWAY | 46 | 30 | 0.00026 |
At last, we visualize the mean expression values for ER negative samples
for all genes differentially expressed on the ACSN master (global) map
using RNaviCell
commands to create heatmaps.
# Select ER negative samples and calculate mean expression values
apply(eset[probe_list,pdat$er == 0],1,mean) -> er_minus_mean
names(er_minus_mean) <- symbol_list
<- as.matrix(er_minus_mean)
er_minus_mean colnames(er_minus_mean) <- c('exp')
# create a NaviCell session, import the expression matrix on the map and create
# heatmaps to represent the data points.
<- NaviCell()
navicell $proxy_url <- "https://acsn.curie.fr/cgi-bin/nv_proxy.php"
navicell$map_url <- "https://acsn.curie.fr/navicell/maps/acsn/master/index.php"
navicell
$launchBrowser()
navicell$importDatatable("mRNA expression data", "GBM_exp", er_minus_mean)
navicell$heatmapEditorSelectSample('0','exp')
navicell$heatmapEditorSelectDatatable('0','GBM_exp')
navicell$heatmapEditorApply() navicell
The Figure 4 is displaying the map for genes having a corrected p-value \(< 0.05\).
Recent years have witnessed a dramatic increase in new technologies for interrogating the activity levels of various cellular components on a genome-wide scale, including genomic, epigenomic, transcriptomic, and proteomic information (Hawkins et al. 2010). Integrating these heterogeneous datasets provides more biological insights than performing separate analyses. For instance, international consortia such as The Cancer Genome Atlas (TCGA) have launched large-scale initiatives to characterize multiple types of cancer at different levels on hundreds of samples. These integrative studies have already led to the identification of novel cancer genes (McLendon et al. 2008).
Malignant gliomas, the most common subtype of primary brain tumors, are aggressive, highly invasive, and neurologically destructive tumors considered to be among the deadliest of human cancers. In its most aggressive manifestation, glioblastoma (GBM), median survival ranges from 9 to 12 months, despite maximum treatment efforts (Maher et al. 2001). In this study we have analyzed whole-genome mutation data generated by the TCGA project on hundreds of patients. More specifically, we parsed the MAF (Mutation Annotation Format) GBM files produced by different sequencing centers to count and calculate gene mutation frequencies. We kept the mutations having a status likely to disturb the target protein’s function (i.e. Frame_Shift_Del, Nonstop_Mutation, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Splice_Site, Translation_Start_Site). In total, we collected mutations for more than 13,000 genes in a total of 379 mutated samples. In order to retain the most frequently mutated genes, we calculated frequencies across all mutated samples, and kept genes having a frequency greater than 1% (3,293 genes). We further labelled genes having a frequency greater than 1% and less than 5% as "1" and genes highly mutated (frequency higher than 5%) as "2".
We loaded the data as a matrix in R and calculated the enrichment in
ACSN maps with the ACSNMineR function enrichment
. The results are
displayed in table 5. There are 6 modules
significantly enriched in the DNA repair and EMT motility maps. Cell
matrix adhesions and ECM (extra cellular matrix), part of the EMT
motility map, are the modules with highest significance. The EMT
motility map is significantly enriched at the global map level (second
line in the table).
module | Mod. size | Nb genes | Cor. pval |
---|---|---|---|
DNA_repair:S_PHASE_CHECKPOINT | 45 | 19 | 0.008 |
EMT_motility | 635 | 181 | 0.0002 |
EMT_motility:CELL_MATRIX_ADHESIONS | 73 | 45 | 3.73e-12 |
EMT_motility:CYTOSKELETON_POLARITY | 154 | 47 | 0.022 |
EMT_motility:DESMOSOMES | 29 | 15 | 0.002 |
EMT_motility:ECM | 147 | 69 | 9.77e-11 |
EMT_motility:EMT_REGULATORS | 629 | 178 | 0.0002 |
Visualization of the list of glioblastoma mutated genes is shown on figure 5. This figure was generated with the ACSNMineR commands detailed below. Results of the enrichment test correlate well with the visualization on the map, with a high density of low and high frequency mutated genes in the EMT motility and DNA repair regions (maps) of the global ACSN map. Although they are not statistically significant, quite high densities can also be seen in other regions of the map.
library(RNaviCell)
# Create a NaviCell object, point it to the ACSN master map and launch
# a session.
<- NaviCell()
navicell $proxy_url <- "https://acsn.curie.fr/cgi-bin/nv_proxy.php"
navicell$map_url <- "https://acsn.curie.fr/navicell/maps/acsn/master/index.php"
navicell$launchBrowser()
navicell
# Read the GBM data file and import it into the session.
<- navicell$readDatatable('gbm.txt')
mat $importDatatable("Mutation data", "GBM", mat)
navicell
# set datatable and sample names for the glyph editor
$drawingConfigSelectGlyph(1, TRUE)
navicell$glyphEditorSelectSample(1, "categ")
navicell$glyphEditorSelectShapeDatatable(1, "GBM")
navicell$glyphEditorSelectColorDatatable(1, "GBM")
navicell$glyphEditorSelectSizeDatatable(1, "GBM")
navicell$glyphEditorApply(1)
navicell
# set color, shape and size parameters for glyphs
$unorderedConfigSetDiscreteShape("GBM", "sample", 0, 1)
navicell$unorderedConfigSetDiscreteShape("GBM", "sample", 1, 5)
navicell$unorderedConfigApply("GBM", "shape")
navicell
$unorderedConfigSetDiscreteColor("GBM", "sample", 0, "398BC3")
navicell$unorderedConfigSetDiscreteColor("GBM", "sample", 1, "CC5746")
navicell$unorderedConfigApply("GBM", "color")
navicell
$unorderedConfigSetDiscreteSize("GBM", "sample", 0, 4)
navicell$unorderedConfigSetDiscreteSize("GBM", "sample", 1, 14)
navicell
$unorderedConfigApply("GBM", "size") navicell
In this work, we presented the R package ACSNMineR, a novel package for the calculation of p-values for enrichment or depletion of genes in biological pathways. The package includes the six large-scale molecular maps and 67 functional modules of the Atlas of Cancer Signaling Network (ACSN) . Enrichment can be calculated for those maps and modules with several options to play with, but can also be calculated for other databases of molecular pathways, that can be imported from GMT formated files.
We also describe in this work the RNaviCell package, a R package convenient to use with ACSNMineR. This package is dedicated to create web-based and interactive data visualization on ACSN maps. Users can use this tools to represent genes of interest that have been shown to be related to the maps by calculating enrichment with ACSNMineR. Creating maps with the graphical user interface of the ACSN website can be a tedious task if the user has multiple samples or gene lists, and wants to compare their representations on ACSN maps. The RNaviCell package can be used to automate the process of creating the graphical representations automatically. The maps are displayed in a browser and are interactive, with the possibility for the user to zoom in and out, search for genes or molecular species, and see the details of the molecular reactions (what partners are involved, what is the state of a given species, etc.). For more details on how to use the interface and the different possibilities, see (Kuperstein et al. 2013), (Bonnet et al. 2015) and (Kuperstein et al. 2015). We have shown how the packages ACSNMineR and RNaviCell can be combined to analyze expression data from breast cancer samples, and also to analyze the frequency of mutated genes in glioblastoma cancer samples.
Of course, ACSNMineR is not the only R package for enrichment
calculations. For instance,
GOstats
(Falcon and Gentleman 2007) is probably one of the first packages that was
created to calculate enrichment for Gene Ontology categories. GOstats
can also be used to calculate enrichment for other biological pathways
categories, such as KEGG pathways (by using an instance of the class
KEGGHyperGParams
) or PFAM protein families (using PFAMHyperGParams
).
However, its usage might not be as straightforward as ACSNMineR, and
it does not seem possible to test user-defined biological pathways.
Furthermore, other authors have pointed out that the KEGG database used
by this package has not been updated since 2012.
clusterProfiler
is a recent R package released for enrichment analysis of Gene Ontology
and KEGG with either hypergeometric test or Gene Set Enrichment Analysis
(GSEA) (Yu et al. 2012). Via other packages, support for
analysis of Disease Ontology and Reactome Pathways is possible.
Interestingly, this package also offers the possibility to import
user-defined gene set, through tab-delimited pairwise definition files.
Other notable packages for enrichment calculations are ReactomePA for
Reactome molecular pathways (Yu and He 2016), miRNApath for microRNA
pathways (Cogswell et al. 2008) and gage (Luo et al. 2009). We believe that the
main advantages of ACSNMineR compared to other packages are a direct
access to the full set of ACSN maps (updated on a regular basis) and an
easy way to test MSigDB gene sets or any user-defined gene set formatted
appropriately.
In order to improve ACSNMineR, we may in the near future try to
improve the speed of calculation, which might be a problem if a very
large number of samples or experiments have to be analyzed rapidly. For
instance, we could use the foreach
and %dopar%
operator to
parallelize the most computationally demanding operations. It could also
be useful to implement more sensitive methods of gene set enrichment
measures, such as the Gene Set Enrichment Analysis (GSEA) method.
RNaviCell relies on standard HTTP calls to provide informations and calculations, and we have developped a number of bindings for popular programming languages such as R, Java and Python (Bonnet et al. 2015). This open architecture is designed to facilitate the development of utilities by other programmers and to facilitate the integration of ACSN maps in existing frameworks. The development of such services, sometimes called “microservices" (Fowler 2014) is in expansion. Furthermore, this kind of open architecture could clear the way for a more unified and general access to reaction networks database, including for example WikiPathways (Kelder et al. 2012), Reactome (Croft et al. 2014) and other databases. The PSICQUIC project is a successfull example of such an architecture (Aranda et al. 2011). It is an effort of the HUPO proteomics standard initiative to standardize the access to molecular interaction databases programmatically, based on the specification of web services (using REST and SOAP calls) and a common query language (MIQL).
This work was supported by a grant “Projet Incitatif et Collaboratif Computational Systems Biology Approach for Cancer" from Institut Curie. The authors would like to thank Pierre Gestraud for his comments on early versions of the ACSNMineR package and Eric Viara for guidance and assistance on the development of the RNaviCell package.
Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Deveau, et al., "Calculating Biological Module Enrichment or Depletion and Visualizing Data on Large-scale Molecular Maps with ACSNMineR and Packages", The R Journal, 2016
BibTeX citation
@article{RJ-2016-047, author = {Deveau, Paul and Barillot, Emmanuel and Boeva, Valentina and Zinovyev, Andrei and Bonnet, Eric}, title = {Calculating Biological Module Enrichment or Depletion and Visualizing Data on Large-scale Molecular Maps with ACSNMineR and Packages}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {293-306} }