During the building of a genome-scale metabolic model, there are several dead-end metabolites and substrates which cannot be imported, produced, nor used by any reaction incorporated in the network. The presence of these dead-end metabolites can block out the net flux of the objective function when it is evaluated through Flux Balance Analysis (FBA), and when it is not blocked, bias in the biological conclusions increase. In this aspect, the refinement to restore the connectivity of the network can be carried out manually or using computational algorithms. The g2f package was designed as a tool to find the gaps from dead-end metabolites and fill them from the stoichiometric reactions of a reference, filtering candidate reactions using a weighting function. Additionally, this algorithm allows downloading all the sets of gene-associated stoichiometric reactions for a specific organism from the KEGG database. Our package is compatible with both 4.0.0 and 3.6.0 R versions.
Genome-scale metabolic models (GEMs) are multi-compartment metabolic reconstructions that specify the set of chemical reactions catalyzed by an organism (usually hundreds to thousands) covering the metabolic biochemical molecular function of a complete genome (Szappanos et al. 2011). The main goal of these reconstructions is to relate the genome of a given organism with its physiology, incorporating every metabolic transformation that this organism can perform (Chen et al. 2012; Agren et al. 2013). The GEMs are converted into computational models for the simulation of a species-specific metabolism in order to gain insight into the complex interactions that give rise to the metabolic capabilities (Alper et al. 2005; Fong et al. 2005; Cook and Nielsen 2017). The predictive accuracy of a model depends on the comprehensiveness and biochemical fidelity of the reconstruction (Thiele et al. 2014).
The GEM construction process can be divided into two fundamental stages: (1) The generation of a draft of the reconstructed network. Here, the reactions associated with the enzymes that participate in the metabolism of a particular organism are downloaded from specialized databases such as KEGG, MetaCyc, or ModelSEED (Pham et al. 2019; Steijn et al. 2019). (2) A refinement of the network is done manually or through the use of computational algorithms (Pham et al. 2019; Steijn et al. 2019). Similar steps are performed during the construction of a tissue-specific metabolic reconstruction, defined as the subset of reactions included in a genome-scale metabolic reconstruction that are highly associated with the metabolism of a specific tissue (Palsson 2009; Schultz and Qutub 2016; Steijn et al. 2019). These are constructed from the measured gene expression or proteomic data allowing researchers to characterize and predict the metabolic behavior of tissue under any physiological conditions (Ataman et al. 2017). It is important to highlight that a drawback of this approach arises from the fact that only the reactions associated with specific enzymes or genes can be mapped from the measured data. Therefore, the spontaneous and non-facilitated-transport reactions are missing in the first stages (Schultz and Qutub 2016).
If all relevant exchange reactions are available, a high-quality model is expected to be able to carry flux in all its reactions (Agren et al. 2013); thus, a refinement stage in the reconstruction is required to restore the connectivity of the network. In this aspect, the gaps in the draft reconstruction are identified, and candidate reactions to fill the gaps are found using literature and metabolic databases (Satish Kumar et al. 2007; Thiele and Palsson 2010). The network gaps can be associated with dead-end metabolites, which cannot be imported nor produced by any of the reactions in the network, or metabolites that are not used as substrates or released by any of the reactions. The presence of this type of metabolites can be problematic when the metabolic network is transformed into a steady-state metabolic model; mainly because flux through the network is blocked due to the incomplete connectivity with the rest of the network. Therefore, it is not possible to accurately optimize the metabolic flux distribution under an objective function, increasing the bias in the biological conclusions obtained from the reconstruction (Satish Kumar et al. 2007).
A manual refinement can be performed as an iterative process to assemble a higher confidence compendium of organism-specific metabolic reactions on a draft metabolic reconstruction (Howe et al. 2008; Bateman 2010; Heavner and Price 2015). Since the network reconstructions typically involve thousands of metabolic reactions, the model refinement can be a very complex task, which not only requires plenty of time and intensive use of available literature, databases, and experimental data (Lakshmanan et al. 2014; Heavner and Price 2015) but also can lead to the introduction of new errors and to overlook old ones (Agren et al. 2013; Machado et al. 2018). These metabolic network gap refinement can also be performed using several algorithms developed for open.source environments, such as Python and GAMS, or in a closed-source environment such as MATLAB (Wang and Marci 2018). Commonly implemented algorithms are mainly based on optimization procedures to fill the gaps that allow the production of a specific metabolite or give flux for a single biological objective function. Other algorithms modify the directionality of reactions or add new reactions to the model without associated evidence (Table 1)
Algorithm | Implementation | (Open source) | ||
Package | Environment | Package | Environment | |
"SMILEY" | COBRApy | Python | Yes | Yes |
"gapFind" and "gapFill" | - | GAMS | - | Yes |
"growMatch" | COBRApy | Python | Yes | Yes |
"fastgapfill" | openCOBRA | MATLAB | Yes | No |
Table 1 listed the four most used algorithms for gap filling across three environments. SMILEY, developed by Reed et al. (2006), identifies the minimum number of reactions required to allow the model a specific metabolite production through an optimization function. Reactions to fill the gaps are identified from a universal database of stoichiometric reactions, and the process is carried out one metabolite per time (user-defined). Alternatively, "gapFind" and "gapFill" in GAMS were developed by Satish Kumar et al. (2007) and identified the metabolites (‘gapFind’) in the metabolic network reconstruction, which cannot be produced under any uptake conditions in both single and multicompartment. Subsequently, ‘gapFill’ identify the reactions from a customized multi-organism database that restores the connectivity of these metabolites to the original network using optimization-based procedures. In the process, the procedure makes several intra-model modifications such as: (1) modify the directionality of the reactions in the model, (2) add fake external transport mechanisms, and (3) add fake intracellular transport reactions in multicompartment models. "growMatch" was developed by Kumar and Maranas (2009), and it identifies the minimum number of reactions required to allow the model flux to a selected objective function through an optimization algorithm. Reactions to fill the gaps are identified from a universal database of stoichiometric reactions. The process is carried out with one objective function per time (user-defined). Finally, developed by Thiele et al. (2014), the ‘fastGapFill’ algorithm identifies the blocked reactions through an optimization procedure. It searches candidate reactions to fill the gaps in a universal database of stoichiometric reactions through the ‘fastCore’ algorithm. This second algorithm computes a compact flux consistent model and uses it to filter and determine the reactions to be added. In the filling process, fake transport reactions between compartments are added.
In this aspect, and with the aim of offering an open-source tool that improves the refinement of drafts network reconstructions and the depuration of metabolic models under the R environment, we introduce the g2f R package. This tool includes five functions to identify and fill gaps, calculate the additional cost of a reaction, and depurate metabolic networks of blocked reactions (no activated under any scenario). The implemented gapFill algorithm in g2f identifies the dead-end metabolites and traces them in a universal database of stoichiometric reactions used as a reference to select candidate reactions to be added. Selected reactions are then filtered by the function additionCost considering metabolites present in the original reconstruction to minimize the number of new metabolites to be added. The function calculates the cost of adding a reaction by dividing the amount of non-included metabolites in the reference metabolic network over the total number of metabolites involved in the reaction. The latter is done to minimize the number of false-positive metabolites that could increase the number of new gaps in the model. Also, blockedReactions search for blocked reactions, so gapFill can fill blocked paths in the network. Finally, getReactionsList extracts the reactions from the model in the form of a list of strings, so it can be easily compared with the list of reactions obtained from getReference, which downloads specific stoichiometric matrices from KEGG in order to reconstruct specific organism models.
Workflow |
Input: A sybil metabolic model. |
1. with getReference: Reference reactions list is retrieved from KEGG database. |
2. with blockedReactions: Check if there is any dead-end metabolite, the results serve as a guide to the user. |
3. with getReactionsList: List of reactions is extracted from input metabolic model. |
4. with additionCost: The addition cost for the reference reactions list can be calculated to do a manual check. |
5. with gapFill: Find dead-end metabolites and fill the gaps with reactions from the reference list, which are below the addition cost treshold defined. |
Loop user defined times (default = 5) |
5.1. Searches dead-end reactants and products. |
5.2. Calculates the additional cost of the reference reactions. |
5.3. Filters reference reactions with a cost above the threshold. |
5.4. Selects the filtered reactions that have any orphan reactant or product. |
5.5. Fills the gaps in the model with the selected reactions. |
Output: List of the added reactions with their additional costs |
The g2f package is available for download and installation from the Comprehensive R Archive Networks (CRAN, Hornik (2012)). This package is compatible with R 3.6.0 and 4.0.0 versions. To get the latest stable version of g2f, install it directly from GitHub:
# Install 'devtools' R Package
> install.packages('devtools')
R
# Install 'g2f' package
> setRepositories(ind=1:2)
R> devtools::install_github('gibbslab/g2f')
R> library('g2f') R
g2f includes 5 functions in order to identify gaps (metabolites not produced or not consumed in any reaction) and fill the gaps from a reference metabolic reconstruction. Briefly, the gap-filling reconstruction is based on the stoichiometric reaction matrix either from a specific model or by the complete set of gene-associated stoichiometric reactions for a specific organism from the KEGG database using a weighting function. Table 3 summarizes the functions contained in the g2f R package.
Function | Description |
---|---|
blockedReactions | Identifies blocked reactions in a metabolic network. |
additionCost | Calculates the cost of addition of a stoichiometric reaction. |
getReactionsList | Extract the reaction list from a model. |
getReference | Download all stoichiometric reactions from the KEGG database. |
gapFill | Find and fill gaps in a metabolic network. |
The KEGG database is a resource, widely used as a reference in genomics,
metagenomics, metabolomics, and other studies. Moreover, KEGG has been
used for modeling and simulation in systems biology, specifically in
GEMs (Kanehisa 2006; Kanehisa et al. 2016; Martín-Jiménez et al. 2017). Currently, the
database includes complete genomes, biological pathways, and the
associated stoichiometric reactions for 542 eukaryotes, 5979 bacteria,
and 334 archaea. The g2f’s getReference
function downloads all the
gene-associated KeggOrthology (KO) stoichiometric reactions from KEGG
and their correspondent E.C. numbers for a customized organism, through
the use of KEGG organism ID. Based on the KOs associated with the
reactions, their respective gene-protein-reaction is constructed as
follows: all genes associated with a given KO are linked by an AND
operator. After that, when a reaction has more than one associated KO,
previously linked genes are now joined by an OR operator. As an example,
to download all the stoichiometric reactions (1492) associated with
Escherichia coli, just type:
> e.coli <- getReference(organism = "eco") R
To identify the blocked reactions included in a metabolic model, the
blockedReactions
function sets each one of the reactions included in
the model (one at the time) as the objective function and optimizes the
system through Flux Balance Analysis (FBA). Reactions that are not
participating in any possible solution during all evaluations are
returned as a blocked reaction.
As an example, we identify the blocked reactions in the E. coli core metabolic model included in the sybil package (Gelius-Dietrich et al. 2013).
> data("Ec_core")
R> blockedReactions(Ec_core)
R
|==============================================================| 100%
1] "EX_fru(e)" "EX_fum(e)" "EX_mal_L(e)" "FUMt2_2" "MALt2_2" [
Adding new reactions in order to fill gaps can be an easy path to
increase the number of dead-end metabolites (Hosseini and Marashi 2017). Therefore,
as a strategy to reduce the possible addition of new dead-end
metabolites into the system, the additionCost
function calculates the
cost of adding new metabolites based on metabolites that constitute the
new reaction and those that compose the stoichiometric reactions already
present in the metabolic reconstruction (Equation (1)).
Values of the function represent a weight ranging between 0 and 1.
\[\label{equation_1} \text{additionCost}=\frac{n(\text{metabolites(newReaction)})\notin(\text{metabolites(reactionList)})}{n(\text{metabolites(newReaction)})} \tag{1}\]
As an example, we select a sample of reactions from the downloaded reference for E. coli and calculate the additional cost for the remaining reactions (6 first values are shown).
> reactionList <- sample(e.coli$reaction,10)
R> head(
R+ additionCost(reaction = e.coli$reaction,
+ reference = reactionList)
+ )
1] 1.0000000 1.0000000 1.0000000 0.8000000 0.8333333 1.0000000 [
To understand the results of the additionCost
, we present two examples
for the glutamine synthetase reaction in the glutamate metabolism of E.
coli core model.
: ATP + Glu-L + Nh4 --> ADP + Gln-L + h + pi [c]
The reaction takes as input Adenosine triphosphate (ATP), L-Glutamate
(Glu-L), and Ammonium (Nh4) and produces Adenosine diphosphate (ADP),
L-Glutamine (Gln-L), H+ (h), and inorganic Phosphate (pi) in the
cytoplasm. We are going to assume that this reaction is going to be
added to the model and that the number of metabolites to be added change
between two conditions. In the first case, the reaction would be
evaluated by additionCost
, but one of the seven metabolites is not
present on the list of reactions of the complete model. In the second
situation, four of the seven metabolites are not present in the
metabolite list of the model. By dividing the number of metabolites to
be added by the total number of metabolites in the reaction,
additionCost
produces 0.14 and 0.57 as resulting values for the two
conditions respectively. In this sense, if we pick a threshold of 0.2
for the gapfill
the first case would allow the reaction to be added
but not the second condition. By using a threshold of 0.2 is possible to
set a medium point for the reaction addition. Where higher values are
more permissive and lower values are more restrictive.
To identify network gaps in a metabolic model and fill them from a
reference network, the gapFill
function performs several steps: (1)
The dead-end metabolites are identified from the stoichiometric matrix,
(2) the candidate reactions are to be added by comparing the metabolites
against the metabolite list of the model, (3) the additional cost of
each candidate reaction is calculated, (4) the candidate reactions with
an additional cost lower or equal to the user-defined limit are added to
the reaction list. Finally, the process returns to step 1 until no more
original-gaps can be filled under the user-defined limit. The function
returns a set of candidate stoichiometric reactions to fill the
original-gaps included in the metabolic network.
As an example, we show how to fill dead-end metabolites included in the previously selected sample using all downloaded stoichiometric reactions from the KEGG database for E. coli as the reference.
> reactionsAdded <- gapFill(reactionList = reactionList,
R+ reference = e.coli$reaction,
+ limit = 1/4
+ )
48% gaps filled in the last iteration
26% gaps filled in the last iteration
13% gaps filled in the last iteration
13% gaps filled in the last iteration
4% gaps filled in the last iteration
> head(reactionsAdded)
R
addCost react1 0.00 L-Glutamine + D-Fructose 6-phosphate <=> L-Glutamate + D-Glucosamine
6-phosphate
2 0.25 ATP + Pyruvate <=> ADP + Phosphoenolpyruvate
3 0.00 ATP + AMP <=> 2 ADP
4 0.25 ATP + dTDP <=> ADP + dTTP
5 0.00 ATP + 5-Fluorouridine diphosphate <=> ADP + 5-Fluorouridine triphosphate
6 0.25 ATP + UDP <=> ADP + UTP
The output is a data frame with the reactions that were found to fill
the gaps in the model, with the corresponding additionCost
calculated
for each one.
In order to provide compatibility, g2f implements getReactionsList
a function that helps to extract the reactions of a sybil model as a
list of strings, each string being a reaction, which is the input format
of gapFill
accepts.
In the examples before, we used a reduced version for the reference
organism of E.coli from KEGG. Now we will use a converted model to SBML
using KEGG2SBML (Moutselos et al. 2009) from (Akiya Jouraku and Kitano 2008), which
will be converted into sybil with the help of the sybilSBLM
package, and then the reactions list will be extracted to use them with
the gapFill
function. Note that we have done this because the name of
the reaction metabolites in the model should be the same as the ones
used in KEGG, and the E.coli core metabolic model included in the
sybil package does not meet this requirement.
# Install and import sybilSBML package
> install.packages('sybilSBML')
R> library('sybilSBML')
R
# Read the SBML and convert it to sybil
> mod <- readSBMLmod("eco/eco00730.xml", bndCond = FALSE)
R
# Extract the model's reactions
> react <- getReactionsList(mod)
R
# Fill the gaps
> reactionsAdded <- gapFill(reactionList = react$react,
Rreference = e.coli$reaction,
limit = 1/4
)
20% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
addCost react1 0 ATP + ADP <=> ADP + ATP
2 0 ATP + H2O <=> ADP + Orthophosphate
We tested the performance of g2f against the most used platforms for
gap-filling in the metabolic networks using a computer with i7 8750h
2.2GHz processor and 12Gb DDR4 Ram. We compared the performance of R
package g2f, Python cobrapy gapfill
function, and Matlab
COBRA fastgapfilling
function (Table 4). The
benchmark was performed for each gap-filling algorithm by deleting 10
random reactions across the E. coli core model (Orth et al. 2010).
Platform | Limit | TicToc (sec) | Solution | |
R: g2f – "gapfill" | 0.1 | 2.83 | Feasible | |
0.15 | 2.76 | |||
0.2 | 2.73 | |||
0.25 | 6.91 | |||
Python: Cobrapy – "gapfill" | - | 1.369 | Unfeasible | |
Matlab: COBRA – "fastgapfill" [Cplex solver] | 0.1 | 7.858 | Feasible | |
0.15 | 8.836 | |||
0.2 | 9.001 | |||
0.25 | 5.695 |
Considering the computational performance and flux recovery across the network (FBA solution), g2f arises as a suitable method for Genome-scale metabolic network reconstructions gap filling using curated models as reference.
A wide variety of open-source, paid software, and webtools have been developed to fill the gaps in automated or manual metabolic reconstructions (Prigent et al. 2017; Machado et al. 2018; Karp et al. 2018). Performing a gap-filling accurately is a challenging task considering the possibility of overestimating reaction addition or excluding metabolites from the filling by inquorate thresholds (Pan and Reed 2018). g2f offers an R based open-source alternative capable of integrating with systems biology packages such as sybil (Gelius-Dietrich et al. 2013) or minVal (Osorio et al. 2017) as well as big projects such as Recon3D (Brunk et al. 2018) or the Human Metabolic Atlas (Pornputtapong et al. 2015). Finally, considering that the majority of metabolic models are derived from annotated genomes where not all the enzymes are known, g2f offers the possibility to optimize the topology of public available metabolic models or automated metabolic reconstructions.
We developed g2f, a novel R package to, find dead-end metabolites in
a genome-scale metabolic reconstruction and fill the reaction gaps with
metabolites available in a stoichiometric matrix from a reference model.
Additionally, g2f filters the candidate reactions using a weighting
function and a user-defined limit. We depicted the functions included in
the package using the E. coli reference model downloaded from the KEGG
database, and the core metabolic model included in the sybil
package. Finally, the performance of g2f was compared with other
gap-filling algorithms (Cobrapy – gapfill
and Matlab:COBRA –
fastgapfill
), showing an adequate feasibility and performance speed.
Dead-end metabolites are a major drawback in genome-scale metabolic reconstruction and analysis. Since there is a lack of available tools to solve this situation in the R environment, hereby, we introduce the g2f package to find and fill dead-end metabolites in a given reconstruction based on a reference template. Our method allows users to filter candidate reactions using a weighting function and a user-defined limit. We show step by step the functionality of each procedure included in the package using a reference model downloaded from the KEGG database for Escherichia coli and the core metabolic model included in the sybil package.
This work was supported by the Pontificia Universidad Javeriana, Bogotá, Colombia, and Minciencias IDs 7740, 8845, and 20304 to JG. We thank the anonymous reviewers and testers for their helpful comments and suggestions to improve the CRAN package.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Osorio, et al., "g2f as a Novel Tool to Find and Fill Gaps in Metabolic Networks", The R Journal, 2021
BibTeX citation
@article{RJ-2021-064, author = {Osorio, Daniel and Botero, Kelly and Velasco, Andrés Pinzón and Mendoza-Mejía, Nicolás and Rojas-Rodriguez, Felipe and Barreto, George E. and González, Janneth}, title = {g2f as a Novel Tool to Find and Fill Gaps in Metabolic Networks}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {2}, issn = {2073-4859}, pages = {28-37} }