g2f as a Novel Tool to Find and Fill Gaps in Metabolic Networks

Daniel Osorio; Kelly Botero; Andrés Pinzón Velasco; Nicolás Mendoza-Mejía; Felipe Rojas-Rodriguez; George E. Barreto; Janneth González

1 Introduction

Genome-scale metabolic models (GEMs) are multi-compartment metabolic reconstructions that specify the set of chemical reactions catalyzed by an organism (usually hundreds to thousands) covering the metabolic biochemical molecular function of a complete genome (Szappanos et al. 2011). The main goal of these reconstructions is to relate the genome of a given organism with its physiology, incorporating every metabolic transformation that this organism can perform (Chen et al. 2012; Agren et al. 2013). The GEMs are converted into computational models for the simulation of a species-specific metabolism in order to gain insight into the complex interactions that give rise to the metabolic capabilities (Alper et al. 2005; Fong et al. 2005; Cook and Nielsen 2017). The predictive accuracy of a model depends on the comprehensiveness and biochemical fidelity of the reconstruction (Thiele et al. 2014).

The GEM construction process can be divided into two fundamental stages: (1) The generation of a draft of the reconstructed network. Here, the reactions associated with the enzymes that participate in the metabolism of a particular organism are downloaded from specialized databases such as KEGG, MetaCyc, or ModelSEED (Pham et al. 2019; Steijn et al. 2019). (2) A refinement of the network is done manually or through the use of computational algorithms (Pham et al. 2019; Steijn et al. 2019). Similar steps are performed during the construction of a tissue-specific metabolic reconstruction, defined as the subset of reactions included in a genome-scale metabolic reconstruction that are highly associated with the metabolism of a specific tissue (Palsson 2009; Schultz and Qutub 2016; Steijn et al. 2019). These are constructed from the measured gene expression or proteomic data allowing researchers to characterize and predict the metabolic behavior of tissue under any physiological conditions (Ataman et al. 2017). It is important to highlight that a drawback of this approach arises from the fact that only the reactions associated with specific enzymes or genes can be mapped from the measured data. Therefore, the spontaneous and non-facilitated-transport reactions are missing in the first stages (Schultz and Qutub 2016).

If all relevant exchange reactions are available, a high-quality model is expected to be able to carry flux in all its reactions (Agren et al. 2013); thus, a refinement stage in the reconstruction is required to restore the connectivity of the network. In this aspect, the gaps in the draft reconstruction are identified, and candidate reactions to fill the gaps are found using literature and metabolic databases (Satish Kumar et al. 2007; Thiele and Palsson 2010). The network gaps can be associated with dead-end metabolites, which cannot be imported nor produced by any of the reactions in the network, or metabolites that are not used as substrates or released by any of the reactions. The presence of this type of metabolites can be problematic when the metabolic network is transformed into a steady-state metabolic model; mainly because flux through the network is blocked due to the incomplete connectivity with the rest of the network. Therefore, it is not possible to accurately optimize the metabolic flux distribution under an objective function, increasing the bias in the biological conclusions obtained from the reconstruction (Satish Kumar et al. 2007).

A manual refinement can be performed as an iterative process to assemble a higher confidence compendium of organism-specific metabolic reactions on a draft metabolic reconstruction (Howe et al. 2008; Bateman 2010; Heavner and Price 2015). Since the network reconstructions typically involve thousands of metabolic reactions, the model refinement can be a very complex task, which not only requires plenty of time and intensive use of available literature, databases, and experimental data (Lakshmanan et al. 2014; Heavner and Price 2015) but also can lead to the introduction of new errors and to overlook old ones (Agren et al. 2013; Machado et al. 2018). These metabolic network gap refinement can also be performed using several algorithms developed for open.source environments, such as Python and GAMS, or in a closed-source environment such as MATLAB (Wang and Marci 2018). Commonly implemented algorithms are mainly based on optimization procedures to fill the gaps that allow the production of a specific metabolite or give flux for a single biological objective function. Other algorithms modify the directionality of reactions or add new reactions to the model without associated evidence (Table 1)

Table 1: Description and comparison of the methods used for gap find and filling. The available algorithms are presented under the different environments.
Algorithm	Implementation		(Open source)
	Package	Environment	Package	Environment
"SMILEY"	COBRApy	Python	Yes	Yes
"gapFind" and "gapFill"	-	GAMS	-	Yes
"growMatch"	COBRApy	Python	Yes	Yes
"fastgapfill"	openCOBRA	MATLAB	Yes	No

Table 1 listed the four most used algorithms for gap filling across three environments. SMILEY, developed by Reed et al. (2006), identifies the minimum number of reactions required to allow the model a specific metabolite production through an optimization function. Reactions to fill the gaps are identified from a universal database of stoichiometric reactions, and the process is carried out one metabolite per time (user-defined). Alternatively, "gapFind" and "gapFill" in GAMS were developed by Satish Kumar et al. (2007) and identified the metabolites (‘gapFind’) in the metabolic network reconstruction, which cannot be produced under any uptake conditions in both single and multicompartment. Subsequently, ‘gapFill’ identify the reactions from a customized multi-organism database that restores the connectivity of these metabolites to the original network using optimization-based procedures. In the process, the procedure makes several intra-model modifications such as: (1) modify the directionality of the reactions in the model, (2) add fake external transport mechanisms, and (3) add fake intracellular transport reactions in multicompartment models. "growMatch" was developed by Kumar and Maranas (2009), and it identifies the minimum number of reactions required to allow the model flux to a selected objective function through an optimization algorithm. Reactions to fill the gaps are identified from a universal database of stoichiometric reactions. The process is carried out with one objective function per time (user-defined). Finally, developed by Thiele et al. (2014), the ‘fastGapFill’ algorithm identifies the blocked reactions through an optimization procedure. It searches candidate reactions to fill the gaps in a universal database of stoichiometric reactions through the ‘fastCore’ algorithm. This second algorithm computes a compact flux consistent model and uses it to filter and determine the reactions to be added. In the filling process, fake transport reactions between compartments are added.

In this aspect, and with the aim of offering an open-source tool that improves the refinement of drafts network reconstructions and the depuration of metabolic models under the R environment, we introduce the g2f R package. This tool includes five functions to identify and fill gaps, calculate the additional cost of a reaction, and depurate metabolic networks of blocked reactions (no activated under any scenario). The implemented gapFill algorithm in g2f identifies the dead-end metabolites and traces them in a universal database of stoichiometric reactions used as a reference to select candidate reactions to be added. Selected reactions are then filtered by the function additionCost considering metabolites present in the original reconstruction to minimize the number of new metabolites to be added. The function calculates the cost of adding a reaction by dividing the amount of non-included metabolites in the reference metabolic network over the total number of metabolites involved in the reaction. The latter is done to minimize the number of false-positive metabolites that could increase the number of new gaps in the model. Also, blockedReactions search for blocked reactions, so gapFill can fill blocked paths in the network. Finally, getReactionsList extracts the reactions from the model in the form of a list of strings, so it can be easily compared with the list of reactions obtained from getReference, which downloads specific stoichiometric matrices from KEGG in order to reconstruct specific organism models.

Table 2: Workflow of *g2f* packet
Workflow
Input: A sybil metabolic model.
1. with getReference: Reference reactions list is retrieved from KEGG database.
2. with blockedReactions: Check if there is any dead-end metabolite, the results serve as a guide to the user.
3. with getReactionsList: List of reactions is extracted from input metabolic model.
4. with additionCost: The addition cost for the reference reactions list can be calculated to do a manual check.
5. with gapFill: Find dead-end metabolites and fill the gaps with reactions from the reference list, which are below the addition cost treshold defined.
Loop user defined times (default = 5)
5.1. Searches dead-end reactants and products.
5.2. Calculates the additional cost of the reference reactions.
5.3. Filters reference reactions with a cost above the threshold.
5.4. Selects the filtered reactions that have any orphan reactant or product.
5.5. Fills the gaps in the model with the selected reactions.
Output: List of the added reactions with their additional costs

2 Installation and Functions

The g2f package is available for download and installation from the Comprehensive R Archive Networks (CRAN, Hornik (2012)). This package is compatible with R 3.6.0 and 4.0.0 versions. To get the latest stable version of g2f, install it directly from GitHub:

# Install 'devtools' R Package
R> install.packages('devtools')

# Install 'g2f' package
R> setRepositories(ind=1:2)
R> devtools::install_github('gibbslab/g2f')
R> library('g2f')

g2f includes 5 functions in order to identify gaps (metabolites not produced or not consumed in any reaction) and fill the gaps from a reference metabolic reconstruction. Briefly, the gap-filling reconstruction is based on the stoichiometric reaction matrix either from a specific model or by the complete set of gene-associated stoichiometric reactions for a specific organism from the KEGG database using a weighting function. Table 3 summarizes the functions contained in the g2f R package.

Table 3: Descriptions of **g2f** available functions.
Function	Description
blockedReactions	Identifies blocked reactions in a metabolic network.
additionCost	Calculates the cost of addition of a stoichiometric reaction.
getReactionsList	Extract the reaction list from a model.
getReference	Download all stoichiometric reactions from the KEGG database.
gapFill	Find and fill gaps in a metabolic network.

3 Downloading reference data from KEGG database

The KEGG database is a resource, widely used as a reference in genomics, metagenomics, metabolomics, and other studies. Moreover, KEGG has been used for modeling and simulation in systems biology, specifically in GEMs (Kanehisa 2006; Kanehisa et al. 2016; Martín-Jiménez et al. 2017). Currently, the database includes complete genomes, biological pathways, and the associated stoichiometric reactions for 542 eukaryotes, 5979 bacteria, and 334 archaea. The g2f’s getReference function downloads all the gene-associated KeggOrthology (KO) stoichiometric reactions from KEGG and their correspondent E.C. numbers for a customized organism, through the use of KEGG organism ID. Based on the KOs associated with the reactions, their respective gene-protein-reaction is constructed as follows: all genes associated with a given KO are linked by an AND operator. After that, when a reaction has more than one associated KO, previously linked genes are now joined by an OR operator. As an example, to download all the stoichiometric reactions (1492) associated with Escherichia coli, just type:

R> e.coli <- getReference(organism = "eco")

4 Identify blocked reactions

To identify the blocked reactions included in a metabolic model, the blockedReactions function sets each one of the reactions included in the model (one at the time) as the objective function and optimizes the system through Flux Balance Analysis (FBA). Reactions that are not participating in any possible solution during all evaluations are returned as a blocked reaction.

As an example, we identify the blocked reactions in the E. coli core metabolic model included in the sybil package (Gelius-Dietrich et al. 2013).

R> data("Ec_core")
R> blockedReactions(Ec_core)

|==============================================================| 100%
[1]  "EX_fru(e)" "EX_fum(e)"  "EX_mal_L(e)"  "FUMt2_2"  "MALt2_2"

5 Calculating the additional cost

Adding new reactions in order to fill gaps can be an easy path to increase the number of dead-end metabolites (Hosseini and Marashi 2017). Therefore, as a strategy to reduce the possible addition of new dead-end metabolites into the system, the additionCost function calculates the cost of adding new metabolites based on metabolites that constitute the new reaction and those that compose the stoichiometric reactions already present in the metabolic reconstruction (Equation (1)). Values of the function represent a weight ranging between 0 and 1.

\[\label{equation_1} \text{additionCost}=\frac{n(\text{metabolites(newReaction)})\notin(\text{metabolites(reactionList)})}{n(\text{metabolites(newReaction)})} \tag{1}\]

As an example, we select a sample of reactions from the downloaded reference for E. coli and calculate the additional cost for the remaining reactions (6 first values are shown).

R> reactionList <- sample(e.coli$reaction,10)
R> head(
    +  additionCost(reaction = e.coli$reaction,
    +  reference = reactionList)
    +  )
[1] 1.0000000 1.0000000 1.0000000 0.8000000 0.8333333 1.0000000

To understand the results of the additionCost, we present two examples for the glutamine synthetase reaction in the glutamate metabolism of E. coli core model.

[c]: ATP + Glu-L + Nh4 --> ADP + Gln-L + h + pi

The reaction takes as input Adenosine triphosphate (ATP), L-Glutamate (Glu-L), and Ammonium (Nh4) and produces Adenosine diphosphate (ADP), L-Glutamine (Gln-L), H+ (h), and inorganic Phosphate (pi) in the cytoplasm. We are going to assume that this reaction is going to be added to the model and that the number of metabolites to be added change between two conditions. In the first case, the reaction would be evaluated by additionCost, but one of the seven metabolites is not present on the list of reactions of the complete model. In the second situation, four of the seven metabolites are not present in the metabolite list of the model. By dividing the number of metabolites to be added by the total number of metabolites in the reaction, additionCost produces 0.14 and 0.57 as resulting values for the two conditions respectively. In this sense, if we pick a threshold of 0.2 for the gapfill the first case would allow the reaction to be added but not the second condition. By using a threshold of 0.2 is possible to set a medium point for the reaction addition. Where higher values are more permissive and lower values are more restrictive.

6 "Gap find and fill" performing, input and syntaxis

To identify network gaps in a metabolic model and fill them from a reference network, the gapFill function performs several steps: (1) The dead-end metabolites are identified from the stoichiometric matrix, (2) the candidate reactions are to be added by comparing the metabolites against the metabolite list of the model, (3) the additional cost of each candidate reaction is calculated, (4) the candidate reactions with an additional cost lower or equal to the user-defined limit are added to the reaction list. Finally, the process returns to step 1 until no more original-gaps can be filled under the user-defined limit. The function returns a set of candidate stoichiometric reactions to fill the original-gaps included in the metabolic network.

As an example, we show how to fill dead-end metabolites included in the previously selected sample using all downloaded stoichiometric reactions from the KEGG database for E. coli as the reference.

R> reactionsAdded <- gapFill(reactionList = reactionList,
          +      reference = e.coli$reaction,
          +      limit = 1/4
          +      )
48% gaps filled in the last iteration
26% gaps filled in the last iteration
13% gaps filled in the last iteration
13% gaps filled in the last iteration
4% gaps filled in the last iteration

R> head(reactionsAdded)
addCost                                                                       react
1    0.00      L-Glutamine + D-Fructose 6-phosphate <=> L-Glutamate + D-Glucosamine 
          6-phosphate
2    0.25                              ATP + Pyruvate <=> ADP + Phosphoenolpyruvate
3    0.00                                                       ATP + AMP <=> 2 ADP
4    0.25                                                 ATP + dTDP <=> ADP + dTTP
5    0.00  ATP + 5-Fluorouridine diphosphate <=> ADP + 5-Fluorouridine triphosphate
6    0.25                                                   ATP + UDP <=> ADP + UTP

The output is a data frame with the reactions that were found to fill the gaps in the model, with the corresponding additionCost calculated for each one.

7 Compatibility

In order to provide compatibility, g2f implements getReactionsList a function that helps to extract the reactions of a sybil model as a list of strings, each string being a reaction, which is the input format of gapFill accepts.

In the examples before, we used a reduced version for the reference organism of E.coli from KEGG. Now we will use a converted model to SBML using KEGG2SBML (Moutselos et al. 2009) from (Akiya Jouraku and Kitano 2008), which will be converted into sybil with the help of the sybilSBLM package, and then the reactions list will be extracted to use them with the gapFill function. Note that we have done this because the name of the reaction metabolites in the model should be the same as the ones used in KEGG, and the E.coli core metabolic model included in the sybil package does not meet this requirement.

# Install and import sybilSBML package
R> install.packages('sybilSBML')
R> library('sybilSBML')

# Read the SBML and convert it to sybil
R> mod <- readSBMLmod("eco/eco00730.xml", bndCond = FALSE)

# Extract the model's reactions
R> react <- getReactionsList(mod)

# Fill the gaps
R> reactionsAdded <- gapFill(reactionList = react$react,
          reference = e.coli$reaction,
          limit = 1/4
)

20% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
0% gaps filled in the last iteration
  addCost                              react
1       0            ATP + ADP <=> ADP + ATP
2       0 ATP + H2O <=> ADP + Orthophosphate

8 g2f performance

We tested the performance of g2f against the most used platforms for gap-filling in the metabolic networks using a computer with i7 8750h 2.2GHz processor and 12Gb DDR4 Ram. We compared the performance of R package g2f, Python cobrapy gapfill function, and Matlab COBRA fastgapfilling function (Table 4). The benchmark was performed for each gap-filling algorithm by deleting 10 random reactions across the E. coli core model (Orth et al. 2010).

Table 4: Performance of **g2f** compared with other gap-filling algorithms. The limit is associated with the threshold for the limit of gap-filling. TicToc was the methodological approach used to measure the performance time. The solution is the capacity of the model to run a FBA after the gap fill function was run. A single iteration of the gap-filling algorithm Cobrapy-"gapfill" was unable to generate a suitable FBA.
Platform	Limit	TicToc (sec)	Solution
R: g2f – "gapfill"	0.1	2.83	Feasible
	0.15	2.76
	0.2	2.73
	0.25	6.91
Python: Cobrapy – "gapfill"	-	1.369	Unfeasible
Matlab: COBRA – "fastgapfill" [Cplex solver]	0.1	7.858	Feasible
	0.15	8.836
	0.2	9.001
	0.25	5.695

Considering the computational performance and flux recovery across the network (FBA solution), g2f arises as a suitable method for Genome-scale metabolic network reconstructions gap filling using curated models as reference.

9 Application

A wide variety of open-source, paid software, and webtools have been developed to fill the gaps in automated or manual metabolic reconstructions (Prigent et al. 2017; Machado et al. 2018; Karp et al. 2018). Performing a gap-filling accurately is a challenging task considering the possibility of overestimating reaction addition or excluding metabolites from the filling by inquorate thresholds (Pan and Reed 2018). g2f offers an R based open-source alternative capable of integrating with systems biology packages such as sybil (Gelius-Dietrich et al. 2013) or minVal (Osorio et al. 2017) as well as big projects such as Recon3D (Brunk et al. 2018) or the Human Metabolic Atlas (Pornputtapong et al. 2015). Finally, considering that the majority of metabolic models are derived from annotated genomes where not all the enzymes are known, g2f offers the possibility to optimize the topology of public available metabolic models or automated metabolic reconstructions.

10 Conclusions

We developed g2f, a novel R package to, find dead-end metabolites in a genome-scale metabolic reconstruction and fill the reaction gaps with metabolites available in a stoichiometric matrix from a reference model. Additionally, g2f filters the candidate reactions using a weighting function and a user-defined limit. We depicted the functions included in the package using the E. coli reference model downloaded from the KEGG database, and the core metabolic model included in the sybil package. Finally, the performance of g2f was compared with other gap-filling algorithms (Cobrapy – gapfill and Matlab:COBRA – fastgapfill), showing an adequate feasibility and performance speed.

11 Summary

Dead-end metabolites are a major drawback in genome-scale metabolic reconstruction and analysis. Since there is a lack of available tools to solve this situation in the R environment, hereby, we introduce the g2f package to find and fill dead-end metabolites in a given reconstruction based on a reference template. Our method allows users to filter candidate reactions using a weighting function and a user-defined limit. We show step by step the functionality of each procedure included in the package using a reference model downloaded from the KEGG database for Escherichia coli and the core metabolic model included in the sybil package.

12 Acknowledgements

This work was supported by the Pontificia Universidad Javeriana, Bogotá, Colombia, and Minciencias IDs 7740, 8845, and 20304 to JG. We thank the anonymous reviewers and testers for their helpful comments and suggestions to improve the CRAN package.

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

g2f as a Novel Tool to Find and Fill Gaps in Metabolic Networks

1 Introduction

2 Installation and Functions

3 Downloading reference data from KEGG database

4 Identify blocked reactions

5 Calculating the additional cost

6 "Gap find and fill" performing, input and syntaxis

7 Compatibility

8 g2f performance

9 Application

10 Conclusions

11 Summary

12 Acknowledgements

CRAN packages used

CRAN Task Views implied by cited packages

Note

References

Reuse

Citation