Setting up PAMs using PAModelpy

This guide provides step-by-step instructions on how to use the provided Python scripts to build Protein Allocation Models (PAMs) from a genome-scale model and parameter datasets.

Overview

The pam_generation.py script allows users to set up a PAM based on a genome-scale metabolic model and an enzyme database. It integrates genome-protein-reaction (GPR) rules with enzyme kinetics to partition cellular proteins into functional sectors.

The accompanying test_pam_generation.py provides unit tests to validate the setup process.

Prerequisites

Required python libraries

Ensure you have PAModelpy installed on Python >=3.9 and <=3.11

Install these dependencies via pip:

pip install cobra PAModelpy

Note that the package has been tested with the Gurobi solver. In order for Gurobi to work properly, please install gurobipy with a version matching your license.

For example for the version used for the development of PAModelpy:

pip install gurobipy==9.5.2

Input files

Genome-scale model: the path to metabolic model in SBML format (e.g., iML1515.xml) or a cobra.Model instance (e.g., build from a json file like e_coli_core.json).
Parameter Excel file: Contains data on enzymatic properties. The file should have at least the following sheets:
- ActiveEnzymes: Enzyme-specific parameters such as reaction IDs, kcat values, and molar masses.
- Translational (optional): Parameters related to the translational protein fraction.
- UnusedEnzyme (optional): Parameters for the unused enzyme sector.

Dataset requirements

Excel file structure

ActiveEnzymes sheet

Column	Description
`rxn_id`	Reaction ID from the genome-scale model.
`enzyme_id`	Unique enzyme identifier*.
`gene`	List of genes associated with the enzyme.
`GPR`	Gene-protein-reaction association (logical expression).
`molMass`	Molar mass of the enzyme (kDa).
`kcat_values`	Turnover number (s⁻¹).
`direction`	Reaction direction (`f` for forward, `b` for backward).

A unique enzyme identifier is defined as a single identifier per catalytically active unit. This means that an enzyme complex has a single identifier.
Enzyme-complex identifiers can be parsed from peptide identifiers and gene-to-protein mapping using the merge_enzyme_complexes function.

Translational sheet

Parameter	Description
`id_list`	Identifier related to protein fraction associated with translational proteins
`tps_0`	Translational protein fraction at zero growth rate. [g_protein/g_CDW]
`tps_mu`	Change in translational protein fraction per unit change of the associated reaction.
`mol_mass`	Molar mass of the translational enzymes. [kDa]

UnusedEnzyme sheet

Parameter	Description
`id_list`	Identifier related to protein fraction associated with the unused enzyme sector
`ups_0`	Unused enzyme fraction at zero growth rate. [g_protein/g_CDW]
`ups_mu`	Change in unused enzyme fraction per unit change of the associated reaction.
`mol_mass`	Molar mass of unused enzymes. [kDa]

Building a PAM

Steps to create a PAM

Prepare the genome-scale model and parameter file.
Optional: change defaults in the Config object to match your model identifiers
Use the set_up_pam function to initialize the model.
Optimize the PAM

Example usage for E. coli

The model defaults are set to generate a PAM for the iML1515 model of Escherichia coli K-12.

from PAModelpy.utils.pam_generation import set_up_pam

#1. Define input paths
model_path = "Models/iML1515.xml"
param_file = "Data/proteinAllocationModel_iML1515_EnzymaticData_new.xlsx"
#2. Config is not required for iML1515
#3. Build the PAM
pam = set_up_pam(pam_info_file=param_file,
                 model=model_path,
                 total_protein=0.258,  # Optional: Total protein concentration (g_prot/g_cdw)
                 active_enzymes=True,
                 translational_enzymes=True,
                 unused_enzymes=True,
                 sensitivity=True,
                 adjust_reaction_ids=False)
#4. Optimize to find the max growth rate at a substrate uptake rate of -10 mmol/gcdw/h
pam.optimize()
print(f"Objective Value: {pam.objective.value}")

Example usage for other microorganisms

The model defaults have to be adapted to the identifiers for another microorganism. In case of a model in the BiGG namespace, only the biomass reaction has to be adapted (which is the case for e.g. iJN1463, the genome-scale model for Pseudomonas putida KT2440). For some other models, the entire namespace has to be adapted (e.g. for the Yeast9 model of Saccharomyces cerevisiae). This can be accomplished using a Config object.

Example for BiGG namespace: P. putida iJN1463

Please note that this is only an example, both the model, as the parameter file do not exist in this repository.

from PAModelpy.utils.pam_generation import set_up_pam
from PAModelpy import Config

#1. Define input paths
model_path = "Models/iJN1463.xml"
param_file = "Data/proteinAllocationModel_iJN1463_EnzymaticData.xlsx"

#2. Change biomass reaction id in config
config = Config()
config.BIOMASS_REACTION = 'BIOMASS_KT2440_WT3'

#3. Build the PAM
pam = set_up_pam(pam_info_file=param_file,
                 model=model_path,
                 config=config,
                 total_protein=0.258,  # Optional: Total protein concentration (g_prot/g_cdw)
                 active_enzymes=True,
                 translational_enzymes=True,
                 unused_enzymes=True,
                 sensitivity=True,
                 adjust_reaction_ids=False)
#4. Optimize to find the max growth rate at a substrate uptake rate of -10 mmol/gcdw/h
pam.optimize()
print(f"Objective Value: {pam.objective.value}")

Example for non-BiGG namespace: S. cerevisia Yeast9

Please note that this is only an example, both the model, as the parameter file do not exist in this repository.

from PAModelpy.utils.pam_generation import set_up_pam
from PAModelpy import Config

#1. Define input paths
model_path = "Models/Yeast9.xml"
param_file = "Data/proteinAllocationModel_yeast9_EnzymaticData.xlsx"

#2. Change all the reaction ids in config and the protein regex
config = Config()
config.TOTAL_PROTEIN_CONSTRAINT_ID = "TotalProteinConstraint"
config.P_TOT_DEFAULT = 0.388  # g_protein/g_cdw
config.CO2_EXHANGE_RXNID = "r_1672"
config.GLUCOSE_EXCHANGE_RXNID = "r_1714"
config.BIOMASS_REACTION = "r_2111"
config.OXYGEN_UPTAKE_RXNID = "r_1992"
config.ACETATE_EXCRETION_RXNID = "r_1634"
config.PHYS_RXN_IDS = [
config.BIOMASS_REACTION,
config.GLUCOSE_EXCHANGE_RXNID,
config.ACETATE_EXCRETION_RXNID,
config.CO2_EXHANGE_RXNID,
config.OXYGEN_UPTAKE_RXNID]
config.ENZYME_ID_REGEX = r'(Y[A-P][LR][0-9]{3}[CW])'

#3. Build the PAM
pam = set_up_pam(pam_info_file=param_file,
                 model=model_path,
                 config=config,
                 total_protein=0.258,  # Optional: Total protein concentration (g_prot/g_cdw)
                 active_enzymes=True,
                 translational_enzymes=True,
                 unused_enzymes=True,
                 sensitivity=True,
                 adjust_reaction_ids=False)
#4. Optimize to find the max growth rate at a substrate uptake rate of -10 mmol/gcdw/h
pam.optimize()
print(f"Objective Value: {pam.objective.value}")

A short note on enzyme identifiers

The Config object has an entry which enables the framework to find and extract enzyme identifiers from the PAModel:

Config.ENZYME_ID_REGEX = r'(?:[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})'

This is the default regular expression to find UniProt identifiers, as derived from the UniProt database (obtained 2024-08-07). Two other default placeholder identifiers (E1, E10, Enzyme_GLC_D, etx) are always included in any regex search with the PAModel:

default_enzyme_regex = r'E[0-9][0-9]*|Enzyme_*'

In case you would like to use other placeholders or another form of protein identifiers, please adapt the Config.ENZYME_ID_REGEX attribute with the proper regular expression.

Testing the setup

To verify the correctness of the PAM generation process, run the tests provided in tests/unit_tests/test_utils/test_pam_generation.py:

python -m pytest tests/unit_tests/test_utils/test_pam_generation.py.py

Additional features

Increasing kcat values

The script includes a utility to scale up kcat values in the parameter file:

from PAModelpy.utils.pam_generation import increase_kcats_in_parameter_file

increase_kcats_in_parameter_file(
    kcat_increase_factor=2,
    pam_info_file_path_ori="Data/old_param_file.xlsx",
    pam_info_file_path_out="Data/new_param_file.xlsx"
)

Generate enzyme-complex identifiers from peptide ids (e.g. uniprot annotation)

This helps with setting up a novel parameter file using identifiers obtained from uniprot and a gene-to-protein mapping

from PAModelpy.utils.pam_generation import get_protein_gene_mapping, merge_enzyme_complexes
from cobra.io import read_sbml_model

model = read_sbml_model('Models/iML1515.xml')
enzyme_db = "Data/old_param_file.xlsx"

protein2gene, gene2protein = get_protein_gene_mapping(enzyme_db, model)
# Ensure the enzyme complexes are merged on one row
eco_enzymes_mapped = merge_enzyme_complexes(enzyme_db, gene2protein)

Saving the PAM

Saving the PAM is not often required, as all the parameters are stored in Excel format and can thus be easily modified. COBRApy’s tools for saving models DO NOT WORK for saving PAMs: in fact the generated models won’t be executable. In case you do want to store a PAM for later use, you can store it using pickle, provided the underlying metabolic model is picklable..

import pickle

pickle.dump(pam, "path/to/model.pickle")

Does pickling not work? Check if one of the reactions or metabolites has LP standard language (such as ‘St’) as id.

Flux Variability Analysis for Enzymes

Similar to the FVA from COBRApy, PAModelpy has implemented a FVA. You can provide a object type, which will be varied. In this manner, you can perform FVA on proteins. Please note that the enzyme variables in the model are corrected for the feasibility tolerance with a factor of 1e6 and that the computed units is in mmol/gCDW. To get interpretable units, unit corrections are required.

The result is a dataframe with the minimum and maximum value of the variable which was studied.

from PAModelpy import Enzyme
from PAModelpy.flux_analysis import flux_variability_analysis

other_variables_to_analyze = [pam.reactions.get_by_id(rid) for rid in ['TPI', 'ACONTa', 'EX_ac_e']]

fva_results = flux_variability_analysis(pam,
                                        variable_type=Enzyme,
                                        variable_list = other_variables_to_analyze #optional, provide a list with other variables to perform fva on which are not enzymes 
                                        )

Recombinant protein overexpression

Example for eGFP overexpression.

import os
from PAModelpy import Enzyme
from PAModelpy.utils.recombinant_protein_expression import (read_sequence_from_file,
                                                            add_recombinant_intracellular_protein_to_pam)

aa_seq = read_sequence_from_file(os.path.join('Data', 'eGFP_protein_sequence.txt'))
gfp_enzyme = Enzyme('eGFP', {},
                        molmass=2.8*1e4)
protein_production_reaction = add_recombinant_intracellular_protein_to_pam(pam, protein=gfp_enzyme, aa_seq=aa_seq)

You can also determine protein export. In this case, ust adding and exporting a protein is not enough, as the exported protein requires ribosomes. In the normal PAM, the translational sector accounts for all ribosomes to make proteins for enzymes inside the cell. If you want to generate more proteins you need to allocate ribosomes specifically to make the exported protein. To determine the amount of ribosomes needed to make one mmol/gCDW of exported recombinant protein, a reverence growth rate is required.

import os
from PAModelpy import Enzyme
from PAModelpy.utils.recombinant_protein_expression import (read_sequence_from_file,
                                                            add_recombinant_protein_production_and_export,)

protein_production_reaction = add_recombinant_protein_production_and_export(aa_txt_file=os.path.join('Data', 'eGFP_protein_sequence.txt'),
                                                                            pam=pam,
                                                                            protein_name='eGFP',
                                                                            molecular_weight=2.8*1e4,
                                                                            reference_growth_rate=0.5)

Troubleshooting

Issue: Missing reaction in enzyme database
Solution: Ensure all reactions in the model are represented in the parameter file or adjust reaction_ids using adjust_reaction_ids=True.
Issue: Objective value is zero after optimization
Solution: Check the input parameter file for consistency and ensure that all reactions are correctly annotated.
Issue: The PAM can be pickled, but not unpickled Solution: Most likely one of the reaction in the metabolic models has a name like ‘St’. Renaming this to something which is not part of the standard LP language will solve the issue.