Establish functional prediction through biological engineering design at the biomolecular, cellular, and consortium scale.

Engineering Biology

Data Science Goal:

Establish functional prediction through biological engineering design at the biomolecular, cellular, and consortium scale.

Current State-of-the-Art

ROSETTA, MOE, and NAMD are representative software platforms for biomolecular structure-based design and for the simulation of small molecules and peptides to proteins and larger systems. Google DeepMind’s recent success at CASP13¹ demonstrated that machine-learning approaches are also increasingly effective for biomolecular structure prediction, and it is anticipated that design and simulation will increasingly integrate physics- and structure-based modeling with statistical comparative- and screening-based data. Existing software tools are largely sufficient to design protein libraries to experimentally explore molecular space, predict protein domains and other structural boundaries, and leverage comparative (meta)genomics to build deep sets of sequence orthologs for important protein classes and suggest tolerable/efficacious mutation locations. Current limitations of these software include dependencies upon imperfect force-fields, a lack of full quantitative and allosteric modeling and parallel computation, and insufficient design-of-experiments support and structural coverage for statistical analyses. While it seems likely that high-throughput screening combined with machine learning may provide a data-driven approach to identifying function from sequence without resorting to first principles or ground-up approaches, measuring molecular activity at scale remains a key bottleneck.

The design of organisms with a targeted metabolic function (e.g., overexpression of a single biomolecular species) requires computational tools that: 1) identify sets of proteins that can convert readily available molecules to high value products, each protein performing one of a series of chemical modifications; and 2) identify best sets of enzymes and their stoichiometry that can work together as parts of pathways in the context of cellular metabolism. On the pathway level, genome-scale metabolic models link genotype to phenotype through the reconstruction of the complete metabolic reaction network of an organism. This technique can be used to define theoretical production limits and design and test new microbial strains in silico. This approach has been especially effective for predicting and improving metabolite production rates in heterologous biosynthetic pathways. Flux Balance Analysis (FBA), Flux Variability Analysis (FVA), and minimization of metabolic adjustment (MOMA) have been successfully used, in combination with genome-scale metabolic models, to predict cell growth, flux distribution, product synthesis, and to guide host design. A MATLAB toolbox called COBRA² (“COnstraint-Based Reconstruction and Analysis”) provides a convenient framework to simulate and analyze the phenotypic behavior of a genome-scale stoichiometric mode³, and retrobiosynthesis tools such as BNICE (“Biochemical Network Integrated Computational Explorer”) and RetroPath are used to design new or improved biochemical pathways.⁴ In these design tools, software identifies novel metabolites, reactions, and whole pathways by predicting promiscuity based on classification of enzymes according to their chemical action. On the cellular level, a wide variety of host design tools have been developed for identification of gene targets for knockout, overexpression, or downregulation, introduction of non-native enzymatic reactions, and elimination of competing pathways in order to improve the cellular phenotypes.⁵ Pathway and host improvements achieved from these design tools are often non-intuitive and non-obvious. And, while genome-scale metabolic models have been important for metabolic engineering efforts with organic compounds, advances are still required to transform the bioeconomy.

When it comes to community and consortia design, we are primarily in a state of data gathering and developing a baseline understanding of microbial communities across diverse locations/ecosystems, thus tools for multi-scale modeling at multicellular, organismal, and population levels have yet to be developed.

Breakthrough Capabilities & Milestones

Fully-automated molecular design from integrated, large-scale design data frameworks.

Structure- and comparative analysis-based libraries for automated directed evolution, with feedback of large-scale results to algorithms.

Automated designs for integrated manufacturing to enable more successful, iterated workflows.

Large-scale design data generation to inform next-generation algorithms for molecular design.

Use of large-scale design data in integrated frameworks.

Design and integration of thousands of critical catalytic activities into proteins for a set of model hosts and creation of standard tools for allosteric control of these activities.

Use of enzyme promiscuity prediction algorithms to design biosynthetic pathways for any molecule (natural or non-natural).

Retro-biosynthesis software that can identify any biological or biochemical route to any organic molecule.

Bottleneck/Challenge: There are a nearly infinite number of chemicals that we want to produce using engineered hosts; however, the routes (biological-only or a combination of biological and chemical) to these chemicals are not always known or easy to imagine.

Potential Solution: Develop retrobiosynthesis software for all known metabolic pathways in all life forms and integrate that software with retrosynthesis software of all chemical catalysis to develop pathways to nearly any organic chemical.

Data integration for certain classes of enzymes and pathways and predictable host-specific expression in model organisms.

Integrated data that allows on-demand characterization, standardization, insertion, and deployment of natural and non-natural pathways.

Scalable, data-driven host design for complex environments that enable high-level production of natural biomolecules.

Ability to make and screen multiple host mutations for epistasis mapping and synthetic interactions, making large-scale host optimization possible.

Better data on physiology and fitness in deployment environments suitable for informing designs in validated lab-scale simulations that meet activity, persistence, and ecological impact goals.

Thematic design rules for host system engineering inferred from data.

Tools to acquire and transfer data to a novel host to inform both genetic-domestication and prediction and determination of function.

Novel design tools to support host design for more complex, natural (non-laboratory) environments.

Data-driven domestication of any new host for new activities in any environment and scale.

Enabled design of functional, self-supporting ecosystems.

Data-driven tools for selecting organisms for synthetic assemblies to achieve resistant, resilient activity.

Direct data collection for the most important communities in human, agriculture, and complex bioreactor work sufficient for informing design.

Modeling tools to identify cross-organismal networks and ecological interactions.

Integration of molecular, pathway, and host design to create and build models of genetically-engineered communities that function predictably, in the context of deployment ecology.

Bottleneck/Challenge: Inability to infer or determine cellular- and sub-cellular-level mechanistic-modes due to computational complexity.

Potential Solution: Develop more comprehensive algorithms for modeling purposes that specifically take advantage of domain specific knowledge, algorithmic advances leveraging parallelization, and hardware advances, such as the use of specialized electronic circuits.

Ability to design and build functional, enclosed, self-supporting ecosystems of multiple engineered microbial species for efficient industrial production.

Ability to design, model, and engineer microbial consortia to simultaneously and efficiently produce multiple products of interest with minimal by-products and waste.

Footnotes

AlQuraishi, M. (2019). AlphaFold at CASP13. Bioinformatics. View publication.
Heirendt, L., Arreckx, S., Pfau, T., Mendoza, S. N., Richelle, A., Heinken, A., … Fleming, R. M. T. (2019). Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nature Protocols, 14(3), 639–702. View publication.
Schellenberger, J., Lewis, N. E., & Palsson, B. Ø. (2011). Elimination of thermodynamically infeasible loops in steady-state metabolic models. Biophysical Journal, 100(3), 544–553. View publication.
Medema, M. H., van Raaphorst, R., Takano, E., & Breitling, R. (2012). Computational tools for the synthetic design of biochemical pathways. Nature Reviews. Microbiology, 10(3), 191–202. View publication.
Long, M. R., Ong, W. K., & Reed, J. L. (2015). Computational methods in metabolic engineering for strain design. Current Opinion in Biotechnology, 34, 135–141. View publication.