Introduction and Impact
At the molecular level, the functional richness, complexity, and diversity of biology can be localized predominantly to large “macro”-molecules (nucleic acids and proteins) and secondary metabolites. Indeed, evolution has produced and leveraged biomolecules and their assemblies to achieve extraordinarily sophisticated natural functions far surpassing our current engineering capabilities. If researchers are able to efficiently design, generate, synthesize, assemble, and regulate biomolecules in ways that rival the functional complexity of natural counterparts, but with user-defined functions, then all areas of bioengineering and synthetic biology should benefit.
The challenge of crafting biomolecules, pathways, and circuits that carry out user-defined functions has historically been an exercise in building out from what exists in nature to what doesn’t. Certainly, this mode of bioengineering will be important going forward and will see transformations as our knowledge of and ability to harvest what exists in nature increases. Likewise, this mode of bioengineering will advance as our ability to take natural components and bring them to new functions improves, both in the ambitiousness of the functions we can reach (that is, how different they are from natural functions) and the scale with which we can reach them. Under this framework, we outline a number of transformative tools, technologies, and goals centered on parts prospecting, high-throughput measurement, and computational and evolutionary design approaches, to both better understand how natural parts work and rapidly improve upon them to reach user-defined functions. We should also keep in mind that as synthetic biology advances, what exists in nature may no longer be the only framework from which we can extract starting points for building out. Indeed, fundamentally new biological components of our own creation, for example ones containing fully unnatural chemical building blocks, might introduce entirely new categories of what exists to biology and so we must develop tools to use and design from those categories. Therefore, we also define a number of transformative tools, technologies, and goals that will allow us to exploit these truly new categories of biological matter.
The roadmap for Biomolecule, Pathway, and Circuit Engineering addresses the engineering of individual biomolecules to have expanded or new functions and the combination of biomolecular parts into macromolecular assemblies, pathways, and circuits that carry out a larger function, both in vivo, in cell culture systems, and in vitro, in cell-free and/or purified settings. The roadmap operates from the definition that 1) biomolecules are made by natural or engineered biological systems; 2) biomolecules are made from natural simple building blocks or engineered variants of those building blocks; and 3) the production of biomolecules can predominantly be genetically encoded. The roadmap uses the broad definition that macromolecular assemblies operate as complexes of physically-interacting individual biomolecules, that pathways are combinations of biomolecules that achieve a coordinated function, and that circuits are combinations of biomolecules that achieve regulatory control or dynamic information processing. Under these definitions, typical biomolecules include natural and engineered variants of existing macromolecules (e.g., DNA, RNA, proteins, lipids, and carbohydrates), as well as new biopolymers containing unnatural nucleotides and amino acids; typical macromolecular assemblies include self-assembling protein nanostructures or nucleoprotein complexes; typical pathways include collections of natural or engineered enzymes that produce desired secondary metabolites; and typical circuits include natural or engineered regulatory modules that control gene expression in a dynamical fashion. We note that the boundaries between biomolecular engineering and host engineering (see Host and Consortia Engineering) can easily blur, but offer the practical and subjective classification guideline that this section treats bioengineering problems where the key innovations can be localized to manipulating and understanding individual molecules and their assemblies in contrast to manipulating and understanding the dynamics of large networks of molecules.
Transformative Tools and Technologies
Computational macromolecular design
Computational design of biomolecules with specific functions is a major area of research in synthetic biology. Advances in this area should eventually result in the on-demand generation of any specific molecular function, including catalysis and intermolecular interactions at the heart of biomolecular, pathway, and circuit engineering. Within computational design, protein, DNA, and RNA engineering have advanced the furthest, so we discuss computational design challenges through the lens of these particular macromolecules with the understanding that similar advances can be made for all biomolecules.
Protein design
Computational protein design is a discipline aimed at identifying specific sequences that have desired three-dimensional shapes or function, ideally exploiting the speed and cheapness of in silico computation to do so. Contrary to experimental methods such as directed evolution (laboratory evolution), computational biomolecular design aspires to “virtually” identify likely functional, while aiming to eliminate non-functional, molecules without producing and directly testing them. Computational biomolecular design has advanced to the point where defined structures and binding interactions can be constructed, but improvements are needed in expanding 1) the range and effectiveness of protein functions that can be designed, and 2) the success rate.
A critical aspect of designing functional proteins is the ability to accurately predict structure from sequences, which remains especially challenging for large proteins (>125 amino acids), beta-sheet topology, long-range contacts, and membrane proteins. Closely homologous proteins in nature have a backbone rmsd (root-mean-square deviation) <3 angstroms, so an rmsd of <3 angstroms between a computationally predicted structure (whether folding an existing sequence or designing a new sequence) and its actual structure solved through X-ray crystallography is a biologically justified metric for success. Currently, there are already several cases where computationally-predicted structures give an atomic level accuracy better than 2.5 angstroms, but regularly achieving such accuracy, especially for large proteins and with a diversity of structural features, remains a critical challenge. Furthermore, the most successful computational platforms still rely on homology to existing proteins at various levels of resolution. And even at the single-residue resolution, there is still reliance on existing protein structures – for example, conformational rotamers in a leading protein design platform, Rosetta, are partly scored based on their frequency in the PDB. Therefore, the types of proteins that can currently be designed are still ones close to natural proteins. Moving farther and farther away from natural structures should result in both a better understanding of protein biophysics and new scaffolds specialized for new applications.
In terms of the types of protein functions that can be effectively designed, enzyme activity presents a major current challenge. One reason for this is that enzymes may rely on intricate molecular dynamics for catalysis that are difficult to capture in current design platforms; currently only single-residue conformational dynamics have been engineered by computational design. Better addressing the challenge of enzyme design would enable broad advances in synthetic biology. For example, enzymes are at the heart of metabolic pathway engineering goals. Quantifiable metrics in enzyme design can be based on diffusion-limited rate constant (kcat/KM) improvements. This limit for enzymes is ~109 M-1s-1, natural enzymes average a kcat/KM of approximately 105 M-1s-1, but computationally designed enzymes have kcat/KM values that are usually around three orders of magnitude lower than natural enzymes. A major goal of computational enzyme design should be to routinely achieve the kcat/KM values of natural enzymes for artificial (user-defined) reactions.
The success rate of protein structure prediction and computational protein design is still low. This significantly limits broad adoption of protein design by the biomolecular engineering community. Because of the inaccuracy and imperfection of the molecular mechanics force-field underlying protein design, highly trained experts are often needed to curate computational design to select the ones that will be tested experimentally. The scant availability of such experts limits the broad deployment of protein design within the industrial and academic communities. Leveraging high-throughput experimental screening of large numbers of computational designs is a way to alleviate such limitations, but result in very high cost for design projects, which in turn restrict application of design to the most well-funded academic and industrial institutions. For example, the typical rate of success for enzyme design is in the low percent range. A rate of success greater than 50% would therefore have a tremendous impact.
Achieving these goals will require progress on multiple aspects of design, which can be categorized as physics-based or knowledge-based. Physics-based design approaches will advance through the improvement of the molecular mechanics force-fields and knowledge-based design approaches will advance through the curation of very large datasets of positive and negative design outcomes to enable the further development of machine learning techniques that extract design models from data. A combination of physics- and knowledge-based advances may be required to maximize the success rate for computational protein design. Physics-based molecular dynamics simulations can incorporate protein dynamics that may be at the heart of certain protein functions, but the amount of computational power required make it impossible to perform dynamics simulation on all design candidates at all scales. The ability to incorporate coarse-grained or full-atom dynamics in the design stage would critically enable the design of enzymes with high catalytic activities without the need to use laboratory evolution post-design.
Computational nucleic acid design
There has been considerable interest in designing nucleic acids (DNA, RNAs) and nucleic acid machines to carry out custom function (e.g., binding, sensing, catalysis, regulation) because nucleic acids are arguably uniquely programmable due to their reliance on base pairing for secondary structure, while allowing a wide range of sophisticated structural elements through tertiary and non-canonical structures. Several breakthrough technologies have emerged recently based on RNA-protein complexes that rely on base-pair guided interactions, including RNA silencing, CRISPR genome editing and gene activation and repression, and therapeutics that target pre-mRNA splicing. Nevertheless, compared to computational protein design, nucleic acid design, especially design of RNAs at the non-canonical and tertiary structural level is underdeveloped. In addition, the design of complexes that mix RNA and proteins, or ‘nucleoproteins’, remains particularly underdeveloped. A major goal of the field should be to resolve this gap.
A number of computational approaches have been developed to be able to predict secondary and tertiary structures from primary sequences, and more recently, increased interest in nucleic acid (DNA/RNA) nanotechnology has emerged given the potential of developing computational rules that can lead to nucleic acids that assemble into complex shapes. Most of the recent successes in computational nucleic acid design leverage our knowledge of secondary structure thermodynamics to design at the level of RNA secondary and canonical Watson-Crick base pairing interactions. In order to expand computational nucleic acid design, several major areas of improvement are needed including: 1) incorporating RNA folding kinetics into design algorithms, both from the standpoint of designing efficient folding pathways, as well as designing structures that can dynamically change in response to local or global environmental changes and specific interactions with other molecules; 2) designing at the level of three dimensional structure; 3) incorporation of non-canonical interactions (i.e. Hoogstein base pairing, nucleotide-backbone pairing, etc.) into design approaches; 4) incorporating the growing number of synthetic nucleotide chemistries (i.e., Hachimoji codes) within nucleic acid design; and 5) integrating frameworks for RNA and protein design to predict and design structure-function relations for nucleic acids alone and in the context of riboprotein complexes and hybrid structures. Together, this will unlock the ability to design new and powerful target functionalities including: RNA-ligand binding pockets for new ligands relevant to biosensor design; catalytic sites for improved RNA catalysis in ribozymes and within the ribosome for new functions such as bespoke gene editing tools and templated unnatural polymer biosynthesis, respectively; higher-order dynamic RNAs that can change global folding patterns in response to stimuli relevant to single-molecule molecular logics; improved RNA-protein complexes for RNA-guided gene editors (e.g., CRISPR systems); new classes of RNA-protein nanomachines that can perform cell-like functions such as cargo sorting and transport; and RNAs that post-transcriptionally control gene expression in a targeted way by directly regulating stability of entire clusters of mRNAs via designed RNA-RNA interactions (we note that this can be particularly relevant to the engineering and optimization of metabolic pathways and complex phenotypes in a variety of hosts.
Evolutionary macromolecular engineering
Evolution is a powerful bioengineer, but the natural evolutionary process is slow. New directed evolution platforms for rapid optimization of nucleic acids, proteins, pathways, and circuits towards desired functions are needed. Metrics for effective directed evolution include: 1) fold-improvement over starting point function, as well as absolute-level of function that can be evolved; 2) the types of functions that can be evolved; and 3) scale (how many experiments can be run simultaneously). Although there are many applications of directed evolution for all types of biomolecules, there are two particularly demanding testbeds for directed evolution technologies discussed below: protein enzyme evolution and the specific binding of nucleic acids to small molecules or proteins.
Enzyme evolution
Only a small fraction of directed enzyme evolution experiments give increases in kcat/KM that bring activities within range of natural enzymes. Empirically, the most extensive directed evolution efforts yielding orders of magnitude improvement in kcat/KM have typically required about ten mutations, but classical directed evolution methods rarely traverse adaptive mutational pathways with >5 mutations. Similar metrics aiming for ten mutations guide the evolution of binding proteins, where this scale of mutation is necessary for achieving extremely high-affinity binders (picomolar) from weak binders (micromolar). Therefore, platforms for directed protein evolution that can routinely yield variants with ten or more adaptive mutations are desirable. We note that the related problem of evolving a new protein-based biosensor such as a transcription factor that binds to a new ligand is related to the KM problem for enzyme evolution. Thus improved technologies for evolving enzymes will have great impact in other areas of biomolecular engineering.
DNA/RNA aptamer evolution
DNA or RNA sequences that can bind to specific proteins or small molecule ligands are commonly referred to as aptamers. While there was an initial push to evolve new aptamers via techniques such as SELEX, the field has existing challenges in terms of the chemical diversity of small molecules that can be targeted with aptamers, as well as incorporating aptamers into functional RNA molecules such as biosensors or gene regulators. With regards to the challenge of diversity, aptamers have been largely limited to those targets that are already known to bind to RNA well (such as nucleotide analogs and other co-factors for which natural aptamers exist), or to compounds that can be easily immobilized on solid supports. In terms of aptamers in functional RNA molecules, there is some recent progress in incorporating new aptamers into fluorescent RNA biosensors as well as new classes of RNA regulators called riboswitches, but progress is still hampered by a lack of understanding of ligand-mediated allosteric effects that alter RNA structure. Both of the challenges could be fruitfully addressed by new evolution methods that selected for RNA binding interactions in the context of functional molecules – for example by new selection methods that used the binding of free ligand (i.e., not bound to a solid support) to trigger a regulatory event (e.g., activation of transcription) that could be selected for. In addition, evolutionary methods that can support the use of non-natural nucleic acids could further enhance the diversity of chemical interactions and structural motifs available to offer new ligand binding properties. We note that progress towards this goal would also impact the recent and growing interest in developing small molecule drugs for RNA targets.
Development of platform technologies for evolutionary macromolecular engineering
Platforms for directed evolution that can traverse long mutational pathways are critical for crossing fitness valleys. Given the length of functional biopolymers (i.e., proteins), evolution (and any design strategy) is (and always will be) a highly limited search through sequence/fitness space. Given the ruggedness of fitness landscapes, this results in fixation of suboptimal sequences that represent local fitness maxima. Crossing fitness valleys to reach more-fit maxima is always a critical challenge in directed evolution. To cross these valleys, multi-mutation pathways, which can be elicited by fluctuating or changing selection conditions or spatial structure, are critical. In addition, when directed evolution is used to improve multi-gene metabolic pathways, the number of beneficial mutations available/needed to achieve an optimal function increases compared to the evolution of single enzymes. As a result, directed evolution systems capable of traversing long mutational pathways are here needed, too. Developments in continuous evolution systems and automation should allow directed evolution to address such demanding biomolecular engineering goals by accessing long mutational pathways at scale.
Evolution platforms require selection. Therefore, the types of functions that one can evolve biomolecules to achieve depend on the availability of high-throughput screens and genetic selections for those properties. Potential selection systems for genetic growth-based selection are abundant – as the propagation of biomolecular variants with desired properties can be linked to cell survival through synthetic genetic circuits – but the reliability of these selections and the difficulty in designing new ones vary widely. In vitro, screening throughput is lower than for in vivo selection, but there is more control over what is screened for through the precision of FACS and assays involving droplet sorting systems or microtiter plates. Therefore, on the one hand, there is a need for more and more general in vivo selection systems, for example, custom transcriptional biosensors for arbitrary small molecules to select biomolecules (enzymes, RNAs) that produce desired products, two-hybrid systems, nucleic acid-protein interaction selection systems, display systems (e.g., ribosome), and other binding-based selection systems to find custom affinity reagents or interaction inhibitors; as well as diversification methods (e.g., PACE, evolvR, and OrthoRep) that operate in vivo to match the possible throughput of selection. On the other hand, there is a need for highly-streamlined and high-throughput in vitro screening technologies that will likely utilize new technologies in microfluidics and DNA barcoding and sequencing. We note that for certain types of nucleic acid based selections such as SELEX, one can carry out in vitro selections with extremely large libraries (>1014 variants), which greatly increase the likelihood of recovering functional molecules. Improving the throughput of other in vitro or in vivo screening technologies to enable more than the current 106-108 variants to be screened could significantly improve the ability to evolve molecules with properties that are difficult to access using SELEX.
In terms of scale, classical methods usually limit one researcher to carry out no more than ten independent biomolecular evolution experiments in parallel, especially if multiple rounds of mutation and selection are needed. Evolution platforms that can exceed this scale are therefore desired. Continuous in vivo evolution platforms can achieve scale by requiring only serial passaging for selection, as serial passaging of cells is highly scalable. Continuous in vitro nucleic acid evolution platforms similarly scale. Although automated methods for selecting macromolecules, including RNA aptamers, were first developed more than 15 years ago, the relatively limited capabilities and high costs of those robotic platforms hampered widespread adoption. Looking ahead, laboratory automation may begin to more easily increase the scale of evolutionary macromolecular engineering undertakings as many steps such as PCR diversification, transformations, and selection have become much more readily automated.
Finally, we note that evolutionary macromolecular engineering and computational design are highly complementary. Not only can computational design provide starting points for evolutionary engineering of both proteins and nucleic acids (especially powerful since gene synthesis is scalable so many starting points can be exactly made), evolutionary methods – in particular highly scalable ones – can generate large sets of successful (and unsuccessful) outcomes to train computational algorithms. Indeed, machine-learning in protein engineering is a rapidly expanding area of research that we believe holds great promise and is highly synergistic with the large datasets that can be provided through continuous evolution experiments and high-throughput sequence-function mapping experiments.
Collection and curation of more biomolecular parts
While we have emphasized how one may design and evolve new biomolecular parts, there is already a rich existing collection of natural biomolecules that nature offers. Proper prospecting and curation of parts from the rapidly growing number of genomes sequences is a valuable strategy to complement design and evolutionary approaches. Even if design and evolutionary approaches rapidly advance, there is still the need for good starting points for design and evolution to modify and these starting points come from parts collections.
As parts collections expand, including through the addition of more and more synthetic variants, characterization and curation become crucial. Standardized methods for measuring the performance of particular parts is therefore essential. This is especially important for parts controlling gene expression, which form the basis of biological circuit design. Host specificity, environmental effects, modularity, and tunability of parts are all critical aspects in biological circuit design.
Unnatural nucleotide and amino acid polymerization systems
The construction of macromolecules that contain unnatural building blocks would be broadly useful for new therapeutics, materials, and biocontainment strategies. Systems for PCR and transcription of fully unnatural nucleotide-containing genes of up to 400 base pairs is an aspirational but reasonable, metric the field should aim for. At this length, unnatural aptamer and aptazyme polymers could be regularly evolved and engineered. Systems capable of handling even longer sequences (1000 base pairs) would be useful as new information polymers capable of encoding unnatural proteins and sustaining genetic codes based on new genetic alphabets. Expanded genetic code systems for translation of fully unnatural amino acid containing proteins with more than 200 amino acids and/or proteins with at least four distinct unnatural amino acid building blocks would also be an aspirational quantifiable goal for the field. This goal would open up new categories of research in biomaterial production and evolution and further motivate the expansion of genetic codes, a key area of synthetic biology with a wide range of applications from biomolecular engineering to biocontainment.