Engineering Biology

Data Science

Data Integration, Modeling, and Automation focuses on robust, systematic use of the design, build, test, learn methodology to create complex systems. Progress requires a purpose-built computational infrastructure that supports DBTL biological processes, the ability to predict design outcomes, and optimize manufacturing processes at scale.

Introduction and Impact

Applications of engineering biology have grown beyond chemical production to include the generation of biosensor organisms for the lab, animal, and field, modification of agricultural organisms for nutrition and pest/environmental resilience, production of organisms for bioremediation, and live cell and gene/viral therapies. The rapid expansion of the field has resulted in new tools and new approaches; however, we are still challenged by the need for novel and more robust computational tools and models for engineering biology. For example, improved models of synthetic systems and of their interaction with their host organisms will facilitate more successful engineering and broader application.

The foundation of a viable design and manufacturing process for, or using, engineering biology is automation, which requires a complete description of a biological system’s components, data to describe the system’s function and interconnections, and computational models to predict the impact of environmental parameters on the system’s behavior. For each stage and interface of the design-build-test-learn framework, we need to specify the new data and algorithms that drive experimental design, clarify the assay frameworks that allow computational diagnosis of outcomes, assure that metrology is high quality and comparable across sites, integrate frameworks that allow algorithmic prediction of process and performance improvements, and build interfaces to drive both automated and human-in-the-loop design improvements.

This information infrastructure for biological design is in a nascent state compared to engineering disciplines such as mechanical and electrical engineering, due to the recent emergence of the biomanufacturing field. A critical bottleneck is a lack of established “design rules,” core aspects of biological and biomolecular function that apply to diverse systems and applications. Furthermore, technologies for the utilization, manufacture, and deployment of biological systems are still under development. These roadblocks have hampered the development of standard computational frameworks to represent and store information about biological components, predict system behavior, and diagnose failures. Therefore, widespread automation remains out of reach.

Data Integration, Modeling, and Automation proposes a roadmap towards efficiently scaling engineering biology applications from the design, build, test, and learn cycle to the efficient and reproducible creation of individual biological components, to intracellular systems, multicellular systems, and their operation in diverse environments. This includes access to a standard information and modeling ecology to support biological design, manufacture, and quality control/diagnosis parallel to those that exist in chemical and other engineering disciplines, but which respect the core differences inherent in the biological substrate; standard and accessible frameworks that support the effective development and use of information on biological system and component function that are a necessary foundation for widespread biological design; models and tools for simulating the behavior of biological components and their interconnected systems in their diverse deployment environments that are necessary to support predictive design of these systems and diagnosis of their failures; and manufacturing process design and optimization tools with similarly attached information systems that are needed to ensure cost- and time- effective and scalable production of designed systems with minimal errors. All these systems should ideally be connected through findable, accessible, interoperable, and reusable (FAIR) data and process modeling efforts so that the community can benefit from their combined experience and work-products. Together, the protocols, metrology, and computational elements of the design-build-test-and-learn process can be continually improved.

Data Science Goals

Transformative Tools and Technologies

Integrated biological designs and data models

The foundation for design is knowledge of the components with which a design can be built and the environmental constraints under which the designed system will operate. While data can often be sparse for biological systems, there has been significant work in representing data about biomolecular function for both basic biology and engineering, including genome organization, gene regulatory-network function, metabolic pathways, and other aspects of biological function and phenotype. However, the specializations necessary to enable effective design across scales, from submolecular to mixed communities of cells in complex environments, is lacking.

The design of proteins and nucleic acids for desired functions has been a long standing biotechnological goal. There has been great progress in computational design for gene expression control, molecularly responsive nucleic acid structures, and protein structures; however, the reliability of these tools is still relatively low and the functional classes accessible for design are limited compared to those required. The current status calls for renewed scaling efforts in biomolecular characterization so that data driven methods of design can properly expand, new data-driven design algorithms and designs-of-experiments to predict such molecules, and better physics-based biomolecular design algorithms.

While design tools for metabolic engineering and gene regulatory network engineering have improved greatly over the last decade, they are still relatively limited to a small number of model organisms, a limited set of regulator families, and relatively well-characterized metabolic pathways. Current tools also have relatively primitive methods for incorporating multi-omic and other biological data to constrain their predictions, and tools for informative designs-of-experiments are lacking. Further, only recently have models of coupling to host resources and toxicity, issues of relative fitness and evolutionary robustness, and cross-organism pathway design been considered. The operation and design of mixed communities is in a primitive state.

There are almost no standardized computational approaches to ensure that the biological systems produced are measured sufficiently to prove effective and reliable function, to diagnose failures, and to predict what parameters or components must change to make the design models better match the observations and meet design goals. Integrated biological data models will be required to understand, predict and control the effect of engineering these systems at all levels and time scales.

Integration of -omics and machine learning for the design-build-test-learn (DBTL) cycle

Rapid advances in fields that leverage supervised machine learning have owed their success to the existence of massive amounts of annotated data. Data that will inform integrated biological data models will include measurements of circuit behavior in a cellular context, continuous measurements of transcriptome, proteome, and metabolome at the single-cell level, measurements that inform bioprocessing at scale, and measurements of the effect of engineered organisms on ecological scales.

Beyond the accumulation of data, theoretical impediments also prevent machine learning from accelerating the DBTL cycle. Suppose X is a set of multi-omics measurements, and Y is the yield of the desired bioproduct. By training on many multi-omics datasets and yields, a machine learning algorithm should be able to take a new multi-omics dataset X’ and predict the corresponding yield Y’. However, the critical question in the DBTL cycle is how to use measurements made in the current design to improve the design of the next iteration. That is, measurements X’ of the design are not being asked to predict the yield Y’ associated with that design. They are instead being asked to predict the yield Y* of a proposed design for which no data X* yet exists. Because the current generation of machine learning methods are powerless to address counterfactuals, new machine learning algorithms are needed that incorporate causal inference to identify interventions that would yield answers to the fundamental questions that drive the DBTL cycle.¹

While existing multi-omics measurements can provide many features, and collect observations on those features in a sufficiently high-throughput manner to fully exploit the DBTL, several major inter-linked challenges are data visualization, integration, mining, and modeling. Creation of design libraries to exercise design space is needed. Multi-omics aspects are useful, but they are generally operated on one design at a time. There is a challenge in library creation and scaling -omics measurements for these libraries for machine learning techniques to work. Further, ideally molecular and cellular functions have been characterized allowing the design-of-experiments to be chosen to minimize the number of manufactured variants that cover the most informative parametric space. The challenge therefore is: 1) having sufficiently characterized components for effective design of experiments; 2) having sufficient information about the cellular function and environmental factors to constrain the machine learning models; 3) having sufficient high quality measurement bandwidth for the design-of-experiment to work; and 4) using machine-learning models to select the next parameter sets to try.

While the variety of available software is enabling more standardized circuit design, there are fewer tools available for multi-omics data analyses, data interrogation, data mining and machine learning. However, such approaches have recently been validated, where combining proteomics and metabolomics data and machine learning allowed the prediction of pathway dynamics that outperformed well-established and existing methods.² Furthermore, two recent groundbreaking studies identified design principles for optimizing translation in Escherichia coli and the principle regulatory sequences of 5′ untranslated regions in yeast using machine learning approaches and large-scale measurements.³

BioCAD tools and design-of-experiment (DoE) approaches

In many other industries, the maturation of computer-aided design (CAD) systems have dramatically increased the productivity of the designer, improved the quality of the design, improved communications through documentation, and created shareable databases for manufacturing. To achieve the level of sophistication of design automation employed in industries such as automotive, shipbuilding, or aerospace, significant progress must be made in laying the foundation for computer-aided design for biology (BioCAD) software tools and data standards to support the DBTL cycle. For example, the Synthetic Biology Open Language or SBOL allows in silico DNA models for synthetic biology to be represented.⁴ Other examples of integrated BioCAD tools are Diva BioCad (https://agilebiofoundry.org/diva-biocad/) and the TeselaGen BioCAD/CAM platform⁵ (https://teselagen.com/), an ‘aspect-oriented’ BioCAD design and modelling framework, and Cello⁶ for gene circuit design automation. Many of these software tools are also currently being integrated into biological foundry automation suites, such as the Agile Biofoundry (https://agilebiofoundry.org/), in order to accelerate these processes. In addition, there is an increasing use of Design-of-Experiment (DoE) approaches for determining the most efficient experimental testing and measurement strategies (such as JMP statistical software from SAS). Such tools are distinct but complement BioCAD tools.

However, with the rapid growth and uptake of liquid handling automation and medium-throughput analytics in biofoundries, there is an increasing need to establish standardized protocols and reference materials to enable reproducibility and standardized measurements. There is also a need to develop numbers and range of software tools to allow interoperability of hardware building on platforms like ANTHA (http://www.synthace.com), as well as common data formats for measurements that can be used for machine learning, and standardized metadata and annotations to compare designs between laboratories and companies. The increasing use of large-scale libraries and high-throughput automation (such as microfluidic platforms) will inevitably lead to a data-deluge which will pose challenges in terms of data storage, data standards, data sharing and data visualisation.

A number of frameworks have recently been developed to aid engineers in turning designs of their biomolecules, pathways, and hosts into a set of formal automatable manufacturing operations. Further, these tools optimize for reliability and correctness of synthesis and efficiency in cost and time-of-production. Some of these link directly into the biomolecular and pathway/host design tools to choose optimal “DNA” parts to meet those design goals. However, there are not yet sophisticated tools supporting manufacture of high-complexity structured libraries for design-of-experiments.

Design tools are at their most powerful when the requirements, limitations, and desired outcome of a given design problem can be flexibly and completely specified in domain-specific languages (DSLs). These languages can and should support defining metrics against which designs can be optimized. Metrics could include, but are not limited to: yield, titer, efficiency, costs, environment, and longevity, among many others. Given the multiple scales at which design software will be asked to operate (such as for individual genetic networks, whole-cell models, cell-to-cell interactions, and up to entire ecosystems), scale-specific DSLs may be appropriate. These languages must be highly expressive but remain digitally interpretable, including support for simulation of designs against encoded requirements as a means for selection among competing design candidates. These languages may also allow for the storage of experimental results that could be formally compared to the specification to determine whether a given design satisfies the encoded requirements.

Automation of ‘Build’ and ‘Test’

To increase throughput, capacity, and reproducibility, physical and informatic automation efforts have been applied to the Build and Test portions of the biological engineering DBTL cycle. The use of (traditional, acoustic, and microfluidic) liquid handling robotics to prepare molecular biology reactions (e.g., PCR, DNA assembly) is representative of Build physical automation. Test physical automation includes parallel arrays of bioreactors integrated with liquid-handlers for automated real-time control (e.g., pH, feeding) and periodic culture sampling (for offline analysis). Sample tracking (through laboratory information management systems – LIMS), automated protocol design/selection, and data analysis pipelines are characteristic of Build and Test informatic automation. The extent of process automation can range from semi-manual (i.e., stand-alone automated unit operations that interface through a human operator), to full automation (autonomous integrated unit operations). Semi-manual and full-automation each have advantages: with semi-manual automation, there is process flexibility and decreased operational complexity; fully-automated platforms allow high-throughput and “24/7” operations; neither process is always preferable to the other.

Sample-independent performance, unit operation de-coupling, and operational “good-enough” thresholds enable process automation. Sample-independent methods are more amenable to automation due to sample-to-sample performance robustness and the direct enablement of method scale-out/parallelization. Representative methods include sequence-independent DNA assembly methods (vs. traditional sequence-dependent cloning strategies), microbial landing-pad strategies that enable the same DNA construct-encoded gene cluster to be productively deployed across phylogeny (rather than a bespoke construct for each organism), next-generation DNA sequencing methods (vs. primer-directed Sanger sequencing), and methods for preparing a single sample for multiple -omics analyses (global or targeted metabolomics, proteomics, and/or lipidomics). Very few methods are completely sample-independent, however, and it is important to have alternative method(s) for samples that prove to be problematic for the preferred method. Since technologies (including methods, software, and instrumentation) change very quickly, and significant effort is needed to adapt an existing, or create a new, automation method, unit operation de-coupling is crucial. The automation of any step in a process should ideally be unaffected by a technological change in an upstream or downstream step, otherwise all coupled steps need to be re-developed if any one step changes. In practice, this is difficult to achieve. For example in Build, it is not yet generally possible to Design any DNA sequence for fabrication without being sensitive to the limits of technology and method of fabricating the DNA (e.g., how sequence-independent or not the DNA synthesis/assembly technology actually is). An important automation-enabling approach is to set “good-enough” thresholds. Automated unit operations often process samples in batches, and a key operational decision or stage-gate is to determine what to do with the (anticipated minority) of samples that fail to be successfully processed. One approach is to set a threshold, and as long as that threshold of samples are successful, to proceed with the successful samples and drop the failed ones. It is, of course, possible, and in some cases desirable or necessary, to re-queue the failed samples (potentially with an alternative method), but at some point repetitively failed samples must be abandoned or they will cumulatively drive the automated workflow to a halt.

Towards the desired impact of Build and Test automation increasing efficiencies, rates, scope, reliability, and reproducibility, there remain considerable challenges and associated opportunities and needs for improvement. These challenges, for example, include that technologies change rapidly leading to process instability and the need to chronically re-develop automation – like the Red Queen telling Alice she must run to stand still. Additional challenges include: that instrumentation differences across facilities limit automation method transferability; that the use and reliance upon automation can pose an operational robustness risk if an instrument fails (and if there is no instrument redundancy); and that a priori it can be difficult to predict which type of method might work effectively for a specific sample. Improvements are needed to better understand how transferable automated methods are across facilities and instruments, how to develop methods that are more suitable and robust to automation (i.e., less sample dependent), to further de-couple unit operations, and to further application of automation approaches, for example, to the Build of transcription/translation systems, biomes, and tissues.

Future requirements of engineering biology databases

A mature computational infrastructure for biodesign requires powerful access to information about biological parts and systems, their environments, their manufacturing processes, and their operations in and beyond the laboratory in which they are created. This in turn requires findable, accessible, interoperable, and reusable data that enable effective aggregation information on biological systems, their environments, and their processes of manufacture, and the establishment of standard models of data processing and analysis that allow open-development and scalable execution.

One of they key enablers of any data-intensive field is the production of computational frameworks capable of supporting findable, accessible, interoperable, and re-usable (FAIR) data and programmatic execution. Adherence to such principles means that informational products developed at one location can be found and used at another. Results can be checked, combined, and leveraged. While all data cannot be public and open, frameworks that support this option enable and strengthen work both within and among organizations and individuals.

In order to (re)use the vast amount of measurements we expect to capture in future engineering biology experiments, new databases will need to adhere to these FAIR Principles:

Findable:
- Data and metadata are assigned globally unique and persistent identifiers.
- Data are described with rich metadata.
- Metadata clearly and explicitly include the identifier of the data they describe.
- (Meta)data are registered or indexed in a searchable resource.
Accessible
- (Meta)data are retrievable by their identifier using a standardized communication protocol.
- Metadata should be accessible even when the data is no longer available.
Interoperable
- (Meta)data use a formal, accessible, shared and broadly applicable language for knowledge representation.
- (Meta)data use vocabularies that follow FAIR principles.
- (Meta)data include qualified references to other (meta)data.
Reusable
- (Meta)data are richly described with a plurality of accurate and relevant attributes.

For Engineering biology, these principles apply across the DBTL cycle: Designs should be FAIR to enable characterization across many different organisms, conditions, and implementations for many different teams. Build protocols should be FAIR to ensure reproducibility, and multi-omics measurements across many different studies of the same organism must be FAIR in order to accumulate enough Test data for Learn activities. (For related reading, please see Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. View publication. and for a related graphic, please see McDermott, J., & Hardeman, M. (2018). Increasing Your Research’s Exposure on Figshare Using the FAIR Data Principles. Figshare. View publication.)

Get a printer-friendly PDF →

Footnotes & Citations

Pearl, J. (2018). Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining – WSDM ’18 (pp. 3–3). New York, New York, USA: ACM Press. View publication.
Costello, Z., & Martin, H. G. (2018). A machine learning approach to predict metabolic pathway dynamics from time-series multiomics data. Npj Systems Biology and Applications, 4, 19. View publication.
Cambray, G., Guimaraes, J. C., & Arkin, A. P. (2018). Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nature Biotechnology, 36(10), 1005–1015. View publication.; Cuperus, J. T., Groves, B., Kuchina, A., Rosenberg, A. B., Jojic, N., Fields, S., & Seelig, G. (2017). Deep learning of the regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences. Genome Research, 27(12), 2015–2024. View publication.
Galdzicki, M., Clancy, K. P., Oberortner, E., Pocock, M., Quinn, J. Y., Rodriguez, C. A., … Sauro, H. M. (2014). The Synthetic Biology Open Language (SBOL) provides a community standard for communicating designs in synthetic biology. Nature Biotechnology, 32(6), 545–550. View publication.
Boeing, P., Leon, M., Nesbeth, D. N., Finkelstein, A., & Barnes, C. P. (2018). Towards an Aspect-Oriented Design and Modelling Framework for Synthetic Biology. Processes (Basel, Switzerland), 6(9), 167. View publication.
Nielsen, A. A. K., Der, B. S., Shin, J., Vaidyanathan, P., Paralanov, V., Strychalski, E. A., … Voigt, C. A. (2016). Genetic circuit design automation. Science, 352(6281), aac7341. View publication.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. View publication.

McDermott, J., & Hardeman, M. (2018). Increasing Your Research’s Exposure on Figshare Using the FAIR Data Principles. Figshare. View publication..

Last updated: June 19, 2019