Establish a computational infrastructure where easy access to data supports the DBTL process for biology.

Engineering Biology

Data Science Goal:

Establish a computational infrastructure where easy access to data supports the DBTL process for biology.

Current State-of-the-Art

The establishment of a computational infrastructure where easy access to data supports the DBTL process for biology is sometimes called data ecology. This means easy access to data and validated models of biological systems, the processes by which they are modified and manufactured, and their reciprocal impact on the environment(s) in which they are deployed. At the core, such access requires both databases of this information and standards that ensure the right information is captured for design. These standards then allow common infrastructures, including applications, programming, and interfaces, for finding, transporting, and analyzing this data. Standards support interoperability of information, portability and reuse of data, tools, and materials, collaboration among teams because of the common communication of data, tools and results, and help to ensure quality, since data and tools in standard formats can be checked for errors in more automated ways. Biological design presents special challenges in that the systems are far more diverse with much less controlled information about them, their operations and interactions with their environment are exceptionally complex in the whole compared to electronic systems (though the engineered aspects tend to be only a small part of the system), and the principles for design and manufacture are evolving rapidly and are highly application specific. The differences among engineering a microbe for production of a high-value chemical, engineering a T-cell for treating a specific cancer, and engineering a plant for growth and productivity in diverse field environments, are large and have different requirements for information and analysis.

Despite the complexities of the data ecology landscape, engineering biologists are increasingly familiar with a large number of key biological information resources. These national repositories and workbenches include those available from NCBI and EBI (REFSEQ, PUBMED, and SWISSPROT), to established repositories of key biological measurement types (PDB, SRA, GEO, ARRAYExpress, and IMG) and more volatile stores like MG-RAST or MicrobesOnline, to knowledge representation sites like METACYC, KEGG, and BRENDA that together have been exceptionally important to interpretation of biological data. These are backed by strong data standards groups and ontological development that ensure that data is “represented” using a common language, with the appropriate organized characteristics to support automated statistical and semantic analysis. Further, there are attempts to unify the object ID space so that genes, genomes, taxa, chemicals, etc., can be uniformly labeled and cross referenced and searched across data sets and systems.

Various individual analytical tools and more integrated data and analysis workbenches have begun to arise. General purpose open systems, like KBase and Galaxy, serve different needs, but allow users to extend and share analytical capabilities and data that cross basic and applied biology and biotechnology. The Experimental Data Depot¹ and the Joint BioEnergy Institute Inventory of Composable Elements² serve as a repositories and representation of data about bioengineered systems, numerous individual genetic device designers like RBSCalculator, and more integrated design systems like Cello, are also available. Further, there has been some effort in the synthetic biology community to develop standards for interchangeable data, including the Synthetic Biology Open Language (SBOL), the Systems Biology Markup Language (SMBL), and others.

Currently, there are very few widely used integrated computational DBTL-support systems, and of these, they rarely advantage themselves of the large number of diverse biological data and analysis resources. Despite some standards efforts, they remain rather siloed and use idiosyncratic technologies for data representation and analysis execution that hinders community use and development. Further, current focus has been, understandably, on the basic design and construction of pathways and less on scalable production/formulation and on understanding post-deployment behaviors such as differences in operation outside the laboratory, failure modes in real environments, and tracking of designed biological objects in the environment and determining their sources and ecological impact (though there are examples of each of these). There is opportunity for the engineering/synthetic biology community to better advantage itself of the investments being made in other fields of quantitative and systems biology, medicine, chemical process engineering, and environmental science, and to establish its own best practices and standards for its unique aims.

There are three main activities associated such an effort that also deeply involve the experimental practice of synthetic biology and biological manufacture: 1) establishing strong standards for representation of synthetic biological objects, experimental design and process control structures, and measurements of these objects and their outcomes in a series of increasingly complex environments from initial laboratory creation to the sites of their application – these standards should adhere to FAIR (findable, accessible, interoperable, reusable) conventions and computation representations parseable and analyzable within the frameworks built for general computational data science (i.e., utilizing standards for ontology, ID space, data formats (e.g., RDF, JSON), and metadata for provenance); 2) demonstrating scalable computational libraries and infrastructure for repositing, searching, transporting, and aggregating/organizing these data types for analysis; and 3) the establishment of open, scalable software platforms that accelerate efficient, predictable design by enabling integrated access to the appropriate biological data, presented in design-oriented ways, and supported by a community-extensible set of tools whose results can be compared and contrasted to determine best practice over time. In each of these cases, the roadmap calls for starting with designs that operate in single organisms in laboratory conditions and scale out to multicellular systems deployed in more open conditions.

Breakthrough Capabilities & Milestones

Established standard and accessible repositories for biomanufacturing data and analysis methods.

Have developed a system of robust communication between academia and industry surrounding engineering biology data access and needs.

Bottleneck/Challenge: Lack of connection and communication between information systems beyond engineering biology.

Potential Solution: Identify core needs for common data/model access spanning molecular and organismal biology, biomanufacturing processes, and tracking operation in deployment.

Potential Solution: In collaboration with existing biological data groups including those from NCBI/EBI/USDA, etc., develop biological design-oriented access and standards for data spanning, for example, protein structure, genomics, genotype-phenotype data, and treatment/disease data.

Potential Solution: In collaboration with existing chemical and materials data groups, develop biological design-oriented access and standards for data.

Develop findable, accessible, interoperable, and reusable (FAIR) data standards and open repositories for engineering biology.

Biomanufacturing-specific data standards and repositories.

Common computational infrastructure for finding biological data and common APIs for search and analysis.

Demonstrate common data search and interchange among current biological and chemical repositories and existing microbial biofabrications.

Produce a common library of open design tools, built upon standard APIs, and supported by portable/virtualized execution environments to demonstrate best-practice interoperable biomanufacturing software.

Produce a common library of open design tools for more open medical and agricultural environments.

End-to-end, industry-normed design software platforms for engineered biological systems.

Develop industry-accepted, sharable assessments of current data tools and uses in reducing cost and increasing reliability of executing the DBTL cycle.

Create an industry-accepted, open-source or publically-accessible version of industrially-relevant DBTL software and data.

Footnotes

Morrell, W. C., Birkel, G. W., Forrer, M., Lopez, T., Backman, T. W. H., Dussault, M., … Garcia Martin, H. (2017). The Experiment Data Depot: A Web-Based Software Tool for Biological Experimental Data Storage, Sharing, and Visualization. ACS Synthetic Biology [Electronic Resource], 6(12), 2248–2259. View publication.
Ham, T. S., Dmytriv, Z., Plahar, H., Chen, J., Hillson, N. J., & Keasling, J. D. (2012). Design, implementation and practice of JBEI-ICE: an open source biological part registry platform and tools. Nucleic Acids Research, 40(18), e141. View publication.