The aim of this proposal is to implement a novel way of processing and accessing the vast detailed knowledge contained within collections of scientific publications on the regulation of transcription initiation in bacterial models. In princple, this model for processing and reading information and new knowledge is applicable to other biological domains, potentially benefiting any area of biomedical knowledge. It is certainly criticl to generate new strategies to cope with the ever-increasing amount of knowledge generated in genomics and in biomedical research at large. Improving the efficiency of the traditional high-quality manual curation of scientific publications will enable us also to expand the type of biological knowledge, beyond mechanisms and their elements in the genome, to start including their connections with larger regulated processes and eventually physiological properties of the cell. We will first implement the necessary technology to improve our curation by means of a computational system that has text mining capabilities for preprocessing the papers before a human expert curator identifies which sentences contain the information that is to be added to the database. Premarked options selected by the curators will accelerate their decisions. The accumulative precise mapping between sentences and curated knowledge will provide training sets for text mining technologies to improve their automatic extraction. The curator practices will become more efficient, enabling us to curate selected high-impact published reviews to place mechanisms into a rich context of their physiological processes and general biology. Another relevant component of our proposal is the improved modeling of regulated processes by means of new concepts in biology that capture larger collections of coregulated genes and their concatenated reactions. Starting from all interactions of a local regulator, coregulated regulators and their domain of action will be incorporated to construct the biobricks of complex decisions, as they are encoded in the genome. These are conceptual containers that capture the organization of knowledge to describe the genetic programming of cellular capabilities. These proposals will be formalized and proposed within an international consortium focused in enriching standard models or ontologies of gene regulation for use by the scientific community. Finally, a portal to navigate across all the sentences of a given corpus of a large number (more than 5,000) of related papers will be implemented. The different avenues of navigation will essentially use two technologies, one dealing with automatically generating simpler sentences from original sentences as input, and the other one with the classification of papers based on their theme or ontology. Their combination will enable a novel navigation reading system. If we achieve our aims, this project will give a proof-of-principle prototype with clearly innovative higher levels of large amounts of integrated knowledge. Future directions may adapt these concepts and methods to the biology of higher organisms, including humans.

Public Health Relevance

Scientific knowledge reported within publications provides a wealth of knowledge that we barely capture in databases for genomics. Enhancing the effectiveness of the processing and representation of all this knowledge will change the way we encode our understanding of concatenated interactions that are organized into networks and processes governing cell behavior. Given the conservation in evolution of the nature of biological complexity, a better encoding of our understanding of a bacterial cell shall influence that of any other living organism.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Center for Genomic Sciences
Zip Code
Salgado, Heladia; Martínez-Flores, Irma; Bustamante, Víctor H et al. (2018) Using RegulonDB, the Escherichia coli K-12 Gene Regulatory Transcriptional Network Database. Curr Protoc Bioinformatics 61:1.32.1-1.32.30
Pannier, Lucia; Merino, Enrique; Marchal, Kathleen et al. (2017) Effect of genomic distance on coexpression of coregulated genes in E. coli. PLoS One 12:e0174887
Méndez-Cruz, Carlos-Francisco; Gama-Castro, Socorro; Mejía-Almonte, Citlalli et al. (2017) First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes. Database (Oxford) 2017:
Ledezma-Tejeida, Daniela; Ishida, Cecilia; Collado-Vides, Julio (2017) Genome-Wide Mapping of Transcriptional Regulation and Metabolism Describes Information-Processing Units in Escherichia coli. Front Microbiol 8:1466
Keseler, Ingrid M; Mackie, Amanda; Santos-Zavaleta, Alberto et al. (2017) The EcoCyc database: reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res 45:D543-D550
Balderas-Martínez, Yalbi Itzel; Rinaldi, Fabio; Contreras, Gabriela et al. (2017) Improving biocuration of microRNAs in diseases: a case study in idiopathic pulmonary fibrosis. Database (Oxford) 2017:
Rinaldi, Fabio; Lithgow, Oscar; Gama-Castro, Socorro et al. (2017) Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford) 2017:
Rinaldi, Fabio; Lithgow, Oscar; Gama-Castro, Socorro et al. (2017) Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford) 2017:
Gama-Castro, Socorro; Salgado, Heladia; Santos-Zavaleta, Alberto et al. (2016) RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res 44:D133-43