Capitalizing on the transformative opportunities afforded by the extremely large and ever-growing volume, velocity, and variety of biomedical data being continuously produced is a major challenge. The development and increasingly widespread adoption of several new technologies, including next generation genetic sequencing, electronic health records and clinical trials systems, and research data warehouses means that we are in the midst of a veritable explosion in data production. This in turn results in the migration of the bottleneck in scientific productivity into data management and interpretation: tools are urgently needed to assist cancer researchers in the assembly, integration, transformation, and analysis of these Big Data sets. In this project, we propose to develop the Semantic Data Lake for Biomedical Research (SDL-BR) system, a cluster-computing software environment that enables rapid data ingestion, multifaceted data modeling, logical and semantic querying and data transformation, and intelligent resource discovery. SDL-BR is based on the idea of a data lake, a distributed store that does not make any assumptions about the structure of incoming data, and that delays modeling decisions until data is to be used. This project adds to the data lake paradigm methods for semantic data modeling, integration, and querying, and for resource discovery based on learned relationships between users and data resources.
The SDL-BR System is a distributed computing software solution that enables research institutions to manage, integrate, and make available large institutional data sets to researchers, and that permits users to generate data models specific to particular applications. It uses state of the art cluster computing, Semantic Web, and machine learning technologies to provide for rapid data ingestion, semantic modeling and querying, and search and discovery of data resources through a sophisticated, Web-based user interface.