This Small Business Technology Transfer (STTR) Phase I project proposes to demonstrate the feasibility of software for the development of solutions to complex life science problems. Life science is truly an interdisciplinary field for which real collaborations can substantially advance the level of technology and product development. However, collaboration between industry and academic/government institutions is currently limited by (a) necessity for industrial collaborators to analyze large amount of complex and specialized information, (b) various heterogeneous nondisclosure policies to protect intellectual property rights of collaborators, and (c) legitimate concerns of businesses that disclosures of their R&D problems to the public could weaken their business position. To address these fundamental limitations, software will be developed with the following innovative components: (1) Text and data mining model for finding nonobvious biological relationships, and (2) Secure workspace for data storing and sharing implemented in new program language that enforces confidentiality and integrity policies of data resource. Phase I will produce preliminary software and documented results of feasibility studies. The results of these studies will be analyzed to evaluate broader impacts and commercial potential of proposed software.
The broader impact/commercial impact of this will be to develop innovative, unique software that will drastically change the way problems in life science and other interdisciplinary fields are currently solved. The software is intended to advance the level of product development in life science and other interdisciplinary industries, and to aid in analyzing large text data. The targeted customers will be life science small businesses as they have a great need for external collaboration but they have limited resources and expertise. Other potential customers include university professors and independent experts with relevant expertise in subject areas. It is anticipated that, once the proposed tool has matured, it also will be used by large industrial customers for troubleshooting problems that require bold action and broad cooperation.
There is an unprecedented growth in synthetic data comprised of research articles, Ph. D. theses, patents, test reports, technology reports, and web-pages with product descriptions. R&D departments and organizations experience increasing difficulties in analyzing massive research synthetic data to identify existing solutions to their problems. Similarly, the exploding volume of prior art synthetic data impedes the evaluation of a technological concept seeking venture capital funding, the investigation of a specific scientific area for new product development, and the analysis whether a patent request does not violate or overlap already patented technology. It can be expected that many organizations of different types and sizes will require massive increases in staffing and budget for activities involving the analysis of synthetic data. Accordingly, there is an emerging need in the art for an intelligent framework that permits an automatic retrieval of information relevant to prior knowledge, an automatic extraction of concepts related to prior knowledge, and a user-guided development of connections between the extracted concepts. Traditional information retrieval (IR) based on text searching can be used for a quick exploration of large collections of synthetic data. However, this approach is incapable of finding specific facts in such collections and establishing connections between these facts. Also, the IR models lack an ability to learn concepts and relationships between the concepts. In contrast, the information extraction (IE) models are too specific and typically require customization for a domain of interest. An intelligent framework that integrates the IR and IE approaches was developed in this project to support a new use of synthetic data for the development of connectable concepts. Synthetic data in this framework are treated as a bag of dissimilar concepts, where each concept is associated with unique label. This label identifies the content of an individual document or a cluster of documents. Information flow for a process for developing connections between concepts includes the following specific steps as shown in Figure 1: a step of gathering prior knowledge that is receiving the set of keywords from a user, a step of collecting documents relevant to this knowledge, a step of extraction of concepts from the collected documents, and a step of development of obvious and nonobvious connections between concepts. This process can be used to assist the discovery of new knowledge, to improve reliability of a prior art search, to accelerate the finding of connectable intellectual property items, and to help in a rapid screening of technological concepts without prior knowledge.