We propose to further develop, test, evaluate and support caTIES - an existing software system for developing networked repositories of sharable de-identified surgical pathology reports. The caTIES system creates a repository of de-identified, structured, and concept-coded clinical reports derived from large corpora of clinical free-text. Documents are automatically coded against a controlled terminology such as the Unified Medical Language System (UMLS), SNOMED-CT, or NCI Metathesaurus. Users construct queries to identify specific kinds of documents and tissue specimens based on the associated clinical report. For example, a researcher studying genetic variation in metastatic breast cancers can identify cases of invasive ductal carcinoma of the breast, followed by metastatic ductal cancer in bone at an interval of three years or greater from the original diagnosis. The caTIES system also supports acquisition and ordering of tissues, using an honest broker model. Through this mechanism, de-identified data and access to tissue can be shared among institutions, enabling multi-center collaborative research. The caTIES system has already been implemented at seven US Cancer Centers, and is being considered for adoption by numerous other institutions including cancer centers, university hospitals and private hospitals. Initial development of caTIES was funded by the Cancer Biomedical Informatics Grid (caBIG). However, interest in the application has far exceeded our expectations and the limitations of caBIG. This grant will allow us to further extend the capabilities of the system by (a) improving the portability of the system and extending the types of documents that can be processed, (b) evaluating the system's NLP performance and usability, (c) building a user community to support this open-source application, and (d) piloting interoperability of caTIES with other enterprise and research systems. This work will preserve and extend a highly novel platform for development of massive repositories of de-identified clinical data that can be used for research within and across institutions. Narrative This grant will fund the further development and evaluation of a system that takes identified clinical documents and converts them into de-identified, concept-coded, structured data. The system enables researchers to access remainder tissues and clinical report data for research purposes within and across institution. This project is important because it will greatly increase the access of researchers to important data and materials while maintaining patient privacy.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-Q (01))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Prabhu, Arpan V; Sturgis, Charles D; Lai, Chi et al. (2017) Improving margin revision: Characterization of tumor bed margins in early oral tongue cancer. Oral Oncol 75:184-188
Tseytlin, Eugene; Mitchell, Kevin; Legowski, Elizabeth et al. (2016) NOBLE - Flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinformatics 17:32
Jacobson, Rebecca S; Becich, Michael J; Bollag, Roni J et al. (2015) A Federated Network for Translational Cancer Research Using Clinical Data and Biospecimens. Cancer Res 75:5194-201
Crowley, Rebecca S; Castine, Melissa; Mitchell, Kevin et al. (2010) caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc 17:253-64