New advances in biomedical research have made it possible to collect multiple data ?views? ? for example, genetic, metabolomic, and clinical data ? for a single patient. Such multi-view data promises to offer deeper insights into a patient's health and disease than would be possible if just one data view were available. However, in order to achieve this promise, new statistical methods are needed. This proposal involves developing statistical methods for the analysis of multi-view data. These methods can be used to answer the following fundamental question: do the data views contain redundant information about the observations, or does each data view contain a different set of information? The answer to this question will provide insight into the data views, as well as insight into the observations. If two data views contain redundant information about the observations, then those two data views are related to each other. Furthermore, if each data view tells the same ?story? about the observations, then we can be quite con?dent that the story is true. The investigators will develop a uni?ed framework for modeling multi-view data, which will then be applied in a number of settings.
In Aim 1, this framework will be applied to multi-view multivariate data (e.g. a single set of patients, with both clinical and genetic measurements), in order to determine whether a single clustering can adequately describe the patients across all data views, or whether the patients cluster separately in each data view.
In Aim 2, the framework will be applied to multi-view network data (e.g. a single set of proteins, with both binary and co-complex interactions measured), in order to determine whether the nodes belong to a single set of communities across the data views, or a separate set of communities in each data view.
In Aim 3, the framework will be applied to multi-view multivariate data in order to determine whether the observations can be embedded in a single latent space across all data views, or whether they belong to a separate latent space in each data view.
In Aims 1 ?3, the methods developed will be applied to the Pioneer 100 study, and to the protein interactome.
In Aim 4 (a), the availability of multiple data views will be used in order to develop a method for tuning parameter selection in unsupervised learning.
In Aim 4 (b), protein communities that were identi?ed in Aim 2 will be validated experimentally. High-quality open source software will be developed in Aim 5. The methods developed in this proposal will be used to determine whether the ?ndings from multiple data views are the same or different. The application of these methods to multi-view data sets, including the Pioneer 100 study and the protein interactome, will improve our understanding of human health and disease, as well as fundamental biology.
Biomedical researchers often collect multiple ?types? of data (e.g. clinical data and genetic data) for a single patient, in order to get a fuller picture of that patient's health or disease status than would be possible using any single data type. This proposal involves developing new statistical methods that can be used in order to analyze data sets that consist of multiple data types. Applying these methods will lead to new insights and better understanding of human health and disease.