With the increasing availability of large-scale datasets such as from intensive care units (ICUs), researchers face a flood of data that does not lead immediately to knowledge. Given its volume and frequency of collection (ICU patients are monitored every 5 seconds) many important events will be rare occurrences. Unlike the traditional approach of prospectively measuring a small set of variables hypothesized to be important, these observational datasets contain a large, unselected, and incomplete set of features. They can allow insight into cases where experiments are infeasible, but using them for decision-making requires new methods for finding the impact of rare events and hidden variables in complex time sense, along with realistic simulated data for evaluation. This proposal addresses two main challenges of large-scale observational data: 1) evaluating the causal impact of rare events, and 2) identifying latent causes. First, we leverage the volume of data and the connection between type (general) and token (singular) causality to infer a model of how a system normally functions, and then determine whether rare event explain a deviation from usual behavior. The basic approach of company a model and observed instances forms the basis for finding latent variables, where we aim to find how much of a variable's value (or how many of its occurrences) is due to influences outside the dataset and to find shared causes for sets of variables. This is motivated by applications to neurological ICU (NICU) data streams where the volume of continuous recordings of patients'brain activity and physiological signs surpasses clinicians'ability to find complex patterns in real time to use them for treatment. Further, clinicians need to know not just that a patient is having a seizure (a low probability event with a potentially significant impact on outcomes), but whether it is causing harm before they can determine how to treat it. To enable rigorous validation of the algorithms, we develop a new computational platform for generating simulated NICU time series data. The methods will improve understanding of seizures in stroke patients and will be broadly applicable to large-scale high- resolution time series data, enabling discoveries in areas such as computational social science.
;The methods developed will improve the translation of data to knowledge to policy by identifying actionable information on causes, enabling better and more rapid decision-making by clinicians. Creating and disseminating realistic simulated data will allow for comparison and validation of methods, facilitating computational advances by researchers in computer science and medicine.