With my graduate student Tao Jiang, we have been working on an extension of least absolute shrinkage and selection operator (Lasso) regression to address variable selection and modeling when sample sizes are limited compared to the data dimension. This is a common phenomenon in genome wide association studies. We developed a new upper bound of the regularization parameter in sparse group Lasso based on an estimated lower bound of the proportion of false null hypotheses with confidence (1-). The bound is estimated by applying the empirical distribution of dependent or independent p-values from single marker/variable analysis, where a second-level significance testing, the higher criticism statistic is used. An upper bound of tuning parameter in Lasso, , is decided corresponding to the lower bound of the proportion of false null hypotheses. Thus, the tuning range is narrow since the upper bound of is lower. The final decision of non-zero estimates (e.g., significant loci in GWAS) will contain more variables so that the power of modified GWAS is higher than or equal to the original sparse group Lasso. Different correlation levels among variables in true regression models are also studied. We demonstrate the performance of our method using both simulation experiments and a real data application in lipid trait genetics from the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial. Another project with Tao Jiang is focused on a machine learning approach for detecting same-species contamination in next generation sequencing analysis. In this project, we have developed a machine method relying on support vector machines and corresponding software to detect same species contamination that can occur because of laboratory quality control issues, or from mixed samples in forensic application. This is the first set of tools that can work directly on the .VCF files, and the approach recognizes a mixture of tumor and normal cells to prevent false positives. Current extensions are being worked out to estimate the percentage of tumor vs. normal samples, and to estimate the number of individuals within forensic application. In collaboration with Dr. Denis Fourches, his student Jeremy Ash and post doc Melaine Kuenemann, my former postdoc Dr. Daniel Rotroff, I have been working on methods to integrate chemical structure information into metabolomics analyses. Developing predictive and transparent approaches to the analysis of metabolite profiles across patient cohorts is of critical importance for understanding the events that trigger or modulate traits of interest (e.g., disease progression, drug metabolism, chemical risk assessment). However, metabolites' chemical structures are still rarely used in the statistical modeling workflows that establish these trait-metabolite relationships. Herein, we present a novel cheminformatics-based approach capable of identifying predictive, interpretable, and reproducible trait-metabolite relationships. As a proof-of-concept, we utilize a previously published case study consisting of metabolite profiles from non-small-cell lung cancer (NSCLC) adenocarcinoma patients and healthy controls. By characterizing each structurally annotated metabolite using both computed molecular descriptors and patient metabolite concentration profiles, we show that these complementary features enhance the identification and understanding of key metabolites associated with cancer. Ultimately, we built multi-metabolite classification models for assessing patients' cancer status using specific groups of metabolites identified based on high structural similarity through chemical clustering. Additionally, with Dr. Jung-Ying Tzeng and several other collaborators at NCSU and UNC Chapel Hill, we have developed an approach for testing for associations of single rare variants. Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization Ongoing projects building onto methods for detecting gene-gene interactions are currently ongoing, using variance QTLs to prioritize single nucleotide polymorphisms for detecting gene-gene interactions.

Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
City
State
Country
Zip Code