A Comprehensive Approach to Pre-processing of Affymetrix GeneChip Data

McGee, Monnie

Abstract

Microarray experiments allow the simultaneous analysis of differences in the expression of thousands of genes in different biological samples. Such experiments have been instrumental in detecting subtle gene expression changes in different stages and types of cancers, and enabling researchers to determine molecular responses to chemotherapy and other external stimuli. Affymetrix microarrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high- level classification and clustering analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. The quality of the final results depends on the validity of the algorithm used for preprocessing microarray data. Therefore, improving the quality of the analysis of microarray data can have important wide-ranging effects on basic research and the resulting medical applications. In previous analysis of Affymetrix GeneChip (r) data, several important patterns that have an impact on high- level results have been uncovered. However, none of these patterns are currently considered by any of the popular algorithms for array preprocessing. For example, for the human genome platforms, thirty percent of MM probes have intensity levels that are greater than their PM counterparts, indicating the presence of cross- hybridization. Further, intensity levels of PM and MM probes are highly correlated, indicating that MM probes may be non-specifically hybridizing to the target gene. Thus, subtracting MM intensities from PM intensities, results in a reduction of the true signal, making differentially expressed genes harder to detect. This grant outlines a proposal for a data-driven model that takes into account cross-hybridization and non- specific hybridization for the analysis of Affymetrix GeneChip (r) brand arrays. Specifically, the model will examine observed PM intensities as a combination of autofluorescence, non-specific hybridization, cross- hybridization, and true signal. MM intensities will include the first three components since it is assumed that, once these background components are properly estimated, only the PM probes will carry true signal. Modeling these components separately will facilitate the determination of the contribution of each, and the ability to account for them during background correction. The performance of this new model-driven approach to the processing of Affymetrix microarray data will be evaluated in comparison with commonly used algorithms like MAS5.0, dChip and RMA using well characterized data sets to validate the improved accuracy of the final model. Implementation of this model should lead to better high-level data analysis, and correspondingly a better understanding of gene expression differences in response to disease states or environmental changes. Gene expression microarrays allow the determination of the expression levels of thousands of genes simultaneously, and have given insights into many areas of basic research, from a description of the genes that determine tumor stage, to the genes expressed during formation of vital organs during development. This project seeks to improve reliability, reproducibility, and applicability of experiments using microarray data by creating better analysis approaches for the extraction of true expression values from these data. ? ? ?