Gene expression measurement using cDNA and oligo arrays is exploding in popularity, yet many technical problems continue to face users. One of the more fascinating problems results from the large, sometimes overwhelming volume of data generated in these experiments. Image capture, processing, interpretation and quantification remain as important fundamental issues. Quality control and statistical design of experiments must be adequately addressed for fruitful results to be obtained. Numerous statistical, image processing and bioinformatics problems confront users of these technologies. As arrays can be constructed to contain thousands of spots, automated analysis of the resulting images is required. The technology itself must be improved to couple it with new tissue sampling technologies such as laser capture microdissection (LCM). Accordingly, this projects seeks to address problems in this area at the statistical, numerical, computational, and informatics levels. Progress in FY2000:Working with laboratories in NCI, NICHD, NIA, NIDDK, NINDS and NIDCR, we have developed an applied software for the analysis of array images from major commercial sources as well as from custom arrays. The program PSCAN was developed to facilitate the image-processing steps of the analysis and produces optimal estimates of spot intensities. The program is written in MATLAB, and the code is being made publicly available, and a Web distribution site has been established. Numerous improvements to the image processing steps have been achieved including: improved spot detection, location and quantification algorithms, improved user interfaces, linkage with web-based information, improved data storage formats and the user-interface. Our analysis method relies on a number of data visualization tools, and allows users to identify significantly over- or under- expressed genes in a comparative study. Importantly, these techniques also allow users to identify experimental artifacts, outliers and other data anomalies which are present and a large percentage of hybrization studies, such as non-constant background hybridization, image defects, dropouts, printing artifacts, spot bleeds, etc. We have generalized the program into program F-SCAN for analysis of two-label arrays using such labels as with fluors Cy3 and Cy5. New algorithms for spot detection, shape determination, and robust methods for signal and background estimation have been developed and extensively tested. These algorithms compare favorably with algorithms used in leading commercial software, and are being trained to reject common artifacts in fluorescently labeled images. In one collaboration with NIA, our methods were applied to early screening studies using commercial arrays, clones containing interesting genes selected, custom arrays manufactured and then used in a second series of studies. A manuscript describing this work is in preparation. We have also developed a method for mapping over and under-expressed genes onto the location within the human genome of each gene. We are now investigating commercially available datamining and visualization software applicable to gene expression studies. Despite the current high cost of most such products, they may become suitable for use at NIH under an enterprise-wide cost-sharing mechanism, and may speed discovery of gene-function using large-scale gene expression studies coupled with newly available human genome sequence data.
Showing the most recent 10 out of 46 publications