SCIENCE AT THE EDGE SEMINAR Friday, 10 September 2010 at 11:30am Room 1400 Biomedical and Physical Sciences Bldg. Refreshments at 11:15 Speaker: Jason Mezey Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY Department of Genetic Medicine, Weill Cornell Medical College, New York, NY Title: Scalable Algorithms for Genome-wide Disease Locus Mapping and Gene Expression Network Analysis Abstract: High throughput genomics and next-generation sequencing technology are changing how we study the genetics of disease and complex aspects of physiology. In this talk, I will describe our research on the computational algorithms used to analyze these data, with the objectives of answering the following questions: 1. Which genes contribute to increased risk for disease? 2. Can we discover gene networks important for complex phenotypes? 3. Are there novel genome-wide factors contributing to the etiology of disease? For the first question, we have developed scalable penalized regression algorithms for mapping the location of disease genes when simultaneously analyzing millions of genetic markers in genome-wide association studies (GWAS). By applying these algorithms, we have been able to confirm associations that have not been replicated when analyzing each genetic marker individually. For the second question, we employ probabilistic graphical models, which are applied extensively in machine learning, to identify novel network and pathway connections from genome-wide gene expression profiles of specific cell and tissue types. Graphical models provide a balance between model fitting and the simplest representation of a network connection that is relevant to biology, i.e. two genes interact or they do not. We have developed scalable algorithms for both undirected and directed graphical network models, where the directed approaches take advantage of genetic perturbations for novel network discovery. For the third question, we have developed specialized algorithms for analyzing next-generation sequence data to discover new biological factors that may contribute to the initiation and production of complex disease. As an example of the application of our methods, we have analyzed RNA-Seq profiles of human lung tissue and we have identified genome-wide splicing alterations that are produced by smoking.