SCIENCE AT THE EDGE SEMINAR SERIES Quantum Biology / Gene Expression in Development and Disease Seminar Friday, 21 September 2012 at 11:30am Room 1400 Biomedical and Physical Sciences Bldg. Refreshments at 11:15 Speaker: Richard Scheuermann J. Craig Venter Institute, San Diego, California Title: Comparative Genomics Analysis to Determine the Origin of Pandemic Influenza Viruses: Sequence Feature Variant Type and Evolutionary Trajectory Analysis using the Influenza Research Database Abstract: The recent experience with the emergence of the 2009 Pandemic H1N1 virus has highlighted the value of free and open access to genome sequence data integrated with information about viral characteristics related to antiviral drug resistance and virulence for influenza viruses. The Influenza Research Database (IRD) (www.fludb.org) is a free publicly accessible resource funded by the U.S. National Institute of Allergy and Infectious Diseases (NIAID) through the Bioinformatics Resource Centers (BRC) program that serves as a comprehensive integrated database and analysis resource for influenza sequence, surveillance and research data. The IRD provides researchers access to user- friendly interfaces for data retrieval, analysis, and visualization tools. IRD integrates genomic, proteomic, immune epitopes, and surveillance data with other information useful in understanding virulence and host-pathogen interactions. IRD also contains novel data related to influenza virus not found in other database resource. Curators at IRD have assembled a set of >4500 sequence features (SF) in the 11 influenza proteins. A sequence feature can be a specific functional region (e.g. enzyme active site), a structural feature (e.g. a particular alpha helix), an immune epitope, or any other region of special interest. IRD then determines the number of unique amino acid sequences for each SF from the entire database of sequence record as the set of variant types (VT) for each SF. An analysis of associations between SFVT's and virus host range has revealed that genetic determinants in the influenza NS1 protein restrict the host range of influenza viruses. In April and May 2009, the first cases of a new influenza-like illness were reported in California, Texas and Mexico. A novel influenza virus strain, Pandemic H1N1 2009 (aka "swine flu"), had emerged as the first official pandemic strain of the 21st century. Early analysis pointed to the swine origin of the virus overall, but there was debate as to the origin of each of the eight genomic segments since segments from two or more influenza strains are known to re-assort. Traditionally, investigators address this question by performing a BLAST analysis to identify closely related sequences as candidate ancestral genomes. However, in the process of evaluating BLAST analysis results, we identified interesting patterns when we plotted nucleotide differences versus isolation year differences between the pandemic strain and the top 1000 BLAST hits. These plots showed three distinct clusters for segment 5 encoding the nucleoprotein NP. Similar patterns were observed for other genomic segments. Manual segregation of these sequence records and subsequent phylogenetic and sequence alignments analysis showed that sequences in Cluster 1 exhibit a gentle trajectory of sequence similarity consistent with the gradual accumulation of mutations over time, whereas sequences in Cluster 2 show a more haphazard pattern of sequence similarity. These observations and others suggested that sequences in Cluster 1 represent the true evolutionary trajectory of the pandemic strain. We are now developing new computational approaches to illuminate the true evolutionary trajectory of virus sequences based on these observations. These results suggest that this novel evolutionary trajectory (ET) analysis can more effectively pinpoint the host and geographic lineages of viral genome segments, in comparison with BLAST-based phylogenetic analysis alone. Supported by NIH N01AI40041.