2006 Bioinformatics preliminary exam, day 1. Answer 8 of the following 10 questions. Each question counts equally, so do the easiest ones first, and spend an average of 1 hour per question. This is a time-limited exam, so be aware of the clock. Partial results are better than none, so show that you know relevant information even if you cannot answer a question completely. Be sure to read the entire question before starting to answer. This is an open book, open internet exam -- however, you must cite all resources that you used to answer a question. Use of any material without attribution is academic dishonesty and possible consequences include failing the exam. If you want clarification about a question, you may email me. Any clarifications I make will be emailed to everyone simultaneously. You must email your answers to me by 6pm at Larry.Hunter@uchsc.edu Good luck! 1. You have been assigned to develop a simulation of the evolution of a population of proteins in order to understand how misfolding might affect the molecular evolutionary process. Describe the algorithm/protocol you would use, and describe The modeling trade-offs you might face with regards to realism in thermodynamic, structural, and population genetics considerations. 2. One proposal for why so many biological networks are scale-free is Barabasi's "rich get richer" model, in which the probability of adding a connection to a node is proportional to the number of existing connections. How would you test this model for coexpression networks using microarrays from several species (e.g. human, mouse, and rat)? Be specific about what calculations you would make and why. 3. The Expectation Maximization (EM) algorithm is used to estimate model parameters when there are missing or unobserved data. Describe a problem in analyzing molecular sequences (protein or DNA) where the EM algorithm is used. What are the parameters and the missing data in this problem? Sketch the E and M steps of the algorithm. 4. Suppose you obtain the raw image files from a cDNA gene expression experiment from your collaborator, who has asked you to find the genes that are differentially expressed across 8 arrays of biological replicates. Outline the steps in your analysis. Discuss the purpose of each step. Describe an example method for each step. 5. The Novartis SymAtlas (http://symatlas.gnf.org/) contains several tissue specific expression datasets. Using this data, find a gene whose expression appears to be specific to human prefrontal cortex. How did you do that? Provide supporting evidence from at least one other database that the gene you identified is in fact specific to human prefrontal cortex. Explain why you selected the particular database and why the evidence you present supports your claim. 6. Affymetrix technology assays mRNA abundance by using oligonucleotides. Increased probe density has made genome tiling chips and exon arrays practical. Define these two types of chips, and constrast them to earlier generations. Now, imagine that you have data on the expression levels of all of the exons from single gene taken from several hundred different conditions (i.e. chips). Using that data, how could you estimate how many different alternative splice forms were present in at least one condition, and what the expression levels of each form were in each condition? What assumptions did you have to make? Can you assess the probability that your estimates are accurate? How? 7. A colleague has come to you with a list of about 75 yeast genes that she believes are co-regulated. She has asked you to analyze the upstream regions of the genes to find conserved regulatory signals. Describe in detail what you will do to provide relevant information to her. Also, explain what kinds of false positives and false negatives might still remain after your analysis. 8. A molecular network is a model that uses nodes to represent genes or proteins, and links (or edges) between nodes to represent a binary relationship between genes/proteins (e.g., physical protein interactions). Molecular networks are increasingly used in systems biology to predict the function of uncharacterized proteins, to predict components of protein complexes, and to better understand evolution and the robustness of molecular processes, among other purposes. However, assessing the significance of the results of network analyses can be a challenge. For example, is it meaningful that 30% of interacting proteins share a GO function in a particular network? Such questions are commonly answered using randomization techniques. Sometimes the network is held constant and the identity of the proteins are randomized. For other problems, random networks are generated with the same number of nodes and edges as the actual molecular network. Give a reference for a paper that uses each randomization method. Describe the types of questions that can be meaningfully assessed with each technique. Besides the number of nodes and edges, what other properties of the network might we keep constant during randomization? Why might this be appropriate? 9. You have been assigned to keep track of all developments regarding GW501516, a pharmaceutical in development that is a subtype selective agonist of peroxisome proliferator-activated receptor delta. This is an active area of research, with dozens of articles published in the last few years. First, put together a short on-line report of the gene, the drug, and recent research results about them. Describe the tools, databases and strategies you used to create the description. Now, describe a strategy and a set of tools / technologies you would use to keep your report up to date about related new developments. 10. The gold-standard of sequence alignment algorithms is an alignment based on structure. Cite three structural alignment methods and give a flow chart of the main algorithms used for each method. List the main differences between these three methods. Discuss how these differences lead to ambiguity in the resulting structural alignments.