1. (2pts) How can structural information be used to improve multiple sequence alignments?

  2. (3pts) Use clustalW to align pairs of these three protein fragments (align A to B, B to C, A to C):
    >A Protein
    MFAQAFPGDFD
    >B Protein
    MFFFPGDYD
    >C Protein
    MFFFQAFFPGDYD

    Now align all three (A to B to C).
    Now reverse the order in the entry box and align all three again (C to B to A).

    1. What do the results tell you about the alignment method of clustalW (is it progressive or iterative)?
    2. Make a single manual adjustment to improve the final alignment of all three proteins.

  3. (4pts) There’s a typo in a subscript of the “Viterbi Picture” slide from the HMM lecture. Find it by tracing through the example.

  4. (8pts) Briefly describe in words the two main steps of the EM algorithm as applied to HMM parameter fitting.

  5. (7pts) The source of variation in molecular evolution can be single nucleotide mutations or insertions/deletions of one or more nucleotides.
    1. Write a regular expression that matches these three sequences:
      ACDACC
      ACDCCC
      ACDDDACC
    2. Build a profile of the following alignment (treat the gap just like any other character): ACD--ACC
      ACD--CCC
      ACDDDACC
    3. Build a probabilistic Markov model of the alignment.
    4. Given your model, compute the probability of of the sequence ACDDDACC.

  6. (4pts) What assumption does the Viterbi HMM training algorithm make? When might this not be a good assumption?

  7. (14pts) Briefly describe each of the main conceptual similarities and differences of the clone-by-clone genome sequencing method used by the original Human Genome Project and the whole-genome shotgun approach described by Gene Myers and used by Celera Genomics.

  8. (3pts) Celera pooled DNA samples from a few individuals to assemble its draft human genome sequence. What potential problem could this have introduced, and how was it handled? (Hint: This is a sequence assembly problem, and the solution is related to the repeat problem.)

  9. (3pts) Computational sequence assembly methods require sequence fragment reads from randomly sampled fragments. Explain one potential source of bias in this sampling. What can be done to minimize this bias or to compensate for it?

  10. (3pts) What is a gene? Be as specific as possible.

  11. (2pts) First order HMMs are often used to model proteins. What order HMM is applicable to DNA models and why?

  12. (2pts) What is “semi” about a semi-Markov Model, and how does it apply to exon finding?

  13. (4pts) Describe the differences in possible decision boundaries resulting from linear discriminant analysis and a neural network. Describe a potential problem with either one.

  14. (3pts) Why might ESTs not show up for some real exons in expression experiments? Provide at least 3 reasons.

  15. (2pts) Is gene finding generally easier in prokaryotes or eukaryotes? Why?

  16. (3pts) What biases exist between coding and noncoding regions of the genome?

  17. (6pts) What is the shortest human exon? How did you find it?

  18. (5pts) What proportion of known human genes contain no introns?

   
         
Course home page | Computational Bioscience Program home page | Professor Hunter's home page