Introduction to Software for the Analysis of Sequence Similarity.

This tutorial was prepared by Mauricio La Rota for the Cornell University course PL BR 607 tought in the fall 2000 by Dave Schneider.

The handouts of the Dave Schneider's lectures are available here (together with animation automaton). The three homeworks associated with the course are available as well ( H1+ sequence, H2 + sequence and final homework).  
  

The purpose of this page is to introduce the PB607 student to a few concepts used by sequence comparison programs as well as to the mechanics of what is probably the most used pieces of software when it comes to comparing biological sequences:  The set of BLAST and FASTA programs, and then to introduce the multiple sequence alignment program ClustalW/ClustalX.  The tutorials are meant to be VERY simple, targeted to the student who has never looked at similarity between biological sequences.

Goals:

At the end of these readings, the student should be familiar with the common formats used to represent biological sequences, and should feel comfortable with the use of the programs described.  Additionally, the student should be able to extract any sequence from any of the public database servers.

Please complete all readings before the start time of the PB607 course (middle October 2000).

Generic Table of Contents

( Click on the image: click here to move there next to a title to go to that page:)

Book sources follow the link

Introductory readings follow the link

(w/ on-line synopses of pertinent sections of books)
  1. Internet resources (optional) 
  2. Sequence databases (biological content)
    1. Genbank. Nucleic acids dababases (dbEST, nr/nt, genomes, etc.)
    2. Genbank. Protein dababases (SwissProt, PDB, PIR, etc.)
    3. Other Protein Databases (but not included in Genbank)
    4. Gene indices (TIGR, NCBI, SAMBI, others.)
    5. Metabolic Pathways: The KEGG database
  3. How to get sequences: Entrez (from point of view of Web search engines)
    1. A generalization of "find" in web browsers and search engines
    2. Basic searching & Refined searching (boolean expressions, limits, etc.)
  4. Homology vs. similarity 
  5. Sequence comparisons and scores
    1. Pairwise and Multiple
    2. Database Scanning
  6. Reminder of Basic Statistics (eg. "What is a mean?")
  7. Repetitive and low complexity regions
    1. What are they?
    2. Why do they matter?
    3. Masking to improve biological fidelity of results
  8. References

Tutorials follow the link

  1. Entrez
    1. Basic searches
    2. Refined searches
    3. Downloading in bulk
  2. Database Scanning with heuristics (what happens behind the common heuristics)
    1. Using Blast
      1. Survey of functions (blastn, blastp, blastx and tblastx) and interpretation of results
      2. Modes of Operation (local vs. network and web clients)
      3. Book Readings & Online Blast tutorial (From the NCBI educational pages)
      4. Coffee break at NCBI: some biological examples
    2. Using FASTA
  3. Using ClustalW

Appendices (under construction) follow the link

  1. Glossary
  2. Data file formats (most of them are also in Baxevanis book: chapter 2)
    1. PDB
    2. GenBank
    3. EMBL
    4. FASTA
    5. ASN.1
  3. Advanced BLAST options
    1. PSI-BLAST
  4. Web resources
    1. NCBI Web pages
    2. FASTA servers




 

Book sources.

The book chapters referenced here, that we will tell you to read are from this list (available on reserve in Mann library):
 

1
    Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins
                      Andreas D. Baxevanis.  B.F. Francis Ouellette
                      June 1998, (volume 39 of Methods of biochemical Analysis).
                      ISBN  471191965
                      John Wiley & Sons, Inc. New York, NY  10158-0012. Call number: QD271 M59 v.39

          From this book: these are optional but recommended:
                     Chapter 1,2, Appendix 2
          And these are required readings:
                     Pages 98,101-120.
                     Pages 145-188

2
    Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
                      Richard Durbin (Edt), Eddy, R. Krogh, A. Mitchison, G.
                      May 1, 1998
                       ISBN  0521620414 or 0-521-62971-3 (paperback)
                       Cambridge  University  Press

          From this book, this is optional but recommended:     Introductory chapter.

3
    Jefferys W.H and Berger J.O. 1992. Okham's razor and Bayesian Analysis.
American Scientist 80:64-72 (is also available on reserve in Mann Library)


For Your Information:
    A list of important terms is available in the local glossary. (a few concepts were added to those already explained in NCBI's glossary)

If you need to check more books that relate to bioinformatics and computational biology in general, there is a list here:  (Local Book Table) and a second one here (http://www.iscb.org/books.html)
 



NEXT To introductory readings
 
This set of pages was designed by Mauricio La Rota. August 2000.