Automata, Computability and Complexity: Theory & Applications - Appendix C

Introduction to Molecular Biology and Genetics

Sequence Matching

Techniques for determining protein and DNA sequences
1. An introduction to protein sequencing
The human genome project
Amino acid distance measures
1. A survey of distance measures
2. Which scoring method should I use?
Databases of known protein sequences
1. Prosite “is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.”
2. Blocks “are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and Block Maker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks (current version), retrieve blocks, and create new blocks, respectively.”
3. Prints “is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours.”
4. Pfam “is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.”
BLAST
1. The original paper describing BLAST, which “provides a method for rapid searching of nucleotide and protein databases. Since the BLAST algorithm detects local as well as global alignments, regions of similarity embedded in otherwise unrelated proteins can be detected.” Blast can be used to search many different databases.
2. NCBI BLAST
3. WU-BLAST
HMMs for Sequence Matching
Using regular expressions to specify protein motifs
1. How to use regular expressions in Python, a nice real example of specifying a motif, and a short note on the different syntax used to search Prosite
2. ToxoDB: An example of a specific sequence database. Shows how to use Perl regular expressions to search it. Exploits numeric codes to represent classes of amino acids. For example, 6 will match any hydrophobic amino acid.
3. Prosite database of significant motifs and patterns

Describing Protein Folding with CFGs

The Protein Folding Problem
1. Introduction to protein folding