Programs for biosequence analysis

This page describes example implementations in Java of some algorithms for pairwise alignment from the book Durbin et al: Biological Sequence Analysis, Cambridge University Press 1998.

Chapter 2: pairwise alignment
Chapter 3: hidden Markov models
Chapter 7: phylogenetic trees
Chapter 9: transformational grammars

The Dina Research School workshop pages: alignment algorithms and phylogeny using Phylip.

Chapter 2: Pairwise alignment

A Perl implementation of algorithms for pairwise alignment with graphic display of the alignment (local/global and linear/affine gap costs).
The algorithms for pairwise alignment as an applet that you can run in your local browser.
The source code for the applet alignment algoritms (in Java).
The additional applet source code for the alignment applet
Some seminar slides presenting the four basic algorithms for pairwise alignment

What algorithms are implemented

The Java implementation comprises the following algorithms for pairwise alignment:

NWSimple: Global alignment with simple linear gap costs, using the Needleman-Wunsch algorithm.
SWSimple: Local alignment with simple linear gap costs, using the Smith-Waterman algorithm.
RMSimple: Repeated matches with simple linear gap costs.
OMSimple: Overlap matches with simple linear gap costs.
NWAffine: Global alignment with affine gap costs, using the Needleman-Wunsch algorithm.
NWSmart: Global alignment with simple linear gap costs. Where all the above implementations require space proportional to the product of the sequence lengths, this one requires only space proportional to the sum of the sequence lengths. It reconstructs the full alignment using a recursive divide-and-conquer algorithm, calling NWSimple to solve the base cases (when one of the sequences have length 1).
SWSmart: Local alignment with simple linear gap costs, using the smart linear space algorithm (as in NWSmart).
AlignSmartAffine: Global alignment with affine gap costs, using the smart linear space algorithm (as in NWAffine and NWSmart).
SWSmartAffine: Local alignment with affine gap costs, using the smart linear space algorithm (as in AlignSmartAffine and SWSmart).

Structure of the implementation

The implementation is modular so that new scoring matrices can be used (currently it uses the BLOSUM50 scoring matrix) and new alignment algorithms can be developed with little effort.

Unless you are familiar with object-oriented programming languages, the modularity is likely to make the details of the program hard to understand. The use of a higher-order functional language, such as Standard ML, might have given a clearer and more concise program. Such languages provide better procedural abstraction than do object-oriented languages.

The `simple' and `affine' algorithms painstakingly build an array of Traceback objects while computing the optimal score. This provides for algorithmic clarity and simplifies the reconstruction of the optimal match. It may seem rather wasteful in terms of space, but eventually the `simple' and `affine' algorithms will be used only to handle the base cases when computing the traceback in a `smart' algorithm. This means that the total size of the traceback arrays allocated will be only linear in the sum of the sizes of the two sequences.

Chapter 3: Hidden Markov models

The source code for the hidden Markov model (HMM) algorithms

What algorithms are implemented

The implementation comprises the following classes and algorithms on hidden Markov models (HMMs):

HMM: The class of hidden Markov models. An HMM object includes state names, a matrix of transition probabilities, emission names, and a matrix of emission probabilities
Viterbi: Uses Viterbi decoding to compute the most probable path through an HMM for a for a given observed sequence.
Forward: Uses the so-called Forward algorithm on an HMM to compute the likelihood of a given observed sequence, and to build the forward table.
Backward: Uses the so-called Backward algorithm on an HMM to compute the likelihood of a given observed sequence, and to build the backward table.
PosteriorProb: Computes the posterior probability of states in a HMM from given Forward and Backward tables. As in part 3 of the casino example we computed the posterior probability of the die being fair: Click to see the graph
baumwelch: Uses the Baum-Welch algorithm to estimate the parameters of an HMM when given its structure (number of states and emissions) and given an observed sequence.

Chapter 7: Phylogenetic trees

The algorithms for phylogenetic trees as an applet that you can run in your local browser
The source code for generating and drawing phylogenetic trees
The additional applet source code for a graphical user interface to the above program

What algorithms are implemented

The implementation comprises the following classes and algorithms for phylogenetic trees:

UPCluster: The class of rooted trees as constructed by the UPGMA algorithm
UPGMA: The UPGMA algorithm for constructing rooted phylogenetic trees. The algorithm takes as input the lower triangle of a symmetric distance matrix and contructs a rooted tree in the form of a UPCluster object. The algorithm does not check that the distance matrix satisfies the triangle inequality (i.e., defines a metric), nor that it is an ultrametric (which is a requirement for proper application of the UPGMA algorithm) or additive.
NJCluster: The class of rooted trees as constructed by the neighbour joining algorithm.
NJ: The neighbour joining algorithm for constructing phylogenetic trees. The algorithm takes as input the lower triangle of a symmetric distance matrix and contructs a rooted tree in the form of a NJCluster object. Really, the algorithm should construct an unrooted tree, but this version arbitrarily adds a root node with appropriate edges to the last two active leaves in the construction. The algorithm does not check that the distance matrix satisfies the triangle inequality (i.e., defines a metric), nor that it is an ultrametric or additive (which is a requirement for proper application of the neighbour joining algorithm).
The demo applet: This applet permits interactive experimentation with distance matrices and the resulting UPGMA and neighbour joining trees. The applet starts up with the example in Figure 7.7 on page 170 of Durbin et al. (for which the tree reconstructed by UPGMA is wrong, whereas that reconstructed by neighbour joining is correct). The applet permits interactive editing of the distance matrix, which may have arbitrary dimension. Pressing a button may fill the distance matrix with a random data set. The data set satisfies the triangle inequality -- in fact, is it a two-dimensional Euclidean distance -- but it may not define an ultrametric nor be additive.

Chapter 9: Transformational grammars

Some seminar slides presenting parts of the chapter on transformational grammars

Peter Sestoft (sestoft@itu.dk) 1999-10-01, 1999-12-19