Programs for biosequence analysis

This page describes example implementations in Java of some algorithms for pairwise alignment from the book Durbin et al: Biological Sequence Analysis, Cambridge University Press 1998.

The Dina Research School workshop pages: alignment algorithms and phylogeny using Phylip.

Chapter 2: Pairwise alignment

What algorithms are implemented

The Java implementation comprises the following algorithms for pairwise alignment:
NWSimple
Global alignment with simple linear gap costs, using the Needleman-Wunsch algorithm.
SWSimple
Local alignment with simple linear gap costs, using the Smith-Waterman algorithm.
RMSimple
Repeated matches with simple linear gap costs.
OMSimple
Overlap matches with simple linear gap costs.
NWAffine
Global alignment with affine gap costs, using the Needleman-Wunsch algorithm.
NWSmart
Global alignment with simple linear gap costs. Where all the above implementations require space proportional to the product of the sequence lengths, this one requires only space proportional to the sum of the sequence lengths. It reconstructs the full alignment using a recursive divide-and-conquer algorithm, calling NWSimple to solve the base cases (when one of the sequences have length 1).
SWSmart
Local alignment with simple linear gap costs, using the smart linear space algorithm (as in NWSmart).
AlignSmartAffine
Global alignment with affine gap costs, using the smart linear space algorithm (as in NWAffine and NWSmart).
SWSmartAffine
Local alignment with affine gap costs, using the smart linear space algorithm (as in AlignSmartAffine and SWSmart).

Structure of the implementation

The implementation is modular so that new scoring matrices can be used (currently it uses the BLOSUM50 scoring matrix) and new alignment algorithms can be developed with little effort.

Unless you are familiar with object-oriented programming languages, the modularity is likely to make the details of the program hard to understand. The use of a higher-order functional language, such as Standard ML, might have given a clearer and more concise program. Such languages provide better procedural abstraction than do object-oriented languages.

The `simple' and `affine' algorithms painstakingly build an array of Traceback objects while computing the optimal score. This provides for algorithmic clarity and simplifies the reconstruction of the optimal match. It may seem rather wasteful in terms of space, but eventually the `simple' and `affine' algorithms will be used only to handle the base cases when computing the traceback in a `smart' algorithm. This means that the total size of the traceback arrays allocated will be only linear in the sum of the sizes of the two sequences.

Chapter 3: Hidden Markov models

What algorithms are implemented

The implementation comprises the following classes and algorithms on hidden Markov models (HMMs):
HMM
The class of hidden Markov models. An HMM object includes state names, a matrix of transition probabilities, emission names, and a matrix of emission probabilities
Viterbi
Uses Viterbi decoding to compute the most probable path through an HMM for a for a given observed sequence.
Forward
Uses the so-called Forward algorithm on an HMM to compute the likelihood of a given observed sequence, and to build the forward table.
Backward
Uses the so-called Backward algorithm on an HMM to compute the likelihood of a given observed sequence, and to build the backward table.
PosteriorProb
Computes the posterior probability of states in a HMM from given Forward and Backward tables. As in part 3 of the casino example we computed the posterior probability of the die being fair: Click to see the graph
baumwelch
Uses the Baum-Welch algorithm to estimate the parameters of an HMM when given its structure (number of states and emissions) and given an observed sequence.

Chapter 7: Phylogenetic trees

What algorithms are implemented

The implementation comprises the following classes and algorithms for phylogenetic trees:
UPCluster
The class of rooted trees as constructed by the UPGMA algorithm
UPGMA
The UPGMA algorithm for constructing rooted phylogenetic trees. The algorithm takes as input the lower triangle of a symmetric distance matrix and contructs a rooted tree in the form of a UPCluster object. The algorithm does not check that the distance matrix satisfies the triangle inequality (i.e., defines a metric), nor that it is an ultrametric (which is a requirement for proper application of the UPGMA algorithm) or additive.
NJCluster
The class of rooted trees as constructed by the neighbour joining algorithm.
NJ
The neighbour joining algorithm for constructing phylogenetic trees. The algorithm takes as input the lower triangle of a symmetric distance matrix and contructs a rooted tree in the form of a NJCluster object. Really, the algorithm should construct an unrooted tree, but this version arbitrarily adds a root node with appropriate edges to the last two active leaves in the construction. The algorithm does not check that the distance matrix satisfies the triangle inequality (i.e., defines a metric), nor that it is an ultrametric or additive (which is a requirement for proper application of the neighbour joining algorithm).
The demo applet
This applet permits interactive experimentation with distance matrices and the resulting UPGMA and neighbour joining trees. The applet starts up with the example in Figure 7.7 on page 170 of Durbin et al. (for which the tree reconstructed by UPGMA is wrong, whereas that reconstructed by neighbour joining is correct). The applet permits interactive editing of the distance matrix, which may have arbitrary dimension. Pressing a button may fill the distance matrix with a random data set. The data set satisfies the triangle inequality -- in fact, is it a two-dimensional Euclidean distance -- but it may not define an ultrametric nor be additive.

Chapter 9: Transformational grammars


Peter Sestoft (sestoft@itu.dk) 1999-10-01, 1999-12-19