Programs for biosequence analysis
This page describes example implementations in Java of some algorithms
for pairwise alignment from the book Durbin et al: Biological Sequence
Analysis, Cambridge University Press 1998.
The Dina Research School workshop pages: alignment algorithms and phylogeny using Phylip.
Chapter 2: Pairwise alignment
What algorithms are implemented
The Java implementation comprises the following algorithms for
pairwise alignment:
- NWSimple
- Global alignment with simple linear gap costs, using the
Needleman-Wunsch algorithm.
- SWSimple
- Local alignment with simple linear gap costs, using the
Smith-Waterman algorithm.
- RMSimple
- Repeated matches with simple linear gap costs.
- OMSimple
- Overlap matches with simple linear gap costs.
- NWAffine
- Global alignment with affine gap costs, using the
Needleman-Wunsch algorithm.
- NWSmart
- Global alignment with simple linear gap costs. Where all the
above implementations require space proportional to the
product of the sequence lengths, this one requires only space
proportional to the sum of the sequence lengths. It
reconstructs the full alignment using a recursive divide-and-conquer
algorithm, calling NWSimple to solve the base cases (when one of the
sequences have length 1).
- SWSmart
- Local alignment with simple linear gap costs, using the smart
linear space algorithm (as in NWSmart).
- AlignSmartAffine
- Global alignment with affine gap costs, using the smart linear
space algorithm (as in NWAffine and NWSmart).
- SWSmartAffine
- Local alignment with affine gap costs, using the smart linear
space algorithm (as in AlignSmartAffine and SWSmart).
Structure of the implementation
The implementation is modular so that new scoring matrices can be
used (currently it uses the BLOSUM50 scoring matrix) and new alignment
algorithms can be developed with little effort.
Unless you are familiar with object-oriented programming languages,
the modularity is likely to make the details of the program hard to
understand. The use of a higher-order functional language, such as
Standard ML, might have given a clearer and more concise program.
Such languages provide better procedural abstraction than do
object-oriented languages.
The `simple' and `affine' algorithms painstakingly build an array
of Traceback objects while computing the optimal score. This provides
for algorithmic clarity and simplifies the reconstruction of the
optimal match. It may seem rather wasteful in terms of space, but
eventually the `simple' and `affine' algorithms will be used only to
handle the base cases when computing the traceback in a `smart'
algorithm. This means that the total size of the traceback arrays
allocated will be only linear in the sum of the sizes of the two
sequences.
Chapter 3: Hidden Markov models
- The
source code for the hidden Markov model (HMM) algorithms
What algorithms are implemented
The implementation comprises the following classes and algorithms on
hidden Markov models (HMMs):
- HMM
- The class of hidden Markov models. An HMM object includes state
names, a matrix of transition probabilities, emission names, and a
matrix of emission probabilities
- Viterbi
- Uses Viterbi decoding to compute the most probable
path through an HMM for a for a given observed sequence.
- Forward
- Uses the so-called Forward algorithm on an HMM to
compute the likelihood of a given observed sequence, and to build the
forward table.
- Backward
- Uses the so-called Backward algorithm on an HMM to
compute the likelihood of a given observed sequence, and to build the
backward table.
- PosteriorProb
- Computes the posterior probability of states in
a HMM from given Forward and Backward tables. As in part 3 of the
casino example we computed the posterior probability of the die being
fair: Click to see the graph
- baumwelch
- Uses the Baum-Welch algorithm to estimate the
parameters of an HMM when given its structure (number of states and
emissions) and given an observed sequence.
Chapter 7: Phylogenetic trees
- The algorithms for phylogenetic trees
as an applet that you can run in your local browser
- The
source code for generating and drawing phylogenetic trees
- The additional
applet source code for a graphical user interface to the above
program
What algorithms are implemented
The implementation comprises the following classes and algorithms for
phylogenetic trees:
- UPCluster
- The class of rooted trees as constructed by the UPGMA algorithm
- UPGMA
- The UPGMA algorithm for constructing rooted phylogenetic
trees. The algorithm takes as input the lower triangle of a symmetric
distance matrix and contructs a rooted tree in the form of a UPCluster
object. The algorithm does not check that the distance matrix
satisfies the triangle inequality (i.e., defines a metric), nor that
it is an ultrametric (which is a requirement for proper application of
the UPGMA algorithm) or additive.
- NJCluster
-
- The class of rooted trees as constructed by the neighbour
joining algorithm.
- NJ
- The neighbour joining algorithm for constructing
phylogenetic trees. The algorithm takes as input the lower triangle
of a symmetric distance matrix and contructs a rooted tree in the form
of a NJCluster object. Really, the algorithm should construct an
unrooted tree, but this version arbitrarily adds a root node with
appropriate edges to the last two active leaves in the construction.
The algorithm does not check that the distance matrix satisfies the
triangle inequality (i.e., defines a metric), nor that it is an
ultrametric or additive (which is a requirement for proper application
of the neighbour joining algorithm).
- The demo applet
- This
applet permits interactive experimentation with distance matrices and
the resulting UPGMA and neighbour joining trees. The applet starts up
with the example in Figure 7.7 on page 170 of Durbin et al. (for which
the tree reconstructed by UPGMA is wrong, whereas that reconstructed
by neighbour joining is correct). The applet permits interactive
editing of the distance matrix, which may have arbitrary dimension.
Pressing a button may fill the distance matrix with a random data set.
The data set satisfies the triangle inequality -- in fact, is it a
two-dimensional Euclidean distance -- but it may not define an
ultrametric nor be additive.
Chapter 9: Transformational grammars
- Some seminar slides presenting
parts of the chapter on transformational grammars
Peter Sestoft (sestoft@itu.dk) 1999-10-01,
1999-12-19