BIOSCI359 Assignment 2: Sequence Alignment
22 August 2006, Alexei Drummond
Due: 4pm Friday 15th September
at the Biology Student Resource Centre
Hand in: For exercises 1-4 hand
in written answers to the questions. For exercise 5 hand in a printout
of your code. In addition email a copy of your program to alexei AT
cs.auckland.ac.nz so that I can run it.
Introduction
A common use of computers in bioinformatics is pairwise or multiple
alignment of DNA and protein sequences. Alignment of two (or more)
sequences is a simple process when there is no length polymorphism or
where the sequences are very similar. Problems arise when recognizable
homology is marginal and there have been insertions and/or deletions in
the sequences.
Standard alignment algorithms are designed to find the alignment
with the greatest score, with character matches contributing positively
to the score and mismatches contributing negatively. As such, the
program is searching on the basis of sequence similarity. Similarity is
not the same as homology. Homology assumes that characters are derived
from a common ancestor. As a consequence, computer generated alignments
may not reflect truly homologous sequences and the best character
alignment for some scoring system may not be the correct alignment.
Many sequence comparison programs are now available for a whole
range of different computer types. These programs often form part of
'packages', some of which are used on a fee-paying basis. The School of
Biological Sciences is a development partner with a local software
company (Biomatters), who produce a bioinformatics software package
called Geneious.
The following exercises have been designed to introduce you to some
of the bioinformatics algorithms that are fundamental to molecular
genetics.
This assignment will use a number of different functions within the
Geneious software to demonstrate sequence alignment. The lab is
intended to simply act as an introduction to some of the algorithms
rather than being a set exercise. We have included a few exercises but
more will be learnt by self-experimentation with the packages than it
would by following a set of instructions.
Summary of the bioinformatics analyses you will perform.
- Needleman and Wunsch (globally optimized) alignment
- Smith and Waterman (locally optimized) alignment
- BLAST
- Produce a dot plot comparison of two sequences
- Find open reading frames
- Translate DNA sequence to peptide sequence
- Perform a multiple sequence alignment
You will find the latest version (2.0.4) of Geneious is installed on
the lab computers. The sequences that you will use can be downloaded as
a Geneious file here.
Download and save this file on your hard drive.
The programs that you will be running are simply accessed through the
menus and tool bar of the Geneious application. This application is
user friendly and comes with a detailed user manual.
Before starting the exercises, make sure that you have first gone
through the Geneious Tutorial (available under the Help menu).
Exercise 1: Pairwise alignment
Before you start you should import the assignment2.geneious
file into Geneious using the File -> Import -> From File... menu
option. Make sure you select the "Geneious" file type in the drop-down
menu. After importing this file a new Folder called "Assignment 2"
should become visible within Geneious in the local document hierarchy
in the left-hand panel.
STEP 1
Compare the sequences named align1 and align2 using the
Needleman-Wunsch (Global) and Smith-Waterman (Local) algorithms using
the default scoring matrix and gap open and gap extension penalties.
Both of these programs use essentially the same algorithm but
Smith-Waterman looks for a local optimum rather than a global optimum.
To align the two sequences simple select both of them in the document
table and click on the "Alignment" button in the tool bar. A drop-down
option will let you choose between global and local alignment options.
STEP 2
Repeat the process, but first reverse-complement the align2 sequence.
This can be achieved by selecting the align2 sequence, clicking the
mouse in the sequence, pressing Ctrl+A (or Command+A on Mac OS X) to
select the whole sequence and then clicking on the "Reverse Complement"
button in the sequence viewer.
QUESTION 1: How did the programs perform and what
regions of sequence similarity did you find with the different
algorithms? Report the regions of similarity and explain the different
results in terms of what you know about how the different algorithms
work. Do different gap open and extension penalties result in
substantially different alignments?
STEP 3
Now view the regions of similarity as a dotplot. This can be done by
selecting the two sequences in the document table and then clicking on
the "Dotplot Viewer" tab in the viewer panel.
QUESTION 2: What features do you
detect in the sequences based on the dotplot? Did you detect any more
elements than you had found with the Needleman-Wunsch and
Smith-Waterman algorithms? Draw a "genetic map" showing the extent and
orientation of the regions of homology between the two sequences,
including the locations of any repeats that you have found.
As a last exercise, you have been given two 16S rRNA genes from
distantly related bacteria. Compare the two with Needleman-Wunsch and
with Dotplot Viewer, and compare the E.
coli sequence to its own reverse sequence using the Dotplot
Viewer. You will have to play around with the window size and the
stringency - particularly on the last plot where it is better to use a
small window size.
QUESTION 3: What did you detect? What
is it that you are seeing in the dotplot when you compare a 16S rRNA
gene to itself in
reverse?
Exercise 2: Translating sequences
This exercise will demonstrate the difficulty of aligning protein
coding sequences. Alignments of closely related sequences are easy to
perform - the alignment is unequivocal. When you work with more
distantly related sequences, the correct alignment may not be quite so
obvious and the computer may get it wrong - particularly if you feed it
inappropriate parameters.
We will use two cytochrome b gene sequences which are very different
(human and wheat).
We only want to compare the coding regions of the two genes and so must
exclude the flanking DNA. Hence, we must find the start and stop
codons. Use the ORF finding function in Geneious to do this.
Tip: Although the genetic code is
said to be universal, some organisms have slight variations, and the
mitochondria of most organisms have differences in their codons for
tryptophan (W in single letter code) and stop codons. In mammalian
mitochondria, the normal stop codon UGA, encodes tryptophan instead of
STOP. A translation can be incorrect if you forget to specify the
correct translation table (genetic code).
Select the wheat_cytb sequence and choose the "Find ORFs" option
in the Tool menu. The default minimum ORF length will be fine. Make
sure that you choose the appropriate mitochondrial translation table to
do the translation.
Do this for both wheat and human sequences (using appropriate
translation tables) and then look at the resulting annotations.
Translation
Performing the translation itself is pretty much self-explanatory. Just
select the ORF you want to translate and click the the translate button.
Translate both the human and wheat sequences into peptide sequences. As
a last exercise, use the Needleman-Wunsch alignment to compare both the
DNA sequences and the peptide sequences.
QUESTION 4: How does the level of
identity change when you use peptide sequences and what does it mean?
What clues do you detect on the DNA alignment to show that some regions
have aligned correctly and others incorrectly?
Exercise 3: Multiple alignment
You are provided with a set of 16S rRNA sequences. These sequences are
not the whole length of the gene but are of the V1-V2 region of a
number of Clostridium sp.
They have been especially chosen because they demonstrate that
alignment is not always an easy task.
Geneious Progressive Pairwise Alignment
Select all off the 16S sequences and click on the "Alignment" button in
the toolbar. Because you have selected more than two sequences you will
have a few more options. Try aligning the sequences provided with a
simple progressive pairwise alignment algorithm. To do this, set the
number of refinements to zero. Have a close look at the alignment the
program produces. Try a few different gap creation and extension
penalties and compare the results.
CLUSTALW
An alternative progressive pairwise alignment program is CLUSTALW.
Geneious does not come bundled with CLUSTALW, but if you install
CLUSTALW on your computer then Geneious can be used to locate and run
it. In many ways CLUSTALW is a more sophisticated alignment algorithm
for amino-acid alignment, however for DNA alignments it uses algorithms
which are almost identical to those built into Geneious. The best thing
to do is to play around with these two alignment algorithms and see how
they compare. We are not questioning you on the use of ClustalW but you
should become familiar and comfortable with different multiple
alignment algorithms as they are the cornerstone of many Bioinformatics
processes.
Things you should explore:
- Changing the multiple alignment parameters
- See what happens when you use extreme values for gap creation
- Split your sequences up into two files. Align one set of
sequences first and then the other. Select the two alignments and click
the "Alignment" button to align them by profile alignment.
- Save the results in different formats. Common formats for
phylogeneticists are NEXUS and PHYLIP. Look at the resulting files in a
text editor to familiarize yourself with them.
- Try hand-editing the alignment (click the edit button in the
sequence viewer). Do you think you could improve the alignment beyond
any of the tricks used with the automated alignment methods?
QUESTION 5: What strategies for
automated multiple sequence alignment did you find worked best?
Exercise 4: Aligning sequences by hand
Use the Needleman-Wunsch algorithm and the BLOSUM 50 substitution
matrix to manually align the two following peptide sequences.
GSLPH
ALLR
Use a gap weighting of -8.
Use a spreadsheet to create a table like the one below. Then using the
Needleman-Wunsch algorithm and the BLOSUM50 matrix above, fill out the
table by hand:
Now, find the traceback path through the matrix, and infer the
alignment.
QUESTION 6: Present the F matrix and
your inferred alignment.
Exercise 5: Programming task
The final part of this assignment will involve completing a small Java
program that implements a simple sequence comparison task. The task is
to takes two strings as input and report the
number of "hits" between the two sequences. A hit is defined as a
ungapped match of length W with a minimum score of T. This programming
task will be
marked for correctness, code style and speed. The Test class provided
will be
used to test your code. Your solution should should complete the
MyHitCounter implementation the
HitCounter interface. As a guideline, my (very bad) default
implementation of MyHitCounter
takes about 10 seconds for both the protein and nucleotide tests on my
laptop. I
would expect good implementations to be 2-20 times faster.
This task is deceptively easy! My default implementation is only 10
lines long. The craft is in getting it working *fast* without
compromising code quality and maintainability. I will tolerate small
changes to any of the other classes (refactoring) if there are good
reasons for it. But please note any such changes in the comments of
your MyHitCounter.java file. I expect this part of the assignment to be
handed in by email to myself in a zip file containing all the java
files with any instructions and implementation notes in the javadoc
comments of MyHitCounter.
MyHitCounter.java
HitCounter.java
Test.java
The following classes are also needed to run the Test
and have been adapted from the JEBL sourceforge project:
AminoAcidScores.java
Blosum62.java
Nucleotides.java
NucleotideScores.java
NucleotideState.java
ScoreMatrix.java
Scores.java
State.java
Evaluation
The evalution of you program will be broken up in the following way:
- Correctness, 50%
- Code style and documentation 25%
- Speed 25%
Ref: Chapter 2, "Biological Sequence
Analysis" Durbin, Eddy. Krough, Mitchison.