Assignment 2

BIOSCI359 Assignment 2: Sequence Alignment

22 August 2006, Alexei Drummond

Due: 4pm Friday 15th September at the Biology Student Resource Centre

Hand in: For exercises 1-4 hand in written answers to the questions. For exercise 5 hand in a printout of your code. In addition email a copy of your program to alexei AT cs.auckland.ac.nz so that I can run it.

Introduction

A common use of computers in bioinformatics is pairwise or multiple alignment of DNA and protein sequences. Alignment of two (or more) sequences is a simple process when there is no length polymorphism or where the sequences are very similar. Problems arise when recognizable homology is marginal and there have been insertions and/or deletions in the sequences.

Standard alignment algorithms are designed to find the alignment with the greatest score, with character matches contributing positively to the score and mismatches contributing negatively. As such, the program is searching on the basis of sequence similarity. Similarity is not the same as homology. Homology assumes that characters are derived from a common ancestor. As a consequence, computer generated alignments may not reflect truly homologous sequences and the best character alignment for some scoring system may not be the correct alignment.

Many sequence comparison programs are now available for a whole range of different computer types. These programs often form part of 'packages', some of which are used on a fee-paying basis. The School of Biological Sciences is a development partner with a local software company (Biomatters), who produce a bioinformatics software package called Geneious.

The following exercises have been designed to introduce you to some of the bioinformatics algorithms that are fundamental to molecular genetics.

This assignment will use a number of different functions within the Geneious software to demonstrate sequence alignment. The lab is intended to simply act as an introduction to some of the algorithms rather than being a set exercise. We have included a few exercises but more will be learnt by self-experimentation with the packages than it would by following a set of instructions.

Summary of the bioinformatics analyses you will perform.

Needleman and Wunsch (globally optimized) alignment
Smith and Waterman (locally optimized) alignment
BLAST
Produce a dot plot comparison of two sequences
Find open reading frames
Translate DNA sequence to peptide sequence
Perform a multiple sequence alignment

You will find the latest version (2.0.4) of Geneious is installed on the lab computers. The sequences that you will use can be downloaded as a Geneious file here. Download and save this file on your hard drive.

The programs that you will be running are simply accessed through the menus and tool bar of the Geneious application. This application is user friendly and comes with a detailed user manual.

Before starting the exercises, make sure that you have first gone through the Geneious Tutorial (available under the Help menu).

Exercise 1: Pairwise alignment

Before you start you should import the assignment2.geneious file into Geneious using the File -> Import -> From File... menu option. Make sure you select the "Geneious" file type in the drop-down menu. After importing this file a new Folder called "Assignment 2" should become visible within Geneious in the local document hierarchy in the left-hand panel.

STEP 1

Compare the sequences named align1 and align2 using the Needleman-Wunsch (Global) and Smith-Waterman (Local) algorithms using the default scoring matrix and gap open and gap extension penalties. Both of these programs use essentially the same algorithm but Smith-Waterman looks for a local optimum rather than a global optimum. To align the two sequences simple select both of them in the document table and click on the "Alignment" button in the tool bar. A drop-down option will let you choose between global and local alignment options.

STEP 2

Repeat the process, but first reverse-complement the align2 sequence. This can be achieved by selecting the align2 sequence, clicking the mouse in the sequence, pressing Ctrl+A (or Command+A on Mac OS X) to select the whole sequence and then clicking on the "Reverse Complement" button in the sequence viewer.

QUESTION 1: How did the programs perform and what regions of sequence similarity did you find with the different algorithms? Report the regions of similarity and explain the different results in terms of what you know about how the different algorithms work. Do different gap open and extension penalties result in substantially different alignments?

STEP 3

Now view the regions of similarity as a dotplot. This can be done by selecting the two sequences in the document table and then clicking on the "Dotplot Viewer" tab in the viewer panel.

QUESTION 2: What features do you detect in the sequences based on the dotplot? Did you detect any more elements than you had found with the Needleman-Wunsch and Smith-Waterman algorithms? Draw a "genetic map" showing the extent and orientation of the regions of homology between the two sequences, including the locations of any repeats that you have found.

As a last exercise, you have been given two 16S rRNA genes from distantly related bacteria. Compare the two with Needleman-Wunsch and with Dotplot Viewer, and compare the E. coli sequence to its own reverse sequence using the Dotplot Viewer. You will have to play around with the window size and the stringency - particularly on the last plot where it is better to use a small window size.

QUESTION 3: What did you detect? What is it that you are seeing in the dotplot when you compare a 16S rRNA gene to itself in reverse?

Exercise 2: Translating sequences

This exercise will demonstrate the difficulty of aligning protein coding sequences. Alignments of closely related sequences are easy to perform - the alignment is unequivocal. When you work with more distantly related sequences, the correct alignment may not be quite so obvious and the computer may get it wrong - particularly if you feed it inappropriate parameters.
We will use two cytochrome b gene sequences which are very different (human and wheat).
We only want to compare the coding regions of the two genes and so must exclude the flanking DNA. Hence, we must find the start and stop codons. Use the ORF finding function in Geneious to do this.

Tip: Although the genetic code is said to be universal, some organisms have slight variations, and the mitochondria of most organisms have differences in their codons for tryptophan (W in single letter code) and stop codons. In mammalian mitochondria, the normal stop codon UGA, encodes tryptophan instead of STOP. A translation can be incorrect if you forget to specify the correct translation table (genetic code).

Select the wheat_cytb sequence and choose the "Find ORFs" option in the Tool menu. The default minimum ORF length will be fine. Make sure that you choose the appropriate mitochondrial translation table to do the translation.

Do this for both wheat and human sequences (using appropriate translation tables) and then look at the resulting annotations.

Translation

Performing the translation itself is pretty much self-explanatory. Just select the ORF you want to translate and click the the translate button.
Translate both the human and wheat sequences into peptide sequences. As a last exercise, use the Needleman-Wunsch alignment to compare both the DNA sequences and the peptide sequences.

QUESTION 4: How does the level of identity change when you use peptide sequences and what does it mean? What clues do you detect on the DNA alignment to show that some regions have aligned correctly and others incorrectly?

Exercise 3: Multiple alignment

You are provided with a set of 16S rRNA sequences. These sequences are not the whole length of the gene but are of the V1-V2 region of a number of Clostridium sp. They have been especially chosen because they demonstrate that alignment is not always an easy task.

Geneious Progressive Pairwise Alignment

Select all off the 16S sequences and click on the "Alignment" button in the toolbar. Because you have selected more than two sequences you will have a few more options. Try aligning the sequences provided with a simple progressive pairwise alignment algorithm. To do this, set the number of refinements to zero. Have a close look at the alignment the program produces. Try a few different gap creation and extension penalties and compare the results.

CLUSTALW

An alternative progressive pairwise alignment program is CLUSTALW. Geneious does not come bundled with CLUSTALW, but if you install CLUSTALW on your computer then Geneious can be used to locate and run it. In many ways CLUSTALW is a more sophisticated alignment algorithm for amino-acid alignment, however for DNA alignments it uses algorithms which are almost identical to those built into Geneious. The best thing to do is to play around with these two alignment algorithms and see how they compare. We are not questioning you on the use of ClustalW but you should become familiar and comfortable with different multiple alignment algorithms as they are the cornerstone of many Bioinformatics processes.

Things you should explore:

Changing the multiple alignment parameters
See what happens when you use extreme values for gap creation
Split your sequences up into two files. Align one set of sequences first and then the other. Select the two alignments and click the "Alignment" button to align them by profile alignment.
Save the results in different formats. Common formats for phylogeneticists are NEXUS and PHYLIP. Look at the resulting files in a text editor to familiarize yourself with them.
Try hand-editing the alignment (click the edit button in the sequence viewer). Do you think you could improve the alignment beyond any of the tricks used with the automated alignment methods?

QUESTION 5: What strategies for automated multiple sequence alignment did you find worked best?

Exercise 4: Aligning sequences by hand

Use the Needleman-Wunsch algorithm and the BLOSUM 50 substitution matrix to manually align the two following peptide sequences.

GSLPH
 
ALLR

Use a gap weighting of -8.

Use a spreadsheet to create a table like the one below. Then using the Needleman-Wunsch algorithm and the BLOSUM50 matrix above, fill out the table by hand:

Now, find the traceback path through the matrix, and infer the alignment.

QUESTION 6: Present the F matrix and your inferred alignment.

Exercise 5: Programming task

The final part of this assignment will involve completing a small Java program that implements a simple sequence comparison task. The task is to takes two strings as input and report the number of "hits" between the two sequences. A hit is defined as a ungapped match of length W with a minimum score of T. This programming task will be marked for correctness, code style and speed. The Test class provided will be used to test your code. Your solution should should complete the MyHitCounter implementation the HitCounter interface. As a guideline, my (very bad) default implementation of MyHitCounter takes about 10 seconds for both the protein and nucleotide tests on my laptop. I would expect good implementations to be 2-20 times faster.

This task is deceptively easy! My default implementation is only 10 lines long. The craft is in getting it working *fast* without compromising code quality and maintainability. I will tolerate small changes to any of the other classes (refactoring) if there are good reasons for it. But please note any such changes in the comments of your MyHitCounter.java file. I expect this part of the assignment to be handed in by email to myself in a zip file containing all the java files with any instructions and implementation notes in the javadoc comments of MyHitCounter.

MyHitCounter.java
HitCounter.java
Test.java

The following classes are also needed to run the Test and have been adapted from the JEBL sourceforge project:

AminoAcidScores.java
Blosum62.java
Nucleotides.java
NucleotideScores.java
NucleotideState.java
ScoreMatrix.java
Scores.java
State.java

Evaluation

The evalution of you program will be broken up in the following way:

Correctness, 50%
Code style and documentation 25%
Speed 25%

Ref: Chapter 2, "Biological Sequence Analysis" Durbin, Eddy. Krough, Mitchison.