Needleman-Wunsch and Smith-Waterman algorithm
From DrugPedia: A Wikipedia for Drug discovery
[edit] Sequence Alignment
sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.
There are two types of Sequence alignment Global Alignment Local Alignment
[edit] Global Alignment
In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment.
[edit] Local Alignment
In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.
global alignment:
local alignment – – – – – – – T G K G – – – – – – – –
| | |
– – – – – – – A G K G – – – – – – – –
The global alignment is stretched over the entire sequence length to include as many matching amino acids as possible up to and including the sequence ends. Vertical bars between the sequences indicate the presence of identical amino acids.
In a local alignment, the alignment stops at the ends of regions of identity or strong similarity, and a much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favors finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein sequences.
The Needleman-Wunsch algorithm for sequence alignment
It is used for global alignment The best alignment over the entire length of two sequences Suitable when the two sequences are of similar length, with a significant degree of similarity throughout. Example:
SIMILARITY PI- LLAR- - -
A mathematical framework
Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved.
A gap, which indicates a residue-to-nothing match, may be introduced in either sequence. A gap-to-gap match is meaningless and is not allowed.
The scoring scheme
Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment).
The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences.
1.The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps. 2.The alignment score is the sum of substitution scores and gap penalties.
A simple scoring scheme
Use +1 as a reward for a match, -1 as the penalty for a mismatch, and ignore gaps
The best alignment ”by eye” from before: ATGGCGT ATG- AGT score: +1 + 1 + 1 + 0 − 1 + 1 + 1 = 4
An alternative alignment:
ATGGCGT A- TGAGT score: +1 + 0 − 1 + 1 − 1 + 1 + 1 = 2
The substitution matrix
A concise way to express the residue substitution costs can be achieved with a N × N matrix (N is 4 for DNA and 20 for proteins). The substitution matrix for the simple scoring scheme:
C T A G C 1 -1 -1 -1 T -1 1 -1 -1 A -1 -1 1 -1 G -1 -1 -1 1
The Needleman-Wunsch algorithm
A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences.
The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953!
How does dynamic programming work?
A divide-and-conquer strategy: 1.Break the problem into smaller subproblems. 2. Solve the smaller problems optimally. 3. Use the sub-problem solutions to construct an optimal solution for the original problem. Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems. Examples include 1. Trevelling salesman problem 2. Finding the best chess move
The mathematics
A matrix D(i, j) indexed by residues of each sequence is built recursively, such that
{D(i − 1, j − 1) + s(xi , yj ) } D(i, j) = max [ {D(i − 1, j) + g } ] {D(i, j − 1) + g } subject to a boundary conditions. s(i, j) is the substitution score for residues i and j, and g is the gap penalty.
A walk-through: an overview
We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation).
We will have two matrices: the score matrix and traceback matrix. The Needleman-Wunsch algorithm consists of three steps: 1. Initialisation of the score matrix 2. Calculation of scores and filling the traceback matrix 3. Deducing the alignment from the traceback matrix
The score and traceback matrices
The cells of score matrix are labeled C(i,j) where i=1,2,.....,N and j=1,2,.....,M SCORE MATRIX
S E N D
C(1,1) C(1,2) C(1,3) C(1,4) C(1,5) C(2,1)
C(3,1)
C(4,1)
TRACEBACK MATRIX
S E N D
Initialization
The first row and the first column of the score and traceback matrices are filled during the initialization.
S E N D
0 -10 -20 -30 -40 A -10
N
-20
D
-30
S E N D
done left left left left A up
N
up
D
up
SCORING
The score matrix cells are filled by row starting from the cell C(2, 2) The score of any cell C(i, j) is the maximum of:
qdiag = C(i − 1, j − 1) + S(i, j) qup = C(i − 1, j) + g qleft = C(i, j − 1) + g where S(i, j) is the substitution score for letters i and j, and g is the gap penalty.
SCORING- A pictorial representation
The value of cell c(i,j) depends only on the values of the immediately adjacent northwest diagonal, up and left cells
C(i-1,j-1) C(i-1,j) C(i,j-1) C(i,j)
The value of C(2, 2)
The calculation for the cell C(2, 2):
qdiag = C(1, 1) + S(S, A) = 0 + 1 = 1 qup = C(1, 2) + g = −10 + (−10) = −20 qlef t = C(2, 1) + g = −10 + (−10) = −20
Where C(1, 1), C(1, 2), and C(2, 1) are read from the score matrix, and S(S, A) is the score for the S ↔ A taken from the BLOSUM62 matrix.
Filling the score and traceback matrices
For the score matrix C(2, 2) = qdiag which is 1. The corresponding cell of the traceback matrix is ”diag”:
S
E
N
D
0 -10 -20 -30 -40 A -10 1
N -20
D
-30
S E N D
done left left left left A up diag
N up
D
up
The progression is recursive
The calculation for the cell C(2, 3)
qdiag = C(1, 2) + S(E, A) = −10 + −1 = −11 qup = C(1, 3) + g = −20 + (−10) = −30 qlef t = C(2, 2) + g = 1 + (−10) = −9
Thus C(2, 3) = −9 and the corresponding cell of the traceback matrix is ”left”.
The final score and traceback matrices
After all cells are filled, the score and traceback matrices are:
S
E
N
D
0 -10 -20 -30 -40 A -10 1 -9 -19 -29 N -20 -9 -1 -3 -13 D -30 -19 -11 2 3
S
E
N
D
done left left left left A up diag left left left N up diag diag diag left D up up diag diag diag
The traceback Traceback = the process of deduction of the best alignment from the traceback matrix.
The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell. One moves according to the traceback value written in the cell. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the first, top-left cell of the matrix is reached (”done” cell).
The traceback path
The traceback performed on the completed traceback matrix:
S
E
N
D
done left left left left A up diag left left left N up diag diag diag left D up up diag diag diag
The best alignment
The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix:
diag – the letters from two sequences are aligned left – a gap is introduced in the left sequence up – a gap is introduced in the top sequence
Sequences are aligned backwards. The fourth cell from the traceback path is also ”diag” implying that the corresponding letters are aligned. We consider the letter A again, this time it is aligned with S:
SEND A-ND
A few observations
It was much easier to align SEND and AND by the exhaustive search!
As we consider longer sequences the situation quickly turns against the exhaustive search:
1.Two 12 residue sequences would require considering ∼ 1 million alignments. 2. Two 150 residue sequences would require considering ∼ 1088 alignments (∼ 1078 is the estimated number of atoms in the Universe). For two 150 residue sequences the Needleman-Wunsch algorithm requires filling a 150 × 150 matrix.
Smith-Waterman algorithm
Local sequence alignment
Involving stretches that are shorter than the entire sequences, possibly more than one.
Suitable when comparing substantially different sequences, which possibly differ significantly in length, and have only a short patches of similarity. For example, the local alignment of SIMILARITY and PILLAR: MILAR ILLAR
Algorithm: similar to global alignment with modified boundary conditions and recurrence rules.
– The top row and left column are now filled with 0. – If the (sub-)alignment score becomes negative, restart the search: {0, {F (i − 1, j − 1) + s(xi, yj ), F (i, j) = max { F (i − 1, j) − d, } {F (i, j − 1) − d. – Traceback: from the maximum of F (i, j) in the whole matrix to the first 0.
Example: the optimal local alignment between HEAGAWGHEE and PAWHEAE is
AWGHE::AW-HE.
Issue: In gapped alignments, the expected score for a random match may be positive.
Local sequence alignment Smith-Waterman algorithm Finding the optimal alignment A T C A G A G T C A C G T 0 0 0 0 0 0 0 0 0 0 A 1 -1 -1 -1 G 0 0 0 0 0 1 0 1 0 0 C -1 1 -1 -1 T 0 0 1 0 0 0 0 0 2 0 G -1 -1 1 -1 C 0 0 0 2 0 0 0 0 0 3 T -1 -1 -1 1 A 0 1 0 0 3 1 1 1 1 1 G 0 0 0 0 1 4 2 2 2 2 T 0 0 1 0 1 2 3 1 3 1 C 0 0 0 2 1 2 1 2 1 44 4 A 0 1 0 0 3 2 3 1 1 2 Gap penalty -2
The optimal local alignment is: A TCAGAGTC
G TCAG - - TCA score : 1+1+1+1-2+1+1=4
References:
1. Bioinformatics: Genome and Sequence analysis(David W.Mount)
2.“www.google.com” 3.wikipedia