Needleman-Wunsch and Smith-Waterman algorithm

From DrugPedia: A Wikipedia for Drug discovery

Jump to: navigation, search

[edit] Sequence Alignment

sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences.

There are two types of Sequence alignment Global Alignment Local Alignment


[edit] Global Alignment

In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment.


[edit] Local Alignment

In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain.


global alignment:

Image:image1.png

local alignment – – – – – – – T G K G – – – – – – – –

               | | |

– – – – – – – A G K G – – – – – – – –


        The global alignment is stretched over the entire sequence length to include as many matching amino acids as possible up to and including the sequence ends. Vertical bars between the sequences indicate the presence of identical amino acids.
        In a local alignment, the alignment stops at the ends of regions of identity or strong similarity, and a much higher priority is given to finding these local regions  than to extending the alignment to include more neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favors finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein sequences. 


        The Needleman-Wunsch algorithm for sequence alignment 
    

It is used for global alignment The best alignment over the entire length of two sequences Suitable when the two sequences are of similar length, with a significant degree of similarity throughout. Example:

        SIMILARITY 
        PI-  LLAR- - - 
            
       
                 A mathematical framework 

Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved.

A gap, which indicates a residue-to-nothing match, may be introduced in either sequence. 
 A gap-to-gap match is meaningless and is not allowed. 


                 The scoring scheme 
  

Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment).

  The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences. 

1.The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps. 2.The alignment score is the sum of substitution scores and gap penalties.

               A simple scoring scheme 

Use +1 as a reward for a match, -1 as the penalty for a mismatch, and ignore gaps

 The best alignment ”by eye” from before: 
              ATGGCGT 
ATG- AGT    score: +1 + 1 + 1 + 0 − 1 + 1 + 1 = 4 

An alternative alignment:

                    ATGGCGT 
               A- TGAGT  score: +1 + 0 − 1 + 1 − 1 + 1 + 1 = 2 



              The substitution matrix 

A concise way to express the residue substitution costs can be achieved with a N × N matrix (N is 4 for DNA and 20 for proteins). The substitution matrix for the simple scoring scheme:

                  C   T    A    G 
               C 1   -1   -1   -1 
               T -1  1    -1   -1 
               A -1  -1    1   -1 
               G -1  -1   -1    1 
   The Needleman-Wunsch algorithm 
 

A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences.

 The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953! 
                                 


How does dynamic programming work?

A divide-and-conquer strategy: 1.Break the problem into smaller subproblems. 2. Solve the smaller problems optimally. 3. Use the sub-problem solutions to construct an optimal solution for the original problem. Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems. Examples include 1. Trevelling salesman problem 2. Finding the best chess move


                           The mathematics 

A matrix D(i, j) indexed by residues of each sequence is built recursively, such that

                                                        {D(i − 1, j − 1) + s(xi , yj ) }
                        D(i, j) = max     [         {D(i − 1, j) + g }                  ]
                                                          {D(i, j − 1) + g }
   subject to a boundary conditions. s(i, j) is the substitution score for residues i and j, and g is the gap penalty. 
                  A walk-through: an overview 

We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation).

   We will have two matrices: the score matrix and traceback matrix. 
 The Needleman-Wunsch algorithm consists of three steps: 
                     1. Initialisation of the score matrix 
                     2. Calculation of scores and filling the traceback matrix 
                     3. Deducing the alignment from the traceback matrix 


                   The score and traceback matrices

The cells of score matrix are labeled C(i,j) where i=1,2,.....,N and j=1,2,.....,M SCORE MATRIX

                             S                          E                    N                      D

C(1,1) C(1,2) C(1,3) C(1,4) C(1,5) C(2,1)



C(3,1)



C(4,1)





TRACEBACK MATRIX

                                   S                        E                      N                       D











         Initialization  

The first row and the first column of the score and traceback matrices are filled during the initialization.


S E N D

0 -10 -20 -30 -40 A -10



N -20



D -30




S E N D

done left left left left A up



N up



D up







                      SCORING

The score matrix cells are filled by row starting from the cell C(2, 2) The score of any cell C(i, j) is the maximum of:

           qdiag =     C(i − 1, j − 1) + S(i, j) 
             qup =     C(i − 1, j) + g 
               qleft =   C(i, j − 1) + g 
     where S(i, j) is the substitution score for letters i and j, 
     and g is the gap penalty. 

SCORING- A pictorial representation

The value of cell c(i,j) depends only on the values of the immediately adjacent northwest diagonal, up and left cells

C(i-1,j-1) C(i-1,j) C(i,j-1) C(i,j)


The value of C(2, 2)

The calculation for the cell C(2, 2):

                       qdiag =     C(1, 1) + S(S, A) = 0 + 1 = 1 
                       qup =     C(1, 2) + g = −10 + (−10) = −20 
                       qlef t =   C(2, 1) + g = −10 + (−10) = −20 

Where C(1, 1), C(1, 2), and C(2, 1) are read from the score matrix, and S(S, A) is the score for the S ↔ A taken from the BLOSUM62 matrix.




                 Filling the score and traceback matrices 
  

For the score matrix C(2, 2) = qdiag which is 1. The corresponding cell of the traceback matrix is ”diag”:



S E N D

0 -10 -20 -30 -40 A -10 1


N -20



D -30





S E N D

done left left left left A up diag


N up



D up




The progression is recursive

The calculation for the cell C(2, 3)

       qdiag =    C(1, 2) + S(E, A) = −10 + −1 = −11 
       qup =    C(1, 3) + g = −20 + (−10) = −30 
       qlef t =   C(2, 2) + g = 1 + (−10) = −9 

Thus C(2, 3) = −9 and the corresponding cell of the traceback matrix is ”left”.

        The final score and traceback matrices 
 

After all cells are filled, the score and traceback matrices are:



S E N D

0 -10 -20 -30 -40 A -10 1 -9 -19 -29 N -20 -9 -1 -3 -13 D -30 -19 -11 2 3



S E N D

done left left left left A up diag left left left N up diag diag diag left D up up diag diag diag

The traceback Traceback = the process of deduction of the best alignment from the traceback matrix.

   The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell. 
 One moves according to the traceback value written in the cell. 
 There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. 
 The traceback is completed when the first, top-left cell of the matrix is reached (”done” cell). 




 The traceback path 
 

The traceback performed on the completed traceback matrix:



S E N D

done left left left left A up diag left left left N up diag diag diag left D up up diag diag diag


The best alignment

The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix:

                   diag – the letters from two sequences are aligned 
               left – a gap is introduced in the left sequence 
               up – a gap is introduced in the top sequence 

Sequences are aligned backwards. The fourth cell from the traceback path is also ”diag” implying that the corresponding letters are aligned. We consider the letter A again, this time it is aligned with S:

     SEND 
     A-ND 

A few observations

It was much easier to align SEND and AND by the exhaustive search!

As we consider longer sequences the situation quickly turns against the exhaustive search: 

1.Two 12 residue sequences would require considering ∼ 1 million alignments. 2. Two 150 residue sequences would require considering ∼ 1088 alignments (∼ 1078 is the estimated number of atoms in the Universe). For two 150 residue sequences the Needleman-Wunsch algorithm requires filling a 150 × 150 matrix.

Smith-Waterman algorithm 
Local sequence alignment 
 

Involving stretches that are shorter than the entire sequences, possibly more than one.

    Suitable when comparing substantially different sequences, which possibly differ significantly in length, and have only  a short patches of similarity. 
 For example, the local alignment of SIMILARITY and PILLAR: 
                 MILAR 
                 ILLAR 

Algorithm: similar to global alignment with modified boundary conditions and recurrence rules.

             – The top row and left column are now filled with 0. 
              – If the (sub-)alignment score becomes negative, restart the search: 
                                         
                                          {0, 
                                          {F (i − 1, j − 1) + s(xi, yj ), 
                   F (i, j) = max { F (i − 1, j) − d,                     }
                                          {F (i, j − 1) − d. 
            – Traceback: from the maximum of F (i, j) in the whole matrix to the first 0. 

Example: the optimal local alignment between HEAGAWGHEE and PAWHEAE is

 AWGHE::AW-HE. 

Issue: In gapped alignments, the expected score for a random match may be positive.




                           Local sequence alignment 
                        Smith-Waterman algorithm 
                       Finding the optimal alignment 
                                                
                                    A   T  C  A  G  A   G T C 
    A C G T             0   0   0   0   0  0   0   0  0  0 
A  1 -1 -1 -1 
                           G 0   0   0   0   0  1   0   1  0  0 
C  -1 1 -1 -1 
                           T 0   0   1   0   0  0  0   0  2  0 
G  -1 -1 1 -1 
                           C 0   0  0   2    0  0  0   0  0  3 
T  -1 -1 -1 1 
                           A  0  1  0  0     3  1  1   1  1  1 
                           G  0  0  0  0     1  4  2   2  2  2 
                           T   0  0  1  0    1  2  3   1  3  1 
                           C   0  0  0  2    1  2  1   2  1  44 4
                           A   0  1  0  0    3  2  3   1  1  2 
        
             Gap penalty -2
                           

The optimal local alignment is: A TCAGAGTC

                            G TCAG - -  TCA 
                              score : 1+1+1+1-2+1+1=4






References:

    1. Bioinformatics: Genome and Sequence analysis(David W.Mount)

2.“www.google.com” 3.wikipedia