Needleman-Wunsch Algorithm
From DrugPedia: A Wikipedia for Drug discovery
The Needleman-Wunch (NW) algorithm Needleman and Wunsch (1970)# is a nonlinear global optimization method that was developed for amino acid sequence alignment in proteins. This was the first of many important alignment techniques which now find application in the Human Genome Project.
Human DNA consists of some 30,000 genes. The total genome is composed of about 3 billion chemical base pairs, or about 100,000 per gene. Finding where a particular string of amino acids fits is an optimization problem that aims to find the optimal alignment of the two strings with respect to a defined set of rules and parameter values for comparing different alignments.
The algorithm is an iterative method in which all possible pairs of amino acids (one from each string) are set up in a 2D matrix and alignments are represented as pathways through this array. The optimum alignment is the path (or paths) connecting maximum scoring values. This approach is an example of dynamic programming, which has also been applied to seismic modeling Darby and Neidell (1966) and travel time computation Schneider et al. (1992).
It is a global optimization process which yields a solution to the problem of pairwise alignment, meaning that we are interested in finding the best fit between only two strings. If alignment of more than two strings is of interest, the problem can be broken down into a cascade of pairwise alignments and thus solved.
In its simplest form, the Needleman-Wunsch algorithm can be summarized by Figure 1. A matrix is formed by placing the two strings, possibly of different length, along the left column and top row. In this step a one is allocated to a cell in the matrix if the letter in each list at this location is the same. Otherwise no entry is made (which is a defacto zero). It is at this stage that the letter-alignment problem becomes purely numerical. In fact, the original string could just as easily consist of integers as letters. The result of this process is the similarity matrix in Figure 1a.
Figure 1 The Needleman-Wunsch algorithm yields the globally optimum alignment between two strings, one along the left of the matrix and the other across the top. (a) Similarity matrix. (b) and (c) partially complete score matrix. (d) Complete score matrix. (e) Traceback route giving globally optimum alignment. (f) Alternate alignment illustrating non-uniqueness.
From the similarity matrix a scoring matrix is formed beginning in the lower right corner. The procedure is to add the score value to the maximum value in a row-column pair whose upper left corner is down and to the right of the current working position. Thus in Figure 1b the similarity value 1 is added to the maximum value in the blackened cells (also 1) to give a score of 2. Figure 1c is a later stage of the computation, which continues up and to the left until every cell has been visited and the scoring matrix is complete, Figure 1d. In this simple form, a final score corresponds to how many character matches exist in the optimum alignment.
The final step (traceback) operates by starting at the highest score value (8 in this case) and determining the maximum score path by moving to the right, down, or diagonally down and to the right, Figure 1e. The fact that more than one 8 score alignment exists (Figure 1f) is an expression of non-uniqueness. An important aspect of the solution is that in the process of finding the best global alignment, we also find the best alignments of any sublength.
The idea is to use dynamic programming to efficiently implement a recursion. Given two input strings x and y, we build a matrix F such that the entry F [i, j] is the score of the optimal alignment of x[1..i] and y[1..j]. We initialize F [0, 0] = 0. We then proceed to fill the matrix from top left to bottom right. Suppose we have filled in F [i − 1, j − 1], F [i − 1, j], and F [i, j − 1], the three entries above, to the left, and diagonally above and left of F [i, j], and we have the (or an) optimal alignment for each of those three pairs. Then there are three corresponding ways to complete these alignments to an alignment of x[1..i] and y[1..j]. We can either align x[i] with y[j] (in the first case), or align x[i] with a new gap (in the second case), or align y[j] with a new gap (in the third case). Therefore we have the recursion:
F [i − 1, j − 1] + s(x[i], y[i]), F [i, j] = max F [i − 1, j] − d, F [i, j − 1] − d
At the top row and the left side, we must specify F [i, 0] and F [0, j]. The value F [i, 0] represents assigning a prefix of x to a gap, so we should define F [i, 0] = id. Similarly down the left side F [0, j] = jd, corresponding to assigning a prefix of y to a gap.
This recursion is a classic example of what dynamic programming is good for. If it is implemented naively, many values will be recomputed exponentially many times. Instead, we use a matrix to store the values, as indicated, and compute them bottom-up in a double for-loop.
Since we want to obtain not only the optimal score, but the (or an) optimal alignment, we store in each cell of the matrix a “pointer” (not in the programming sense)–the pointer value is left, up, or diagonal. In an actual program we could keep these values in a separate matrix, the same size as F . These pointer values tell us which of the three choices was used to compute F [i, j]. In case two of the choices yield the same value of F [i, j], we just pick one arbitrarily. (If we want to return all optimal alignments, we have to allow for retaining more than one pointer value.)
When the matrix has been computed, we read the optimal score as the lower right corner entry F [n, m]. Note that F [n, m]does not have to be the maximum entry along the right side or the bottom of the matrix F, but starting there would mean omitting one or more characters from the end of one of the strings. We are only allowed to insert gaps, not to omit characters, so we must start in the corner at F[n, m].
How do we recover the optimal alignment itself? That will be given by two strings u and v, such that u is formed from x by inserting some gaps, and v is formed from y by inserting some gaps. We initialize u and v as follows: u = x[n] and v = y[m].
We now use a “trace-back” process, following the pointers back starting from the cell with the maximum score. A diagonal pointer adds a character (at the left) to each string. An up pointer adds a gap to u and a character at the left of v. A left pointer adds a gap to v and a character at the left of u.