Statistical analysis of significance of sequence alignment
From DrugPedia: A Wikipedia for Drug discovery
STATISTICAL SIGNIFICANCE OF ALIGNMENTS
One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. A significance question arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the results of a database search for sequences that are similar to a sequence by the BLAST and FASTA programs . The test will be applied to every sequence matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores btained by the dynamic programming method.
Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal
statistical distribution. If sequences are randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in generating a sequence
by picking marbles representing four bases or 20 mino acids out of a bag (the number of each type is proportional to the frequency found in sequences),
the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores
follow a different distribution than the normal distribution called the Gumbel extreme value distribution.The statistical analysis of alignment scores
is much better understood for local alignments than for global alignments .
Significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word.
The significance level of a test is a traditional frequentist statistical hypothesis testing concept. In simple cases, it is defined as the probability
of making a decision to reject the null hypothesis when the null hypothesis is actually true (a decision known as a Type I error, or
"false positive determination". The decision is often made using the p-value: if the p-value is less than the significance level, then the null
hypothesis is rejected. The smaller the p-value, the more significant the result is said to be.
In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the null hypothesis is actually true is no more than the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis.
Determination of the Significance of an Alignment Score
Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs, using matrix values in bit units. If the actual alignment score in bits is greater than expected for alignment of random sequences, the alignment is significant For a typical amino acid scoring matrix and protein sequences.
K=0.1 and λ depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above then λ=loge2= 0.693 and the following simplified form of
Equation : { P (S=>x)=Kmne-λx }
may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments
reaching a score of S or greater
log2p = log2 ( Kmn e-λs)
= log2 (Kmn) + log2(e-λS)
= log2 (Kmn) + (loge(e-λs))/loge2
= log2 (Kmn) - λ S/loge2
= log2 (Kmn) – S -----(1)
then S, the score corresponding to probability P, may be obtained by rearranging terms of
Equation (1) as follows
S =log2 (Kmn) - log2P
=log2 (K/P) + log2(nm) ......(2)
Since for most scoring matrices K =̴ 0.1 and choosing P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one for calculating the score (Altschul 1991), thus giving
S =log2 (nm)