Statistical analysis of significance of sequence alignment
From DrugPedia: A Wikipedia for Drug discovery
(30 intermediate revisions not shown.) | |||
Line 1: | Line 1: | ||
- | + | =='''STATISTICAL SIGNIFICANCE OF ALIGNMENTS'''== | |
- | == ''' STATISTICAL SIGNIFICANCE OF ALIGNMENTS''' == | + | |
- | |||
- | |||
One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA | One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA | ||
or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. | or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. | ||
Line 9: | Line 6: | ||
In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected | In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected | ||
between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the | between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the | ||
- | results of a database search for sequences that are similar to a sequence by the BLAST and FASTA programs . The test will be applied to every sequence | + | results of a database search for sequences that are similar to a sequence by the [BLAST][http://en.wikipedia.org/wiki/BLAST] and [FASTA][http://en.wikipedia.org/wiki/Fasta] programs . The test will be applied to every sequence |
matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that | matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that | ||
have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores | have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores | ||
Line 16: | Line 13: | ||
Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal | Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal | ||
- | statistical distribution. If sequences are randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in generating a sequence | + | statistical distribution. If sequences are randomly generated in a computer by a [Monte Carlo or sequence shuffling][http://en.wikipedia.org/wiki/Monte_Carlo_Method] method, as in generating a sequence |
by picking marbles representing four bases or 20 mino acids out of a bag (the number of each type is proportional to the frequency found in sequences), | by picking marbles representing four bases or 20 mino acids out of a bag (the number of each type is proportional to the frequency found in sequences), | ||
the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores | the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores | ||
Line 22: | Line 19: | ||
is much better understood for local alignments than for global alignments . | is much better understood for local alignments than for global alignments . | ||
- | == '''Significance''' == | + | === '''Significance''' === |
- | + | ||
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" | In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" | ||
Line 39: | Line 35: | ||
applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed | applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed | ||
within the null hypothesis. | within the null hypothesis. | ||
- | |||
- | |||
- | |||
- | + | ==='''Determination of the Significance of an Alignment Score '''=== | |
- | '''Determination of the Significance of an Alignment Score ''' | + | |
- | + | ||
Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this | Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this | ||
fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple | fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple | ||
estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs, | estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs, | ||
- | using matrix values in bit units. If the actual alignment score in bits is greater than expected for | + | using matrix values in bit units. If the actual alignment score in bits is greater than expected for alignment of random sequences, the alignment |
- | is significant For a typical amino acid scoring matrix and protein | + | is significant For a typical amino acid scoring matrix and protein sequences. |
- | + | ||
- | + | ||
- | depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above | + | K=0.1 and λ depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above |
- | simplified form of | + | then '''λ=log<sub>e</sub>2= 0.693''' and the following simplified form of |
+ | '''Equation : P(s=>x)=1-exp(-e<sup>-x</sup>)''' | ||
+ | ''' =1-exp(-e<sup>-λ(x-µ)</sup>)''' Becomes | ||
- | + | ''' P(S=>x)=Kmne<sup>-λx</sup> For µ=(ln Kmn) / λ''' may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments reaching a score of S or greater | |
- | + | '''log<sub>2</sub>p(S=>x) = log<sub>2</sub>( Kmn e<sup>-λs</sup>)''' | |
- | + | ||
- | + | ||
- | + | ||
- | + | '''= log<sub>2</sub> (Kmn) + log<sub>2</sub>(e<sup>-λs</sup>)''' | |
- | + | '''= log<sub>2</sub> (Kmn) + (log<sub>e</sub>(e<sup>-λs</sup>))/log<sub>e</sub>2''' | |
- | + | '''= log<sub>2</sub> (Kmn) - λ S/log<sub>e</sub>2''' | |
- | + | '''= log<sub>2</sub> (Kmn) – S ''' -----(1) | |
Line 80: | Line 68: | ||
- | S = | + | '''S =log<sub>2</sub> (Kmn) - log<sub>2</sub>P(S=>x) |
- | + | ''' | |
- | = | + | '''=log<sub>2</sub> (K/P(S=>x)) + log<sub>2</sub>(nm)''' ......(2) |
- | + | Since for most scoring matrices K =̴ 0.1 and choosing P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one | |
for calculating the score (Altschul 1991), thus giving | for calculating the score (Altschul 1991), thus giving | ||
- | S = | + | '''S =log<sub>2</sub> (nm)''' |
+ | |||
+ | ==='''Analysis of Z-Score'''=== | ||
+ | |||
+ | Clearly if the romdomised sequences score as well as the original one the alignment is unlikely to be significant . We can measure the mean and | ||
+ | standard deviation of scores of the alignment of randomized sequences, and ask wheather the score of the original alignment is un usuually high. | ||
+ | The Z score reflects the extent to which thye original result is an outlier from the population | ||
+ | |||
+ | |||
+ | |||
+ | '''Z-score= (score(S)-mean)/standard deviation''' | ||
+ | |||
+ | |||
+ | [[Image:Graph.png|graph.txt]] | ||
+ | |||
+ | |||
+ | A Z-score of 0 means that observed similarity is no better than the average of random permutations of the sequence, and might well have arisen by chance . | ||
+ | The problem with using a Z-score is that whether the problem has occurred by chance is that Z-score assumes a normal distribution. However the data does | ||
+ | not follow normal distribution. As a result higher Z-score should be taken as a threshold of significance. | ||
+ | |||
+ | |||
+ | =='''References:'''== | ||
+ | |||
+ | [1] Introduction to Bioinformatics: Arthur M. Lesk | ||
+ | |||
+ | [2] Bioinformatics: David Mount | ||
+ | |||
+ | [3] http://www.ncbi.nlm.nih.gov/ |
Current revision
Contents |
[edit] STATISTICAL SIGNIFICANCE OF ALIGNMENTS
One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. A significance question arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the results of a database search for sequences that are similar to a sequence by the [BLAST][1] and [FASTA][2] programs . The test will be applied to every sequence matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores btained by the dynamic programming method.
Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal
statistical distribution. If sequences are randomly generated in a computer by a [Monte Carlo or sequence shuffling][3] method, as in generating a sequence
by picking marbles representing four bases or 20 mino acids out of a bag (the number of each type is proportional to the frequency found in sequences),
the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores
follow a different distribution than the normal distribution called the Gumbel extreme value distribution.The statistical analysis of alignment scores
is much better understood for local alignments than for global alignments .
[edit] Significance
In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word.
The significance level of a test is a traditional frequentist statistical hypothesis testing concept. In simple cases, it is defined as the probability
of making a decision to reject the null hypothesis when the null hypothesis is actually true (a decision known as a Type I error, or
"false positive determination". The decision is often made using the p-value: if the p-value is less than the significance level, then the null
hypothesis is rejected. The smaller the p-value, the more significant the result is said to be.
In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the null hypothesis is actually true is no more than the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis.
[edit] Determination of the Significance of an Alignment Score
Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs, using matrix values in bit units. If the actual alignment score in bits is greater than expected for alignment of random sequences, the alignment is significant For a typical amino acid scoring matrix and protein sequences.
K=0.1 and λ depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above then λ=loge2= 0.693 and the following simplified form of
Equation : P(s=>x)=1-exp(-e-x)
=1-exp(-e-λ(x-µ)) Becomes
P(S=>x)=Kmne-λx For µ=(ln Kmn) / λ may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments reaching a score of S or greater
log2p(S=>x) = log2( Kmn e-λs)
= log2 (Kmn) + log2(e-λs)
= log2 (Kmn) + (loge(e-λs))/loge2
= log2 (Kmn) - λ S/loge2
= log2 (Kmn) – S -----(1)
then S, the score corresponding to probability P, may be obtained by rearranging terms of
Equation (1) as follows
S =log2 (Kmn) - log2P(S=>x)
=log2 (K/P(S=>x)) + log2(nm) ......(2)
Since for most scoring matrices K =̴ 0.1 and choosing P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one for calculating the score (Altschul 1991), thus giving
S =log2 (nm)
[edit] Analysis of Z-Score
Clearly if the romdomised sequences score as well as the original one the alignment is unlikely to be significant . We can measure the mean and standard deviation of scores of the alignment of randomized sequences, and ask wheather the score of the original alignment is un usuually high. The Z score reflects the extent to which thye original result is an outlier from the population
Z-score= (score(S)-mean)/standard deviation
A Z-score of 0 means that observed similarity is no better than the average of random permutations of the sequence, and might well have arisen by chance .
The problem with using a Z-score is that whether the problem has occurred by chance is that Z-score assumes a normal distribution. However the data does
not follow normal distribution. As a result higher Z-score should be taken as a threshold of significance.
[edit] References:
[1] Introduction to Bioinformatics: Arthur M. Lesk
[2] Bioinformatics: David Mount