Statistical analysis of significance of sequence alignment

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)

Current revision

[edit] STATISTICAL SIGNIFICANCE OF ALIGNMENTS

One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary. A significance question arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the results of a database search for sequences that are similar to a sequence by the [BLAST][1] and [FASTA][2] programs . The test will be applied to every sequence matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores btained by the dynamic programming method.

Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal statistical distribution. If sequences are randomly generated in a computer by a [Monte Carlo or sequence shuffling][3] method, as in generating a sequence by picking marbles representing four bases or 20 mino acids out of a bag (the number of each type is proportional to the frequency found in sequences), the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores follow a different distribution than the normal distribution called the Gumbel extreme value distribution.The statistical analysis of alignment scores is much better understood for local alignments than for global alignments .

[edit] Significance

In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in the common meaning of the word.

The significance level of a test is a traditional frequentist statistical hypothesis testing concept. In simple cases, it is defined as the probability of making a decision to reject the null hypothesis when the null hypothesis is actually true (a decision known as a Type I error, or "false positive determination". The decision is often made using the p-value: if the p-value is less than the significance level, then the null hypothesis is rejected. The smaller the p-value, the more significant the result is said to be.

In more complicated, but practically important cases, the significance level of a test is a probability such that the probablility of making a decision to reject the null hypothesis when the null hypothesis is actually true is no more than the stated probability. This allows for those applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed within the null hypothesis.

[edit] Determination of the Significance of an Alignment Score

Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs, using matrix values in bit units. If the actual alignment score in bits is greater than expected for alignment of random sequences, the alignment is significant For a typical amino acid scoring matrix and protein sequences.

K=0.1 and λ depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above then λ=log_e2= 0.693 and the following simplified form of

Equation : P(s=>x)=1-exp(-e^-x)

                =1-exp(-e^-λ(x-µ))    Becomes

P(S=>x)=Kmne^-λx For µ=(ln Kmn) / λ may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments reaching a score of S or greater

log₂p(S=>x) = log₂( Kmn e^-λs)

= log₂ (Kmn) + log₂(e^-λs)

= log₂ (Kmn) + (log_e(e^-λs))/log_e2

= log₂ (Kmn) - λ S/log_e2

= log₂ (Kmn) – S -----(1)

then S, the score corresponding to probability P, may be obtained by rearranging terms of Equation (1) as follows

S =log₂ (Kmn) - log₂P(S=>x)

=log₂ (K/P(S=>x)) + log₂(nm) ......(2)

Since for most scoring matrices K =̴ 0.1 and choosing P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one for calculating the score (Altschul 1991), thus giving

S =log₂ (nm)

[edit] Analysis of Z-Score

Clearly if the romdomised sequences score as well as the original one the alignment is unlikely to be significant . We can measure the mean and standard deviation of scores of the alignment of randomized sequences, and ask wheather the score of the original alignment is un usuually high. The Z score reflects the extent to which thye original result is an outlier from the population

Z-score= (score(S)-mean)/standard deviation

A Z-score of 0 means that observed similarity is no better than the average of random permutations of the sequence, and might well have arisen by chance . The problem with using a Z-score is that whether the problem has occurred by chance is that Z-score assumes a normal distribution. However the data does not follow normal distribution. As a result higher Z-score should be taken as a threshold of significance.

[edit] References:

[1] Introduction to Bioinformatics: Arthur M. Lesk

[2] Bioinformatics: David Mount

[3] http://www.ncbi.nlm.nih.gov/

Statistical analysis of significance of sequence alignment

From DrugPedia: A Wikipedia for Drug discovery

Current revision

Contents

[edit] STATISTICAL SIGNIFICANCE OF ALIGNMENTS

[edit] Significance

[edit] Determination of the Significance of an Alignment Score

[edit] Analysis of Z-Score

[edit] References:

Views

Personal tools

Search

Navigation

Toolbox

@@ Line 1: / Line 1: @@
+=='''STATISTICAL SIGNIFICANCE OF ALIGNMENTS'''==
-== ''' STATISTICAL SIGNIFICANCE OF ALIGNMENTS''' ==
 One of the most important recent advances in sequence analysis is the development of methods to assess the significance of an alignment between DNA
 or protein sequences. For sequences that are quite similar, such as two proteins that are clearly in the same family, such an analysis is not necessary.
@@ Line 9: / Line 6: @@
 In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected
 between related sequences or would just as likely be found if the sequences were not related. The significance test is also needed to evaluate the
-results of a database search for sequences that are similar to a sequence by the BLAST and FASTA programs . The test will be applied to every sequence
+results of a database search for sequences that are similar to a sequence by the [BLAST][http://en.wikipedia.org/wiki/BLAST] and [FASTA][http://en.wikipedia.org/wiki/Fasta] programs . The test will be applied to every sequence
 matched so that the most significant matches are reported. Finally, a significance test can also help to identify regions in a single sequence that
 have an unusual composition suggestive of an interesting function. Our present purpose is to examine the significance of sequence alignment scores
@@ Line 16: / Line 13: @@
 Originally, the significance of sequence alignment scores was evaluated on the basis of the assumption that alignment scores followed a normal
-statistical distribution. If sequences are randomly generated in a computer by a Monte Carlo or sequence shuffling method, as in generating a sequence
+statistical distribution. If sequences are randomly generated in a computer by a [Monte Carlo or sequence shuffling][http://en.wikipedia.org/wiki/Monte_Carlo_Method] method, as in generating a sequence
 by picking marbles representing four bases or 20  mino acids out of a bag (the number of each type is proportional to the frequency found in sequences),
 the distribution may look normal at first glance. However, further analysis of the alignment scores of random sequences will reveal that the scores
@@ Line 22: / Line 19: @@
 is much better understood for local alignments than for global alignments .
-== '''Significance''' ==
+=== '''Significance''' ===
 In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. "A statistically significant difference"
@@ Line 39: / Line 35: @@
 applications where the probability of deciding to reject may be much smaller than the significance level for some sets of assumptions encompassed
 within the null hypothesis.
+==='''Determination of the Significance of an Alignment Score '''===
-'''Determination of the Significance of an Alignment Score '''
 Scoring matrices are most useful for statistical work if they are scaled in logarithms to the base 2 called bits. Scaling the matrices in this
 fashion does not alter their ability to score sequence similarities, and thereby to distinguish good matches from poor ones, but does allow a simple
 estimation of the significance of an alignment. The actual alignment may then be calculated by summing the matrix values for each of the aligned pairs,
-using matrix values in bit units. If the actual alignment score in bits is greater than expected for    alignment of random sequences, the alignment
+using matrix values in bit units. If the actual alignment score in bits is greater than expected for  alignment of random sequences, the alignment
-is significant   For a typical amino acid scoring matrix and protein sequence.
+is significant   For a typical amino acid scoring matrix  and protein sequences.
-K=0.1 	and 	λ
-depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above, then λ=loge2= 0.693, and the following
+K=0.1 	and 	λ  depends on the values of the scoring matrix. If the log odds matrix is in units of bits as described above
-simplified form of
+then '''λ=log<sub>e</sub>2= 0.693''' and the following simplified form of
+'''Equation :  P(s=>x)=1-exp(-e<sup>-x</sup>)'''
+                ''' =1-exp(-e<sup>-λ(x-µ)</sup>)'''    Becomes
-Equation :  { P (S=>x)=Kmne-λx  }
+'''  P(S=>x)=Kmne<sup>-λx</sup>   For   µ=(ln Kmn) /  λ'''  may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments reaching a score of S or greater
-may be derived (Altschul 1991) by taking logarithms to the base 2 and setting p as the probability of the scores of random or unrelated alignments
+'''log<sub>2</sub>p(S=>x)   =  log<sub>2</sub>( Kmn e<sup>-λs</sup>)'''
-reaching a score of S or greater
-log2p   =  log2 ( Kmn e-λs)
- =  log2 (Kmn) + log2(e-λS)
+'''=  log<sub>2</sub> (Kmn) + log<sub>2</sub>(e<sup>-λs</sup>)'''
- =  log2 (Kmn) + (loge(e-λs))/loge2
+'''=  log<sub>2</sub> (Kmn) + (log<sub>e</sub>(e<sup>-λs</sup>))/log<sub>e</sub>2'''
- =  log2 (Kmn) -  λ S/loge2
+'''=  log<sub>2</sub> (Kmn) -  λ S/log<sub>e</sub>2'''
- =  log2 (Kmn) – S             -----(1)
+'''=  log<sub>2</sub> (Kmn) – S '''           -----(1)
@@ Line 80: / Line 68: @@
-S =log2 (Kmn) - log2P
+'''S =log<sub>2</sub> (Kmn) - log<sub>2</sub>P(S=>x)
+          '''
-=log2 (K/P) + log2(nm) 	......(2)
+'''=log<sub>2</sub> (K/P(S=>x)) + log<sub>2</sub>(nm)''' 	......(2)
- Since for most scoring matrices K =̴ 0.1 and choosing   P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one
+Since for most scoring matrices K =̴ 0.1 and choosing   P=0.05, the first term is 1, and the second term in Equation (2) becomes the most important one
 for calculating the score  (Altschul 1991), thus giving
-S =log2 (nm)
+'''S =log<sub>2</sub> (nm)'''
+==='''Analysis of Z-Score'''===
+Clearly if the romdomised sequences score as well as the original one the alignment is unlikely to be significant . We can measure the mean and
+standard deviation of scores of the alignment of randomized sequences, and ask wheather the score of the original alignment is un usuually high.
+The Z score reflects the extent to which thye original result is an outlier from the population
+'''Z-score= (score(S)-mean)/standard deviation'''
+[[Image:Graph.png|graph.txt]]
+A Z-score of 0 means that observed similarity is no better than the average of random permutations of the sequence, and might well have arisen by chance .
+The problem with using a Z-score is that whether the problem has occurred by chance is that Z-score assumes a normal distribution. However the data does
+not follow normal distribution. As a result higher Z-score should be taken as a threshold of significance.
+=='''References:'''==
+[1] Introduction to Bioinformatics: Arthur M. Lesk
+[2] Bioinformatics: David Mount
+[3] http://www.ncbi.nlm.nih.gov/