PAM Scoring Matrices

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)
Jump to: navigation, search
Current revision (16:42, 16 September 2008) (edit) (undo)
 
(11 intermediate revisions not shown.)
Line 1: Line 1:
-
         THE GENERAL MATHEMATICAL SETUP OF PAM SCORING MATRICES
+
          
 +
== '''THE GENERAL MATHEMATICAL SETUP OF PAM SCORING MATRICES''' ==
-
PAM is Point Accepted Mutation.
+
PAM is Point Accepted Mutation.
 +
 
PAM scoring matrix mainly scores for protein sequences alignment. These matrices are based on global alignments of closely related proteins.  
PAM scoring matrix mainly scores for protein sequences alignment. These matrices are based on global alignments of closely related proteins.  
-
Protein sequence is amino acid sequence ,and nature shows that their relative replace-ability have many impacts in an evolutionary scenario. Therefore, PAM is  the substitution of one amino acid of a protein by another that is accepted and permitted biologically and spreading to essentially some given entire species over time of evolutions. Therefore it has more to do with the study of homology between  protein sequences tracing back to their common ancestors. A PAM1 probability transition matrix is the Markov chain matrix applying for a time period over which we expect 1% divergence or 1% of the amino acids to undergo accepted point mutations within the species of interest.  
+
Protein sequence is amino acid sequence, and nature shows that their relative replace-ability have many impacts in an evolutionary scenario. Therefore, PAM is  the substitution of one amino acid of a protein by another that is accepted and permitted biologically and spreading to essentially some given entire species over time of evolutions. Therefore it has more to do with the study of homology between  protein sequences tracing back to their common ancestors. A PAM1 probability transition matrix is the [Markov chain] [http://en.wikipedia.org/wiki/Markov_chain]matrix applying for a time period over which we expect 1% divergence or 1% of the amino acids to undergo accepted point mutations within the species of interest.  
PAM matrices were derived on the basis of 71 blocks of aligned, ungapped amino acid sequences. These blocks are conserved sequences  sharing at least 85% of similarity.
PAM matrices were derived on the basis of 71 blocks of aligned, ungapped amino acid sequences. These blocks are conserved sequences  sharing at least 85% of similarity.
Line 12: Line 14:
    
    
-
The Requirements:  
+
The Requirements:
 +
a) given a list of accepted mutations or a hypothetical phylogenetic trees
a) given a list of accepted mutations or a hypothetical phylogenetic trees
 +
b) all the 20 amino acids forming Y- row and X-column
b) all the 20 amino acids forming Y- row and X-column
 +
c) the probability of occurrence P(a) for each amino acid 'a'
c) the probability of occurrence P(a) for each amino acid 'a'
Line 20: Line 25:
         ∑ P(a)= 1  
         ∑ P(a)= 1  
         ᵃ
         ᵃ
-
Let f(ab) = the number of times the mutation a ↔ b was observed to occur.
+
 
-
  And also  f(ab)= f(ba) (not directional).
+
Let f(ab) = the number of times the mutation a ↔ b was observed to occur.
 +
   
 +
And also  f(ab)= f(ba) (not directional).
 +
 
Then,  
Then,  
-
         the total number of mutations in which a was involved
+
         the total number of mutations in which 'a' was involved is
         f(a)=  ∑ f(ab)
         f(a)=  ∑ f(ab)
                 b≠a  
                 b≠a  
      
      
-
         the total number of amino acid occurrences involved in mutations. The  number f is also twice the total number of mutations.                             
+
         the total number of amino acid occurrences involved in mutations.                            
    
    
         f= ∑ f(a)
         f= ∑ f(a)
             a  
             a  
-
 
+
       
 +
        here number f is also twice the total number of mutations. 
-
The matrix element is M(ab) is the probability of amino acid 'a' changing into amino acid 'b'. M(aa) is probability to be unchanged for certain amino acid 'a' during the evolutionary interval.
 
-
 
+
The matrix-M element is M(ab) is the probability of amino acid 'a' changing into amino acid 'b'. M(aa) is probability to be unchanged for certain amino acid 'a' during the evolutionary interval.
-
 
+
-
Relative mutability of amino acid a defined as  
+
Relative mutability of amino acid 'a' defined as  
               f(a)
               f(a)
Line 53: Line 60:
       M(aa) = 1 − m(a)  
       M(aa) = 1 − m(a)  
 +
On the other hand, the probability of a changing into b can be computed as the product of the conditional probability that a will change into b, given that a changed, times the probability of a changing ,then we have
On the other hand, the probability of a changing into b can be computed as the product of the conditional probability that a will change into b, given that a changed, times the probability of a changing ,then we have
      
      
 +
    M(ab) = P (a → b)
 +
 +
          = P (a → b| a changed)*P (a changed)
-
    M(ab) = P (a → b)
 
-
          = P (a → b| a changed)*P (a changed)
 
               f(ab)  
               f(ab)  
           =  ―  *m(a)  
           =  ―  *m(a)  
Line 78: Line 87:
    
    
-
Lets continue on with 1-PAM matrix and define the scoring matrix. The entries in this matrix
+
Lets continue on with 1-PAM matrix and define the scoring matrix-S. The entries in this matrix
are related to the ratio between two probabilities, i.e, the odds ratio ~M(ab)/P(b)
are related to the ratio between two probabilities, i.e, the odds ratio ~M(ab)/P(b)
-
Each entry of this matrix is calculated as log of odds :
+
Each entry of the matrix-S is calculated as log of odds :
                      
                      
     S(ab)= log{ M(ab)/P(b)}  we can have the log base of our choice
     S(ab)= log{ M(ab)/P(b)}  we can have the log base of our choice
Line 99: Line 108:
     a) 1 PAM = 1 accepted mutation per 100 amino acids  
     a) 1 PAM = 1 accepted mutation per 100 amino acids  
 +
     b) 250 PAM = 2.5 accepted mutations per amino acid  
     b) 250 PAM = 2.5 accepted mutations per amino acid  
        
        
-
ETC. !!
+
EXTRA....!!
The other commonly used types of scoring  matrices are the BLOSUM matrices.Contrary to the PAM matrices that've been developed from global alignments, the BLOSUM (BLOcks SUbstitution Matrix) matrices are based on local multiple alignments of more distantly related sequences. For instance, BLOSUM 62, the default matrix in BLAST, is a matrix calculated from comparisons of sequences  with no less than 62% identity. Unlike PAM matrices, new BLOSUM matrices are never extrapolated from existing BLOSUM matrices, but are always based on local multiple alignments. So, the BLOSUM 80 matrix would be derived from a set of sequences having 80% sequence identity.  
The other commonly used types of scoring  matrices are the BLOSUM matrices.Contrary to the PAM matrices that've been developed from global alignments, the BLOSUM (BLOcks SUbstitution Matrix) matrices are based on local multiple alignments of more distantly related sequences. For instance, BLOSUM 62, the default matrix in BLAST, is a matrix calculated from comparisons of sequences  with no less than 62% identity. Unlike PAM matrices, new BLOSUM matrices are never extrapolated from existing BLOSUM matrices, but are always based on local multiple alignments. So, the BLOSUM 80 matrix would be derived from a set of sequences having 80% sequence identity.  
Line 120: Line 130:
[5] Bioinformatics: Polanski and kimmel
[5] Bioinformatics: Polanski and kimmel
 +
 +
http://en.wikipedia.org/wiki/Markov_chain

Current revision

[edit] THE GENERAL MATHEMATICAL SETUP OF PAM SCORING MATRICES

PAM is Point Accepted Mutation.

PAM scoring matrix mainly scores for protein sequences alignment. These matrices are based on global alignments of closely related proteins.

Protein sequence is amino acid sequence, and nature shows that their relative replace-ability have many impacts in an evolutionary scenario. Therefore, PAM is the substitution of one amino acid of a protein by another that is accepted and permitted biologically and spreading to essentially some given entire species over time of evolutions. Therefore it has more to do with the study of homology between protein sequences tracing back to their common ancestors. A PAM1 probability transition matrix is the [Markov chain] [1]matrix applying for a time period over which we expect 1% divergence or 1% of the amino acids to undergo accepted point mutations within the species of interest.

PAM matrices were derived on the basis of 71 blocks of aligned, ungapped amino acid sequences. These blocks are conserved sequences sharing at least 85% of similarity. We concentrate on PAM1 which is the basic substitution(transition) 20 × 20 matrix from where other higher PAM units eg. PAM20,PAM250,etc. are extrapolated.


The Requirements:

a) given a list of accepted mutations or a hypothetical phylogenetic trees

b) all the 20 amino acids forming Y- row and X-column

c) the probability of occurrence P(a) for each amino acid 'a'


        ∑ P(a)= 1 
        ᵃ

Let f(ab) = the number of times the mutation a ↔ b was observed to occur.

And also f(ab)= f(ba) (not directional).

Then,

        the total number of mutations in which 'a' was involved is
        f(a)=  ∑ f(ab)
               b≠a 
   
        the total number of amino acid occurrences involved in mutations.                             
 
        f= ∑ f(a)
           a 
       
        here number f is also twice the total number of mutations.  


The matrix-M element is M(ab) is the probability of amino acid 'a' changing into amino acid 'b'. M(aa) is probability to be unchanged for certain amino acid 'a' during the evolutionary interval.


Relative mutability of amino acid 'a' defined as

             f(a)
     m(a)=   ――
             100* f *P(a)

                     

Mutabilities are scaled to the number of replacements per occurrence of the given amino acid per 100 residues in each alignment. Relative mutability is the probability that the given amino acid will change in the evolutionary period of interest. Hence, the probability of a remaining unchanged is the complementary probability


     M(aa) = 1 − m(a) 


On the other hand, the probability of a changing into b can be computed as the product of the conditional probability that a will change into b, given that a changed, times the probability of a changing ,then we have


    M(ab) = P (a → b)

          = P (a → b| a changed)*P (a changed)
             f(ab) 
          =   ―   *m(a) 
             f(a) 


we implement Markov-type model of evolution in deriving the above equations, which has good mathematical properties. The element M has the following properties :

  1)   ∑ M(ab)= 1
       b
     because,   Σ M(ab)= M(aa) + ∑ m(a) f(ab)/f(a)   = 1 – m(a)+ ∑ m(a) *f(ab)/f(a)                                                   
                 b               b≠a                             b≠a
                                                     = 1 – m(a)+ m(a) = 1   
  2)  n-PAM  model has n mutation steps and transition matrix for this model is just n times multiplications of 1-PAM matrix.


Lets continue on with 1-PAM matrix and define the scoring matrix-S. The entries in this matrix are related to the ratio between two probabilities, i.e, the odds ratio ~M(ab)/P(b)

Each entry of the matrix-S is calculated as log of odds :

   S(ab)= log{ M(ab)/P(b)}   we can have the log base of our choice

the score for an alignment is thus given by:

   S = ∑ S ( a a) 
            
                              
           
  

ON PAM'S ?

PAM matrix is used extensively in BLAST Search algorithm, which is extremely fast, robust and popular heuristic. There is a whole family of matrices: PAM-10, ..., PAM-250, ... these matrices are extrapolated from PAM-1 matrix (by matrix multiplication) A PAM is a relative measure of evolutionary distance eg. :

   a) 1 PAM = 1 accepted mutation per 100 amino acids 
   b) 250 PAM = 2.5 accepted mutations per amino acid 


EXTRA....!!

The other commonly used types of scoring matrices are the BLOSUM matrices.Contrary to the PAM matrices that've been developed from global alignments, the BLOSUM (BLOcks SUbstitution Matrix) matrices are based on local multiple alignments of more distantly related sequences. For instance, BLOSUM 62, the default matrix in BLAST, is a matrix calculated from comparisons of sequences with no less than 62% identity. Unlike PAM matrices, new BLOSUM matrices are never extrapolated from existing BLOSUM matrices, but are always based on local multiple alignments. So, the BLOSUM 80 matrix would be derived from a set of sequences having 80% sequence identity.


REFERENCES :

[1] Joao Setubal and Joao Meidanis, Introduction to computational molecular biology, University of Campinas, Brazil, December 1997.

[2] Warren J. Ewens and Gregory R. Grant, Statistical methods in bioinformatics: an introduction, Springer-Verlag New York, 2001.

[3] Heniko JG. Heniko S. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A.,pages 89(22):10915-9,1992

[4] Dayho MO Schwartz RM. Atlas of Protein Sequence and Structure, 5 suppl., volume 3:353-358. Nat. Biomed. Res. Found., Washington D.C., 978.

[5] Bioinformatics: Polanski and kimmel

http://en.wikipedia.org/wiki/Markov_chain