Blast Algorithm

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)
Jump to: navigation, search
(New page: Blasts stands for ''Basic Local Alignment Search Tool''.Its Based on ''Karlin-Altschul'' statistics. It compares novel sequence with that present in nucleotide & protein databases which ...)
Current revision (04:59, 19 September 2008) (edit) (undo)
 
(6 intermediate revisions not shown.)
Line 12: Line 12:
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.  
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.  
-
1 Seeding step
+
#Seeding step
-
 
+
#Extension step
-
2 Extension step
+
#Evaluation step
-
 
+
-
3 Evaluation step
+
==The Seeding step==
==The Seeding step==
Line 24: Line 22:
[[Image:blast1.png]]
[[Image:blast1.png]]
-
listing of words in a sequence
+
===Listing of words in a sequence===
Line 37: Line 35:
                                             PRG       14      T=13            PRG   14  
                                             PRG       14      T=13            PRG   14  
                                             PSG       13                        PSG   13
                                             PSG       13                        PSG   13
-
                                             PQA       12                        PQA   12
+
                                             PQA       12                         
 +
 
 +
 
 +
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for
 +
 
 +
50 x 250 = 12,500 !!
 +
 
 +
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
 +
 
 +
==The Extension Step==
 +
 
 +
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
 +
…..???? PQL ??? ….
 +
 
 +
…. $$$ PQL $$$ ….
 +
 
 +
 
 +
===When do we stop the extension?===
 +
 
 +
Let’s do an example:
 +
 
 +
Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right...
 +
Use a drop off score variable, X = 5
 +
 
 +
                    The quick  brown fox  jumps over the lazy dog
 +
                    The quiet  brown cat  purrs when she sees him
 +
                    123 45654  56789 876 5654…   score
 +
                    000 00012  10000 123 4345…     drop off score
 +
 
 +
After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.
 +
[[Image:blast2.png]]
 +
 
 +
===Trimming back for getting hsp===
 +
 
 +
Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.
 +
[[Image:blast3.png|thumb|300px|non overlapping hsp]]
 +
[[Image:blast4.gif|thumb|300px|extension of seed]]
 +
[[Image:blast5.gif|thumb|300px|extension until cumulative score drops]]
 +
 
 +
 
 +
==Evaluation Step==
 +
We know that Blast is based on  Karlin-Altschul statistics .
 +
The  Karlin-Altschul equation is given by:
 +
[[Image:blast6.png|100px|e value]] 
 +
                   
 +
E = expected no. of alignments
 +
 
 +
k= minor constant
 +
 
 +
m= length of query
 +
 
 +
n= length of database
 +
 
 +
λ= Scaling factor
 +
 
 +
S= raw score
 +
 
 +
λS= Normalized score
 +
 
 +
mn= search space
 +
 
 +
Raw score (S):  Sum of scores for each aligned position and scores for gaps
 +
S = sum(matches) - sum(mismatches) - sum(gap penalties)
 +
note: this score varies with the scoring matrix used and thus may not be meaningfully
 +
compared for different searches
 +
 
 +
Bit score (S’):  Version of the raw score that is normalized by the scale of the  scoring matrix ( ) and the scale of the search space size (K)
 +
S’ = (sum(S) – ln(K)) / ln(2)
 +
note: because it is normalized the bit score can be meaningfully compared across searches
 +
 
 +
E value:  Number of alignments with score S’ or better that one would expect  to find by chance in a search of a database of the same size
 +
E = mn2-S’ 
 +
m = effective length of database
 +
n = effective length of query sequence
 +
note: E values may change if databases of different sizes are searched
 +
 
 +
 
 +
Once we have all the HSP’s, we use a different cutoff “S” to rank the HSP’s.
 +
The raw scores of HSPs are sorted base on “S”
 +
The “S” alignment threshold can remove many random and low-scoring alignments. But if it's
 +
set too high, it may also remove real alignments.
 +
Once the HSPs are organized, they can be evaluated with a Final threshold based on the entire search space and corresponding to the value E set for the search.
 +
BLAST reports any alignment or group of alignments that meets the E requirement.
 +
 
 +
Sometimes, two or more HSP regions that can be made into a longer alignment will be found.
 +
 
 +
For example, HSP score #1: 65 and 40 and HSP score #2: 52 and 45
 +
 
 +
Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40)
 +
Sum-of-scores method: 65+40=105 is higher than 52+45=97.
 +
 
 +
original BLAST uses the Poisson method whereas gapped BLAST and the WU-BLAST use the sum-of scores method.
 +
 
 +
Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database.
 +
Those matches whose E value is lower then a threshold value are reported.
 +
 
 +
 
 +
==References==
 +
* D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis."
 +
 
 +
* Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool"

Current revision

Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.

It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:

local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.

Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".

Contents

[edit] How does blast works?

First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.

  1. Seeding step
  2. Extension step
  3. Evaluation step

[edit] The Seeding step

BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.

Image:blast1.png

[edit] Listing of words in a sequence

BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.



                                           Seq       Score                      Seq     Score       
                                            …..       ….                       .......  ........
     PQG        aligns with                PEG	       15                        PEG	   15
                                           PRG	       14       T=13             PRG	   14 
                                           PSG	       13                        PSG	   13
                                           PQA	       12                        


If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for

50 x 250 = 12,500 !!

As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!

[edit] The Extension Step

After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment. …..???? PQL ??? ….

…. $$$ PQL $$$ ….


[edit] When do we stop the extension?

Let’s do an example:

Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right... Use a drop off score variable, X = 5

                    The quick  brown fox  jumps over the lazy dog
                    The quiet  brown cat  purrs when she sees him
                    123 45654  56789 876 5654…    score
                    000 00012  10000 123 4345…     drop off score

After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair. Image:blast2.png

[edit] Trimming back for getting hsp

Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.

Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast3.png': No such file or directory @ blob.c/OpenBlob/2480.
convert: unable to open file `/usr1/webserver/cgidocs/drugpedia/images/Blast3.png' @ png.c/ReadPNGImage/2889.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast3.png/300px-Blast3.png' @ convert.c/ConvertImageCommand/2800.
non overlapping hsp
Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast4.gif': No such file or directory @ blob.c/OpenBlob/2480.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast4.gif/300px-Blast4.gif' @ convert.c/ConvertImageCommand/2800.
extension of seed
Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast5.gif': No such file or directory @ blob.c/OpenBlob/2480.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast5.gif/300px-Blast5.gif' @ convert.c/ConvertImageCommand/2800.
extension until cumulative score drops


[edit] Evaluation Step

We know that Blast is based on Karlin-Altschul statistics . The Karlin-Altschul equation is given by:

Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast6.png': No such file or directory @ blob.c/OpenBlob/2480.
convert: unable to open file `/usr1/webserver/cgidocs/drugpedia/images/Blast6.png' @ png.c/ReadPNGImage/2889.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast6.png/100px-Blast6.png' @ convert.c/ConvertImageCommand/2800.

E = expected no. of alignments

k= minor constant

m= length of query

n= length of database

λ= Scaling factor

S= raw score

λS= Normalized score

mn= search space

Raw score (S): Sum of scores for each aligned position and scores for gaps S = sum(matches) - sum(mismatches) - sum(gap penalties) note: this score varies with the scoring matrix used and thus may not be meaningfully compared for different searches

Bit score (S’): Version of the raw score that is normalized by the scale of the scoring matrix ( ) and the scale of the search space size (K) S’ = (sum(S) – ln(K)) / ln(2) note: because it is normalized the bit score can be meaningfully compared across searches

E value: Number of alignments with score S’ or better that one would expect to find by chance in a search of a database of the same size E = mn2-S’ m = effective length of database n = effective length of query sequence note: E values may change if databases of different sizes are searched


Once we have all the HSP’s, we use a different cutoff “S” to rank the HSP’s. The raw scores of HSPs are sorted base on “S” The “S” alignment threshold can remove many random and low-scoring alignments. But if it's set too high, it may also remove real alignments. Once the HSPs are organized, they can be evaluated with a Final threshold based on the entire search space and corresponding to the value E set for the search. BLAST reports any alignment or group of alignments that meets the E requirement.

Sometimes, two or more HSP regions that can be made into a longer alignment will be found.

For example, HSP score #1: 65 and 40 and HSP score #2: 52 and 45

Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40) Sum-of-scores method: 65+40=105 is higher than 52+45=97.

original BLAST uses the Poisson method whereas gapped BLAST and the WU-BLAST use the sum-of scores method.

Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database. Those matches whose E value is lower then a threshold value are reported.


[edit] References

  • D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis."
  • Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool"