Position-specific scoring matrix

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)
Jump to: navigation, search
(New page: {{otheruses4|Bioinformatics|the disease in horses known by the acronym "PSSM"|Equine polysaccharide storage myopathy}} A '''position weight matrix (PWM)''', also called '''position-specifi...)
Current revision (10:27, 18 September 2008) (edit) (undo)
 
(2 intermediate revisions not shown.)
Line 1: Line 1:
-
{{otheruses4|Bioinformatics|the disease in horses known by the acronym "PSSM"|Equine polysaccharide storage myopathy}}
+
A '''position weight matrix (PWM)''', also called '''position-specific weight matrix (PSWM)''' or '''position-specific scoring matrix (PSSM)''', is a commonly used representation of motifs/patterns in biological sequences.
-
A '''position weight matrix (PWM)''', also called '''position-specific weight matrix (PSWM)''' or '''position-specific scoring matrix (PSSM)''', is a commonly used representation of [[sequence motif |motifs]] (patterns) in biological sequences.<ref name="Ben-Gal2005">{{cite journal |author=Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I |title=Identification of Transcription Factor Binding Sites with Variable-order Bayesian Networks |journal=Bioinformatics |volume=21 |issue=11 |date=2005 |pages=2657–2666 |url=http://bioinformatics.oxfordjournals.org/cgi/reprint/bti410?ijkey=KkxNhRdTSfvtvXY&keytype=ref |doi=10.1093/bioinformatics/bti410}}</ref>
+
-
A PWM is a matrix of score values that gives a weighted match to any given [[substring]] of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the pattern. The score assigned by a PWM to a [[substring]] <math>s=(s_j)_{j=1}^N</math> is defined as <math>\textstyle \sum_{j=1}^{N}{m_{s_j,j}}</math>, where <math>j</math> represents position in the substring, <math>s_j</math> is the symbol at position <math>j</math> in the substring, and <math>m_{\alpha,j}</math> is the score in row <math>\alpha</math>, column <math>j</math> of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the substring.
+
A PWM is a matrix of score values that gives a weighted match to any given substring of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the pattern. The score assigned by a PWM to a substring <math>s=(s_j)_{j=1}^N</math> is defined as <math>\textstyle \sum_{j=1}^{N}{m_{s_j,j}}</math>, where <math>j</math> represents position in the substring, <math>s_j</math> is the symbol at position <math>j</math> in the substring, and <math>m_{\alpha,j}</math> is the score in row <math>\alpha</math>, column <math>j</math> of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the substring.
==Basic PWM with log-likelihoods==
==Basic PWM with log-likelihoods==
A PWM assumes independence between positions in the pattern, as it calculates scores at each position independently from the symbols at other positions.
A PWM assumes independence between positions in the pattern, as it calculates scores at each position independently from the symbols at other positions.
-
The score of a substring aligned with a PWM can be interpreted as the [[likelihood function |log-likelihood]] of the substring under a product multinomial distribution. Since each column defines log-likelihoods for each of the different symbols, where the sum of likelihoods in a column equals one, the PWM corresponds to a [[multinomial distribution]]. A PWM's score is the sum of log-likelihoods, which corresponds to the product of likelihoods, meaning that the score of a PWM is then a product-multinomial distribution. The PWM scores can also be interpreted in a physical framework as the sum of binding energies for all [[nucleotide]]s (symbols of the substring) aligned with the PWM.  
+
The score of a substring aligned with a PWM can be interpreted as the log-likelihood of the substring under a product multinomial distribution. Since each column defines log-likelihoods for each of the different symbols, where the sum of likelihoods in a column equals one, the PWM corresponds to a multinomial distribution. A PWM's score is the sum of log-likelihoods, which corresponds to the product of likelihoods, meaning that the score of a PWM is then a product-multinomial distribution. The PWM scores can also be interpreted in a physical framework as the sum of binding energies for all nucleotides (symbols of the substring) aligned with the PWM.  
==Incorporating background distribution==
==Incorporating background distribution==
-
Instead of using log-likelihood values in the PWM, as described in the previous paragraph, several methods uses [[log-odds |log-odds]] scores in the PWMs. An element in a PWM is then calculated as <math>m_{i,j}=log(p_{i,j} / b_i)</math>, where <math>p_{i,j}</math> is the probability of observing symbol i at position j of the motif, and <math>b_i</math> is the probability of observing the symbol i in a background model. The PWM score then corresponds to the log-odds of the substring being generated by the motif versus being generated by the background, in a [[generative model]] of the sequence.
+
Instead of using log-likelihood values in the PWM, as described in the previous paragraph, several methods uses log-odds scores in the PWMs. An element in a PWM is then calculated as <math>m_{i,j}=log(p_{i,j} / b_i)</math>, where <math>p_{i,j}</math> is the probability of observing symbol i at position j of the motif, and <math>b_i</math> is the probability of observing the symbol i in a background model. The PWM score then corresponds to the log-odds of the substring being generated by the motif versus being generated by the background, in a generative model of the sequence.
==Information content of a PWM==
==Information content of a PWM==
-
The [[information content]] (IC) of a PWM is sometimes of interest, as it says something about how different a given PWM is from a [[uniform distribution]].
+
The information content (IC) of a PWM is sometimes of interest, as it says something about how different a given PWM is from a uniform distribution.
-
The [[self-information]] of observing a particular symbol at a particular position of the motif is:
+
The self-information of observing a particular symbol at a particular position of the motif is:
:<math>-log(p_{i,j})</math>
:<math>-log(p_{i,j})</math>
Line 23: Line 22:
:<math>\textstyle -\sum_{i,j} p_{i,j}\cdot log(p_{i,j})</math>
:<math>\textstyle -\sum_{i,j} p_{i,j}\cdot log(p_{i,j})</math>
-
Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g. the GC-content of DNA of [[thermophilic]] bacteria range from 65.3 to 70.8<ref name="Aleksandrushkina1978">{{cite journal |author=Aleksandrushkina NI, Egorova LA |title=Nucleotide makeup of the DNA of thermophilic bacteria of the genus Thermus |journal=Mikrobiologiia |volume=47 |issue=2 |pages=250–2 |year=1978 |pmid=661633}}</ref>, thus a motif of ATAT would contain much more information than a motif of CCGG). The equation for information content thus becomes
+
Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g. the GC-content of DNA of thermophilic bacteria range from 65.3 to 70.8, thus a motif of ATAT would contain much more information than a motif of CCGG). The equation for information content thus becomes
:<math>\textstyle \sum_{i,j} p_{i,j}\cdot log(p_{i,j}/p_{b})</math>
:<math>\textstyle \sum_{i,j} p_{i,j}\cdot log(p_{i,j}/p_{b})</math>
where <math>p_{b}</math> is the background frequency for that letter.
where <math>p_{b}</math> is the background frequency for that letter.
-
 
-
==References==
 
-
{{reflist}}
 
-
 
-
[[Category:Bioinformatics]]
 
-
[[Category:Evaluation methods]]
 
==External links==
==External links==
[http://jaspar.genereg.net/ JASPAR]
[http://jaspar.genereg.net/ JASPAR]

Current revision

A position weight matrix (PWM), also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs/patterns in biological sequences.

A PWM is a matrix of score values that gives a weighted match to any given substring of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the pattern. The score assigned by a PWM to a substring <math>s=(s_j)_{j=1}^N</math> is defined as <math>\textstyle \sum_{j=1}^{N}{m_{s_j,j}}</math>, where <math>j</math> represents position in the substring, <math>s_j</math> is the symbol at position <math>j</math> in the substring, and <math>m_{\alpha,j}</math> is the score in row <math>\alpha</math>, column <math>j</math> of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the substring.

Contents

[edit] Basic PWM with log-likelihoods

A PWM assumes independence between positions in the pattern, as it calculates scores at each position independently from the symbols at other positions. The score of a substring aligned with a PWM can be interpreted as the log-likelihood of the substring under a product multinomial distribution. Since each column defines log-likelihoods for each of the different symbols, where the sum of likelihoods in a column equals one, the PWM corresponds to a multinomial distribution. A PWM's score is the sum of log-likelihoods, which corresponds to the product of likelihoods, meaning that the score of a PWM is then a product-multinomial distribution. The PWM scores can also be interpreted in a physical framework as the sum of binding energies for all nucleotides (symbols of the substring) aligned with the PWM.

[edit] Incorporating background distribution

Instead of using log-likelihood values in the PWM, as described in the previous paragraph, several methods uses log-odds scores in the PWMs. An element in a PWM is then calculated as <math>m_{i,j}=log(p_{i,j} / b_i)</math>, where <math>p_{i,j}</math> is the probability of observing symbol i at position j of the motif, and <math>b_i</math> is the probability of observing the symbol i in a background model. The PWM score then corresponds to the log-odds of the substring being generated by the motif versus being generated by the background, in a generative model of the sequence.

[edit] Information content of a PWM

The information content (IC) of a PWM is sometimes of interest, as it says something about how different a given PWM is from a uniform distribution.

The self-information of observing a particular symbol at a particular position of the motif is:

<math>-log(p_{i,j})</math>

The expected (average) self-information of a particular element in the PWM is then:

<math>-p_{i,j} \cdot log(p_{i,j})</math>

Finally, the IC of the PWM is then the sum of the expected self-information of every element:

<math>\textstyle -\sum_{i,j} p_{i,j}\cdot log(p_{i,j})</math>

Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal probabilities of each letter (e.g. the GC-content of DNA of thermophilic bacteria range from 65.3 to 70.8, thus a motif of ATAT would contain much more information than a motif of CCGG). The equation for information content thus becomes

<math>\textstyle \sum_{i,j} p_{i,j}\cdot log(p_{i,j}/p_{b})</math>

where <math>p_{b}</math> is the background frequency for that letter.

[edit] External links

JASPAR