TBpred
From DrugPedia: A Wikipedia for Drug discovery
(New page: == TBpred == frame <b>TBpred</b> is a prediction server that predicts four subcellular localization (cytoplasmic,integral membrane,secretory and membrane attached by lip...) |
|||
Line 2: | Line 2: | ||
[[Image:Cell.jpg|frame]] | [[Image:Cell.jpg|frame]] | ||
<b>TBpred</b> is a prediction server that predicts four subcellular localization (cytoplasmic,integral membrane,secretory and membrane attached by lipid anchor) of <b>mycobacterial proteins</b>.It is SVM based method that exploits defferent features of protein such as amino acid compositin, dipeptide composition and position specific scoring matrix (PSSM).The overall prediction accuracy of these SVM modules are 82.51, 80.39 and 86.62% respectively.Along with SVM other techniques like profile HMM and MEME/MAST motif based studies were also applied.Moreover a hybrid approach combining the pssm based SVM model and the MEME/MAST model has been incorporated.<br> | <b>TBpred</b> is a prediction server that predicts four subcellular localization (cytoplasmic,integral membrane,secretory and membrane attached by lipid anchor) of <b>mycobacterial proteins</b>.It is SVM based method that exploits defferent features of protein such as amino acid compositin, dipeptide composition and position specific scoring matrix (PSSM).The overall prediction accuracy of these SVM modules are 82.51, 80.39 and 86.62% respectively.Along with SVM other techniques like profile HMM and MEME/MAST motif based studies were also applied.Moreover a hybrid approach combining the pssm based SVM model and the MEME/MAST model has been incorporated.<br> | ||
- | <p align="justify"> | + | <p align="justify"> |
+ | ==Importance of this webserver:== | ||
<li>Location of a protein inside a cell gives an insight into its function. So this server may serve as a tool for functional annotation of mycobacterial protein. | <li>Location of a protein inside a cell gives an insight into its function. So this server may serve as a tool for functional annotation of mycobacterial protein. | ||
<li>The organism specific classifier is better than the generalised one. Hopefully the server can allocate the protein's subcellular localization more correctly. | <li>The organism specific classifier is better than the generalised one. Hopefully the server can allocate the protein's subcellular localization more correctly. | ||
<li>The un-annotated portion of mycobacterial genomes can be annotated and new potential drug/vaccine targets would be identified. | <li>The un-annotated portion of mycobacterial genomes can be annotated and new potential drug/vaccine targets would be identified. | ||
</p> | </p> | ||
- | + | ==Availability of TBpred Webserver:== | |
+ | This server is available at [http://www.imtech.res.in/raghava/tbpred TBpred]</p> | ||
<!--<body bgcolor="#F4E78B" link="pink" vlink="red" alink="darkblue" leftmargin="10" rightmargin="10">--> | <!--<body bgcolor="#F4E78B" link="pink" vlink="red" alink="darkblue" leftmargin="10" rightmargin="10">--> | ||
<div align="center"><font size="5" color="brown"><strong>Algorithm behind TBpred</strong></font></div> | <div align="center"><font size="5" color="brown"><strong>Algorithm behind TBpred</strong></font></div> | ||
<p align="justify"> | <p align="justify"> | ||
- | + | ==About Dataset:== | |
Current dataset of mycobacterial proteins along with their subcellular localization has been developed from SWISS-PROT along with their subcellular localization. Out of 1365 proteins, non-experimental qualifier "by similarity" is excluded resulting in 882 proteins. Among 13 different subcellular compartments , 4 major sites have been selected containing reasonable number of samples.</p> | Current dataset of mycobacterial proteins along with their subcellular localization has been developed from SWISS-PROT along with their subcellular localization. Out of 1365 proteins, non-experimental qualifier "by similarity" is excluded resulting in 882 proteins. Among 13 different subcellular compartments , 4 major sites have been selected containing reasonable number of samples.</p> | ||
<div align="center"><table border="2" cellspacing="2" cellpadding="4"> | <div align="center"><table border="2" cellspacing="2" cellpadding="4"> | ||
Line 22: | Line 24: | ||
</table></div> | </table></div> | ||
<p align="justify"> | <p align="justify"> | ||
- | + | ==Support Vector Machine (SVM):== | |
SVMlight has been used in the present study in classification mode.Several parameters may be tuned for their appropriate values to get optimum results.Among different inbuilt kernels three have been used namely linear,polynomial and RBF.Subcellular localization prediction is a multi-class approach. For a defined protein feature, four types of SVM modules have been developed each belonging to a specific subcellular localization.The nth SVM model learns from nth class samples with positive labels and rest other samples with negative labels.Prediction of an unknown sample is based upon the maximum score out of four scores, generated by four models specific to four different subcellular compartments.</p> | SVMlight has been used in the present study in classification mode.Several parameters may be tuned for their appropriate values to get optimum results.Among different inbuilt kernels three have been used namely linear,polynomial and RBF.Subcellular localization prediction is a multi-class approach. For a defined protein feature, four types of SVM modules have been developed each belonging to a specific subcellular localization.The nth SVM model learns from nth class samples with positive labels and rest other samples with negative labels.Prediction of an unknown sample is based upon the maximum score out of four scores, generated by four models specific to four different subcellular compartments.</p> | ||
<p align="justify"> | <p align="justify"> | ||
- | + | ==Evaluation of prediction performance of TBpred:== | |
The performance of this method is evaluated by 5-fold cross-validation technique.The whole data is partitioned in 5 sets in such a manner that no two proteins from different sets shows sequence similarity greater than 36%.The training is done on four sets and remaining one is used for testing.In order to test each and every protein this process is carried out 5 times, each time using distinct set for testing.Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC) by the following equations:</p> | The performance of this method is evaluated by 5-fold cross-validation technique.The whole data is partitioned in 5 sets in such a manner that no two proteins from different sets shows sequence similarity greater than 36%.The training is done on four sets and remaining one is used for testing.In order to test each and every protein this process is carried out 5 times, each time using distinct set for testing.Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC) by the following equations:</p> | ||
<table width="100%" ><tr> | <table width="100%" ><tr> | ||
Line 34: | Line 36: | ||
</p> | </p> | ||
<p align="justify"> | <p align="justify"> | ||
- | + | ==Various Prdiction Approahes:== | |
In this study mainly three approaches have been studied, based on different features of proteins. | In this study mainly three approaches have been studied, based on different features of proteins. | ||
<li><b>Amino Acid Composition</b> is the fraction of each amino acid present in a protein.SVM is trained with 20 dimensional input vector for each protein.Overall prediction accuracy of this SVM module (kernel-RBF,g = 0.1 ,c = 600, j = 5) was 82.51%.<br><br> | <li><b>Amino Acid Composition</b> is the fraction of each amino acid present in a protein.SVM is trained with 20 dimensional input vector for each protein.Overall prediction accuracy of this SVM module (kernel-RBF,g = 0.1 ,c = 600, j = 5) was 82.51%.<br><br> |
Revision as of 07:05, 20 August 2008
Contents |
TBpred
TBpred is a prediction server that predicts four subcellular localization (cytoplasmic,integral membrane,secretory and membrane attached by lipid anchor) of mycobacterial proteins.It is SVM based method that exploits defferent features of protein such as amino acid compositin, dipeptide composition and position specific scoring matrix (PSSM).The overall prediction accuracy of these SVM modules are 82.51, 80.39 and 86.62% respectively.Along with SVM other techniques like profile HMM and MEME/MAST motif based studies were also applied.Moreover a hybrid approach combining the pssm based SVM model and the MEME/MAST model has been incorporated.
Importance of this webserver:
Availability of TBpred Webserver:
This server is available at TBpred</p>
About Dataset:
Current dataset of mycobacterial proteins along with their subcellular localization has been developed from SWISS-PROT along with their subcellular localization. Out of 1365 proteins, non-experimental qualifier "by similarity" is excluded resulting in 882 proteins. Among 13 different subcellular compartments , 4 major sites have been selected containing reasonable number of samples.Subcellular Localization | Sample Number |
1.Cytoplasmic | 340 |
2.Integral Membrane | 402 |
3.Secreted | 50 |
4.Attached to the membrane by lipid anchor | 60 |
Support Vector Machine (SVM):
SVMlight has been used in the present study in classification mode.Several parameters may be tuned for their appropriate values to get optimum results.Among different inbuilt kernels three have been used namely linear,polynomial and RBF.Subcellular localization prediction is a multi-class approach. For a defined protein feature, four types of SVM modules have been developed each belonging to a specific subcellular localization.The nth SVM model learns from nth class samples with positive labels and rest other samples with negative labels.Prediction of an unknown sample is based upon the maximum score out of four scores, generated by four models specific to four different subcellular compartments.Evaluation of prediction performance of TBpred:
The performance of this method is evaluated by 5-fold cross-validation technique.The whole data is partitioned in 5 sets in such a manner that no two proteins from different sets shows sequence similarity greater than 36%.The training is done on four sets and remaining one is used for testing.In order to test each and every protein this process is carried out 5 times, each time using distinct set for testing.Evaluation of performance of different SVM modules has been done by calculating accuracy and Matthew's correlation coefficient (MCC) by the following equations:where, x can be any subcellular location (cytoplasmic, mitochondrial, nuclear, or plasma membrane), exp(x) is the number of sequences observed in location x, p(x) is the number of correctly predicted sequences of location x, n(x) is the number of correctly predicted sequences not of location x, u(x) is the number of under predicted sequences and o(x) is the number of over-predicted sequences.
Various Prdiction Approahes:
In this study mainly three approaches have been studied, based on different features of proteins.
Subcellular Localization | Accuracy(%) | MCC |
cytoplasmic | 88.82 | 0.77 |
Integral Membrane | 86.07 | 0.71 |
Secreted | 44.00 | 0.57 |
Attached to membrane by a lipid anchor | 55.00 | 0.58 |
Subcellular Localization | Accuracy(%) | MCC |
cytoplasmic | 89.41 | 0.72 |
Integral Membrane | 81.09 | 0.67 |
Secreted | 50.00 | 0.60 |
Attached to membrane by a lipid anchor | 50.00 | 0.57 |
From the PSSM obtained for each protein sequence a SVM pattern has been made.The input vector contains 400 dimensions.Overall accuracy acheived by this SVM module (kernel-RBF,g=2, c=50, j=1) was 86.62%. |
Subcellular Localization | Accuracy(%) | MCC |
cytoplasmic | 94.71 | 0.85 |
Integral Membrane | 87.81 | 0.80 |
Secreted | 44.00 | 0.48 |
Attached to membrane by a lipid anchor | 68.33 | 0.69 |
--Mamoon 04:57, 20 August 2008 (UTC)