PSLpred

From DrugPedia: A Wikipedia for Drug discovery

Revision as of 07:06, 27 August 2008 by Aarti (Talk | contribs)
Jump to: navigation, search

A Method to predict Bacterial subcellular Localizations


Need for a method Prokaryotes are the causative agent of most of the deadly disease and widespread of epidemics, hence, biologists are paying much attention for the functional annotation of prokaryotic proteins. This may further guide the determination of virulence factors as well as new pattern of resistance for antiobiotic agents in pathogenic bacteria. Hence, prediction of protein subcellular localization ( an alternative to functional annotation) of gram-negative bacteria would be very useful in the field of molecular biology, cell biology, pharmacology, and medical science.

Prokaryotes (Gram-negative bacteria) have 5 major subcellular localizations (outer membrane, inner membrane, periplasm, cytoplasm, and extracellular), specialized in distinct biochemical process. Hence, PSLpred a SVM based method has been devloped for the prediction of subcellular localization of prokaryotic proteins using input features such as amino acid and dipeptide composition, physico-chemical properties along with similarity search based results.


Strategies used to devlop PSLpred Algorithm

Dataset

The data set used in the present work was same as used by Yu et al (2004) for developing the method CELLO. The data set was generated from SWISS- PROT release 40.29, consisted of a total of 1443 proteins, 1302 localized in single subcellular site and 141 proteins resident at multiple locations. However, for devloping PSLpred, 141 proteins residing in more then one subcellular location were excluded and 1302 proteins (248 cytoplasmic, 268 inner membrane, 244 periplasmic, 352 outher membrane, and 190 extracellular) having single subcellular localization were used for the prediction of subcellular localization of prokaryotic proteins.


Performance of Modules of PSLpred

i) Amino acid composition A SVM module developed on the basis of amino acid composition in a protein has achieved best results with the RBF kernel (g=100, c=2, j=1). The calculation of amino acid composition generates the 20 dimensional input vectors for each protein sequence which were used to train five types of SVM models for the five types of subcellular localizations. The composition based SVM module was predicted with an overall accuracy of 86%.

ii) Dipeptide composition The dipeptide composition based SVM module encompasses the information about amino acid composition along local order of amino acid.It uses the fixed pattern length of a vector with 400 dimensions. The dipeptide composition based SVM module with the RBF kernel (g=300, C=2) was predicted with an overall accuracy of 86%.

iii) Composition of physicochemical properties The calculation of composition of physico-chemical properties of the protein sequences generates input vector of 33 dimensions for each sequence. The overall accuracy of properties based SVM module is 83%,~3% lesser then amino acid composition based SVM module.

iv) Similarity-search based module The performance of the PSI-BLAST based module was also evaluated through 5-fold cross-validation. The performance of this module is poorer as compared to other modules developed in the present study. The SVM module based on this approach was able to predict the subcellular localization of the proteins with overall accuracy of 68%.


Hybrid Module

It encompassed the information about composition and similarity search based module and achieved an overall accuracy of 91.2% (g=25, C=4) , which is 5-8% higher than individual compositions based modules. It proves hybrid module is able to encapsulate more information, which successfully improves the reliability of prediction accuracy. These results confirmed that detection of subcellular localization of proteins requires wide range of information about a protein.

Subcellular localization Accuracy (%)  MCC
Cytoplasmic 90.7 0.86
Extracellular 86.8 0.88
Inner-membrane 90.3 0.90
Outer-membrane 95.2 0.95
Periplasmic 90.6 0.89


Reliability Index (RI)

In order to confirm the prediction reliability, RI assignment was carried out for the hybrid module and 90% and 98.1% of accuracy was obtained with RI=4 and 5 respectively. It has also been observed that ~74% of the sequences have RI=5. Hence, the present method can predict subcellular localization of prokaryotic proteins more reliably.


Comparison with other methods
The performance of the hybrid module developed in the present study was compared with methods such as CELLO, PSORT-B, which were also developed from the same data set. It has been observed that overall performance of the hybrid module is nearly 2% higher than CELLO and 16% higher than that of PSORT-B. Hence it can be mentioned here that present method is more accurate for the subcellular localization of prokaryotic proteins.