HSLpred

From DrugPedia: A Wikipedia for Drug discovery

Jump to: navigation, search

Subcellular Localization Prediction method for Human Proteins


In 21st century, the functional annoatation of sequence data generated from human genome project is a major challenge in front of a scientific community. Since functions are closely related to subcellular localization, hence determination of subcellular localization of a protein can assist in elucidating the functions.

HSLpred is a SVM-based method for the prediction of 4 major subcellular localization (cytoplasm, mitochondrial, nuclear and plasma membrane) of human proteins.


Dataset

The dataset of human proteins used to devlop HSLpred was extracted from special release of SWISSPROT database. Final non-redundant data set consisted of a total of 3532 human proteins (840 cytoplasmic, 315 mitochondrial, 858 nuclear, 1519 plasma membrane). The dataset is available at www.imtech.res.in/raghava/hslpred


Input Features

i) Amino acid composition is the fraction of each amino acid in a protein. The calculation of amino acid composition generates the 20 dimensional input vectors which were used to train four types of SVM models for the four types of subcellular localizations. The composition based SVM module (kernel=RBF, g= 300, C=2, j=1) was able to predict with overall accuracy of 76.6%.

ii) Dipeptide Composition was used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400 (20 X 20). This representation encompassed the information about amino acid composition along local order of amino acid. In the case of 1-2dipeptide SVM module the best results of 77.8% of overall accuracy was achieved with the RBF kernel (g=50, C=6, j=1). In addition, to observe the interaction of the ith residue with the 3rd, 4th, and 5th residue in the sequence, higher order dipeptides such as i + 2, i + 3, and i + 4, respectively was also calculated.

iii) PSIBLAST a similarity search based module was designed in which a query sequence was searched against a non redundant dataset of local human proteins. PSI-BLAST was used instead of normal standard BLAST because it has the capability to detect remote homologies. Three iterations of PSI-BLAST were carried out at a cut-off E-value of 0.001. This module could predict any of the four localizations depending upon the similarity of the query protein to the proteins in the dataset with an accuracy of 73.3%. The module would return "unknown subcellular localization" if no significant similarity was obtained.


Hybrid module

This module uses the nformation about the protein that is amino acid composition, dipeptide composition and evolutionary information of PS-BLAST output. SVM was provided with an output vector 425 dimensions that consisted of 20 for amino acid composition, 400 for dipeptide composition, five for PSI-BLAST output. The performance of this module was better then any other individual feature based module. This hybrid module with the RBF kernel (g=50, C=2, j=1) was able to achieve overall 84.9% accuracy.


Availablity of HSLpred server

Various types of SVM modules constructed have been implemented on a Web server (HSLPred) using CGI/Perl script. The HSLPred server is available on the World Wide Web at www.imtech.res.in/raghava/hslpred/ or bioinformatics.uams.edu/raghava/hslpred/. Users can enter a protein sequence in one of the standard formats, such as FASTA, GenBankTM, EMBL, GCG, or plain format. The server provides options to select various approaches for the prediction of the subcellular localization of a query sequence. In the case of the default prediction, it uses the hybrid module for prediction.