Datasets in Bioinformatics
From DrugPedia: A Wikipedia for Drug discovery
(4 intermediate revisions not shown.) | |||
Line 130: | Line 130: | ||
click [http://www.imtech.res.in/raghava/pprint/download.html/ here] to download the dataset. | click [http://www.imtech.res.in/raghava/pprint/download.html/ here] to download the dataset. | ||
- | ==ECGPred== | + | ==[http://www.imtech.res.in/raghava/ecgpred/ ECGPred]== |
- | Prediction of Gene Expression from its Nucleotide Composition | + | '''Prediction of Gene Expression from its Nucleotide Composition''' |
+ | |||
+ | This server allows user to analsis the expresion data (Microarray Data) where it calculate correlation coefficient between level of gene expression and nucleotides composition of genes. This will facilitate users in understanding which nucleotides are prefered and vice verse in a organism in given condition. This server also allows to learn from known microarray gene expression data and to predict expression level of other genes of same organism in that condition from their DNA sequence. The method uses SVM for learning and prediction. | ||
+ | |||
+ | The dataset used in this server is [http://www.imtech.res.in/raghava/ecgpred/ here]. | ||
+ | |||
+ | ==[http://www.imtech.res.in/raghava/nrpred/help.html/ NRpred] : SVM based method for prediction of Nuclear Receptors== | ||
+ | |||
+ | Nuclear receptors are key transcription factors that regulate crucial gene network for cell growth, differentiation and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins.Beside of diverse function nuclear receptor share a common structural organization. Nuclear receptors consist of six distinct regions: N & C terminal regions (A/B & F domains), central well conserved DNA binding domain (C), a non conserved hinge region (D) and moderately conserved ligand binding domain (E) . The C region of nuclear receptors consists of two zinc fingers, which is signature for this superfamily . | ||
+ | |||
+ | '''Dataset for development of Method''' | ||
+ | |||
+ | The data for four subfamilies of nuclear receptors was obtained from nucleaRDB database available at http://www.receptors.org/NR/. All the entries, which are not marked as fragments, are extracted from database by text parsing method. The initial dataset have 577 sequences belonging to four subfamilies of nuclear receptors. Redundancy was reduced such that none had >=90% sequence identity with any other sequence is data set using PROSET software. | ||
+ | |||
+ | |||
+ | ==[http://www.imtech.res.in/raghava/gpcrpred/ GPCRpred] : Prediction of families and superfamilies of G-protein coupled receptor.== | ||
+ | |||
+ | G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives.The GPCRs are consist of seven transmembrane domains that are connected through loops.The N terminal of these protein are located extracellularly and C terminal is extended into the cytoplasmic space. Due to this topology they are able to transduce the external signal into the cell. This transduction of signal took place with the help of G-protein. | ||
+ | |||
+ | Datasets used in this server are [http://www.imtech.res.in/raghava/gpcrpred/datasets/ here]. | ||
+ | |||
+ | |||
+ | ==[http://www.imtech.res.in/raghava/pslpred/ PSLpred] : predicts subcellular localization of prokaryotic proteins== | ||
+ | |||
+ | PSLpred is a svm(support vector machine) based method, predicts five major subcellular localization(Cytoplasm, Inner-membrane, Outer-membrane, Extracellular and Periplasm) of gram-negitive bacteria. | ||
+ | |||
+ | The datasets used in this server is [http://www.imtech.res.in/raghava/pslpred/data/ here]. | ||
+ | |||
+ | |||
+ | |||
+ | ==[http://www.imtech.res.in/raghava/btxpred/ BTXpred Server] : Prediction of Bacterial Toxins== | ||
+ | |||
+ | The aim of BTXpred server is to predict bacterial toxins and its function from primary amino acid sequence using SVM, HMM and PSI-Blast. Bacterial toxins play an vital role to cause disease and are responsible for majority of symptoms and lesions during infection. | ||
+ | |||
+ | Datasets : [http://www.imtech.res.in/raghava/btxpred/supplementary.html/ Datasets] that are used in the server is given here. |
Current revision
There are a number of Datasets that are being created and used in the field of Bioinformatics. Datasets contains the vital information based on which a prediction server depends for it's function. here is some of the datasets that are being created or used by Bioinformatics Centre, Institute of Microbiology, Chandigarh are as follows :
[edit] Datasets for evaluation of beta turn prediction method
The dataset has 426 non-homologus protein chains. In this data set, no two protein chains have more than 25% sequence identity.The structure of these proteins is determined by X-ray crystallography at 2.0 resolution or better. Each chain contains minimum one beta turn.
Complete Dataset
- Amino acid sequence of 426 protein chains in fasta format
[edit] ProPred-I
The Promiscuous MHC Class-I Binding Peptide Prediction Server
The ProPred-I is an on-line service for identifying the MHC Class-I binding regions in antigens. It implements matrices for 47 MHC Class-I alleles, proteasomal and immunoproteasomal models. The main aim of this server is to help users in identifying the promiscuous regions.
Dataset
Here is two datasets that are used in developing this server is :
[edit] Matrix Optimization Technique for Predicting MHC binding Core
The X-ray crystal structure of MHC class II molecule has revealed open peptide binding groove. A peptide bound in this groove may flank from one or the other side. Understanding which residues are acctually involved in binding will be very useful for understanding MHC peptide interactios.Here Matrix Optimization Technique is used to predict MHC binding core. Using binders from MHCPEP and nonbinder Data with MOT an accuracy of correct classification from 97 to 99% was obtained with HLA-DR1, HLA-DR2 and HLA-DR5 allele. This is the highest accuracy reported by any method. The prediction method used in this server is based on MOT and relies on the thought that binders have unique patterns which can be easily distinguished from nonbinders.
Dataset
The "Binder" used in this study :
The "Non-binder" used in this study are :
[edit] Bcepred: Prediction of linear B-cell epitopes, using physico-chemical properties
We evaluated the performance of existing linear B-cell epitope prediction methods based on physico-chemical properties on a non-redundant dataset. The dataset consists of 1029 B-cell epitopes obtained from Bcipep database and equally number of non-epitopes obtained randomly from Swiss-Prot database.
Data set
B-cell epitopes were obtained from B cell epitope database BCIPEP, which contains 2479 continuous epitopes, including 654 immunodominant, 1617 immunogenic epitopes. All the identical epitopes and non-immunogenic peptides were removed, finally we got 1029 unique experimentally proved continuous B cell epitopes. The dataset covers a wide range of pathogenic group like virus, bacteria, protozoa and fungi. Final dataset consists of 1029 B-cell epitopes and 1029 non-epitopes or random peptides (equal length and same frequency generated from SWISS-PROT).
[edit] HLApred: Identification and prediction of HLA class I & class II binder
The method can identify and predict HLA binding regions from antigen sequence. The method allows identification & prediction for 87 alleles, out of which 51 belong to Class I and 36 belongs to Class II.The output format (HTML MAPPING) will assist users in locating promiscuous HLA binders, which can be most putative vaccine candidates.
Data for Identification of Experimentally proven Binders
The server allows searching of antigen sequence against MHCBN Database version 3.1 (13). MHCBN is a comprehensive database of Major Histocompatibility Complex (MHC) binding and non-binding peptides compiled from published literature and existing databases. The database consists more than 23000 entries. The HLApred server searches all the peptides obtained from MHCBN for selected HLA alleles in query antigen sequence.
here is the link for MHCBN Version 4.0.
[edit] HLA_Affi
The preliminary requirement for the stimulation of cytotoxic T cell response, a mechanism against viruses and certain tumors, is the processing and presentation of endogenous antigenic peptides by MHC-I molecules on the surface of the cell. Methods have been developed to classify and predict the binders and non-binders of MHC. Here we develop a SVM based method to predict the binding affinity of peptides to MHC-I. The method takes into consideration the amino acid sequence and the physio-chemical properties of proteins.
Dataset
The dataset used in this study to train the SVM was collected from MHCBN and AntiJen.The dataset contained the peptides whose affinity value (IC50) are already determined experimentally.The size of the dataset was reduced by selecting and keeping only peptides, having 9 amino acids length (as nonamers are ideal binders of MHC-I molecules).The redundancy of the dataset was further reduced so that no two peptides have >90% sequence identity.This dataset consists of 402 binders and 222 nonbinders.
[edit] GenenBench : Evaluation of Gene finder and Dataset creation server
The GeneBench is an interface developed for evaluating the gene-finding algorithms. Users are allowed to compare the performance of the old/new prediction methods on a set of defined accuracy parameters or measures. The server also offers a collection of established data sets that are used for training and testing gene-finding algorithms. The users can download these sets and can evaluate their own algorithms.
Brief description of datasets available in GENEBENCH server
DNA sequences were extracted form the GenBank release 111.0 (April 1999) to date of study (Rogic et al., 2001). Source organisms were--H. sapiens,M. musculus, R. norvegicus. The ratio of human:mouse:rat sequences is 103:82:10. The mean length is 7096 bp. The number of single exon genes is 43 and mulit-exon genes is 152 with average number of 4.86 exons per gene and mean exon length 208 bp, and mean intron length 1015 bp. The porportion of coding sequences is 14% against non coding intron sequence of 46% and intergenic region of 40%.
The DNA sequences were extracted from GenBank release 85.0 (October 15,1994) from the vertebrate divisions. Source organisms were all vertebrate organisms. A total of 570 sequences were obtained after clean up procedure (Burset and Guigo, 1996) totalling 2,892,149 bp. There were 2649 coding exons, correcponding to 444,498 coding bp (~15%). All the sequences are having multi-exon genes.
GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection.
The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection. This data set is developed by Martin Reese (LBNL) with help from Uwe Ohler (University of Erlangen), David Kulp (UCSC) and Andrew Gentles (Stanford). It has 416 gene sequences including 275 multi-exon and 141 single exon genes.
Two sets of sequences were developed. First, a typical benchmark set made of sequences from the EMBL database release 50 (1997) that included 178 human genomic sequences (h178) coding for single complete genes for which both the mRNA and the coding exons are known. Second, a semi-artficial set of genomic sequences consisting of 42 sequences in which accurate gene-annotation is guaranteed. The h178 set has 50% G+C content, has an average length of 7169 bp with 1 gene each and 5.1 exons per sequence. The semi-artificial sequences have an average length of 177160 bp with 4.1 genes each sequence and 21 exons on average per sequence nd has a G+C content of 40%.
All data were taken from the GenBank collection of human nucleotide sequence data on May 30, 1992. All E. coli sequences were extracted on June 28, 1992. For the primary benchmark, successive non-overlapping windos of length 54 bases were taken from all human genomic sequences. Window length of 108 and 162 were also obtained. Each set of window is split and one set is used for training and and other used for testing the accuracy.
Used for Genome Annotation Assessment Project (GASP) 2000 in Drosophila (Reese et al., 2000). The total size of Adh region is 2.9 Mb. Presently estimated to contain over 200 genes.
[edit] CDpred
Dicer an RNase III enzyme cleaves pre-miRNA and dsRNA into sort dsRNA (~21 nucleotide) with 2 nucleotide overhang at 3'site.Thus identification of cleavage sites is crucial for understanding the RNA interference (miRNA and siRNA) in organisms. In this study, first time a systematic attempt has been made to develop a method for predicting dicer cleavage site in miRNA. We generate fixed length cleavage pattern (nucleotides having cleavage site at center) and non-cleavage pattern (nucleotides having no cleavage site) from 719 experimentally validated miRNA obtained from miRBase version 9.0. A SVM model has been developed to discriminate cleavage and non-cleavage patterns using binary pattern and achieved maximum MCC of 0.66. The organism specific SVM models have been developed for predicting cleavage site for Human, Mice, Rat, D. melanogaster, C. elegans and Denio rerio; in order to examine the similarity/dissimilarity in cleavage specificity of dicers belongs to different organism. It has been observed that Mice and Rat dicers have similar cleavage specificity where as cleavage specificity of C. elegans dicer is different than dicers of other organisms.
The main aim of this server is to help users to predict Dicer processing sites in pre-miRNA.
Dataset information
The data set of hairpin-miRNA was obtained from six model organisms from miRBase version 9.0, which consists of total 1610 sequences. In this study we remove miRNA sequences from our dataset, which are not experimentally validated. For study we select only those miRNAs which are present on 5p arm of pre-miRNA. Our final dataset consists of 719 hairpin-miRNA sequences that includes 218 Human, 184 Mice, 97 Rat, 32 D. melanogaster, 39 C. elegans and 149 D. rerio miRNA.
[edit] PPRint
Pprint (Prediction of Protein RNA- Interaction) is a web-server for predicting RNA-binding residues of a protein. The prediction is done by SVM model trained on PSSM profile generated by PSI-BLAST search of 'nr' protein database. The SVM model is trained and tested on a set of 86 non-homologous protein chains with 5-fold cross-validation.
Dataset
click here to download the dataset.
[edit] ECGPred
Prediction of Gene Expression from its Nucleotide Composition
This server allows user to analsis the expresion data (Microarray Data) where it calculate correlation coefficient between level of gene expression and nucleotides composition of genes. This will facilitate users in understanding which nucleotides are prefered and vice verse in a organism in given condition. This server also allows to learn from known microarray gene expression data and to predict expression level of other genes of same organism in that condition from their DNA sequence. The method uses SVM for learning and prediction.
The dataset used in this server is here.
[edit] NRpred : SVM based method for prediction of Nuclear Receptors
Nuclear receptors are key transcription factors that regulate crucial gene network for cell growth, differentiation and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins.Beside of diverse function nuclear receptor share a common structural organization. Nuclear receptors consist of six distinct regions: N & C terminal regions (A/B & F domains), central well conserved DNA binding domain (C), a non conserved hinge region (D) and moderately conserved ligand binding domain (E) . The C region of nuclear receptors consists of two zinc fingers, which is signature for this superfamily .
Dataset for development of Method
The data for four subfamilies of nuclear receptors was obtained from nucleaRDB database available at http://www.receptors.org/NR/. All the entries, which are not marked as fragments, are extracted from database by text parsing method. The initial dataset have 577 sequences belonging to four subfamilies of nuclear receptors. Redundancy was reduced such that none had >=90% sequence identity with any other sequence is data set using PROSET software.
[edit] GPCRpred : Prediction of families and superfamilies of G-protein coupled receptor.
G-protein-coupled receptors are a pharmacologically important protein family with approximately 450 genes identified to date. Pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives.The GPCRs are consist of seven transmembrane domains that are connected through loops.The N terminal of these protein are located extracellularly and C terminal is extended into the cytoplasmic space. Due to this topology they are able to transduce the external signal into the cell. This transduction of signal took place with the help of G-protein.
Datasets used in this server are here.
[edit] PSLpred : predicts subcellular localization of prokaryotic proteins
PSLpred is a svm(support vector machine) based method, predicts five major subcellular localization(Cytoplasm, Inner-membrane, Outer-membrane, Extracellular and Periplasm) of gram-negitive bacteria.
The datasets used in this server is here.
[edit] BTXpred Server : Prediction of Bacterial Toxins
The aim of BTXpred server is to predict bacterial toxins and its function from primary amino acid sequence using SVM, HMM and PSI-Blast. Bacterial toxins play an vital role to cause disease and are responsible for majority of symptoms and lesions during infection.
Datasets : Datasets that are used in the server is given here.