Datasets in Bioinformatics
From DrugPedia: A Wikipedia for Drug discovery
Line 75: | Line 75: | ||
The dataset used in this study to train the SVM was collected from [http://www.imtech.res.in/raghava/mhcbn/ MHCBN] and AntiJen.The dataset contained the peptides whose affinity value (IC50) are already determined experimentally.The size of the dataset was reduced by selecting and keeping only peptides, having 9 amino acids length (as nonamers are ideal binders of MHC-I molecules).The redundancy of the dataset was further reduced so that no two peptides have >90% sequence identity.This dataset consists of 402 binders and 222 nonbinders. | The dataset used in this study to train the SVM was collected from [http://www.imtech.res.in/raghava/mhcbn/ MHCBN] and AntiJen.The dataset contained the peptides whose affinity value (IC50) are already determined experimentally.The size of the dataset was reduced by selecting and keeping only peptides, having 9 amino acids length (as nonamers are ideal binders of MHC-I molecules).The redundancy of the dataset was further reduced so that no two peptides have >90% sequence identity.This dataset consists of 402 binders and 222 nonbinders. | ||
+ | |||
+ | |||
+ | ==GenenBench : Evaluation of Gene finder and Dataset creation server== | ||
+ | |||
+ | The GeneBench is an interface developed for evaluating the gene-finding algorithms. Users are allowed to compare the performance of the old/new prediction methods on a set of defined accuracy parameters or measures. The server also offers a collection of established data sets that are used for training and testing gene-finding algorithms. The users can download these sets and can evaluate their own algorithms. | ||
+ | |||
+ | '''Brief description of datasets available in GENEBENCH server''' | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ HMR195 Dataset] | ||
+ | |||
+ | DNA sequences were extracted form the GenBank release 111.0 (April 1999) to date of study (Rogic et al., 2001). Source organisms were--H. sapiens,M. musculus, R. norvegicus. The ratio of human:mouse:rat sequences is 103:82:10. The mean length is 7096 bp. The number of single exon genes is 43 and mulit-exon genes is 152 with average number of 4.86 exons per gene and mean exon length 208 bp, and mean intron length 1015 bp. The porportion of coding sequences is 14% against non coding intron sequence of 46% and intergenic region of 40%. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Dataset] | ||
+ | |||
+ | The DNA sequences were extracted from GenBank release 85.0 (October 15,1994) from the vertebrate divisions. Source organisms were all vertebrate organisms. A total of 570 sequences were obtained after clean up procedure (Burset and Guigo, 1996) totalling 2,892,149 bp. There were 2649 coding exons, correcponding to 444,498 coding bp (~15%). All the sequences are having multi-exon genes. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Reese/Kulp Human Dataset] | ||
+ | |||
+ | GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Reese/Kulp Dataset of Drosophila melanogaster] | ||
+ | |||
+ | The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection. This data set is developed by Martin Reese (LBNL) with help from Uwe Ohler (University of Erlangen), David Kulp (UCSC) and Andrew Gentles (Stanford). It has 416 gene sequences including 275 multi-exon and 141 single exon genes. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Guigo2000 Dataset] | ||
+ | |||
+ | Two sets of sequences were developed. First, a typical benchmark set made of sequences from the EMBL database release 50 (1997) that included 178 human genomic sequences (h178) coding for single complete genes for which both the mRNA and the coding exons are known. Second, a semi-artficial set of genomic sequences consisting of 42 sequences in which accurate gene-annotation is guaranteed. The h178 set has 50% G+C content, has an average length of 7169 bp with 1 gene each and 5.1 exons per sequence. The semi-artificial sequences have an average length of 177160 bp with 4.1 genes each sequence and 21 exons on average per sequence nd has a G+C content of 40%. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Fickett-Tung92 Dataset] | ||
+ | |||
+ | All data were taken from the GenBank collection of human nucleotide sequence data on May 30, 1992. All E. coli sequences were extracted on June 28, 1992. For the primary benchmark, successive non-overlapping windos of length 54 bases were taken from all human genomic sequences. Window length of 108 and 162 were also obtained. Each set of window is split and one set is used for training and and other used for testing the accuracy. | ||
+ | |||
+ | * [http://www.imtech.res.in/raghava/genebench/datasets.html/ Drosophila AdH Region Dataset] | ||
+ | |||
+ | Used for Genome Annotation Assessment Project (GASP) 2000 in Drosophila (Reese et al., 2000). The total size of Adh region is 2.9 Mb. Presently estimated to contain over 200 genes. |
Revision as of 04:40, 4 September 2008
There are a number of Datasets that are being created and used in the field of Bioinformatics. Datasets contains the vital information based on which a prediction server depends for it's function. here is some of the datasets that are being created or used by Bioinformatics Centre, Institute of Microbiology, Chandigarh are as follows :
Datasets for evaluation of beta turn prediction method
The dataset has 426 non-homologus protein chains. In this data set, no two protein chains have more than 25% sequence identity.The structure of these proteins is determined by X-ray crystallography at 2.0 resolution or better. Each chain contains minimum one beta turn.
Complete Dataset
- Amino acid sequence of 426 protein chains in fasta format
ProPred-I
The Promiscuous MHC Class-I Binding Peptide Prediction Server
The ProPred-I is an on-line service for identifying the MHC Class-I binding regions in antigens. It implements matrices for 47 MHC Class-I alleles, proteasomal and immunoproteasomal models. The main aim of this server is to help users in identifying the promiscuous regions.
Dataset
Here is two datasets that are used in developing this server is :
Matrix Optimization Technique for Predicting MHC binding Core
The X-ray crystal structure of MHC class II molecule has revealed open peptide binding groove. A peptide bound in this groove may flank from one or the other side. Understanding which residues are acctually involved in binding will be very useful for understanding MHC peptide interactios.Here Matrix Optimization Technique is used to predict MHC binding core. Using binders from MHCPEP and nonbinder Data with MOT an accuracy of correct classification from 97 to 99% was obtained with HLA-DR1, HLA-DR2 and HLA-DR5 allele. This is the highest accuracy reported by any method. The prediction method used in this server is based on MOT and relies on the thought that binders have unique patterns which can be easily distinguished from nonbinders.
Dataset
The "Binder" used in this study :
The "Non-binder" used in this study are :
Bcepred: Prediction of linear B-cell epitopes, using physico-chemical properties
We evaluated the performance of existing linear B-cell epitope prediction methods based on physico-chemical properties on a non-redundant dataset. The dataset consists of 1029 B-cell epitopes obtained from Bcipep database and equally number of non-epitopes obtained randomly from Swiss-Prot database.
Data set
B-cell epitopes were obtained from B cell epitope database BCIPEP, which contains 2479 continuous epitopes, including 654 immunodominant, 1617 immunogenic epitopes. All the identical epitopes and non-immunogenic peptides were removed, finally we got 1029 unique experimentally proved continuous B cell epitopes. The dataset covers a wide range of pathogenic group like virus, bacteria, protozoa and fungi. Final dataset consists of 1029 B-cell epitopes and 1029 non-epitopes or random peptides (equal length and same frequency generated from SWISS-PROT).
HLApred : Identification and prediction of HLA class I & class II binder
The method can identify and predict HLA binding regions from antigen sequence. The method allows identification & prediction for 87 alleles, out of which 51 belong to Class I and 36 belongs to Class II.The output format (HTML MAPPING) will assist users in locating promiscuous HLA binders, which can be most putative vaccine candidates.
Data for Identification of Experimentally proven Binders
The server allows searching of antigen sequence against MHCBN Database version 3.1 (13). MHCBN is a comprehensive database of Major Histocompatibility Complex (MHC) binding and non-binding peptides compiled from published literature and existing databases. The database consists more than 23000 entries. The HLApred server searches all the peptides obtained from MHCBN for selected HLA alleles in query antigen sequence.
here is the link for MHCBN Version 4.0.
HLA_Affi
The preliminary requirement for the stimulation of cytotoxic T cell response, a mechanism against viruses and certain tumors, is the processing and presentation of endogenous antigenic peptides by MHC-I molecules on the surface of the cell. Methods have been developed to classify and predict the binders and non-binders of MHC. Here we develop a SVM based method to predict the binding affinity of peptides to MHC-I. The method takes into consideration the amino acid sequence and the physio-chemical properties of proteins.
Dataset
The dataset used in this study to train the SVM was collected from MHCBN and AntiJen.The dataset contained the peptides whose affinity value (IC50) are already determined experimentally.The size of the dataset was reduced by selecting and keeping only peptides, having 9 amino acids length (as nonamers are ideal binders of MHC-I molecules).The redundancy of the dataset was further reduced so that no two peptides have >90% sequence identity.This dataset consists of 402 binders and 222 nonbinders.
GenenBench : Evaluation of Gene finder and Dataset creation server
The GeneBench is an interface developed for evaluating the gene-finding algorithms. Users are allowed to compare the performance of the old/new prediction methods on a set of defined accuracy parameters or measures. The server also offers a collection of established data sets that are used for training and testing gene-finding algorithms. The users can download these sets and can evaluate their own algorithms.
Brief description of datasets available in GENEBENCH server
DNA sequences were extracted form the GenBank release 111.0 (April 1999) to date of study (Rogic et al., 2001). Source organisms were--H. sapiens,M. musculus, R. norvegicus. The ratio of human:mouse:rat sequences is 103:82:10. The mean length is 7096 bp. The number of single exon genes is 43 and mulit-exon genes is 152 with average number of 4.86 exons per gene and mean exon length 208 bp, and mean intron length 1015 bp. The porportion of coding sequences is 14% against non coding intron sequence of 46% and intergenic region of 40%.
The DNA sequences were extracted from GenBank release 85.0 (October 15,1994) from the vertebrate divisions. Source organisms were all vertebrate organisms. A total of 570 sequences were obtained after clean up procedure (Burset and Guigo, 1996) totalling 2,892,149 bp. There were 2649 coding exons, correcponding to 444,498 coding bp (~15%). All the sequences are having multi-exon genes.
GENIE gene finding data set, containing a total of 793 unrelated human genes. This data set was used to train the GENIE gene finding system (WWW-access) developed at LBNL and UC Santa Cruz. The last update was done in March 1998 using GenBank v.105. Please see the documentation file for further information, including links to previous versions of this data collection.
The GENIE set of unrelated D. melanogaster genes . This data set was used to train the GENIE gene finding system developed at LBNL and UCSC. The last update was done in October 1998 using GenBank v.109. Please see the documentation file for further information, including links to previous versions of this data collection. This data set is developed by Martin Reese (LBNL) with help from Uwe Ohler (University of Erlangen), David Kulp (UCSC) and Andrew Gentles (Stanford). It has 416 gene sequences including 275 multi-exon and 141 single exon genes.
Two sets of sequences were developed. First, a typical benchmark set made of sequences from the EMBL database release 50 (1997) that included 178 human genomic sequences (h178) coding for single complete genes for which both the mRNA and the coding exons are known. Second, a semi-artficial set of genomic sequences consisting of 42 sequences in which accurate gene-annotation is guaranteed. The h178 set has 50% G+C content, has an average length of 7169 bp with 1 gene each and 5.1 exons per sequence. The semi-artificial sequences have an average length of 177160 bp with 4.1 genes each sequence and 21 exons on average per sequence nd has a G+C content of 40%.
All data were taken from the GenBank collection of human nucleotide sequence data on May 30, 1992. All E. coli sequences were extracted on June 28, 1992. For the primary benchmark, successive non-overlapping windos of length 54 bases were taken from all human genomic sequences. Window length of 108 and 162 were also obtained. Each set of window is split and one set is used for training and and other used for testing the accuracy.
Used for Genome Annotation Assessment Project (GASP) 2000 in Drosophila (Reese et al., 2000). The total size of Adh region is 2.9 Mb. Presently estimated to contain over 200 genes.