Datasets in Bioinformatics

From DrugPedia: A Wikipedia for Drug discovery

Revision as of 11:55, 3 September 2008 by Ravi (Talk | contribs)
Jump to: navigation, search

There are a number of Datasets that are being created and used in the field of Bioinformatics. Datasets contains the vital information based on which a prediction server depends for it's function. here is some of the datasets that are being created or used by Bioinformatics Centre, Institute of Microbiology, Chandigarh are as follows :

Contents

Datasets for evaluation of beta turn prediction method

The dataset has 426 non-homologus protein chains. In this data set, no two protein chains have more than 25% sequence identity.The structure of these proteins is determined by X-ray crystallography at 2.0 resolution or better. Each chain contains minimum one beta turn.

Complete Dataset

  • Amino acid sequence of 426 protein chains in fasta format


ProPred-I

The Promiscuous MHC Class-I Binding Peptide Prediction Server

The ProPred-I is an on-line service for identifying the MHC Class-I binding regions in antigens. It implements matrices for 47 MHC Class-I alleles, proteasomal and immunoproteasomal models. The main aim of this server is to help users in identifying the promiscuous regions.

Dataset

Here is two datasets that are used in developing this server is :

HLA-A*0201

H2-kb


Matrix Optimization Technique for Predicting MHC binding Core

The X-ray crystal structure of MHC class II molecule has revealed open peptide binding groove. A peptide bound in this groove may flank from one or the other side. Understanding which residues are acctually involved in binding will be very useful for understanding MHC peptide interactios.Here Matrix Optimization Technique is used to predict MHC binding core. Using binders from MHCPEP and nonbinder Data with MOT an accuracy of correct classification from 97 to 99% was obtained with HLA-DR1, HLA-DR2 and HLA-DR5 allele. This is the highest accuracy reported by any method. The prediction method used in this server is based on MOT and relies on the thought that binders have unique patterns which can be easily distinguished from nonbinders.

Dataset

The "Binder" used in this study :

HLA-DR1

HLA-DR2

HLA-DR5

The "Non-binder" used in this study are :

HLA-DR1

HLA-DR2

HLA-DR


Bcepred: Prediction of linear B-cell epitopes, using physico-chemical properties

We evaluated the performance of existing linear B-cell epitope prediction methods based on physico-chemical properties on a non-redundant dataset. The dataset consists of 1029 B-cell epitopes obtained from Bcipep database and equally number of non-epitopes obtained randomly from Swiss-Prot database.

Data set

B-cell epitopes were obtained from B cell epitope database BCIPEP, which contains 2479 continuous epitopes, including 654 immunodominant, 1617 immunogenic epitopes. All the identical epitopes and non-immunogenic peptides were removed, finally we got 1029 unique experimentally proved continuous B cell epitopes. The dataset covers a wide range of pathogenic group like virus, bacteria, protozoa and fungi. Final dataset consists of 1029 B-cell epitopes and 1029 non-epitopes or random peptides (equal length and same frequency generated from SWISS-PROT).


HLApred : Identification and prediction of HLA class I & class II binder

The method can identify and predict HLA binding regions from antigen sequence. The method allows identification & prediction for 87 alleles, out of which 51 belong to Class I and 36 belongs to Class II.The output format (HTML MAPPING) will assist users in locating promiscuous HLA binders, which can be most putative vaccine candidates.

Data for Identification of Experimentally proven Binders

The server allows searching of antigen sequence against MHCBN Database version 3.1 (13). MHCBN is a comprehensive database of Major Histocompatibility Complex (MHC) binding and non-binding peptides compiled from published literature and existing databases. The database consists more than 23000 entries. The HLApred server searches all the peptides obtained from MHCBN for selected HLA alleles in query antigen sequence.

here is the link for MHCBN Version 4.0.


HLA_Affi

The preliminary requirement for the stimulation of cytotoxic T cell response, a mechanism against viruses and certain tumors, is the processing and presentation of endogenous antigenic peptides by MHC-I molecules on the surface of the cell. Methods have been developed to classify and predict the binders and non-binders of MHC. Here we develop a SVM based method to predict the binding affinity of peptides to MHC-I. The method takes into consideration the amino acid sequence and the physio-chemical properties of proteins.

Dataset

The dataset used in this study to train the SVM was collected from MHCBN and AntiJen.The dataset contained the peptides whose affinity value (IC50) are already determined experimentally.The size of the dataset was reduced by selecting and keeping only peptides, having 9 amino acids length (as nonamers are ideal binders of MHC-I molecules).The redundancy of the dataset was further reduced so that no two peptides have >90% sequence identity.This dataset consists of 402 binders and 222 nonbinders.