Biological Sequence Databases

From DrugPedia: A Wikipedia for Drug discovery

[edit] Biological Sequence Databases

Biological databases are the computer sites that organize,store and disseminate files that contain information consisting of literature references, nucleic acids sequences, protein sequences and protein structures. Databases are effectively electronic filling cabinets, convenient and efficient method of storing vast information. The different types of databases are:

a) Primary databases

b) Composite databases

c) Secondary databases

a) Primary databases:

The primary database is that stores biomolecular sequence (protein or nucleic acid) and associated annotation information(organism, species, function,mutations linked to particular diseases;functional/structural pattern, bibliographic etc).

b) Composite databases:

Composite database is the database that amalgamates a number of primary sources, using a set of defined criteria that determine the priority of inclusion of the different sources and level of redundancy retained.

c) Secondary databases:

Secondary database s are the ones that contains information derived from primary sequences data typically in the form of regular expression (patterns), fingerprints, blocks, profiles or HMM.

[edit] Classification of different types of databases

[edit] Primary sequences Databases

[edit] Nucleic Acid sequence databases

GenBank(USA)

EMBL(Europe)

DDBJ(Japan)

[edit] Protein sequence databases

PIR (Protein Information Resources)

MIPS

SWISS-PROT

TrEMBL(trqnslated EMBL)

NRL-3

[edit] Composite Sequences Databases(of proteins)

[edit] Non-redundant Databases(NRDB)

PDB

PIR

SWISS-PROT update

Genpept Update

SWISS-PROT

Genpept

[edit] Non-redundant Protein sequence databases(OWL)

SWISS-PROT

GenBank

PIR

NRL-3D

[edit] MIPSX-Max planck institute in Martinsried)

PIR1-4

MIPS Own

MIPSTrn

PIRMOD

NRL-3D

SWISS-PROT

EM Trans

GB Trans

Kabat

Pseq IP

[edit] Secondary or Pattern databases

Prosite ( primary source- SWISS PROT; Stored information: Regular expression)

Prints ( primary source- OWL; Stored information; fingerprints)

Pfam ( primary source: SWISS PROT;stored information: HMM)

Blocks ( primary source: PROSITE/PRINTS; stored information: aligned motifs, blocks)

Identify ( primary source: BLOCKS/PRINTS; stored information: Fuzzy regular expression(patterns))

Profiles ( primary source: SWISS PROT; stored information: weighed matrices(profiles))

[edit] Primary Sequence databases for Nucleic Acid=

[edit] GenBank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 85,759,586,764 bases in 82,853,685 sequence records in the traditional GenBank divisions and 108,635,736,141 bases in 27,439,206 sequence records in the WGS division as of February 2008.

[edit] EMBL

The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 20 European countries and Australia as associate member state. In 1974, EMBL was payday loans online created and it is a non-profit organisation funded by public research money from its member states. Research at EMBL is conducted by approximately 85 independent groups covering the spectrum of molecuar biology. The cornerstones of EMBL's mission are: to perform basic research in molecular biology and molecular medicine, to train scientists, students and visitors at all levels, to offer vital services to scientists in the member states, to develop new instruments and methods in the life sciences, and to actively engage in technology transfer.

[edit] DDBJ

The DNA Data Bank of Japan is a DNA data bank. It is located at the National Institute of Genetics of Japan. t shares its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information. The DNA Data Bank of Japan (DDBJ, has made an effort to collect as much data as possible mainly from Japanese researchers. The increase rates of the data collected, annotated and released to the public in the past year are 43% for the number of entries and 52% for the number of bases. The increase rates are accelerated even after the human genome was sequenced, because sequencing technology has been remarkably advanced and simplified, and research in life science has been shifted from the gene scale to the genome scale.

[edit] Other nucleotide databases

[edit] UniGene

UniGene is an NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene)together with the information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.

[edit] SGD

SGD is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.

[edit] EBI Genomes

It provides the access to the information of the completed genomes. The first completed genomes from viruses, phages, and organelles were deposited into the EMBL Database in the early 1980's. Since then, molecular biology's shift to obtain the complete sequences of as many genomes as possible combined with major developments in sequencing technology resulted in hundreds of complete genome sequences being added to the database, including Archaea, Bacteria and Eukaryota.

[edit] Genome Biology

NCBI provides several genomic biology tools and resources, including organism-specific pages that include links to many web sites and databases relevant to that species.

[edit] Ensembl

Ensembl is a joint project between (EMBL) - (EBI) and the Wellcome Trust Sanger Institute (WTSI) to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes. The project consists of: a database schema and associated API to store genomic information of about 40 genomes and many extension databases to represent functional , comparative and variational genomics.

[edit] Primary Sequence databases for protein

[edit] PIR

The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC) is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information.

[edit] MIPS

The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max- Planck- Institution for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database .

[edit] SWISS PROT

Swiss-Prot is a protein sequence database found out in 1986. It is a manually curated biological database of protein sequences. Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Swiss-Prot strives to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. In 1996, a computer annotated supplement to SWISS-PROT was created, termed TrEMBL.

[edit] TrEMBL

It was created in 1996 as a computer annotated supplement to SWISS PROT. The database helps the SWISS PROT format and contains translations of all coding sequences (CDS) in EMBL.

It has two main sections:

SP-TrEMBL: (SWISS PROT -TrEMBL): It contains the entries that eventually be incorporated into SWISS PROT; that have not yet been manually annotated.

REM-TrEMBL: it contains sequences that are not destined to be included in SWISS PROT . These include: immunoglobulins, T-cell receptors, fragments of fewer than eight amino acids, synthetic sequences, patented sequences, codon translations.

[edit] Other database

[edit] Ens EMBL

It is a collaborative project of EMBL, EBI and the Sanger Center to automatically track sequence fragment of human genome and assemble them into longer structures. Automated analysis methods such as: gene finding feature finding tools, sequence comparison tools, are then applied to the assembled sequences. These are made available to users through a web interface.

[edit] References

http://en.wikipedia.org/wiki/GenBank

http://en.wikipedia.org/wiki/European_Molecular_Biology_Laboratory

http://en.wikipedia.org/wiki/UniGene

http://en.wikipedia.org/wiki/Protein_Information_Resource