Database
From DrugPedia: A Wikipedia for Drug discovery
(New page: ==Database== A Database is a structured collection of data which is managed to meet the needs of a community of users. The structure is achieved by organizing the data according to a data...) |
|||
Line 1: | Line 1: | ||
- | |||
A Database is a structured collection of data which is managed to meet the needs of a community of users. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation of relationships (see below for explanation of the various database models). | A Database is a structured collection of data which is managed to meet the needs of a community of users. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation of relationships (see below for explanation of the various database models). |
Revision as of 05:04, 1 September 2008
A Database is a structured collection of data which is managed to meet the needs of a community of users. The structure is achieved by organizing the data according to a database model. The model in most common use today is the relational model. Other models such as the hierarchical model and the network model use a more explicit representation of relationships (see below for explanation of the various database models).
A computer database relies upon software to organize the storage of data. This software is known as a database management system (DBMS). Databases management systems are categorized according to the database model that they support. The model tends to determine the query languages that are available to access the database. A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures. In these areas there are large differences between products.
History
The earliest known use of the term data base was in November 1963, when the System Development Corporation sponsored a symposium under the title Development and Management of a Computer-centered Data Base[1]. Database as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers. (The abbreviation DB, however, survives.)
The first database management systems were developed in the 1960s. A pioneer in the field was Charles Bachman. Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on punched cards and magnetic tape, so that serial processing was the dominant activity. Two key data models arose at this time: CODASYL developed the network model based on Bachman's ideas, and (apparently independently) the hierarchical model was used in a system developed by North American Rockwell later adopted by IBM as the cornerstone of their IMS product. While IMS along with the CODASYL IDMS were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today. Two worthy of mention are the PICK and MUMPS databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems.
The relational model was proposed by E. F. Codd in 1970. He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms. For a long while, however, the relational model remained of academic interest only. While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time. Among the first implementations were Michael Stonebraker's Ingres at Berkeley, and the System R project at IBM. Both of these were research prototypes, announced during 1976. The first commercial products, Oracle and DB2, did not appear until around 1980. The first successful database product for microcomputers was dBASE for the CP/M and PC-DOS/MS-DOS operating systems.
During the 1980s, research activity focused on distributed database systems and database machines. Another important theoretical idea was the Functional Data Model, but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.
In the 1990s, attention shifted to object-oriented databases. These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as spatial databases, engineering data (including software repositories), and multimedia data. Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result. The 1990s also saw the spread of Open Source databases, such as PostgreSQL and MySQL.
In the 2000s, the fashionable area for innovation is the XML database. As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products. XML databases aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.
Database models
Various techniques are used to model data structure. Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model. For any one logical model various physical implementations may be possible, and most products will offer the user some level of control in tuning the physical implementation, since the choices that are made have a significant effect on performance. Here are three examples:
Hierarchical model
In a hierarchical model, data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list. This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files. Each unit in the model is a record which is also known as a node. In such a model, each record on one level can be related to multiple records on the next lower level. A record that has subsidiary records is called a parent and the subsidiary records are called children. Data elements in this model are well suited for one-to-many relationships with other data elements in the database.
This model is advantageous when the data elements are inherently hierarchical. The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated. Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization.
Network model
In the network model, records can participate in any number of named relationships. Each relationship associates a record of one type (called the owner) with multiple records of another type (called the member). These relationships (somewhat confusingly) are called sets. For example a student might be a member of one set whose owner is the course they are studying, and a member of another set whose owner is the college they belong to. At the same time the student might be the owner of a set of email addresses, and owner of another set containing phone numbers. The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent. The hierarchical model is therefore a subset of the network model.
Programmatic access to network databases is traditionally by means of a navigational data manipulation language, in which programmers navigate from a current record to other related records using verbs such as find owner, find next, and find prior. The most common example of such an interface is the COBOL-based Data Manipulation Language defined by CODASYL.
Network databases are traditionally implemented by using chains of pointers between related records. These pointers can be node numbers or disk addresses.
The network model became popular because it provided considerable flexibility in modelling complex data relationships, and also offered high performance by virtue of the fact that the access verbs used by programmers mapped directly to pointer-following in the implementation.
However, the model had several disadvantages. Navigational programming proved error-prone as data models became more complex, and small changes to the data structure could require changes to many programs. Also, because of the use of physical pointers, operations such as database loading and restructuring could be very time-consuming.
Relational model
The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows. The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number). Rows (also called records) represent instances of an entity (e.g. specific employees).
The "relation" in "relational database" comes from the mathematical notion of relations from the field of set theory. A relation is a set of tuples, so rows are sometimes called tuples. All tables in a relational database adhere to three basic rules.
- The ordering of columns is immaterial
- Identical rows are not allowed in a table
- Each row has a single (separate) value for each of its columns (each tuple has an atomic value).
If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records. Relationships between records are often categorized by their cardinality (1:1, (0), 1:M, M:M).
Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values. A "primary key" is a key that has a unique value for each row in the table. Keys are commonly used to join or combine data from two or more tables. For example, an employee table may contain a column named address which contains a value that matches the key of an address table. Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables. It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.