Indatabase theory,relational algebra is a theory that usesalgebraic structures for modeling data and defining queries on it with well foundedsemantics. The theory was introduced byEdgar F. Codd.[1]
The main application of relational algebra is to provide a theoretical foundation forrelational databases, particularlyquery languages for such databases, chief among which isSQL. Relational databases store tabular data represented asrelations. Queries over relational databases often likewise return tabular data represented as relations.
The main purpose of relational algebra is to defineoperators that transform one or more input relations to an output relation. Given that these operators accept relations as input and produce relations as output, they can be combined and used to express complex queries that transform multiple input relations (whose data are stored in the database) into a single output relation (the query results).
Unary operators accept a single relation as input. Examples include operators to filter certain attributes (columns) ortuples (rows) from an input relation.Binary operators accept two relations as input and combine them into a single output relation. For example, taking all tuples found in either relation (union), removing tuples from the first relation found in the second relation (difference), extending the tuples of the first relation with tuples in the second relation matching certain conditions, and so forth.
Relational algebra received little attention outside of pure mathematics until the publication ofE.F. Codd'srelational model of data in 1970.[2] Codd proposed such an algebra as a basis for database query languages.
Relational algebra operates on homogeneous sets of tuples where we commonly interpretm to be the number of rows of tuples in a table andn to be the number of columns. All entries in each column have the sametype.
A relation also has a unique tuple called theheader which gives each column a unique name orattribute inside the relation. Attributes are used in projections and selections.
The relational algebra usesset union,set difference, andCartesian product from set theory, and adds additional constraints to these operators to create new ones.[3]
For set union and set difference, the tworelations involved must beunion-compatible—that is, the two relations must have the same set of attributes. Becauseset intersection is defined in terms of set union and set difference, the two relations involved in set intersection must also be union-compatible.
For the Cartesian product to be defined, the two relations involved must have disjoint headers—that is, they must not have a common attribute name.
In addition, the Cartesian product is defined differently from the one inset theory in the sense that tuples are considered to be "shallow" for the purposes of the operation. That is, the Cartesian product of a set ofn-tuples with a set ofm-tuples yields a set of "flattened"(n + m)-tuples (whereas basic set theory would have prescribed a set of 2-tuples, each containing ann-tuple and anm-tuple). More formally,R ×S is defined as follows:
The cardinality of the Cartesian product is the product of the cardinalities of its factors, that is, |R ×S| = |R| × |S|.
Aprojection (Π) is aunary operation written as where is a set of attribute names. The result of such projection is defined as theset that is obtained when alltuples inR are restricted to the set.
Note: when implemented inSQL standard the "default projection" returns amultiset instead of a set, and theΠ projection to eliminate duplicate data is obtained by the addition of theDISTINCT keyword.
Ageneralized selection (σ) is aunary operation written as whereφ is apropositional formula that consists ofatoms as allowed in thenormal selection and the logical operators (and), (or) and (negation). This selection selects all thosetuples inR for whichφ holds.
To obtain a listing of all friends or business associates in an address book, the selection might be written as. The result would be a relation containing every attribute of every unique record whereisFriend is true or whereisBusinessContact is true.
Arename (ρ) is aunary operation written as where the result is identical toR except that theb attribute in all tuples is renamed to ana attribute. This is commonly used to rename the attribute of arelation for the purpose of a join.
To rename the "isFriend" attribute to "isBusinessContact" in a relation, might be used.
There is also the notation, whereR is renamed tox and the attributes are renamed to.[4]
In practice the classical relational algebra described above is extended with various operations such as outer joins, aggregate functions and even transitive closure.[5]
Whereas the result of a join (or inner join) consists of tuples formed by combining matching tuples in the two operands, an outer join contains those tuples and additionally some tuples formed by extending an unmatched tuple in one of the operands by "fill" values for each of the attributes of the other operand. Outer joins are not considered part of the classical relational algebra discussed so far.[6]
The operators defined in this section assume the existence of anull value,ω, which we do not define, to be used for the fill values; in practice this corresponds to theNULL in SQL. In order to make subsequent selection operations on the resulting table meaningful, a semantic meaning needs to be assigned to nulls; in Codd's approach the propositional logic used by the selection isextended to a three-valued logic, although we elide those details in this article.
Three outer join operators are defined: left outer join, right outer join, and full outer join. (The word "outer" is sometimes omitted.)
The left outer join (⟕) is written asR ⟕S whereR andS arerelations.[a] The result of the left outer join is the set of all combinations of tuples inR andS that are equal on their common attribute names, in addition (loosely speaking) to tuples inR that have no matching tuples inS.[citation needed]
For an example consider the tablesEmployee andDept and their left outer join:
|
|
|
In the resulting relation, tuples inS which have no common values in common attribute names with tuples inR take anull value,ω.
Since there are no tuples inDept with aDeptName ofFinance orExecutive,ωs occur in the resulting relation where tuples inEmployee have aDeptName ofFinance orExecutive.
Letr1,r2, ...,rn be the attributes of the relationR and let {(ω, ...,ω)} be the singletonrelation on the attributes that areunique to the relationS (those that are not attributes ofR). Then the left outer join can be described in terms of the natural join (and hence using basic operators) as follows:
The right outer join (⟖) behaves almost identically to the left outer join, but the roles of the tables are switched.
The right outer join ofrelationsR andS is written asR ⟖S.[b] The result of the right outer join is the set of all combinations of tuples inR andS that are equal on their common attribute names, in addition to tuples inS that have no matching tuples inR.[citation needed]
For example, consider the tablesEmployee andDept and their right outer join:
|
|
|
In the resulting relation, tuples inR which have no common values in common attribute names with tuples inS take anull value,ω.
Since there are no tuples inEmployee with aDeptName ofProduction,ωs occur in the Name and EmpId attributes of the resulting relation where tuples inDept hadDeptName ofProduction.
Lets1,s2, ...,sn be the attributes of the relationS and let {(ω, ...,ω)} be the singletonrelation on the attributes that areunique to the relationR (those that are not attributes ofS). Then, as with the left outer join, the right outer join can be simulated using the natural join as follows:
Theouter join (⟗) orfull outer join in effect combines the results of the left and right outer joins.
The full outer join is written asR ⟗S whereR andS arerelations.[c] The result of the full outer join is the set of all combinations of tuples inR andS that are equal on their common attribute names, in addition to tuples inS that have no matching tuples inR and tuples inR that have no matching tuples inS in their common attribute names.[citation needed]
For an example consider the tablesEmployee andDept and their full outer join:
|
|
|
In the resulting relation, tuples inR which have no common values in common attribute names with tuples inS take anull value,ω. Tuples inS which have no common values in common attribute names with tuples inR also take anull value,ω.
The full outer join can be simulated using the left and right outer joins (and hence the natural join and set union) as follows:
There is nothing in relational algebra introduced so far that would allow computations on the data domains (other than evaluation of propositional expressions involving equality). For example, it is not possible using only the algebra introduced so far to write an expression that would multiply the numbers from two columns, e.g. a unit price with a quantity to obtain a total price. Practical query languages have such facilities, e.g. the SQL SELECT allows arithmetic operations to define new columns in the resultSELECTunit_price*quantityAStotal_priceFROMt, and a similar facility is provided more explicitly by Tutorial D'sEXTEND keyword.[7] In database theory, this is calledextended projection.[8]: 213
Furthermore, computing various functions on a column, like the summing up of its elements, is also not possible using the relational algebra introduced so far. There are fiveaggregate functions that are included with most relational database systems. These operations are Sum, Count, Average, Maximum and Minimum. In relational algebra the aggregation operation over a schema (A1,A2, ...An) is written as follows:
where eachAj', 1 ≤j ≤k, is one of the original attributesAi, 1 ≤i ≤n.
The attributes preceding theg are grouping attributes, which function like a "group by" clause in SQL. Then there are an arbitrary number of aggregation functions applied to individual attributes. The operation is applied to an arbitrary relationr. The grouping attributes are optional, and if they are not supplied, the aggregation functions are applied across the entire relation to which the operation is applied.
Let's assume that we have a table namedAccount with three columns, namelyAccount_Number, Branch_Name andBalance. We wish to find the maximum balance of each branch. This is accomplished byBranch_NameGMax(Balance)(Account). To find the highest balance of all accounts regardless of branch, we could simply writeGMax(Balance)(Account).
Grouping is often written asBranch_NameɣMax(Balance)(Account) instead.[8]
Although relational algebra seems powerful enough for most practical purposes, there are some simple and natural operators onrelations that cannot be expressed by relational algebra. One of them is thetransitive closure of a binary relation. Given a domainD, let binary relationR be a subset ofD×D. The transitive closureR+ ofR is the smallest subset ofD×D that containsR and satisfies the following condition:
It can be proved using the fact that there is no relational algebra expressionE(R) takingR as a variable argument that producesR+.[9]
SQL however officially supports suchfixpoint queries since 1999, and it had vendor-specific extensions in this direction well before that.
This sectiondoes notcite anysources. Please helpimprove this section byadding citations to reliable sources. Unsourced material may be challenged andremoved.(June 2013) (Learn how and when to remove this message) |
Relational database management systems often include aquery optimizer which attempts to determine the most efficient way to execute a given query. Query optimizers enumerate possiblequery plans, estimate their cost, and pick the plan with the lowest estimated cost. If queries are represented by operators from relational algebra, the query optimizer can enumerate possible query plans by rewriting the initial query using the algebraic properties of these operators.
Queries can be represented as atree, where
The primary goal of the query optimizer is to transformexpression trees into equivalent expression trees, where the average size of the relations yielded by subexpressions in the tree is smaller than it was before theoptimization. The secondary goal is to try to form common subexpressions within a single query, or if there is more than one query being evaluated at the same time, in all of those queries. The rationale behind the second goal is that it is enough to compute common subexpressions once, and the results can be used in all queries that contain that subexpression.
Here are a set of rules that can be used in such transformations.
Rules about selection operators play the most important role in query optimization. Selection is an operator that very effectively decreases the number of rows in its operand, so if the selections in an expression tree are moved towards the leaves, the internalrelations (yielded by subexpressions) will likely shrink.
Selection isidempotent (multiple applications of the same selection have no additional effect beyond the first one), andcommutative (the order selections are applied in has no effect on the eventual result).
A selection whose condition is aconjunction of simpler conditions is equivalent to a sequence of selections with those same individual conditions, and selection whose condition is adisjunction is equivalent to a union of selections. These identities can be used to merge selections so that fewer selections need to be evaluated, or to split them so that the component selections may be moved or optimized separately.
Cross product is the costliest operator to evaluate. If the inputrelations haveN andM rows, the result will contain rows. Therefore, it is important to decrease the size of both operands before applying the cross product operator.
This can be effectively done if the cross product is followed by a selection operator, e.g.. Considering the definition of join, this is the most likely case. If the cross product is not followed by a selection operator, we can try to push down a selection from higher levels of the expression tree using the other selection rules.
In the above case the conditionA is broken up in to conditionsB,C andD using the split rules about complex selection conditions, so that andB contains attributes only fromR,C contains attributes only fromP, andD contains the part ofA that contains attributes from bothR andP. Note, thatB,C orD are possibly empty. Then the following holds:
Selection isdistributive over the set difference, intersection, and union operators. The following three rules are used to push selection below set operations in the expression tree. For the set difference and the intersection operators, it is possible to apply the selection operator to just one of the operands following the transformation. This can be beneficial where one of the operands is small, and the overhead of evaluating the selection operator outweighs the benefits of using a smallerrelation as an operand.
Selection commutes with projection if and only if the fields referenced in the selection condition are a subset of the fields in the projection. Performing selection before projection may be useful if the operand is a cross product or join. In other cases, if the selection condition is relatively expensive to compute, moving selection outside the projection may reduce the number of tuples which must be tested (since projection may produce fewer tuples due to the elimination of duplicates resulting from omitted fields).
Projection is idempotent, so that a series of (valid) projections is equivalent to the outermost projection.
Projection isdistributive over set union.
Projection does not distribute over intersection and set difference. Counterexamples are given by:
and
whereb is assumed to be distinct fromb'.
Successive renames of a variable can be collapsed into a single rename. Rename operations which have no variables in common can be arbitrarily reordered with respect to one another, which can be exploited to make successive renames adjacent so that they can be collapsed.
Rename is distributive over set difference, union, and intersection.
Cartesian product is distributive over union.
The first query language to be based on Codd's algebra was Alpha, developed by Dr. Codd himself. Subsequently,ISBL was created, and this pioneering work has been acclaimed by many authorities[10] as having shown the way to make Codd's idea into a useful language.Business System 12 was a short-lived industry-strength relational DBMS that followed the ISBL example.
In 1998Chris Date andHugh Darwen proposed a language calledTutorial D intended for use in teaching relational database theory, and its query language also draws on ISBL's ideas.[11] Rel is an implementation ofTutorial D. Bmg is an implementation of relational algebra in Ruby which closely follows the principles ofTutorial D andThe Third Manifesto.[12]
Even the query language ofSQL is loosely based on a relational algebra, though the operands in SQL (tables) are not exactlyrelations and several useful theorems about the relational algebra do not hold in the SQL counterpart (arguably to the detriment of optimisers and/or users). The SQL table model is a bag (multiset), rather than a set. For example, the expression is a theorem for relational algebra on sets, but not for relational algebra on bags.[8]
{{cite book}}: CS1 maint: location missing publisher (link)