This is theprint version ofStructured Query Language You won't see this message or any elements not part of the book's content when you print orpreview this page. |
The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Structured_Query_Language
This Wikibook introduces the programming language SQL as defined by ISO/IEC. The standard is — similar to most standard publications — fairly technical and neither easy to read nor understandable. Therefore there is a demand for a text document explaining the key features of the language. That is what this wikibook strives to do: we want to present an easily readable and understandable introduction for everyone interested in the topic.
Manuals and white papers from database vendors are mainly focused on the technical aspects of their product. As they want to set themselves apart from each other, they tend to emphasize those aspects which go beyond the SQL standard and the products of other vendors. This is contrary to the Wikibook's approach: we want to emphasize the common aspects.
The main audience of this Wikibook are people who want to learn the language, either as beginners or as persons with existing knowledge and some degree of experience looking for a recapitulation.
First of all, the Wikibook is not a reference manual for the syntax of standard SQL or any of its implementations. Reference manuals usually consist of definitions and explanations for those definitions. By contrast, the Wikibook tries to present concepts and basic commands through textual descriptions and examples. Of course, some syntax will be demonstrated. On some pages, there are additional hints about small differences between the standard and particular implementations.
The Wikibook is also not a complete tutorial. First, its focus is the standard and not any concrete implementation. When learning a computer language it is necessary to work with it and experience it personally. Hence, a concrete implementation is needed. And most of them differ more or less from the standard. Second, the Wikibook is far away from reflecting thecomplete standard, e.g. the central part of the standard consists of about 18 MB text in more than 1,400 pages. But you can use the Wikibook as a companion for learning SQL.
For everyone new to SQL, it will be necessary to study the chapters and pages from beginning to end. For persons who have some experience with SQL or are interested in a specific aspect, it is possible to navigate directly to any page.
Knowledge about any other computer language is not necessary, but it will be helpful.
The Wikibook consists of descriptions, definitions, and examples. It should be read with care. Furthermore, it is absolutely necessary to do some experiments with data and data structures personally. Hence,access to a concrete database system, where you can do read-only and read-write tests, is necessary. For those tests, you can use our example database or individually defined tables and data.
The elements of the language SQL are case-insensitive, e.g., it makes no difference whether you writeSELECT ...,Select ...,select ... or any combination of upper and lower case characters likeSeLecT. For readability reasons, the Wikibook uses the convention that all language keywords are written in upper case letters and all names of user objects e.g., table and column names, are written in lower case letters.
We will write short SQL commands within one row.
SELECTstreetFROMaddressWHEREcity='Duckburg';
For longer commands spawning multiple lines we use atabular format.
SELECTstreetFROMaddressWHEREcityIN('Duckburg','Gotham City','Hobbs Lane');
One of the original scopes of computer applications was storing large amounts of data on mass storage devices and retrieving them at a later point in time. Over time user requirements increased to include not only sequential access but also random access to data records, concurrent access by parallel (writing) processes, recovery after hardware and software failures, high performance, scalability, etc. In the 1970s and 1980s, the science and computer industries developed techniques to fulfill those requests.
Basic bricks for efficient data storage - and for this reason for all Database Management Systems (DBMS) - are implementations of fast read and write access algorithms to data located in central memory and mass storage devices like routines forB-trees,Index Sequential Access Method (ISAM), other indexing techniques as well as buffering of dirty and non-dirty blocks. These algorithms are not unique to DBMS. They also apply to file systems, some programming languages, operating systems, application servers, and much more.
In addition to the appropriation of these routines, a DBMS guarantees compliance with theACID paradigm. This compliance means that in a multi-user environment all changes to data within one transaction are:
A distinction between the following generations of DBMS design and implementation can be made:
... WHERE employee.department_id = department.id .... The consequence is that - except for explicit foreign keys - there is no meaning of a parent/child or owner/member denotation. Relationships in this model do not have any direction.A relational DBMS is an implementation of data stores according to the design rules of the relational model. This approach allows operations on the data according to therelational algebra like projections, selections, joins, set operations (union, difference, intersection, ...), and more. Together withBoolean algebra (and, or, not, exists, ...) and other mathematical concepts, relational algebra builds up a complete mathematical system with basic operations, complex operations, and transformation rules between the operations. Neither a DBA nor an application programmer needs to know the relational algebra. But it is helpful to know that your rDBMS is based on this mathematical foundation - and that it has the freedom to transform queries into several forms.
The relational model designs data structures as relations (tables) with attributes (columns) and the relationship between those relations. The information about one entity of the real world is stored within one row of a table. However, the termone entity of the real world must be used with care. It may be that our intellect identifies a machine like a single airplane in this vein. Depending on the information requirements, it may be sufficient to put all of the information into one row of a tableairplane. But in many cases, it is necessary to break the entity into its pieces and model the pieces as discrete entities, including the relationship to the whole thing. If, for example, information about every single seat within the airplane is needed, a second tableseat and some way of joining seats to airplanes will be required.
This way of breaking up information about real entities into a complex data model depends highly on the information requirements of the business concept. Additionally, there are some formal requirements independent of any application: the resulting data model should conform to a so-callednormal form. Normally these data models consist of a great number of tables and relationships between them. Such models will not predetermine their use by applications; they are strictly descriptive and will not restrict access to the data in any way.
Operations within databases must have the ability to act not only on single rows, but also on sets of rows. Relational algebra offers this possibility. Therefore languages based on relational algebra, e.g.: SQL, offer a powerful syntax to manipulate lots of data within one single command.
As operations within relational algebra may be replaced by different but logically equivalent operations, a language based on relational algebra should not predetermine how its syntax is mapped to operations (the execution plan). The language should describewhat should be done and nothow to do it. Note: This choice of operations does not concern the use or neglect of indices.
As described before the relational model tends to break up objects into sub-objects. In this and in other cases it is often necessary to collect associated information from a bunch of tables into one information unit. How is this possible without links between participating tables and rows? The answer is: All joining is done based on thevalues which are actually stored in the attributes. The rDBMS must make its own decisions about how to reach all concerned rows: whether to read all potentially affected rows and ignore those which are irrelevant (full table scan) or, to use some kind of index and read-only those which match the criteria. This value-based approach allows even the use of operators other than the equal-operator, e.g.:
SELECT*FROMgiftJOINboxONgift.extent<box.extent;
This command will join all "gift" records to all "box" records with a larger "extent" (whatever "extent" means).
As outlined above, rDBMS acts on the data with operations ofrelational algebra like projections, selections, joins, set operations (union, except and intersect) and more. The operations of relational algebra are denoted in a mathematical language that is highly formal and hard to understand for end-users and - possibly also - for many software engineers. Therefore, rDBMS offers a layer above relational algebra that is easy to understand but can be mapped to the underlying relational operations. Since the 1970s, we have seen some languages doing this job; one of them was SQL - another example wasQUEL. In the early 1980s (after a rename from its original nameSEQUEL due to trademark problems), SQL achieved market dominance. And in 1986, SQL was standardized for the first time. The current version isSQL 2023.
The tokens and syntax of SQL are modeled onEnglish common speech to keep the access barrier as small as possible. An SQL command likeUPDATE employee SET salary = 2000 WHERE id = 511; is not far away from the sentence "Change employee's salary to 2000 for the employee with id 511."
The keywords of SQL can be expressed in any combination of upper and lower case characters, i.e. the keywords arecase insensitive. It makes no difference whetherUPDATE, update, Update, UpDate, or any other combination of upper and lower case characters is written in SQL code.
Next, SQL is adescriptive language, not a procedural one. It does not proscribe all aspects of the relational operations (which operation, their order, ...), which are generated from the given SQL statement. The rDBMS has the freedom to generate more than one execution plan from a statement. It may compare several generated execution plans with each other and run the one it thinks is best in a given situation. Additionally, the programmer is freed from considering all the details of data access, e.g.: Which one of a set of WHERE criteria should be evaluated first if they are combined with AND?
Despite the above simplifications, SQL is very powerful. It allows the manipulation of aset of data records with a single statement.UPDATE employee SET salary = salary * 1.1 WHERE salary < 2000; will affect all employee records with an actual salary smaller than 2000. Potentially, there may be thousands of those records, only a few or even zero. The operation may also depend on data already present in the database; the statementSET salary = salary * 1.1 leads to an increase of the salaries by 10%, which may be 120 for one employee and 500 for another one.
The designer of SQL tried to define the language elementsorthogonally to each other. Among other things, this refers to the fact that any language element may be used in all positions of a statement where the result of that element may be used directly. E.g.: If you have a function power(), which takes two numbers and returns another number, you can use this function in all positions where numbers are allowed. The following statements are syntactically correct (if you have defined the function power() ) - and lead to the same resulting rows.
SELECTsalaryFROMemployeeWHEREsalary<2048;SELECTsalaryFROMemployeeWHEREsalary<power(2,11);SELECTpower(salary,1)FROMemployeeWHEREsalary<2048;
Another example of orthogonality is the use of subqueries within UPDATE, INSERT, DELETE, or inside another SELECT statement.
However, SQL is not free ofredundancy. Often there are several possible formulations to express the same situation.
SELECTsalaryFROMemployeeWHEREsalary<2048;SELECTsalaryFROMemployeeWHERENOTsalary>=2048;SELECTsalaryFROMemployeeWHEREsalarybetween0AND2048;-- 'BETWEEN' includes edges
This is a very simple example. In complex statements, there may be the choice between joins, subqueries, and theexists predicate.
Core SQL consists of statements. Statements consist of keywords, operators, values, names of system- and user-objects or functions. Statements are concluded by a semicolon. In the statementSELECT salary FROM employee WHERE id < 100; the tokens SELECT, FROM and WHERE are keywords. salary, employee, and id are object names, the "<" sign is an operator, and "100" is a value.
The SQL standard arranges statements into nine groups:
This detailed grouping is unusual in everyday speech. A typical alternative is to organize SQL statements into the following groups:
Core SQL, as described above, is not Turing complete. It misses conditional branches, variables, subroutines. But the standard, as well as most implementations, offers an extension to fulfill the demand forTuring completeness. In 'Part 4: Persistent Stored Modules (SQL/PSM)' of the standard, there are definitions for IF-, CASE-, LOOP-, assignment- and other statements. The existing implementations of this part have different names, different syntax, and also a different scope of operation: PL/SQL in Oracle, SQL/PL in DB2, Transact-SQL, or T-SQL in SQL Server and Sybase, PL/pgSQL in Postgres and simply 'stored procedures' in MySQL.
Like most other standards, the primary purpose of SQL isportability. Usually, software designers and application developers structure and solve problems in layers. Every abstraction level is realized in its own component or sub-component: presentation to end-user, business logic, data access, data storage, net, and operation system demands are typical representatives of such components. They are organized as a stack and every layer offers an interface to the upper layers to use its functionality. If one of those components is realized by two different providers and both offer the same interface (as an API, Web-Service, language specification, ...), it is possible to exchange them without changing the layers which are based on them. In essence, the software industry needsstable interfaces at the top of essential layers to avoid dependence on a single provider. SQL acts as such an interface to relational database systems.
If an application uses only those SQL commands which are defined within standard SQL, it should be possible to exchange the underlying rDBMS with a different one without changing the source code of the application. In practice, this is a hard job, because concrete implementations offer numerous additional features and software engineers love to use them.
A second aspect is theconservation of know-how. If a student learns SQL, he is in a position to develop applications that are based on an arbitrary database system. The situation is comparable with any other popular programming language. If one learns Java or C-Sharp, he can develop applications of any kind running on a lot of different hardware systems and even different hardware architectures.
Database systems consist of many components. Access to the data is an essential element but not the only component. Other components include: throughput optimization, physical design, backup, distributed databases, replication, 7x24 availability, ... . Standard SQL is focused mainly on data access and ignores typical DBA tasks. Even theCREATE INDEX statement as a widely used optimization strategy is not part of the standard. Nevertheless, the standard fills thousands of pages. But most of the DBA's daily work is highly specialized to every concrete implementation and must be done differently when switching to a different rDBMS. Mainly application developers benefit from SQL.
The standardization process is organized in two levels. The first level acts in a national context. Interested companies, universities and persons of one country work within their national standardization organization likeANSI,Deutsches Institut für Normung (DIN) orBritish Standards Institution (BSI), where every member has one vote. The second level is the international stage. The national organizations are members of ISO, respectively IEC. In case of SQL there is a common committee of ISO and IEC namedJoint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 32, Data management and interchange, where every national body has one vote. This committee approves the standard under the nameISO/IEC 9075-n:yyyy, wheren is the part number andyyyy is the year of publication. The ten parts of the standard are described in shorthere.
If the committee releases a new version, this may concern only some of the ten parts. So it is possible that theyyyy denomination differs from part to part.Core SQL is defined mainly by the second part:ISO/IEC 9075-2:yyyy Part 2: Foundation (SQL/Foundation) - but it also contains some features of other parts.
Note: The APIJDBC is part of Java SE and Java EE but not part of the SQL standard.
A second, closely related standard complements the standard:ISO/IEC 13249-n:yyyy SQL Multimedia and Application Packages, which is developed by the same organizations and committee. This publication defines interfaces and packages based on SQL. They focus on particular kinds of applications: text, pictures, data mining, and spatial data applications.
Until 1996 theNational Institute of Standards and Technology (NIST) certified the compliance of the SQL implementation of rDBMS with the SQL standard. As NIST abandon this work, nowadays, vendors self-certify the compliance of their product. They must declare the degree of conformance in a special appendix of their documentation. This documentation may be voluminous as the standard defines not only a set of base features - calledCore SQL:yyyy - but also a lot of additional features an implementation may conform to or not.
To fulfill their clients' demands, all major vendors of rDBMS offers - among other data access ways - the language SQL within their product. The implementations coverCore SQL, a bunch of additional standardized features, and a huge number of additional, not standardized features. The access to standardized features may use the regular syntax or an implementation-specific syntax. In essence, SQL is the clamp holding everything together, but typically there are a lot of detours around the official language.
SQL consists of statements that start with a keyword like SELECT, DELETE or CREATE and terminate with a semicolon. Their elements are case-insensitive except for fixed character string values like 'Mr. Brown'.

When learning SQL (or any other programming language), it is not sufficient to read books or listen to lectures. It's absolutely necessary that one does exercises - prescribed exercises as well as own made-up tests. In the case of SQL, one needs access to a DBMS installation, where he can create tables, store, retrieve and delete data, and so on.
This page offers hints and links to some popular DBMS. In most cases, one can download the system for test purposes or use a free community edition. Some of them offer an online version so that there is no need for any local installation. Instead, such systems can be used in the cloud.
Often, but not always, a DBMS consists of more than the pure database engine. To be able to formulate SQL commands easily, we additionally need an interactive access to the database engine. Different client programs and IDEs provide this. They offer interactive access, and in many cases, they are part of the downloads. (In some cases, there are several different clients from the same producer.) At the same time, there are client programs and IDEs from other companies or organizations which offer only an interactive access but no DBMS. Such clients often support a lot of different DBMS.
http://www-01.ibm.com/software/data/db2/linux-unix-windows/
http://www-01.ibm.com/software/data/informix/
http://www.microsoft.com/en/server-cloud/products/sql-server/default.aspx
DBMS:http://dev.mysql.com/downloads/
IDE for administration and SQL-tests:http://dev.mysql.com/downloads/workbench/
The Oracle database engine is available in 4 editions: Enterprise Edition (EE), Standard Edition (SE), Standard Edition One (SE One), and Express Edition (XE). The last-mentioned is the community edition and is sufficient for this course.http://www.oracle.com/technetwork/database/enterprise-edition/downloads/index.html.
SQL-Developer is an IDE with an Eclipse-like look-and-feel and offers access to the database engine.http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/
In the context of Oracles application builder APEX (APplication EXpress), there is a cloud solution consisting of a database engine plus APEX.https://apex.oracle.com/. Among a lot of other things, it offers an SQL workshop where everybody can execute his own SQL commands for testing purposes. On the other hand, APEX can be downloaded separately and installed into any of the above editions except for the Express Edition.
SQL Fiddle offers an online access to following implementations:
MySQL,PostgreSQL,MS SQL Server,Oracle, andSQLite.
This page offers SQL examples concerning different topics. You can copy/past the examples according to your needs.
---- Frequently used data types and simple constraintsCREATETABLEt_standard(-- column name data type default nullable/constraintidDECIMALPRIMARYKEY,-- some prefer the name: 'sid'col_1VARCHAR(50)DEFAULT'n/a'NOTNULL,-- string with variable length. Oracle: 'VARCHAR2'col_2CHAR(10),-- string with fixed lengthcol_3DECIMAL(10,2)DEFAULT0.0,-- 8 digits before and 2 after the decimal. Signed.col_4NUMERIC(10,2)DEFAULT0.0,-- same as col_3col_5INTEGER,col_6BIGINT-- Oracle: use 'NUMBER(n)', n up to 38);-- Data types with temporal aspectsCREATETABLEt_temporal(-- column name data type default nullable/constraintidDECIMALPRIMARYKEY,col_1DATE,-- Oracle: contains day and time, seconds without decimalcol_2TIME,-- Oracle: use 'DATE' and pick time-partcol_3TIMESTAMP,-- Including decimal for secondscol_4TIMESTAMPWITHTIMEZONE,-- MySql: no time zonecol_5INTERVALYEARTOMONTH,col_6INTERVALDAYTOSECOND);CREATETABLEt_misc(-- column name data type default nullable/constraintidDECIMALPRIMARYKEY,col_1CLOB,-- very long string (MySql: LONGTEXT)col_2BLOB,-- binary, eg: Word document or mp3-streamcol_3FLOAT(6),-- example: two-thirds (2/3).col_4REAL,col_5DOUBLEPRECISION,col_6BOOLEAN,-- Oracle: Not supportedcol_7XML-- Oracle: 'XMLType');
---- Denominate all constraints with an expressive name, eg.: abbreviations for-- table name (unique across all tables in your schema), column name, constraint type, running number.--CREATETABLEmyExampleTable(idDECIMAL,col_1DECIMAL(1),-- only 1 (signed) digitcol_2VARCHAR(50),col_3VARCHAR(90),CONSTRAINTexample_pkPRIMARYKEY(id),CONSTRAINTexample_uniqUNIQUE(col_2),CONSTRAINTexample_fkFOREIGNKEY(col_1)REFERENCESperson(id),CONSTRAINTexample_col_1_nnCHECK(col_1ISNOTNULL),CONSTRAINTexample_col_1_checkCHECK(col_1>=0ANDcol_1<6),CONSTRAINTexample_col_2_nnCHECK(col_2ISNOTNULL),CONSTRAINTexample_check_1CHECK(LENGTH(col_2)>3),CONSTRAINTexample_check_2CHECK(LENGTH(col_2)<LENGTH(col_3)));
---- Reference to a different (or the same) table. This creates 1:m or n:m relationships.CREATETABLEt_hierarchie(idDECIMAL,part_nameVARCHAR(50),super_part_idDECIMAL,-- ID of the part which contains this partCONSTRAINThier_pkPRIMARYKEY(id),-- In this special case the foreign key refers to the same tableCONSTRAINThier_fkFOREIGNKEY(super_part_id)REFERENCESt_hierarchie(id));-- ------------------------------------------------- n:m relationships-- -----------------------------------------------CREATETABLEt1(idDECIMAL,nameVARCHAR(50),-- ...CONSTRAINTt1_pkPRIMARYKEY(id));CREATETABLEt2(idDECIMAL,nameVARCHAR(50),-- ...CONSTRAINTt2_pkPRIMARYKEY(id));CREATETABLEt1_t2(idDECIMAL,t1_idDECIMAL,t2_idDECIMAL,CONSTRAINTt1_t2_pkPRIMARYKEY(id),-- also this table should have its own Primary KeyCONSTRAINTt1_t2_uniqueUNIQUE(t1_id,t2_id),-- every link should occur only onceCONSTRAINTt1_t2_fk_1FOREIGNKEY(t1_id)REFERENCESt1(id),CONSTRAINTt1_t2_fk_2FOREIGNKEY(t2_id)REFERENCESt2(id));-- ------------------------------------------------------------------------------------- ON DELETE / ON UPDATE / DEFFERABLE-- ------------------------------------------------------------------------------------- DELETE and UPDATE behaviour for child tables (see first example)-- Oracle: Only DELETE [CASCADE | SET NULL] is possible. Default is NO ACTION, but this cannot be-- specified explicit - just omit the phrase.CONSTRAINThier_fkFOREIGNKEY(super_part_id)REFERENCESt_hierarchie(id)ONDELETECASCADE-- or: NO ACTION (the default), RESTRICT, SET NULL, SET DEFAULTONUPDATECASCADE-- or: NO ACTION (the default), RESTRICT, SET NULL, SET DEFAULT-- Initial stage: immediate vs. deferred, [not] deferrable-- MySQL: DEFERABLE is not supportedCONSTRAINTt1_t2_fk_1FOREIGNKEY(t1_id)REFERENCESt1(id)INITIALLYIMMEDIATEDEFERRABLE-- Change constraint characteristics at a later stageSETCONSTRAINThier_fkDEFERRED;-- or: IMMEDIATE
Concerning columns.
-- Add a column (plus some column constraints). Oracle: The key word 'COLUMN' is not allowed.ALTERTABLEt1ADDCOLUMNcol_1VARCHAR(100)CHECK(LENGTH(col_1)>5);-- Change a columns characteristic. (Some implementations use different key words like 'MODIFY'.)ALTERTABLEt1ALTERCOLUMNcol_1SETDATATYPENUMERIC;ALTERTABLEt1ALTERCOLUMNcol_1SETSETDEFAULT-1;ALTERTABLEt1ALTERCOLUMNcol_1SETNOTNULL;ALTERTABLEt1ALTERCOLUMNcol_1DROPNOTNULL;-- Drop a column. Oracle: The key word 'COLUMN' is mandatory.ALTERTABLEt1DROPCOLUMNcol_2;
Concerning complete table.
--ALTERTABLEt1ADDCONSTRAINTt1_col_1_uniqUNIQUE(col_1);ALTERTABLEt1ADDCONSTRAINTt1_col_2_fkFOREIGNKEY(col_2)REFERENCESperson(id);-- Change definitions. Some implementations use different key words like 'MODIFY'.ALTERTABLEt1ALTERCONSTRAINTt1_col_1_uniqueUNIQUE(col_1);-- Drop a constraint. You need to know its name. Not supported by MySQL, there is only a 'DROP FOREIGN KEY'.ALTERTABLEt1DROPCONSTRAINTt1_col_1_unique;-- As an extension to the SQL standard, some implementations offer an ENABLE / DISABLE command for constraints.
---- All data and complete table structure inclusive indices are thrown away.-- No column name. No WHERE clause. No trigger is fired. Considers Foreign Keys. Very fast.DROPTABLEt1;
---- Overall structure: SELECT / FROM / WHERE / GROUP BY / HAVING / ORDER BY-- constants, column values, operators, functionsSELECT'ID: ',id,col_1+col_2,sqrt(col_2)FROMt1-- precedence within WHERE: functions, comparisions, NOT, AND, ORWHEREcol_1>100ANDNOTMOD(col_2,10)=0ORcol_3<col_1ORDERBYcol_4DESC,col_5;-- sort ascending (the default) or descending-- number of rows, number of not-null-valuesSELECTCOUNT(*),COUNT(col_1)FROMt1;-- predefined functionsSELECTCOUNT(col_1),MAX(col_1),MIN(col_1),AVG(col_1),SUM(col_1)FROMt1;-- UNIQUE values onlySELECTDISTINCTcol_1FROMt1;-- In the next example col_1 many have duplicates. Only the combination of col_1 plus col_2 is unique.SELECTDISTINCTcol_1,col_2FROMt1;
---- CASE expression with conditions on exactly ONE columnSELECTid,CASEcontact_type-- ONE column nameWHEN'fixed line'THEN'Phone'WHEN'mobile'THEN'Phone'ELSE'Not a telephone number'END,contact_valueFROMcontact;-- CASE expression with conditions on ANY columnSELECTid,CASE-- NO column nameWHENcontact_typeIN('fixed line','mobile')THEN'Phone'WHENid=4THEN'ICQ'ELSE'Something else'END,contact_valueFROMcontact;
--SELECTproduct_group,count(*)AScntFROMsalesWHEREregion='west'-- additional restrictions are possible but not necessaryGROUPBYproduct_group-- 'product_group' is the criterion which creates groupsHAVINGCOUNT(*)>1000-- restriction to groups with more than 1000 sales per groupORDERBYcnt;-- Attention: in the next example, col_2 is not part of the GROUP BY criterion. Therefore it cannot be displayed.SELECTcol_1,col_2FROMt1GROUPBYcol_1;-- We must accumulate all col_2-values of each group to ONE value, eg:SELECTcol_1,sum(col_2),min(col_2)FROMt1GROUPBYcol_1;
---- Inner join: Only persons together with their contacts.-- Ignores all persons without contacts and all contacts without personsSELECT*FROMpersonpJOINcontactcONp.id=c.person_id;-- Left outer join: ALL persons. Ignores contacts without personsSELECT*FROMpersonpLEFTJOINcontactcONp.id=c.person_id;-- Right outer join: ALL contacts. Ignores persons without contactsSELECT*FROMpersonpRIGHTJOINcontactcONp.id=c.person_id;-- Full outer join: ALL persons. ALL contacts.SELECT*FROMpersonpFULLJOINcontactcONp.id=c.person_id;-- Carthesian product (missing ON keyword): be carefull!SELECTCOUNT(*)FROMpersonpJOINcontactc;
---- Subquery within SELECT clauseSELECTid,lastname,weight,(SELECTavg(weight)FROMperson)-- the subqueryFROMperson;-- Subquery within WHERE clauseSELECTid,lastname,weightFROMpersonWHEREweight<(SELECTavg(weight)FROMperson)-- the subquery;-- CORRELATED subquery within SELECT clauseSELECTid,(SELECTstatus_nameFROMstatusstWHEREst.id=sa.state)FROMsalessa;-- CORRELATED subquery retrieving the highest version within each booking_numberSELECT*FROMbookingbWHEREversion=(SELECTMAX(version)FROMbookingsqWHEREsq.booking_number=b.booking_number);
---- UNIONSELECTfirstname-- first SELECT commandFROMpersonUNION-- push both intermediate results together to one resultSELECTlastname-- second SELECT commandFROMperson;-- Default behaviour is: 'UNION DISTINCT'. 'UNION ALL' must be explicitly specified, if duplicate values shall be removed.-- INTERSECT: resulting values must be in BOTH intermediate resultsSELECTfirstnameFROMpersonINTERSECTSELECTlastnameFROMperson;-- EXCEPT: resulting values must be in the first but not in the second intermediate resultSELECTfirstnameFROMpersonEXCEPT-- Oracle uses 'MINUS'. MySQL does not support EXCEPT.SELECTlastnameFROMperson;
-- Additional sum per group and sub-groupSELECTSUM(col_x),...FROM...GROUPBYROLLUP(producer,model);-- the MySQL syntax is: GROUP BY producer, model WITH ROLLUP-- Additional sum per EVERY combination of the grouping columnsSELECTSUM(col_x),...FROM...GROUPBYCUBE(producer,model);-- not supported by MySQL
-- The frames boundariesSELECTid,emp_name,dep_name,FIRST_VALUE(id)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_first_row,LAST_VALUE(id)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_last_row,COUNT(*)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_count,LAG(id)OVER(PARTITIONBYdep_nameORDERBYid)ASprev_row,LEAD(id)OVER(PARTITIONBYdep_nameORDERBYid)ASnext_rowFROMemployee;-- The moving averageSELECTid,dep_name,salary,AVG(salary)OVER(PARTITIONBYdep_nameORDERBYsalaryROWSBETWEEN2PRECEDINGANDCURRENTROW)ASsum_over_1or2or3_rowsFROMemployee;
-- The 'with clause' consists of three parts:-- First: arbitrary name of an intermediate table and its columnsWITHintermediate_table(id,firstname,lastname)AS(-- Second: starting row (or rows)SELECTid,firstname,lastnameFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALL-- Third: Definition of the rule for querying the next level. In most cases this is done with a join operation.SELECTf.id,f.firstname,f.lastnameFROMintermediate_tableiJOINfamily_treefONf.father_id=i.id)-- After the 'with clause': depth first / breadth first-- SEARCH BREADTH FIRST BY firstname SET sequence_number (default behaviour)-- SEARCH DEPTH FIRST BY firstname SET sequence_number-- The final SELECTSELECT*FROMintermediate_table;-- Hints: Oracle supports the syntax of the SQL standard since version 11.2. .-- MySQL does not support recursions at all and recommend procedural workarounds.
---- fix list of values/rowsINSERTINTOt1(id,col_1,col_2)VALUES(6,46,'abc');INSERTINTOt1(id,col_1,col_2)VALUES(7,47,'abc7'),(8,48,'abc8'),(9,49,'abc9');COMMIT;-- subselect: leads to 0, 1 or more new rowsINSERTINTOt1(id,col_1,col_2)SELECTid,col_x,col_yFROMt2WHEREcol_y>100;COMMIT;-- dynamic valuesINSERTINTOt1(id,col_1,col_2)VALUES(16,CURRENT_DATE,'abc');COMMIT;INSERTINTOt1(id,col_1,col_2)SELECTid,CASEWHENcol_x<40THENcol_x+10ELSEcol_x+5END,col_yFROMt2WHEREcol_y>100;COMMIT;
---- basic syntaxUPDATEt1SETcol_1='Jimmy Walker',col_2=4711WHEREid=5;-- raise value of col_2 by factor 2; no WHERE ==> all rows!UPDATEt1SETcol_2=col_2*2;-- non-correlated subquery leads to one single evaluation of the subqueryUPDATEt1SETcol_2=(SELECTmax(id)FROMt1);-- correlated subquery leads to one evaluation of subquery for EVERY affected row of outer queryUPDATEt1SETcol_2=(SELECTcol_2FROMt2wheret1.id=t2.id);-- Subquery in WHERE clauseUPDATEarticleSETcol_1='topseller'WHEREidIN(SELECTarticle_idFROMsalesGROUPBYarticle_idHAVINGCOUNT(*)>1000);
---- INSERT / UPDATE depending on any criterion, in this case: the two columns 'id'MERGEINTOhobby_shadowt-- the target tableUSING(SELECTid,hobbyname,remarkFROMhobbyWHEREid<8)s-- the sourceON(t.id=s.id)-- the 'match criterion'WHENMATCHEDTHENUPDATESETremark=concat(s.remark,' Merge / Update')WHENNOTMATCHEDTHENINSERT(id,hobbyname,remark)VALUES(s.id,s.hobbyname,concat(s.remark,' Merge / Insert'));-- Independent from the number of affected rows there is only ONE round trip between client and DBMS
---- Basic syntaxDELETEFROMt1WHEREid=5;-- no column name behind 'DELETE' key word because the complete row will be deleted-- no hit is OKDELETEFROMt1WHEREid!=id;-- subqueryDELETEFROMperson_hobbyWHEREperson_idIN(SELECTidFROMpersonWHERElastname='Goldstein');
---- TRUNCATE deletes ALL rows (WHERE clause is not possible). The table structure remains.-- No trigger actions will be fired. Foreign Keys are considered. Much faster than DELETE.TRUNCATETABLEt1;
Let's start with a simple example. Suppose we want to collect information about people - their name, place of birth and some more items. In the beginning we might consider to collect this data in a simple spreadsheet. But what if we grow to a successful company and have to handle millions of those data items? Could a spreadsheet deal with this huge amount of information? Could several employees or programs simultaneously insert new data, delete or change it? Of course not. And this is one of the noteworthy advantages of a Database Management System (DBMS) over a spreadsheet program: we can imagine the structure of a table as a simple spreadsheet - but the access to it is internally organized in a way thathuge amounts of data can be accessed by alot of users at thesame time.
In summary, it can be said that one can imagine a table as a spreadsheet optimized for bulk data and concurrent access.
To keep control and to ensure good performance, tables are subject to a few strict rules. Every table column has a fixed name, and the values of each column must be of the same data type. Furthermore, it is highly recommended - though not compulsory - that each row can be identified by a unique value. The column in which this identifying value resides is called the Primary Key. In this Wikibook, we always name itid. But everybody is free to choose a different name. Furthermore, we may use the concatenation of more than one column as the Primary Key.
At the beginning we have to decide the following questions:
DATE.In our example, we decide to save the first name, last name, date, and place of birth, social security number, and the person's weight. Obviously date of birth is of data typeDATE, the weight is a number, and all others are some kind of strings. For strings, there is a distinction between those that have a fixed length and those in which the length usually varies greatly from row to row. The former is named CHAR(<n>), where <n> is thefixed length, and the others VARCHAR(<n>), where <n> is themaximum length.
The decisions previously taken must be expressed in a machine-understandable language. This language is SQL, which acts as the interface between end-users - or special programs - and the DBMS.
-- comment lines start with two consecutive minus signs '--'CREATETABLEperson(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,firstnameVARCHAR(50)NOTNULL,lastnameVARCHAR(50)NOTNULL,date_of_birthDATE,place_of_birthVARCHAR(50),ssnCHAR(11),weightDECIMALDEFAULT0NOTNULL,-- select one of the defined columns as the Primary Key and-- guess a meaningful name for the Primary Key constraint: 'person_pk' may be a good choiceCONSTRAINTperson_pkPRIMARYKEY(id));
We chooseperson as the name of the table, which consists of seven columns. Theid column is assigned the role of the Primary Key. We can store exclusively digits in the columnsid andweight, strings of a length up to 50 characters infirstname,lastname andplace_of_birth, dates indate_of_birth and a string of exactly eleven characters inssn. The phrase NOT NULL is part of the definition ofid,firstname,lastname andweight. This means that in every row, there must be a value for those four columns. Storing no value in any of those columns is not possible - but the 8-character-string 'no value' or the digit '0' are allowed because they are values. Or to say it the other way round: it is possible to omit the values ofdate_of_birth,place_of_birth andssn.
The definition of a Primary Key is called a 'constraint' (later on, we will get to know more kinds of constraints). Every constraint should have a name - it'sperson_pk in this example.
After execution of the above 'CREATE TABLE' command, the DBMS has created an object that one can imagine similar to the following Wiki-table:
| id | firstname | lastname | date_of_birth | place_of_birth | ssn | weight |
|---|---|---|---|---|---|---|
This Wiki-table shows 4 lines. The first line represents the names of the columns - not values! The following 3 lines are for demonstration purposes only. But in the database table, there is currently no single row! It is completely empty, no rows at all, no values at all! The only thing that exists in the database is thestructure of the table.
Maybe we want to delete the table one day. To do so, we can use theDROP command. It removes the table totally: all data and the complete structure are thrown away.
DROPTABLEperson;
Don't confuse the DROP command with the DELETE command, which we present on the next page. The DELETE command removes only rows - possibly all of them. However, the table itself, which holds the definition of the structure, is retained.
As shown in the previous page, we now have an empty table namedperson. What can we do with such a table? Just use it like a bag! Store things in it, look into it to check the existence of things, modify things in it or throw things out of it. These are the four essential operations, which concerns data in tables:
For each of these four operations, there is a SQL command. It starts with a keyword and runs up to a terminating semicolon. This rule applies to all SQL commands: They are introduced by a keyword and terminated by a semicolon. In the middle, there may be more keywords as well as object names and values.
When storing new data in rows of a table, we must name all affected objects and values: the table name (there may be a lot of tables within the database), the column names and the values. All this is embedded within some keywords so that the SQL compiler can recognize the tokens and their meaning. In general, the syntax for a simple INSERT is
INSERTINTO<tablename>(<list_of_columnnames>)VALUES(<list_of_values>);
Here is an example
-- put one rowINSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(1,'Larry','Goldstein',date'1970-11-20','Dallas','078-05-1120',95);-- confirm the INSERT commandCOMMIT;
When the DBMS recognizes the keywords INSERT INTO and VALUES, it knows what to do: it creates a new row in the table and puts the given values into the named columns. In the above example, the command is followed by a second one: COMMIT confirms the INSERT operation as well as the other writing operations UPDATE and DELETE. (We will learn much more about COMMIT and its counterpart ROLLBACK in a later chapter.)
Now we will put some more rows into our table. To do so, we use a variation of the above syntax. It is possible to omit the list of column names if the list of values correlates precisely with the number, order, and data type of the columns used in the original CREATE TABLE statement.
-- put four rowsINSERTINTOpersonVALUES(2,'Tom','Burton',date'1980-01-22','Birmingham','078-05-1121',75);INSERTINTOpersonVALUES(3,'Lisa','Hamilton',date'1975-12-30','Mumbai','078-05-1122',56);INSERTINTOpersonVALUES(4,'Debora','Patterson',date'2011-06-01','Shanghai','078-05-1123',11);INSERTINTOpersonVALUES(5,'James','de Winter',date'1975-12-23','San Francisco','078-05-1124',75);COMMIT;
Now our table should contain five rows. Can we be sure about that? How can we check whether everything worked well and the rows and values exist really? To do so, we need a command which shows us the actual content of the table. It is the SELECT command with the following general syntax
SELECT<list_of_columnnames>FROM<tablename>WHERE<search_condition>ORDERBY<order_by_clause>;
As with the INSERT command, you may omit some parts. The simplest example is
SELECT*FROMperson;
The asterisk character '*' indicates 'all columns'. In the result, the DBMS should deliver all five rows, each with the seven values we used previously with the INSERT command.
In the following examples, we add the currently missing clauses of the general syntax - one after the other.
Add a list of some or all columnnames
SELECTfirstname,lastnameFROMperson;
The DBMS should deliver the two columnsfirstname andlastname of all five rows.
Add a search condition
SELECTid,firstname,lastnameFROMpersonWHEREid>2;
The DBMS should deliver the three columnsid,firstname andlastname of three rows.
Add a sort instruction
SELECTid,firstname,lastname,date_of_birthFROMpersonWHEREid>2ORDERBYdate_of_birth;
The DBMS should deliver the four columnsid,firstname,lastname anddate_of_birth of three rows in the ascending order ofdate_of_birth.
If we want to change the values of some columns in some rows we can do so by using the UPDATE command. The general syntax for a simple UPDATE is:
UPDATE<tablename>SET<columnname>=<value>,<columnname>=<value>,...WHERE<search_condition>;
Values are assigned to the named columns. Unmentioned columns keep unchanged. The search_condition acts in the same way as in the SELECT command. It restricts the coverage of the command to rows, which satisfy the criteria. If the WHERE keyword and the search_condition are omitted,all rows of the table are affected. It is possible to specify search_conditions, which hit no rows. In this case, no rows are updated - and no error or exception occurs.
Change one column of one row
UPDATEpersonSETfirstname='James Walker'WHEREid=5;COMMIT;
The first name of Mr. de Winter changes to James Walker, whereas all his other values keep unchanged. Also, all other rows keep unchanged. Please verify this with a SELECT command.
Change one column of multiple rows
UPDATEpersonSETfirstname='Unknown'WHEREdate_of_birth<date'2000-01-01';COMMIT;
The <search_condition> isn't restricted to the Primary Key column. We can specify any other column. And the comparison operator isn't restricted to the equal sign. We can use different operators - they solely have to match the data type of the column.
In this example, we change thefirstname of four rows with a single command. If there is a table with millions of rows we can change all of them using one single command.
Change two columns of one row
-- Please note the additional commaUPDATEpersonSETfirstname='Jimmy Walker',lastname='de la Crux'WHEREid=5;COMMIT;
The two values are changed with one single command.
The DELETE command removes complete rows from the table. As the rows are removed as a whole, there is no need to specify any columnname. The semantics of the <search_condition> is the same as with SELECT and UPDATE.
DELETEFROM<tablename>WHERE<search_condition>;
Delete one row
DELETEFROMpersonWHEREid=5;COMMIT;
The row of James de Winter is removed from the table.
Delete many rows
DELETEFROMperson;COMMIT;
All remained rows are deleted as we have omitted the <search_condition>. The table is empty, but it still exists.
No rows affected
DELETEFROMpersonWHEREid=99;COMMIT;
This command will remove no row as there is no row withid equals to 99. But the syntax and the execution within the DBMS are still perfect. No exception is thrown. The command terminates without any error message or error code.
The INSERT and DELETE commands affect rows in their entirety. INSERT puts a complete new row into a table (unmentioned columns remain empty), and DELETE removes entire rows. In contrast, SELECT and UPDATE affect only those columns that are mentioned in the command; unmentioned columns are unaffected.
The INSERT command (in the simple version of this page) has no <search_condition> and therefore handles exactly one row. The three other commands may affect zero, one, or more rows depending on the evaluation of their <search_condition>.
First of all a database is a collection of data. These data are organized in tables as shown in the exampleperson. In addition, there are many other kinds of objects in the DBMS: views, functions, procedures, indices, rights and many others. Initially we focus on tables and present four of them. They serve as the foundation for our Wikibook. Other kind of objects will be given later.
We try to keep everything as simple as possible. Nevertheless, this minimalistic set of four tables demonstrates a 1:n as well as a n:m relationship.
Theperson table holds information about fictitious persons; see:Create a simple Table.
-- comment lines start with two consecutive minus signs '--'CREATETABLEperson(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,firstnameVARCHAR(50)NOTNULL,lastnameVARCHAR(50)NOTNULL,date_of_birthDATE,place_of_birthVARCHAR(50),ssnCHAR(11),weightDECIMALDEFAULT0NOTNULL,-- select one of the defined columns as the Primary Key and-- guess a meaningfull name for the Primary Key constraint: 'person_pk' may be a good choiceCONSTRAINTperson_pkPRIMARYKEY(id));
Thecontact table holds information about the contact data of some persons. One could consider to store this contact information in additional columns of theperson table: one column for email, one for icq, and so on. We decided against it for some serious reasons.
We can deal with all these situations in an uncomplicated way, when the contact data goes to its own table. The only special thing is bringing persons together with their contact data. This task will be managed by the columnperson_id of tablecontact. It holds the same value as the Primary Key of the allocated person.
The general statement is that we do haveone information unit (person) to whichpotentially multiple information units of the same type (contact) belongs to. We call this togetherness arelationship - in this case a1:m relationship (also known as a one to many relationship). Whenever we encounter such a situation, we store the values, which may occur more than once, in a separate table together with the id of the first table.
CREATETABLEcontact(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,person_idDECIMALNOTNULL,-- use a default value, if contact_type is omittedcontact_typeVARCHAR(25)DEFAULT'email'NOTNULL,contact_valueVARCHAR(50)NOTNULL,-- select one of the defined columns as the Primary KeyCONSTRAINTcontact_pkPRIMARYKEY(id),-- define Foreign Key relation between column person_id and column id of table personCONSTRAINTcontact_fkFOREIGNKEY(person_id)REFERENCESperson(id),-- more constraint(s)CONSTRAINTcontact_checkCHECK(contact_typeIN('fixed line','mobile','email','icq','skype')));
People usually pursue one or more hobbies. Concerning multiplicity, we have the same problems as before withcontact. So we need a separate table for hobbies.
CREATETABLEhobby(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,hobbynameVARCHAR(100)NOTNULL,remarkVARCHAR(1000),-- select one of the defined columns as the Primary KeyCONSTRAINThobby_pkPRIMARYKEY(id),-- forbid duplicate recording of a hobbyCONSTRAINThobby_uniqueUNIQUE(hobbyname));
You may have noticed that there is no column for the corresponding person. Why this? With hobbies, we have an additional problem: It's not just that one person pursues multiple hobbies. At the same time, multiple persons pursue the same hobby.
We call this kind of togetherness an:m relationship. It can be designed by creating a third table between the two original tables. The third table holds the ids of the first and second table. So one can decide which person pursues which hobby. In our example, this 'table-in-the-middle' isperson_hobby and will be defined next.
CREATETABLEperson_hobby(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,person_idDECIMALNOTNULL,hobby_idDECIMALNOTNULL,-- Also this table has its own Primary Key!CONSTRAINTperson_hobby_pkPRIMARYKEY(id),-- define Foreign Key relation between column person_id and column id of table personCONSTRAINTperson_hobby_fk_1FOREIGNKEY(person_id)REFERENCESperson(id),-- define Foreign Key relation between column hobby_id and column id of table hobbyCONSTRAINTperson_hobby_fk_2FOREIGNKEY(hobby_id)REFERENCEShobby(id));
Every row of the table holds one id fromperson and one fromhobby. This is the technique of how the information of persons and hobbies are joined together.
After execution of the above commands, your database should contain four tables (without any data). The tables and their relationship to each other may be visualized in a so-calledEntity Relationship Diagram. On the left side there is the 1:n relationship betweenperson andcontact and on the right side the n:m relationship betweenperson andhobby with its 'table-in-the-middle'person_hobby.

rDBMS offers different ways to put data into their storage: from CSV files, Excel files, product-specific binary files, via several API's or special gateways to other databases respectively database systems and some more technics. So there is a wide range of - non standardized - possibilities to bring data into our system. Because we are speaking about SQL, we use the standardized INSERT command to do the job. It is available on all systems.
We use only a small amount of data because we want to keep things simple. Sometimes one needs a high number of rows to do performance tests. For this purpose, we show a special INSERT command at the end of this page, which exponentially inflates your table.
---- After we have done a lot of tests we may want to reset the data to its original version.-- To do so, use the DELETE command. But be aware of Foreign Keys: you may be forced to delete-- persons at the very end - with DELETE it's just the opposite sequence of tables in comparison to INSERTs.-- Be careful and don't confuse DELETE with DROP !!---- DELETE FROM person_hobby;-- DELETE FROM hobby;-- DELETE FROM contact;-- DELETE FROM person;-- COMMIT;INSERTINTOpersonVALUES(1,'Larry','Goldstein',DATE'1970-11-20','Dallas','078-05-1120',95);INSERTINTOpersonVALUES(2,'Tom','Burton',DATE'1977-01-22','Birmingham','078-05-1121',75);INSERTINTOpersonVALUES(3,'Lisa','Hamilton',DATE'1975-12-23','Richland','078-05-1122',56);INSERTINTOpersonVALUES(4,'Kim','Goldstein',DATE'2011-06-01','Shanghai','078-05-1123',11);INSERTINTOpersonVALUES(5,'James','de Winter',DATE'1975-12-23','San Francisco','078-05-1124',75);INSERTINTOpersonVALUES(6,'Elias','Baker',DATE'1939-10-03','San Francisco','078-05-1125',55);INSERTINTOpersonVALUES(7,'Yorgos','Stefanos',DATE'1975-12-23','Athens','078-05-1126',64);INSERTINTOpersonVALUES(8,'John','de Winter',DATE'1977-01-22','San Francisco','078-05-1127',77);INSERTINTOpersonVALUES(9,'Richie','Rich',DATE'1975-12-23','Richland','078-05-1128',90);INSERTINTOpersonVALUES(10,'Victor','de Winter',DATE'1979-02-28','San Francisco','078-05-1129',78);COMMIT;
Please note that the format of DATEs may depend on your local environment. Furthermore, SQLite uses a different syntax for the implicit conversion from string to DATE.
-- SQLite syntaxINSERTINTOpersonVALUES(1,'Larry','Goldstein',DATE('1970-11-20'),'Dallas','078-05-1120',95);...
-- DELETE FROM contact;-- COMMIT;INSERTINTOcontactVALUES(1,1,'fixed line','555-0100');INSERTINTOcontactVALUES(2,1,'email','larry.goldstein@acme.xx');INSERTINTOcontactVALUES(3,1,'email','lg@my_company.xx');INSERTINTOcontactVALUES(4,1,'icq','12111');INSERTINTOcontactVALUES(5,4,'fixed line','5550101');INSERTINTOcontactVALUES(6,4,'mobile','10123444444');INSERTINTOcontactVALUES(7,5,'email','james.dewinter@acme.xx');INSERTINTOcontactVALUES(8,7,'fixed line','+30000000000000');INSERTINTOcontactVALUES(9,7,'mobile','+30695100000000');COMMIT;
-- DELETE FROM hobby;-- COMMIT;INSERTINTOhobbyVALUES(1,'Painting','Applying paint, pigment, color or other medium to a surface.');INSERTINTOhobbyVALUES(2,'Fishing','Catching fishes.');INSERTINTOhobbyVALUES(3,'Underwater Diving','Going underwater with or without breathing apparatus (scuba diving / breath-holding).');INSERTINTOhobbyVALUES(4,'Chess','Two players have 16 figures each. They move them on an eight-by-eight grid according to special rules.');INSERTINTOhobbyVALUES(5,'Literature','Reading books.');INSERTINTOhobbyVALUES(6,'Yoga','A physical, mental, and spiritual practices which originated in ancient India.');INSERTINTOhobbyVALUES(7,'Stamp collecting','Collecting of post stamps and related objects.');INSERTINTOhobbyVALUES(8,'Astronomy','Observing astronomical objects such as moons, planets, stars, nebulae, and galaxies.');INSERTINTOhobbyVALUES(9,'Microscopy','Observing very small objects using a microscope.');COMMIT;
-- DELETE FROM person_hobby;-- COMMIT;INSERTINTOperson_hobbyVALUES(1,1,1);INSERTINTOperson_hobbyVALUES(2,1,4);INSERTINTOperson_hobbyVALUES(3,1,5);INSERTINTOperson_hobbyVALUES(4,5,2);INSERTINTOperson_hobbyVALUES(5,5,3);INSERTINTOperson_hobbyVALUES(6,7,8);INSERTINTOperson_hobbyVALUES(7,4,4);INSERTINTOperson_hobbyVALUES(8,9,8);INSERTINTOperson_hobbyVALUES(9,9,9);COMMIT;
For realistic performance tests, we need a vast amount of data. The few number of rows in our example database does not meet this criteria. How can we generate test data and store it in a table? There are different possibilities: FOR loops in a procedure, (pseudo-) recursive calls, importing external data in a system-specific fashion, and some more.
Because we are dealing with SQL, we introduce an INSERT command, which is portable across all rDBMS. Although it has a simple syntax, it is very powerful. With every execution, it will double the number of rows. Suppose there is 1 row in a table. After the first execution, there will be a second row in the table. At first glance, this sounds boring. But after 10 executions there are more than a thousand rows, after 20 executions there are more than a million, and we suspect that only a few installations can execute it more than 30 times.
INSERTINTOperson(id,firstname,lastname,weight)SELECTid+(selectmax(id)fromperson),firstname,lastname,weightFROMperson;COMMIT;
The command is an INSERT in combination with a (Sub-)SELECT. The SELECT retrieves all rows of the table because there is no WHERE clause. This is the reason for the doubling. The mandatory columnsfirstname andlastname keeps unchanged. We ignore optional columns. Only the primary keyid is computed. The new value is the sum of the old value plus the highest availableid when starting the command.
Some more remarks:
The SELECT command retrieves data from one or more tables or views. It generally consists of the following language elements:
SELECT<things_to_be_displayed>-- the so called 'Projection' - mostly a list of columnnamesFROM<tablename>-- table or view names and their aliasesWHERE<where_clause>-- the so called 'Restriction' or 'search condition'GROUPBY<group_by_clause>HAVING<having_clause>ORDERBY<order_by_clause>OFFSET<offset_clause>FETCH<fetch_first_or_next_clause>;
With the exception of the first two elements all others are optional. The sequence of language elements is mandatory. At certain places within the command there may start new SELECT commands - in a recursive manner.
In the projection part of the SELECT command, you specify a list of columns, operations working on columns, functions, fixed values, or new SELECT commands.
-- C/Java style comments are possible within SQL commandsSELECTid,/* the name of a column */concat(firstname,lastname),/* the concat() function */weight+5,/* the add operation */'kg'/* a value */FROMperson;
The DBMS will retrieve ten rows, each of which consists of four columns.
We can mix the sequence of columns in any order or retrieve them several times.
SELECTid,lastname,lastname,'weighs',weight,'kg'FROMperson;
The asterisk '*' is an abbreviation for the list of all columns.
SELECT*FROMperson;
For numeric columns, we can apply the usual numeric operators +, -, * and /. There are also many predefined functions depending on the data type: power, sqrt, modulo, string functions, date functions.
It is possible to compact the result in the sense of unique values by using the keyword DISTINCT. In this case, all resultingrows, which would be identical, will be compressed to onerow. In other words: duplicates are eliminated - just like in set theory.
-- retrieves ten rowsSELECTlastnameFROMperson;-- retrieves only seven rows. Duplicate values are thrown away.SELECTDISTINCTlastnameFROMperson;-- Hint:-- The keyword 'DISTINCT' refers to the entirety of the resulting rows, which you can imagine as-- the concatenation of all columns. It follows directly behind the SELECT keyword.-- The following query leads to ten rows, although three persons have the same lastname.SELECTDISTINCTlastname,firstnameFROMperson;-- again only seven rowsSELECTDISTINCTlastname,lastnameFROMperson;
Sometimes we want to give resulting columns more descriptive names. We can do so by choosing an alias within the projection. This alias is the new name within the result set. GUIs show the alias as the column label.
-- The keyword 'AS' is optionalSELECTlastnameASfamily_name,weightASweight_in_kgFROMperson;
There are predefined functions for use in projections (and at some other positions). The most frequently used are:
Standard SQL and every DBMS offers many more functions.
We must differentiate between those functions which return one value per row like concat() and those which return only one row per complete resultset like max(). The former one may be mixed in any combination with column names, as shown in the very first example of this page. With the later ones, there exists a problem: If we mix them with a regular column name, the DBMS recognises a contradiction in the query. On the one hand it should retrieve precisely one value (in one row), and on the other hand, it should retrieve a lot of values (in a lot of rows). The reaction of DBMS differs from vendor to vendor. Some throw an error message at runtime - in accordance with the SQL standard -, others deliver suspicious results.
-- works fineSELECTlastname,concat(weight,' kg')FROMperson;-- check the reaction of your DBMS. It should throw an error message.SELECTlastname,avg(weight)FROMperson;
-- a legal mixture of functions resulting in one row with 4 columnsSELECTmin(weight),max(weight),avg(weight)asaverage_1,sum(weight)/count(*)asaverage_2FROMperson;
If wereally want to see the result of a result-set-oriented-function in combination with columns of more than one row, we can start a very new SELECT on a location where - in simple cases - a column name occurs. This second SELECT is an absolutely independent command. Be careful: It will be executed forevery resulting row of the first SELECT!
-- retrieves 10 rows; notice the additional parenthesis to delimit the two SELECTs from each other.SELECTlastname,(SELECTavg(weight)FROMperson)FROMperson;-- Compute the percentage of each persons weight in relation to the average weight of all personsSELECTlastname,weight,weight*100/(SELECTavg(weight)FROMperson)ASpercentage_of_averageFROMperson;
The Keyword FROM is used to specify the table on which the command will work. This table name can be used as an identifier. In the first simple examples prefixing column names with the table name identifier can be used but isn't required. In the later more complex command, the table name identifier is a needed feature.
SELECTperson.firstname,person.lastnameFROMperson;-- Define an alias for the table name (analog to column names). To retain overview we usually-- abbreviate tables by the first character of their name.SELECTp.firstname,p.lastnameFROMpersonASp;-- Hint: not all systems accept keyword 'AS' with table aliases. Omit it in these cases!-- The keyword 'AS' is optional again.SELECTp.firstname,p.lastnameFROMpersonp;
In the WHERE clause, we specify some 'search conditions' which are among the named table(s) or view(s). The evaluation of this criteria is - mostly - one of the first things during the execution of a SELECT command. Before any row can be sorted or displayed, it must meet the conditions in the clause.
If we omit the clause, all rows of the table are retrieved. Else the number of rows will be reduced according to the specified criteria. If we specify 'weight < 70', for example, only those rows are retrieved where the weight column stores a value less than 70. The restrictions act onrows oftables by evaluatingcolumn values (sometimes they act on other things like the existence of rows, but for the moment we focus on basic principles). As a result, we can imagine that the evaluation of the 'where clause' produces a list of rows. This list of rows will be processed in further steps like sorting, grouping, or displaying certain columns (projection).
We compare variables, constant values, and results of function calls with each other in the same way as we would do in different programming languages. The only difference is that we use column names instead of variables. The comparison operators must match the given data types they have to operate on. The result of the comparison is a boolean value. If it is 'true', the according row will be processed furthermore. Some examples:
Often we want to specify more than a single search criterion, e.g., are there people born in San Francisco with lastname Baker? To do this, we specify every necessary comparison independent from the next one and join them together with the boolean operators AND respectively OR.
SELECT*FROMpersonWHEREplace_of_birth='San Francisco'ANDlastname='Baker';
The result of a comparison is a boolean. It may be toggled between 'true' and 'false' by the unary operator NOT.
SELECT*FROMpersonWHEREplace_of_birth='San Francisco'ANDNOTlastname='Baker';-- all except 'Baker'-- for clarification: The NOT in the previous example is a 'unary operation' on the result of the-- comparison. It's not an addition to the AND.SELECT*FROMpersonWHEREplace_of_birth='San Francisco'AND(NOT(lastname='Baker'));-- same as before, but explicit notated with parenthesis
Theprecedence of comparisons and boolean logic is as follows:
-- AND (born in SF and lastname Baker; 1 hit as an intermediate result) will be processed before-- OR (person Yorgos; 1 hit)-- 1 + 1 ==> 2 rowsSELECT*FROMpersonWHEREplace_of_birth='San Francisco'-- 4 hits SFANDlastname='Baker'-- 1 hit BakerORfirstname='Yorgos'-- 1 hit Yorgos;-- Same example with parentheses added to make the precedence explicit.-- AND gets processed before OR.-- results ==> same 2 rows as aboveSELECT*FROMpersonWHERE(place_of_birth='San Francisco'-- 4 hits SFANDlastname='Baker')-- 1 hit BakerORfirstname='Yorgos'-- 1 hit Yorgos;-- AND (person Yorgos Baker; no hit as an intermediate result) will be processed before-- OR (born in SF; 4 hits)-- 0 + 4 ==> 4 rowsSELECT*FROMpersonWHEREplace_of_birth='San Francisco'-- 4 hits SFORfirstname='Yorgos'-- 1 hit YorgosANDlastname='Baker'-- 1 hit Baker;-- Same example with parentheses added to make the precedence explicit.-- AND gets processed before OR.-- results ==> same 4 rows as aboveSELECT*FROMpersonWHEREplace_of_birth='San Francisco'-- 4 hits SFOR(firstname='Yorgos'-- 1 hit YorgosANDlastname='Baker')-- 1 hit Baker;-- We can modify the sequence of evaluations by specifying parentheses.-- Same as the first example, adding parentheses, one row.SELECT*FROMpersonWHEREplace_of_birth='San Francisco'-- 4 hits SFAND(lastname='Baker'-- 1 hit BakerORfirstname='Yorgos')-- 1 hit Yorgos;
Two abbreviations
Sometimes we shorten the syntax by using the BETWEEN keyword. It defines a lower and upper limit and is primarily used for numeric and date values, but also applicable to strings.
SELECT*FROMpersonWHEREweight>=70ANDweight<=90;-- An equivalent shorter and more expressive wordingSELECT*FROMpersonWHEREweightBETWEEN70AND90;-- BETWEEN includes the two cutting edges
For the comparison of a column or function with several values, we can use the short IN expression.
SELECT*FROMpersonWHERElastname='de Winter'ORlastname='Baker';-- An equivalent shorter and more expressive wordingSELECT*FROMpersonWHERElastnameIN('de Winter','Baker');
Sometimes we are not interested in all resulting rows, e.g.: we may want to see only the first 3 or 10 rows. This can be achieved with the OFFSET and FETCH clauses. OFFSET specifies the number of rows to be skipped (counting from the beginning of the result set), and FETCH specifies the number of rows, after which the delivery of rows shall stop.
SELECT*FROMpersonWHEREplace_of_birth='San Francisco'ORDERBYfirstnameFETCHFIRST2ROWSONLY-- only the first 2 rows;SELECT*FROMpersonORDERBYid-- the WHERE clause (and the ORDER BY clause) are optionalOFFSET5ROWSFETCHFIRST2ROWSONLY-- only the 6th and 7th row (according to the ORDER BY);
Please notice that the OFFSET and FETCH clauses are separate parts of the SELECT command. Some implementations handle this functionality as part of the WHERE clause or with different keywords (ROWNUM, START, SKIP, LIMIT).
The functionality of OFFSET and FETCH can be achieved likewise bywindow functions with their more general syntax.
We will offer the GROUP BY clause in combination with the HAVING clause in alater chapter.
The DBMS is free to deliver the resulting rows in an arbitrary order. Rows may be returned in the order of the Primary Key, in the chronological order they are stored into the database, in the order of a B-tree organized internal key, or even in random order. Concerning the sequence of delivered rows, the DBMS may do what it wants to do. Don't expect anything.
If we expect a particular order of rows, we must express our wishes explicitly. We can do this in the ORDER BY clause. There we specify a list of column names in combination with an option for ascending or descending sorting.
-- all persons in ascending (which is the default) order of their weightSELECT*FROMpersonORDERBYweight;-- all persons in descending order of their weightSELECT*FROMpersonORDERBYweightdesc;
In the above result, there are two rows with identical values in the columnweight. As this situation leads to random results, we have the possibility to specify more columns. These following columns are processed only for those rows with identical values in all previous columns.
-- All persons in descending order of their weight. In ambiguous cases order the-- additional column place_of_birth ascending: Birmingham before San Francisco.SELECT*FROMpersonORDERBYweightdesc,place_of_birth;
In the ORDER BY clause, we can specify any column of the processed table. We are not limited to the ones which are returned by the projection.
-- same ordering as aboveSELECTfirstname,lastnameFROMpersonORDERBYweightdesc,place_of_birth;
Only the first two elements of the SELECT command are mandatory: the part up to the first table (or view) name. All others are optional. If we also specify the optional ones, their predetermined sequence must be kept in mind. But they are combinable according to our needs.
-- We have seen on this page: SELECT / FROM / WHERE / ORDER BYSELECTp.lastname,p.weight,p.weight*100/(SELECTavg(p2.weight)FROMpersonp2)ASpercentage_of_averageFROMpersonpWHEREp.weightBETWEEN70AND90ORDERBYp.weightdesc,p.place_of_birth;
There is more information about the additional options for the SELECT command.
Show hobbyname and remark from the hobby table.
SELECThobbyname,remarkFROMhobby;
Show hobbyname and remark from the hobby table. Order the result by hobbyname.
SELECThobbyname,remarkFROMhobbyORDERBYhobbyname;
Show hobbyname and remark from the hobby table. Choose 'Hobby' as first columnname and 'Short_Description_of_Hobby' as second columnname.
SELECThobbynameasHobby,remarkasShort_Description_of_HobbyFROMhobby;-- columnname without underscore: Use quotesSELECThobbynameas"Hobby",remarkas"Short Description of Hobby"FROMhobby;
Show firstname and lastname of persons born in San Francisco.
SELECTfirstname,lastnameFROMpersonWHEREplace_of_birth='San Francisco';
Show all information items of persons with lastname 'de Winter'.
SELECT*FROMpersonWHERElastname='de Winter';
How many rows are stored in the contact table?
SELECTcount(*)FROMcontact;9
How many E-Mails are stored in the contact table?
SELECTcount(*)FROMcontactWHEREcontact_type='email';3
What is the mean weight of persons born in San Francisco?
SELECTavg(weight)FROMpersonWHEREplace_of_birth='San Francisco';71.25
Find persons born after 1979-12-31, which weigh more than / less than 50 kg.
SELECT*FROMpersonWHEREdate_of_birth>DATE'1979-12-31'ANDweight>50;--SELECT*FROMpersonWHEREdate_of_birth>DATE'1979-12-31'ANDweight<50;
Find persons born in Birmingham, Mumbai, Shanghai or Athens in the order of their firstname.
SELECT*FROMpersonWHEREplace_of_birth='Birmingham'ORplace_of_birth='Mumbai'ORplace_of_birth='Shanghai'ORplace_of_birth='Athens'ORDERBYfirstname;-- equivalent:SELECT*FROMpersonWHEREplace_of_birthIN('Birmingham','Mumbai','Shanghai','Athens')ORDERBYfirstname;
Find persons born in Birmingham, Mumbai, Shanghai or Athens within the 21. century.
SELECT*FROMpersonWHERE(place_of_birth='Birmingham'ORplace_of_birth='Mumbai'ORplace_of_birth='Shanghai'ORplace_of_birth='Athens')ANDdate_of_birth>=DATE'2000-01-01';-- equivalent:SELECT*FROMpersonWHEREplace_of_birthIN('Birmingham','Mumbai','Shanghai','Athens')ANDdate_of_birth>=DATE'2000-01-01';
Find persons born between Dallas and Richland ('between' not in the sense of a geographic area but of the lexical order of citynames)
-- strings have a lexical order. So we can use some operators known-- from numeric data types.SELECT*FROMpersonWHEREplace_of_birth>='Dallas'ANDplace_of_birth<='Richland'ORDERBYplace_of_birth;-- equivalent:SELECT*FROMpersonWHEREplace_of_birthBETWEEN'Dallas'AND'Richland'ORDERBYplace_of_birth;
Which kind of contacts are stored in the contact table? (Only one row per value.)
SELECTDISTINCTcontact_typeFROMcontact;fixedlineemailicqmobile
How many different kind of contacts are stored in the contact table? (Hint: Count the rows of above query.)
SELECTcount(DISTINCTcontact_type)FROMcontact;4
Show contact_type, contact_value and a string of the form 'total number of contacts: <x>', where <x> is the quantity of all existing contacts.
SELECTcontact_type,contact_value,(SELECTconcat('total number of contacts: ',count(*))FROMcontact)FROMcontact;-- Some systems need explicit type casting from numeric to stringSELECTcontact_type,contact_value,(SELECTconcat('total number of contacts: ',cast(count(*)aschar))FROMcontact)FROMcontact;-- The '||' operator is some kind of 'syntactical sugar'. It's an abbreviation for the concat() function.-- The operator is part of the SQL standard, but not implemented by all vendors.SELECTcontact_type,contact_value,(SELECT'total number of contacts: '||count(*)FROMcontact)FROMcontact;
Data should be stored in such a way that no redundant information exists in the database. For example, if our database includes groups of people who, in each case, all pursue the same hobby, then we would rather avoid repeatedly storing the same static details about a given hobby; namely in every record about one of the hobby's enthusiasts. Likewise, we would rather avoid repeatedly storing the same detailed information about an individual hobbyist, namely in every record about one of that person's hobbies. Instead, we create independentperson andhobby tables and point from one to the other. This technique for grouping data in separate, redundancy-free tables is calleddatabase normalization. Such separation also tends to simplify the logic and enhance the flexibility of assembling precisely the items needed for a given purpose. This assembly is accomplished through the 'JOIN' operation.
In our example database, there are two tables:person andcontact. Thecontact table contains the columnperson_id, which correlates with the Primary-Key columnid of theperson table. By evaluating the column values, we can join contacts and persons together.
person table P
| contact table C
|
Joined (virtual) table, created out ofperson andcontact
| P.ID | P.LASTNAME | P.FIRSTNAME | ... | C.ID | C.PERSON_ID | C.CONTACT_TYPE | C.CONTACT_VALUE |
|---|---|---|---|---|---|---|---|
| 1 | Goldstein | Larry | ... | 1 | 1 | fixed line | 555-0100 |
| 1 | Goldstein | Larry | ... | 2 | 1 | larry.goldstein@acme.xx | |
| 1 | Goldstein | Larry | ... | 3 | 1 | lg@my_company.xx | |
| 1 | Goldstein | Larry | ... | 4 | 1 | icq | 12111 |
| 2 | Burton | Tom | ... | ? | ? | ? | ? |
| 3 | Hamilton | Lisa | ... | ? | ? | ? | ? |
| 4 | Goldstein | Kim | ... | 5 | 4 | fixed line | 5550101 |
| 4 | Goldstein | Kim | ... | 6 | 4 | mobile | 10123444444 |
| ... | ... | ... | ... | ... | ... | ... | ... |
So, Larry Goldstein, that exists only once in the storedperson table, is now listed four times in the joined, virtual table – each time, in combination with one of his four contact items. The same applies to Kim Goldstein and his two contact items.
But what is going on with Tom Burton and Lisa Hamilton, whose contact information is not available? We may have some trouble attempting to join theirperson data with their non-existentcontact information. For the moment, we have flagged the situation with question marks. A detailed explanation of how to transform the problem into a solution appears later on this page.
Obviously it's necessary to specify two things with the JOIN operation
The basic syntax extends the SELECT command with these two elements
SELECT<things_to_be_displayed>-- as usualFROM<tablename_1><table_1_alias>-- a table aliasJOIN<tablename_2><table_2_alias>ON<joincondition>-- the join criterion...-- optionally all the other elements of SELECT command;
Let's make a first attempt.
SELECT*FROMpersonpJOINcontactcONp.id=c.person_id;
One of the table names is referenced after the FROM keyword (as previously), and the other one after the new keyword, JOIN, which (no surprise here) instructs the DBMS to perform a join operation. Next, the ON keyword introduces the column names together with a comparison operator (or a general condition, as you will see later). The column names are prefixed with the respective aliases of the table names,p andc. This is necessary because columns with identical names (likeid) may exist in multiple tables.
When the DBMS executes the command, it delivers 'something' that contains all the columns from both tables, including the twoid columns from their respective (person andcontact) tables. The result contains nine rows, one perexisting combination of person and contact; viz., due to the 'ON' expression, person records without any corresponding contact records will not appear in the result.
The delivered 'something' looks like a new table; in fact, it has the same structure, behavior, and data as a table. If it is created from a view or as the result of a subselection, we can even perform new SELECTs on it. But there is an important difference between this and a table: Its assembled data isnot stored in the DBMS as such; rather, the data iscomputed at run time from the values of real tables, and only held in temporary memory while the DBMS is running your program.
This key feature – assembling complex information from simple tables – is made possible by means of the two simple keywords, JOIN and ON. As you will see also, the syntax can be extended to build very complex queries, such that you can add many additional refinements to the specification of your join criteria.
It can sometimes be confusing when results don't match your intentions. If this happens, try to simplify your query, as shown here. Confusion often results from the fact that the JOIN syntax itself may become quite complicated. Moreover, joining can be combined with all of the other syntactic elements of the SELECT command, which also may lead to a lack of clarity.
The combination of the join syntax with other language elements is shown in the following examples.
---- show only important columnsSELECTp.firstname,p.lastname,c.contact_typeas"Kind of Contact",c.contact_valueas"Call Number"FROMpersonpJOINcontactcONp.id=c.person_id;-- show only desired rowsSELECTp.firstname,p.lastname,c.contact_typeas"Kind of Contact",c.contact_valueas"Call Number"FROMpersonpJOINcontactcONp.id=c.person_idWHEREc.contact_typeIN('fixed line','mobile');-- apply any sort orderSELECTp.firstname,p.lastname,c.contact_typeas"Kind of Contact",c.contact_valueas"Call Number"FROMpersonpJOINcontactcONp.id=c.person_idWHEREc.contact_typeIN('fixed line','mobile')ORDERBYp.lastname,p.firstname,c.contact_typeDESC;-- use functions: min() / max() / count()SELECTcount(*)FROMpersonpJOINcontactcONp.id=c.person_idWHEREc.contact_typeIN('fixed line','mobile');-- JOIN a table with itself. Example: Search different persons with the same lastnameSELECTp1.id,p1.firstname,p1.lastname,p2.id,p2.firstname,p2.lastnameFROMpersonp1JOINpersonp2ONp1.lastname=p2.lastname-- for the second incarnation of person we must use a different aliasWHEREp1.id!=p2.id-- sorting of p2.lastname is not necessary as it is identical to the already sorted p1.lastnameORDERBYp1.lastname,p1.firstname,p2.firstname;-- JOIN more than two tables. Example: contact information of different persons with same lastnameSELECTp1.id,p1.firstname,p1.lastname,p2.id,p2.firstname,p2.lastname,c.contact_type,c.contact_valueFROMpersonp1JOINpersonp2ONp1.lastname=p2.lastnameJOINcontactcONp2.id=c.person_id-- contact info from person2. p1.id would lead to person1WHEREp1.id!=p2.idORDERBYp1.lastname,p1.firstname,p2.lastname;
Earlier on this page, we saw an example of a join result wherein some rows contained person names, but no contact information – instead showing a question mark in that latter column. If the basic syntax of the JOIN operation had been used, those (question-mark) rows would have been filtered out. That (basic syntax with exclusive result) is known as an INNER join. There are also three different kinds of OUTER joins. The results of an OUTER join will contain not only all the full-data rows that an INNER join's results would, but also partial-data rows, i.e., those where no data was found in one or both of the two stored tables; thus, they're called LEFT OUTER, RIGHT OUTER and FULL OUTER joins.
So we can widen the basic JOIN syntax to the four options:
Keywords surrounded by [ ] are optional. The parser infers OUTER from LEFT, RIGHT or FULL, and a plain (i.e., basic-syntax) JOIN defaults to INNER.
The inner join is probably the most commonly used of the four types. As we have seen, it results in precisely those rows that exactly match the criterion following the ON. Below is an example showing how to create a list of persons and their contacts.
-- A list of persons and their contactsSELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpJOINcontactcONp.id=c.person_id-- identical meaning: INNER JOIN ...ORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;
What is most significant is that records for persons without any contact information arenot part of the result.
Sometimes we need a little more; for example, we might want a list of all person records, to include any contact-information records that may also be available for that person. Note how this differs from the example above: this time, the results will containall person records, even those for persons who haveno contact-information record(s).
-- A list of ALL persons plus their contactsSELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpLEFTJOINcontactcONp.id=c.person_id-- identical meaning: LEFT OUTER JOIN ...ORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;
In those cases where the contact information is unavailable, the DBMS will supplant it with the 'null value' or with the 'null special marker' (not to be confused with thestring (-type) 'null value' or 'null' nor with binary 0. Nonetheless, implementation details aren't important here. The null special marker will be discussed in alater chapter).
In summary, the left (outer) join is an inner join, plus one row for each left-side match without a counterpart on the right side.
Consider the word 'left'. It refers to the left side of the formula, "FROM <table_1> LEFT JOIN <table_2>", or more specifically, the table denoted on the left side (here:table_1); indicating that every row of that table will be represented at least once in the result, whether a corresponding record is found in the right-side table (here:table_2) or not.
Another example:
SELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMcontactcLEFTJOINpersonpONp.id=c.person_id-- identical meaning: LEFT OUTER JOIN ...ORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;
What's the difference? We've changed the order of the table names. Note that we're still using a LEFT join, but becausecontact is now the "left" referent (the object in the FROM clause),contact data will now be considered as being of primary importance; therefore, all the contact rows will appear in the result - along with any corresponding information that may exist in the person table. As it happens, in the database we're using, every contact record corresponds to a person record so, in this case, it works out that the results are equivalent to what they'd have been if we'd used an inner join. Yet they're different from those of the previous left-join example.
The right join obeys the same rules as the left join, but in reverse. Now, every record from the table referenced in the join clause will appear in the result, including those that have no corresponding record in the other table. Again, the DBMS supplies each empty right-column cell with the null special marker. The only difference is that the evaluation sequence of tables is carried out in reverse or, in other words, with the roles of the two tables swapped.
-- A list of ALL contact records with any corresponding person data, even if sSELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpRIGHTJOINcontactcONp.id=c.person_id-- same as RIGHT OUTER JOIN ...ORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;
A full join retrieves every row of both the left table and the right table, regardless of whether a corresponding record exists in the respective opposite table.
SELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpFULLJOINcontactcONp.id=c.person_id-- identical meaning: FULL OUTER JOIN ...ORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;
Giventable_1 andtable_2 below,
table_1
| table_2
|
the full join:
SELECT*FROMtable_1t1FULLJOINtable_2t2ONt1.id=t2.table_1_id;
will yield:
| T1.ID | T1.X | T2.ID | T2.TABLE_1_ID | T2.Y |
|---|---|---|---|---|
| 1 | 11 | 1 | 1 | 21 |
| 2 | 12 | null | null | null |
| 3 | 13 | null | null | null |
| null | null | 2 | 5 | 22 |
These results contain the (single) matching row, plus a row each for all the other records of both of the original tables. As each of these other rows represent data found in only one of the tables, they are each missing some data, so the cells representative of that missing data contain the null special marker.
Note: The full join is not supported by all DBMS. Nevertheless, because it isn't an atomic operation, it is always possible to create the desired result by a combination of multiple SELECTs with SET operations.
With inner joins it is possible to omit the ON. SQL interprets this as a - syntactically correct - request to combine every record of the left table with every record of the right table. It will return a large number of rows: the product of the row counts of the two tables.
This particular kind of an inner join is called aCartesian product or CROSS JOIN. The Cartesian product is an elementary operation ofrelational algebra, which is the foundation for all rDBMS implementations.
-- all persons combined with all contacts (some implementations replace the-- keyword 'JOIN' with a comma)SELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpJOINcontactc-- missing ON keyword: p X c will be createdORDERBYp.lastname,p.firstname,c.contact_typeDESC,c.contact_value;-- count the resulting rowsSELECTcount(*)FROMpersonpJOINcontactc;
Be careful then; if you unintentionally omit the ON term, the result will be much larger than expected. If, for example, the first table contains 10,000 records, and the second one 20,000 records, the output will contain 200 million rows.
How can we create a list of persons and their hobbies? Remember: one person may run many hobbies and several persons may run the same hobby. So there is no direct connection from persons to hobbies. Between the two tables, we have created a third oneperson_hobby. It holds the id of persons as well as the id of hobbies.
We have to 'walk' fromperson toperson_hobby and from there tohobby.
-- persons combined with their hobbiesSELECTp.idp_id,p.firstname,p.lastname,h.hobbyname,h.idh_idFROMpersonpJOINperson_hobbyphONp.id=ph.person_idJOINhobbyhONph.hobby_id=h.idORDERBYp.lastname,p.firstname,h.hobbyname;
Please note that no column of the tableperson_hobby goes to the result. This table acts only during intermediate execution steps. Even its columnid is not of interest.
Some people do not perform a hobby. As we performed an INNER JOIN they are not part of the above list. If we want to see in the list also persons without hobbies, we must do what we have done before: use LEFT OUTER JOINs instead of INNER JOINs.
-- ALL persons plus their hobbies (if present)SELECTp.idp_id,p.firstname,p.lastname,h.hobbyname,h.idh_idFROMpersonpLEFTJOINperson_hobbyphONp.id=ph.person_idLEFTJOINhobbyhONph.hobby_id=h.idORDERBYp.lastname,p.firstname,h.hobbyname;
Hint: If necessary, we can combine every kind of join with every other kind of join in every desired sequence, e.g.: LEFT OUTER with FULL OUTER with INNER ... .
Criteria for join operations are not restricted to the usual formulation:
SELECT...FROMtable_1t1JOINtable_2t2ONt1.id=t2.fk...
First, we can useany column, not only primary key and foreign key columns. In one of the above examples, we used the lastname for a join. Lastname is of type character and has no meaning of any key. To avoid poor performance, some DBMS restrict the use of columns to those having an index.
Second, the comparator is not restricted to theequal sign. We can use any meaningful operator, for example, the 'greater than' for numeric values.
-- Which person has the greater body weight - restricted to 'de Winter' for claritySELECTp1.id,p1.firstnameas"is heavier",p1.weight,p2.id,p2.firstnameas"than",p2.weightFROMpersonp1JOINpersonp2ONp1.weight>p2.weightWHEREp1.lastname='de Winter'ANDp2.lastname='de Winter'ORDERBYp1.weightdesc,p2.weightdesc;
Third, we can use anarbitrary function.
-- short lastnames vs. long lastnamesSELECTp1.firstname,p1.lastnameas"shorter lastname",p2.firstname,p2.lastnameFROMpersonp1JOINpersonp2ONLENGTH(p1.lastname)<LENGTH(p2.lastname)-- likewise ORDER BY can use functionsORDERBYlength(p1.lastname),length(p2.lastname);
Show first- and lastname plus icq number for persons having an icq number
SELECTp.id,p.firstname,p.lastname,c.contact_valueFROMpersonpJOINcontactcONp.id=c.person_idWHEREc.contact_type='icq';
Show first- and lastname plus ICQ number plus fixed line number for persons having an ICQ number AND a fixed line. You need to join thecontact table twice.
SELECTp.id,p.firstname,p.lastname,c1.contact_valueasicq,c2.contact_valueas"fixed line"-- looks like previous, but is differentFROMpersonpJOINcontactc1ONp.id=c1.person_idJOINcontactc2ONp.id=c2.person_id-- it's a second (virtual) incarnation of contact tableWHEREc1.contact_type='icq'-- from first incarnationANDc2.contact_type='fixed line';-- from second incarnation-- In this example of an INNER JOIN we can convert the WHERE part to an additional JOIN criterion.-- This may clarify the intention of the command. But be careful: This shifting in combination with-- one of the OUTER JOINs may lead to different results.SELECTp.id,p.firstname,p.lastname,c1.contact_valueasicq,c2.contact_valueas"fixed line"FROMpersonpJOINcontactc1ONp.id=c1.person_idANDc1.contact_type='icq'JOINcontactc2ONp.id=c2.person_idANDc2.contact_type='fixed line';
Show first- and lastname plus (if present) the ICQ number for ALL persons
-- To retrieve ALL persons, it's necessary to use a LEFT join.-- But the first approach is not what we expect! In this example, the LEFT JOIN is evaluated first-- and creates an intermediate table with null-values in contact_type (eliminate the-- WHERE clause to see this intermediate result). These rows and all other except the-- one with 'ICQ' are then thrown away by evaluating the WHERE clause.SELECTp.id,p.firstname,p.lastname,c.contact_valueFROMpersonpLEFTJOINcontactcONp.id=c.person_idWHEREc.contact_type='icq';-- It's necessary to formulate the search criterion as part of the JOIN. Unlike with-- the INNER JOIN in the previous example with (LEFT/FULL/RIGHT) OUTER JOINs it is not possible-- to shift it to the WHERE clause.SELECTp.id,p.firstname,p.lastname,c.contact_valueFROMpersonpLEFTJOINcontactcONp.id=c.person_idANDc.contact_type='icq';
Create a list which contains ALL hobbies plus according persons (if present)
SELECTp.idp_id,p.firstname,p.lastname,h.hobbyname,h.idh_idFROMpersonpRIGHTJOINperson_hobbyphONp.id=ph.person_idRIGHTJOINhobbyhONph.hobby_id=h.idORDERBYh.hobbyname,p.lastname,p.firstname;
Is it possible that one of the three outer joins contains fewer rows than the corresponding inner join?
No.
All four join types contain the same rows with column-matching-values. In addition
outer joins contain rows where column values do not match - if such a situation exists.
In this chapter we will leave the level of individual rows and describe statements that refer togroups of rows. In the context of SQL such 'row-groups' (or sets of rows) are built by the GROUP BY clause and further processed by the HAVING clause.
First we must establish criteria for filtering rows by groups. To do so we use the content of one or more columns of the involved table(s). If the values are identical, the rows belong to the same group. Consider thelastname in tableperson. In our small example we can insinuate that persons with same lastname form a family. So to see information about families we should use this column as the grouping criterion. This grouping allows us to ask questions concerning whole families, such as 'Which families are there?', 'How many families exists?', 'How many persons are in each family?'. All of these are questions about the whole group (which means the family), not about single rows (which means the person).
In the SQL syntax, the criterion is specified after the keyword GROUP BY and consists of one or more column names.
SELECT...-- as usualFROM...-- as usual (optionally plus JOINs)GROUPBY<column-name>-- optionally more column names...-- optionally other elements of SELECT command;
Our concrete example about families looks like this:
SELECTlastnameFROMpersonGROUPBYlastname;
The query retrieves seven 'family names' out of the ten rows. There are several persons with lastname 'Goldstein' or 'de Winter'.
We can retrieve the same seven 'family names' by applying the keyword DISTINCT in a SELECT without GROUP BY.
SELECTDISTINCTlastnameFROMperson;-- no GROUP BY clause
What makes the difference? The DISTINCT keyword is limited to remove duplicate values. It can not initiate computations on other rows and columns of the result set. In contrast, the GROUP BY additionally arranges the intermediate received rows as several groups and offers the possibility to get information about each of these groups. It is even the case that within these groupsall columns are available, not only the 'criterion'-column. To confirm this statement about 'all' columns, we useweight, which is not the 'criterion'-column.
SELECTlastname,avg(weight)-- avg() is a function to compute the arithmetic mean of numerical valuesFROMpersonGROUPBYlastname;
The result shows the seven family names - as seen before - plus the average weight of every family. The weight of individual persons is not shown. (In groups with precisely one person the average weight of the group is, of course, identical to the single person's weight.)
If necessary, we can define the grouping over more than one column. In this case, we can imagine the concatenation of the columns as the grouping rule.
-- Group over one column: place_of_birth leads to 6 resulting rowsSELECTplace_of_birth,count(*)FROMpersonGROUPBYplace_of_birth;-- Group over two columns: place_of_birth plus lastname leads to 8 resulting rows with Richland and SF shown twiceSELECTplace_of_birth,lastname,count(*)FROMpersonGROUPBYplace_of_birth,lastname;
After we have defined groups with the GROUP BY keyword, we can select more information about each of them, e.g.: how many persons (rows) exist within each family (group of rows)?
SELECTlastname,count(*)-- count() is a function which counts values or rowsFROMpersonGROUPBYlastname;
We see that in our small example database there is one family with three members, another with two members and all others consist of exactly 1 member.
What is going on behind the scene during the execution of the command?
In step 4exactly one resulting row is generated per group. Because the SELECT creates only one resulting row per group, it is not possible to show values of such columns which may differ from row to row, e.g., the firstname. The SELECT can only show such values of which it is ensured that they are identical within all rows of the group: the 'criterion'-column.
-- It is not possible to show the 'firstname' of a group! 'firstname' is an attribute of single person.-- Within a group 'firstname' varies from row to row.-- The DBMS should recognize this problem and should issue an error message.SELECTlastname,firstnameFROMpersonGROUPBYlastname;-- A hint to users of MySQL:-- To receive correct results (the error message) you must deactivate a special performance feature by issuing the command-- set sql_mode = 'ONLY_FULL_GROUP_BY'; or set it in the workbench or in the ini-file.
Nevertheless, we can get information about the non-criterion-columns. But this information is more generalized. The DBMS offers a special group of functions which build one value out of a set of rows. Consider the avg() function, which computes the arithmetic mean of numerical values. This function receives a column name and operates on a set of rows. If our command in question contains a GROUP BY clause, the avg() function does compute one value per group - not one value per all rows as usual. So it is possible to show the result of such functions together with the values of the 'criterion'-column.
Here is an - incomplete - list of suchfunctions: count(), max(), min(), sum(), avg(). Not all functions are of that kind, e.g. the function concat(), which concatenates two strings, operates on single rows and creates one value per row.
-- compute avg() by your own formulaSELECTlastname,sum(weight)/count(weight)as"Mean weight 1",avg(weight)as"Mean weight 2"FROMpersonGROUPBYlastname;
You know the WHERE clause. It defines which rows of a table will be part of the result set. The HAVING clause has the same meaning at the group-level. It determines which groups will be part of the result set.
-- The HAVING complements the GROUP BYSELECT...FROM...GROUPBY<columnname>HAVING<havingclause>;-- specify a criterion which can be applied to groups
We retrieve exclusively families with more than 1 members:
SELECTlastnameFROMpersonGROUPBYlastname-- grouping over lastnameHAVINGcount(*)>1;-- more than one person within the group
All families with one member are no longer part of the result.
In a second example we focus on such groups which satisfies a criterion on columnfirstname. Consider thatfirstname is not the grouping-column.
-- Groups containing a person whose firstname has more than 4 characters: 5 resulting rowsSELECTlastnameFROMpersonGROUPBYlastnameHAVINGmax(length(firstname))>4;-- max() returns ONE value (the highest one) for all rows of each 'lastname'-group
The result shows the five families Baker, de Winter, Goldstein, Rich, and Stefanos (but not the row(s) with the longfirstname).
Please note that this result is very different from the similar question to persons whose firstname has more than four characters:
-- Persons whose firstname has more than four characters: six resulting rows!!SELECTlastname,firstnameFROMpersonWHERElength(firstname)>4;-- no GROUP BY and no HAVING. The WHERE isn't an equivalent replacement for the HAVING!!
Where is the additional row coming from? In the family de Winter, there are two persons with a firstname longer than four characters: James and Victor. Because in the command without GROUP BY we select for persons and not for families, both rows are displayed individually.
In summary, we can say that the HAVING clause decides which groups are part of the result set and which are not.
The GROUP BY and HAVING clauses are part of the SELECT command, and we can combine them with any other clauses of the SELECT as desired. Only the order of the clauses is obligatory.
-- This is the obligatory order of clausesSELECT...FROM...WHERE...GROUPBY...HAVING...ORDERBY...;
As mentioned, the WHERE clause works on the row-level, whereas the HAVING clause works on the group-level. First, the WHERE is evaluated, next the GROUP BY, next the HAVING, next the ORDER BY, and at the end the SELECT. Every step is based on the results of the previous one.
Finally we offer two additional examples:
-- Are there persons born on the same day?SELECTdate_of_birth-- In a later chapter, you will learn how to select the name of these persons.FROMpersonGROUPBYdate_of_birthHAVINGcount(date_of_birth)>1-- more than one on the same day?ORDERBYdate_of_birth;-- Families with long first- and lastname. Comment out some lines to see differences to the original query.SELECTlastname,count(*)ascntFROMpersonWHERElength(firstname)>4GROUPBYlastnameHAVINGlength(lastname)>4ORDERBYcntdesc,lastname;
Are there persons born on the same day in the same city? Hint: group over both criteria
SELECTdate_of_birth,place_of_birthFROMpersonGROUPBYdate_of_birth,place_of_birthHAVINGcount(*)>1;
Categorise persons according to the formula: 'round (weight / 10)': 10 to 19 kg --> 1, 20 to 29 kg --> 2, ...
How much persons exist in each category?
SELECTround(weight/10),count(*)FROMpersonGROUPBYround(weight/10)-- ORDER BY round (weight / 10) -- order by categoryORDERBYcount(*)-- order by frequency;
Which contact type is used in which frequency in table contact?
SELECTcontact_type,count(*)FROMcontactGROUPBYcontact_type-- ORDER BY contact_type -- order by contact_typeORDERBYcount(*)-- order by frequency;
Restrict previous result to contact types which occurs more than once.
SELECTcontact_type,count(*)FROMcontactGROUPBYcontact_typeHAVINGcount(*)>1-- order by contact_type -- order by contact_typeORDERBYcount(*)-- order by frequency;
Are there persons performing more than 2 hobbies? Hint: check table person_hobby.
SELECTperson_id,count(*)FROMperson_hobbyGROUPBYperson_idHAVINGcount(*)>2;
Are there persons performing only one hobby?
SELECTperson_id,count(*)FROMperson_hobbyGROUPBYperson_idHAVINGcount(*)=1;
Are there persons performing no hobby?
There are persons, who do not perform a hobby. But the nearby formulation 'count(*) = 0'
will not lead to the expected result because for such persons there are no rows
in table person_hobby, so the DBMS cannot create any group and hence cannot display anything.
Looking for something that does NOT exist is often more difficult than looking for the
existence of something. In such cases, you usually have to use one of: NOT EXISTS, NOT IN,
a combination of OUTER JOIN and IS NULL, a combination of OUTER JOIN and MINUS together
with INNER JOIN.
When creating new rows, it may occur that we don't know the value of one or more columns.
Let's assume that we want to store information about banking accounts, and for one of those accounts we don't know the balance. What can we do? There are several possibilities:
Relational DBMS uses the last mentioned technique, and the sense of the flag is 'there is no value stored'. Sometimes people say 'TheNULL value is stored' or 'TheNULL special marker is stored'.
Assume there is a table for banking accounts, and some of its rows hold the NULL special marker in the columnbalance. Do those rows fulfill at least one of the two WHERE conditions 'balance >= 0' or 'balance <= 0'? No. It is not possible to decide whether these conditions are true or false! Honestly, we must admit that we don't know an answer in our usual true/false logic because we don't know a value forbalance. We are forced to extend the range of boolean values with a third one, which we callunknown. The two conditions above evaluate neither true nor false; both evaluate to 'unknown' for rows wherebalance holds the NULL special marker.
In a later stage, we need definitions for the boolean operators NOT, AND, OR, and EQUAL when true/false interact with unknown. You find the definitionshere.
Within every SELECT command, such rows become part of the resulting rows, in which the WHERE condition evaluates to true. If it evaluates to false or unknown, the row will be rejected. Asall WHERE conditions like the above 'balance >= 0' - and also their negation - evaluates to unknown for missingbalance values, there is preliminary no way to retrieve them.
To overcome this lack, SQL contains the particular phrase 'IS NULL'. The wording 'balance IS NULL' evaluates to true for precisely the rows with a missing value inbalance.
SELECT...FROM...WHERE<columnname>ISNULL...;
We must use exactly this wording. The use of any arithmetic operator like >, <=, !=, ... will not retrieve rows with the NULL special marker. The same holds true even for the condition '(balance = 0) OR NOT (balance = 0)', which is atautology in conventional true/false logic. Besides this IS NULL predicate, there is no other way to retrieve the NULL special marker - without one simple but not helpful exception: if you omit the WHERE condition, all rows of the table are retrieved, with and without NULL special marker in any column.
That's all! Dealing with NULL special marker and the 3-value-logic might sound strange if you first met this topic. But as the IS NULL predicate always evaluates to true or false, everything works as usual afterward. We can use all other elements of the SELECT command (boolean logic, join, having, order by, ...) in the same way we have done so far.
Our test database does not contain the NULL special marker. Nevertheless, we have met the situation during the explanation of OUTER joins. OUTER joins create resulting rows where some columns contain the NULL special marker. We must consider this possibility if we deal with the results of such subselects.
There are two other ways to generate the NULL special marker.
To demonstrate this and to create some examples for the following exercises, we put one row into theperson table with some columns left empty.
-- Insert a new row for testing purposeINSERTINTOperson(id,firstname,lastname)VALUES(51,'Half man','Uncomplete');COMMIT;-- Retrieve the row. As defined in CREATE TABLE statement the weight has a default value of integer 0.-- Date_of_birth and place_of_birth contain the NULL special marker.SELECT*FROMpersonWHEREid=51;-- use the IS NULL predicate within WHERE clause. The result contains 1 row.SELECT*FROMpersonWHEREssnISNULL;-- weight has a value!! We expect to retrieve no rows when we use the IS NULL predicate.SELECT*FROMpersonWHEREweightISNULL;-- or, to say it the other way round, the number of rows is 0SELECTcount(*)FROMpersonWHEREweightISNULL;-- but in the next statement the number of rows is 1SELECTcount(*)FROMpersonWHEREweight=0;-- Negate the IS NULL predicateSELECTcount(*)FROMpersonWHEREssnISNULL;-- IS NULLSELECTcount(*)FROMpersonWHEREssnISNOTNULL;-- Negation of IS NULLSELECTcount(*)FROMpersonWHEREssnISNULLORssnISNOTNULL;-- A tautology, which always retrieves ALL rows of a table-- Same as aboveSELECTcount(*)FROMpersonWHEREssnISNULLORNOTssnISNULL;-- A tautology, which always retrieves ALL rows of a table
Next we show the use of the UPDATE command in combination with the key word NULL
---- Insert a new row for testing purpose with all columns filled with a useful valueINSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(52,'Lyn','Mutable',DATE'1951-05-13','Anchorage','078-05-1152',69);COMMIT;SELECT*FROMpersonWHEREid=52;-- Delete a single column value (not the complete row)UPDATEpersonSETssn=nullWHEREid=52;COMMIT;SELECT*FROMpersonWHEREid=52;-- one rowSELECT*FROMpersonWHEREssnISNULL;-- two rows: 51 + 52
Restore the original state of the example database.
DELETEFROMpersonWHEREid>50;COMMIT;
In the context of the NULL special marker it is often the case that we have to retrieve rows with no value (the NULL special marker)or a default value such as 0 or blank. In such cases, the WHERE condition looks something like this: "... WHERE (col IS NULL OR col = 0) ...". To keep source code simpler, the SQL standard defines a functioncoalesce(<expression_1>, <expression_2>). If the first argument, which normally is the name of a column, is not NULL, the function evaluates to this argument - else to the second argument.
Example:
-- Retrieve rows without ssn or with ssn equal to blank.SELECT*FROMpersonWHEREcoalesce(ssn,' ')=' ';-- equivalent:-- WHERE (ssn IS NULL-- OR ssn = ' ');
The function namecoalesce results from the fact that the function accepts an arbitrary number of parameters and recursively evaluates them. If parametern results in a real value, it evaluates to this parameter, else the function calls itself without the n-th parameter. coalesce(expression_1, expression_2, expression_3) evaluates to expression_1, if expression_1 is not NULL, else to expression_2, if expression_2 is not NULL, else to expression_3.
The SQL standard defines another functionnullif(<expression_1>, <expression_2>). It evaluates to NULL, if the two expressions are equal - and it evaluates to the first expression, if they differ from each other.
Different vendors offers some more functions likeisnull(),ifnull() ornvl() to support handling of NULL values. The meanings of these functions are vendor specific.
Insert a new hobby 'Snowshoeing' without a remark.
INSERTINTOhobby(id,hobbyname,remark)VALUES(10,'Snowshoeing',null);COMMIT;
Find a second solution for the above question without using the key word 'null'. (First delete row 10.)
DELETEFROMhobbyWHEREid=10;INSERTINTOhobby(id,hobbyname)VALUES(10,'Snowshoeing');COMMIT;
Retrieve all hobbies without a remark.
-- 1 rowSELECT*FROMhobbyWHEREremarkISNULL;
How many hobbies are exemplified with a remark?
-- 9 rowsSELECTcount(*)FROMhobbyWHEREremarkISNOTNULL;
Change row 10 of hobby so that hobbyname contains the string 'NULL' and remark contains 'Name of hobby not known'.
-- Consider the two apostrophes surrounding the string 'NULL', which consists of the 4 characters N, U, L and L !!UPDATEhobbySEThobbyname='NULL',remark='Name of hobby not known'WHEREid=10;COMMIT;
a) Retrieve the row where hobbyname is 'NULL'.
b) Retrieve the row where remark is 'Name of hobby not known'.
-- This may be a pitfall question. There is no relation to the IS NULL predicateSELECT*FROMhobbyWHEREhobbyname='NULL';SELECT*FROMhobbyWHEREremark='Name of hobby not known';
How many hobbies have a hobby name?
-- All 10 rows contains a hobby name, even the row with the hobbyname 'NULL'SELECTcount(*)FROMhobbyWHEREhobbynameISNOTNULL;
There are two groups of predefined functions:
They work on a set of rows and return one single value like the number of rows, the highest or lowest value, the standard deviation, etc. The most important aggregate functions are:
| Signatur | Semantic |
|---|---|
COUNT(*) | The number of rows |
COUNT(<column name>) | The number of rows where <column name> contains a value (IS NOT NULL). The elimination of rows with the NULL special marker in the considered column applies to all aggregate functions. |
MIN(<column name>) | Lowest value. In the case of strings according to the sequence of characters. |
MAX(<column name>) | Highest value. In the case of strings according to the sequence of characters. |
SUM(<column name>) | Sum of all values |
AVG(<column name>) | Arithmetic mean |
As an example we retrieve the maximum weight of all persons:
SELECTMAX(weight)FROMperson;
A Word of Caution
Aggregate functions result in one value for a set of rows. Therefore it is not possible to use them together with 'normal' columns in the projection (the part behind SELECT keyword). If we specify, for example,
SELECTlastname,SUM(weight)FROMperson;
we try to instruct the DBMS to show alot of rows containing thelastname simultaneously withone value. This is a contradiction and the system will throw an exception. We can use a lot of aggregate functions within one projection but we are not allowed to use them together with 'normal' columns.
-- Multiple aggregate functions. No 'normal' columns.SELECTSUM(weight)/COUNT(weight)asaverage_1,AVG(weight)asaverage_2FROMperson;
Grouping
If we use aggregate functions in the context of commands containing a GROUP BY, the aggregate functions are called once per group.
-- Not only one resulting row, but one resulting row per lastname together with the average weight of all rows with this lastname.SELECTAVG(weight)FROMpersonGROUPBYlastname;
In such cases the GROUP BY column(s) may be displayed as it is impossible that they change within the group.
-- The lastname may be shown as it is the GROUP BY criteriaSELECTlastname,AVG(weight)FROMpersonGROUPBYlastname;
If a row contains no value (it holds the NULL special marker) in the named column, the row is not part of the computation.
-- If ssn is NULL, this row will not count.SELECTCOUNT(ssn)FROMperson;
The complete signatures of the functions are a little more detailed. We can prepend the column name with one of the two key words ALL or DISTINCT. If we specify ALL, which is the default, every value is part of the computation, else only those, which are distinct from each other.
function_name ([ALL|DISTINCT]<column name>)
COUNT (DISTINCT weight) -- as an example
The standard defines some more aggregate functions to compute statistical measures. Also the keywords ANY, EVERY and SOME formally are defined as aggregate functions. We will discuss them on a separate page.
Scalar functions act on a 'per row basis'. They are called once per row and they return one value per call. Often they are grouped according to the data types they act on:
There is another wikibook where those functions are shownin detail. The data type of the return value is not always identical to the type of the input, e.g. 'character_length()' receives a string and returns a number.
Here is an example with some scalar functions:
SELECTLOWER(firstname),UPPER(lastname),CONCAT('today is: ',CURRENT_DATE)FROMperson;
What is the hightest id used so far in the hobby table?
SELECTmax(id)FROMhobby;
Which lastname will occur first in an ordered list?
SELECTmin(lastname)FROMperson;
Are there aggregate functions where it makes no difference to use the ALL or the DISTINCT key word?
Yes. min(ALL <column name>) leads to the same result as min(DISTINCT <column name>) as
it makes no difference whether the smallest value occurs one or more times. The same is true for max().
Show persons with a short firstname (up to 4 characters).
-- We can use functions as part of the WHERE clause.SELECT*FROMpersonWHEREcharacter_length(firstname)<=4;-- Hint: Some implementations use a different function name: length() or len().
Show firstname, lastname and the number of characters for the concatenated string. Find two different solutions. You may use the character_length() function to compute the length of strings and the concat() function to concatenate strings.
-- Addition of the computed length. Hint: Some implementations use a different function name: length() or len().SELECTfirstname,lastname,character_length(firstname)+character_length(lastname)FROMperson;-- length of the concatenated stringSELECTfirstname,lastname,character_length(concat(firstname,lastname))FROMperson;-- show both solutions togetherSELECTfirstname,lastname,character_length(firstname)+character_length(lastname)asL1,character_length(concat(firstname,lastname))asL2FROMperson;
Tables, views, and results of SELECT commands are in somewhat similar to sets of set theory. In this comparison the elements of sets correspond to rows of tables, views, and SELECT results. The differences between set theory and the itemized SQL constructs are:
The comparison between set theory and SQL goes even further. In SQL, we have operations which act onmultisets in the sense of set theory: The SQL operations UNION, INTERSECT, and EXCEPT (some name it MINUS) process intermediatemultisets generated by differents SELECT commands. The operations expect themultisets are of the same type. This means mainly that theymust have the same number of columns. Also, their data type should correlate, but this is not mandatory. If they do not, the DBMS will cast them to a common data type - if possible.
The UNION operation pushes the result of several SELECT commands together. The result of the UNION contains those values, which are in the first or the second intermediate result.
-- Please consider that this is only one command (only ONE semicolon at the very end)SELECTfirstname-- first SELECT commandFROMpersonUNION-- push both intermediate results together to one resultSELECTlastname-- second SELECT commandFROMperson;

This is a single SQL command. It consists of two SELECTs and one UNION operation. The SELECTs are evaluated first. Afterward, their results are pushed together to one single result. In our example, the result contains all lastnames and firstnames in a single column (our example may be of limited help in praxis, it's only a demonstration for the UNION).
DISTINCT / ALL
If we examine the result carefully, we will notice that it consists only of 17 values. The tableperson contains ten rows so that we probably expect twenty values in the result. If we perform the 'SELECT firstname ...' and 'SELECT lastname ...' as separate commands without the UNION, we receive for both commands ten values. The explanation for the three missing values is the UNION command. By default, UNION removes duplicates. Therefore some of the intermediate values are skipped. If we want to obtain the duplicate values we have to modify the UNION operation. Its behavior can be changed with one of the two keywords DISTINCT or ALL. DISTINCT is the default, and it removes duplicate values as we have seen before. ALL will retain all values, including duplicates.
-- remove (that's the default) or keep duplicatesSELECT...UNION[DISTINCT|ALL]SELECT...[-- it is possible to 'UNION' more than 2 intermediate resultsUNION[DISTINCT|ALL]SELECT...];
A hint for Oracle users: The use of the keyword DISTINCT, which is the default, is not accepted by Oracle. Omit it.
General hint
In most cases, the UNION combines SELECT commands on different tables or different columns of the same table. SELECT commands on the same column of a single table usually use the WHERE clause in combination with boolean logic.
-- A very unusual example. People apply such queries on the same table only in combination with very complex WHERE conditions.-- This example would normally be expressed with a phrasing similar to: WHERE lastname IN ('de Winter', 'Goldstein');SELECT*FROMpersonWHERElastname='de Winter'UNIONALLSELECT*FROMpersonWHERElastname='Goldstein';

The INTERSECT operation evaluates to those values, which are in both intermediate results, in the first as well as in the second.
-- As in our example database, there is no example for the INTERSECT we insert a new person.-- This person has the same last name 'Victor' as the first name of another person.INSERTINTOpersonVALUES(21,'Paul','Victor',DATE'1966-04-02','Washington','078-05-1121',66);COMMIT;-- All firstnames which are used as lastname.SELECTfirstname-- first SELECT commandFROMpersonINTERSECT-- looking for common valuesSELECTlastname-- second SELECT commandFROMperson;
A hint to MySQL users: MySQL (5.5) does not support INTERSECT operation. But as it is not an elementary operation, there are workarounds.

The EXCEPT operation evaluates to those values, which are in the first intermediate result but not in the second.
-- All firstname except for 'Victor', because there is a lastname with this value.SELECTfirstname-- first SELECT commandFROMpersonEXCEPT-- are there values in the result of the first SELECT but not of second?SELECTlastname-- second SELECT commandFROMperson;
A hint to MySQL users: MySQL (5.5) does not support the EXCEPT operation. But as it is not an elementary operation, there are workarounds.
A hint to Oracle users: Oracle uses the keyword MINUS instead of EXCEPT.
-- Clean up the example databaseDELETEFROMpersonWHEREid>10;COMMIT;
We can combine set operations with all other elements of SELECT command, in particular with ORDER BY and GROUP BY. But this may lead to some uncertainties. Therefore, we would like to explain some of the details below.
SELECTfirstname-- first SELECT commandFROMpersonUNION-- push both intermediate results together to one resultSELECTlastname-- second SELECT commandFROMpersonORDERBYfirstname;
To which part of the command belongs the ORDER BY? To the first SELECT, the second SELECT, or the result of the UNION? The SQL rules determine that set operations are evaluated before ORDER BY clauses (as always parenthesis can change the order of evaluation). Therefore the ORDER BY sorts the final result and not any of the intermediate results.
We rearrange the example in the hope that things get clear.
-- Equivalent semanticSELECT*FROM(SELECTfirstname-- first SELECT commandFROMpersonUNION-- push both intermediate (unnamed) results together to the next intermediate result 't'SELECTlastname-- second SELECT commandFROMperson)t-- 't' is the name for the intermediate result generated by UNIONORDERBYt.firstname;
First, the two SELECTS are evaluated, afterward the UNION. This intermediate result gets the name 't'. 't' is ordered.
Often one would like the rows from the first SELECT to be ordered independent from the rows of the second SELECT. We can do this by adding a virtual column to the result of each SELECT statement and using the virtual columns in the ORDER BY.
SELECT'1'asdummy,firstnameFROMpersonUNIONSELECT'2',lastnameFROMpersonORDERBYdummy,firstname;
With the GROUP BY clause, things are a little more complicated than with ORDER BY. The GROUP BY refers to the last SELECT or - to say it the other way round - to the SELECT of its direct level.
-- Will not work because the GROUP BY belongs to the second SELECT and not to the UNION!SELECTfirstnameFROMpersonUNIONSELECTlastnameFROMpersonGROUPBYfirstname;---- Works, but possibly not what you want to do.-- The alias name for the (only) column of the UNION is 'firstname'.SELECTfirstnameFROMpersonUNION-- We group over the (only) column of the second SELECT, which is 'lastname' and results in 7 valuesSELECTlastnameFROMpersonGROUPBYlastname;---- Make things clear: rearrange the query to group over the final resultSELECT*FROM(SELECTfirstname-- columnnames of the first SELECT determins the columnnames of the UNIONFROMpersonUNIONSELECTlastnameFROMperson)tGROUPBYt.firstname;-- now we can group over the complete result
Show the lowest, highest and mean weight as a) 3 values of 1 row and b) 1 value in 3 rows.
-- 1 rowSELECTmin(weight),max(weight),avg(weight)FROMperson;-- 3 rowsSELECTmin(weight)FROMpersonUNIONSELECTmax(weight)FROMpersonUNIONSELECTavg(weight)FROMperson;
Extend the previous 3-rows-solution to meet two additional criteria: a) consider only persons born in San Francisco and
b) add a virtual column to show 'Min', 'Max' and 'Avg' according to the correlating numeric values.
SELECT'Min',min(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT'Max',max(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT'Avg',avg(weight)FROMpersonWHEREplace_of_birth='San Francisco';
Extend the previous solution to order the result: the minimum value first, followed by the average and then the highest value.
-- 'ugly' solutionSELECT'1 Min'ASnote,min(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT'3 Max'ASnote,max(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT'2 Avg'ASnote,avg(weight)FROMpersonWHEREplace_of_birth='San Francisco'ORDERBYnote;-- 'clean' solutionSELECT1ASnote,'Min',min(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT3ASnote,'Max',max(weight)FROMpersonWHEREplace_of_birth='San Francisco'UNIONSELECT2ASnote,'Avg',avg(weight)FROMpersonWHEREplace_of_birth='San Francisco'ORDERBYnote;
Create a list of lastnames for persons with a weight greater than 70 kg together with
all e-mail values (one value per row). There is no concordance between lastnames and e-mails.
(This example is not very helpfull for praxis, but instructive.)
SELECTlastnameFROMpersonWHEREweight>70UNIONSELECTcontact_valueFROMcontactWHEREcontact_type='email';
In the previous example the lastname 'de Winter' is shown only once. But there are more than one persons of the family with a weight greater than 70 kg.
Why?
Extend the previous solution to show as much resulting rows as hits to the criteria.
-- Extend 'UNION' to 'UNION ALL'. The default is 'UNION DISTINCT'SELECTlastnameFROMpersonWHEREweight>70UNIONALLSELECTcontact_valueFROMcontactWHEREcontact_type='email';
Sometimes it's necessary to translate stored values (or values to be stored) from one representation to another. Suppose there is a columnstatus with legal values from 0 to 9, but the end-users should receive strings that explain the meaning of the numeric values in short, eg.: 'ordered', 'delivered', 'back delivery', 'out of stock', ... . The recommended way to do this is a separate table where the numeric values maps to the explanatory strings. Notwithstanding this, application developers may favor a solution within an application server.
The CASE expression, which is shown on this page, is a technique for solving the described situation as part of a SELECT, INSERT or UPDATE command as well as solving additional problems. As part of the language, it's a powerful term that can be applied at plenty of places within SQL commands. On this page, we focus on its use together with the SELECT command. The strategy and syntax for CASE withinINSERT andUPDATE are equivalent and are presented over there. In comparison with the recommended technique of a separate table for the translation, the CASE expression is much more flexible (which is not an advantage in all cases).
-- Technical term: "simple case"-- Select id, contact_type in a translated version and contact_valueSELECTid,CASEcontact_typeWHEN'fixed line'THEN'Phone'WHEN'mobile'THEN'Phone'ELSE'Not a telephone number'ENDAS'contact_type',contact_valueFROMcontact;
The CASE expression is introduced with its keyword CASE and runs up to the END keyword. In this first example, it specifies a column name and a series of WHEN/THEN clauses with an optional ELSE clause. The WHEN/THEN clauses are compared and evaluated against the values of the named column, one after the other. If none of them hits, the ELSE clause applies. If there is no ELSE clause and none of the WHEN/THEN clauses hit, the NULL special marker will be applied.
The comparison between the values of the column and the fixed values within the WHEN/THEN clause is done solely by "=" (equals). This is a good starting point, but real applications need more than that. Therefore there is a variant of the CASE.
-- Technical term: "searched case"-- Select persons name, weight and a denomination of the weightSELECTfirstname,lastname,weight,CASEWHEN(weightISNULLORweight=0)THEN'weight is unknown'WHENweight<40THEN'lightweight'WHENweightBETWEEN40AND85THEN'medium'ELSE'heavyweight'ENDAS'weight'FROMperson;
The crucial point is the direct succession of the two keywords CASE and WHEN. There isno column name between them. In this variant, there must be a complete expression, which evaluates to one of the 3-value-logic termstrue,false orunknown, between each WHEN and THEN. Now it is possible to use all the comparisons and boolean operators as they are known by theWHERE clause. It is even possible to compare different columns or function calls with each other.
There are the two variantssimple case andsearched case.
-- "simple case" performs successive comparisons using the equal operator: <column_name> = <expression_x>CASE<column_name>WHEN<expression_1>THEN<result_1>WHEN<expression_2>THEN<result_2>...ELSE<default_result>-- optionalEND-- "searched case" is recognised by 'nothing' between CASE and first WHENCASEWHEN<condition_1>THEN<result_1>WHEN<condition_2>THEN<result_2>...ELSE<default_result>-- optionalEND
Thesimple case is limited to one column and the use of the equal operator, whereas thesearched case may evaluate arbitrary columns of the (intermediate) result with arbitrary operators, functions or predicates.
The use of CASE expressions is not limited to projections (the column list between SELECT and FROM). As the clause evaluates to a value, it can be applied as a substitution for values at several places within SQL commands. In the following, we offer some examples.
ORDER BY clause
Sort contact values in the order: all fixed lines, all mobile phones, all emails, all icq's. Within each group sort over the contact values.
SELECT*FROMcontactORDERBY-- a "simple case" construct as substitution for a column nameCASEcontact_typeWHEN'fixed line'THEN0WHEN'mobile'THEN1WHEN'email'THEN2WHEN'icq'THEN3ELSE4END,contact_value;
In the next example, persons are ordered by weight classes, within the classes by their name.
-- order by weight classesSELECTfirstname,lastname,weight,CASEWHEN(weightISNULLORweight=0)THEN'weight is unknown'WHENweight<40THEN'lightweight'WHENweightBETWEEN40AND85THEN'medium'ELSE'heavyweight'ENDFROMpersonORDERBY-- a "searched case" construct with IS NULL, BETWEEN and 'less than'.CASEWHEN(weightISNULLORweight=0)THEN0WHENweight<40THEN1WHENweightBETWEEN40AND85THEN2ELSE3END,lastname,firstname;
WHERE clause
Within the WHERE clauses, there may occur fixed values or column names. CASE expressions can be used as a substitution for them. In the example, persons receive a discount on their weight depending on their place of birth (consider it as a theoretical example). Thus Mr. Goldstein, with its 95 kg, counts only with 76 kg and is not part of the result set.
SELECT*FROMpersonWHERECASE-- Modify weight depending on place of birth.WHENplace_of_birth='Dallas'THENweight*0.8WHENplace_of_birth='Richland'THENweight*0.9ELSEweightEND>80ORweight<20;-- any other condition
Show firstname, lastname and the gender of all persons. Consider Larry, Tom, James, John, Elias, Yorgos, Victor as 'male',
Lisa as 'female' and all others as 'unknown gender'. Use asimple case expression.
SELECTfirstname,lastname,CASEfirstnameWHEN'Larry'THEN'male'WHEN'Tom'THEN'male'WHEN'James'THEN'male'WHEN'John'THEN'male'WHEN'Elias'THEN'male'WHEN'Yorgos'THEN'male'WHEN'Victor'THEN'male'WHEN'Lisa'THEN'female'ELSE'unknown gender'ENDFROMperson;
Use asearched case expression to solve the previous question.
SELECTfirstname,lastname,CASEWHENfirstnamein('Larry','Tom','James','John','Elias','Yorgos','Victor')THEN'male'WHENfirstname='Lisa'THEN'female'ELSE'unknown gender'ENDFROMperson;
Show firstname, lastname and a classification of all persons. Classify persons according to the length of their firstname. Call the class 'short name' if character_length(firstname) < 4, 'medium length' if < 6, 'long name' else.
-- Hint: Some implementations use a different function name: length() or len().SELECTfirstname,lastname,CASEWHENCHARACTER_LENGTH(firstname)<4THEN'short name'WHENCHARACTER_LENGTH(firstname)<6THEN'medium length'ELSE'long name'ENDFROMperson;
-- Hint: Some implementations use a different function name: length() or len().SELECTSUM(CASEWHENCHARACTER_LENGTH(firstname)<4THEN1ELSE0END)asshort_names,SUM(CASEWHENCHARACTER_LENGTH(firstname)between4and5THEN1ELSE0END)asmedium,SUM(CASEWHENCHARACTER_LENGTH(firstname)>5THEN1ELSE0END)aslong_namesFROMperson;
A subquery is a complete SELECT command which is used within another SELECT, UPDATE, INSERT or DELETE command. The only difference to a simple SELECT is, that it is enclosed in parenthesis.
Depending on the type of the created result there are three classes of subqueries:
(SELECT max(weight) FROM person).(SELECT min(weight), max(weight) FROM person).(SELECT lastname, weight FROM person). For the classification, it makes no difference whether the resulting list contains zero, one, or more rows. The demarcation between a table subquery and a row subquery is thatpotentially more than one row may occur.Every type can be used on all positions where the type it stands for may occur: the scalar value subquery where a single value may occur, the row subquery where a single row may occur and the table subquery where a table may occur. Additionally, table subqueries may occur as an argument of an EXISTS, IN, SOME, ANY or ALL predicate.
Independent from this classification subqueries may becorrelated subqueries ornon-correlated subqueries. Correlated subqueries have a correlation to the surrounding query by the fact that they use values from the surrounding query within the subquery. Non-correlated subqueries are independent of the surrounding query. This distinction is shown in detail inthe next chapter but applies also to the other two subquery classes.
Because correlated subqueries use values, which are determined by the surrounding query and may change from row to row, the subquery is executed - conceptually - as often as resulting rows of the surrounding query exist. This might lead to performance problems. Nevertheless correlated subqueries are an often used construct. In many cases, there are equivalent constructs which use a JOIN. Which one shows the better performance depends highly on the DBMS, and the number of involved rows, the existence of indices, and a lot more variables.
The first example creates a list of lastnames, weights and the average weight of all persons.
SELECTid,lastname,weight,(SELECTavg(weight)FROMperson)AS'avg_weight'-- this is the subqueryFROMpersonORDERBYlastname;
Because the subquery uses theavg() function, the SQL compiler knows that it will return exactly one single value. Therefore it's type isScalar Value Subquery and can be used on positions where scalar values may occur, e.g. in the list between SELECT and FROM.
In the next example, the subquery is used as a deputy for a value within the WHERE clause.
-- Persons who weigh more than the average of all personsSELECTid,lastname,weightFROMpersonWHEREweight>=(SELECTavg(weight)FROMperson)-- another position for the subqueryORDERBYlastname;
Both examples use the tableperson twice. One can also use different tables. There is no dependency between the table name in the subquery and in the surrounding query. This applies to all classes of correlated and non-correlated subqueries. The subqueries may retrieve any value from any other table, e.g. the number of contacts.
These first two examples show non-correlated subqueries, which means, that the subqueries are independent from the queries in which they are embedded. They are executed only once.
But often an application faces a situation, where the subquery must use values from the outside query (similar to subroutines which uses parameters). This kind of subquery is called a correlated subquery. As an example, the next query lists persons together with the average weight of their family.
SELECTid,firstname,lastname,weight,(SELECTavg(weight)FROMpersonsq-- 'sq' is an arbitrary alias name for the table in the subqueryWHEREsq.lastname=p.lastname-- identify the inner and outer table by its alias names)family_average-- an arbitrary alias name for the computed family averageFROMpersonp-- 'p' is an arbitrary alias name for the table in the surrounding queryORDERBYlastname,weight;
The subselect gets one row of the surrounding SELECT after the next as an parameter with the name 'p'. Within the subselect all columns of the row 'p' are known and may be used. Here the family name from the outside row is used in the subquery to find all persons within the family and the average weight of the family members.
Be careful: Correlated subqueries are executed once per row of the surrounding query. Therefore they are much more costly than non-correlated subqueries. There might exist an equivalent solution using JOIN or GROUP BY which works with better performance. The query optimizer of the DBMS internally might rearrange the given subquery into one of the equivalent forms. But this does not work in all cases.
The distinction between correlated and non-correlated subqueries is universal. It applies also to the other subquery classes.
This example retrieves one or more persons, whose firstname is the lowest (in the sense of the lexical order) of all firstnames and whose lastname is the lowest of all lastnames. Because of the AND condition it might be the case that no person is found.
-- One resulting row: Elias BakerSELECT*FROMpersonWHERE(firstname,lastname)=(SELECTMIN(firstname),MIN(lastname)FROMperson);
Within the subquery, the lowest first- and lastnames are retrieved. The use of themin() function guarantees that not more than one row with two columns will arise - therefore it is a row subquery. In the surrounding query, this intermediate result is compared with each row of the complete tableperson or - if present - an index is used.
It is fortunate that the command retrieves a row. In most cases, the lowest first- and lastname results from different persons. But also in those cases, the command is syntactically correct and will not throw any exception.
In the next example, persons with the lowest first- and lastnames within every family are retrieved. To do so, it is necessary to use a correlated row subquery.
-- 7 rows, one per familySELECT*FROMpersonpWHERE(firstname,lastname)=(SELECTMIN(firstname),MIN(lastname)FROMpersonsqwherep.lastname=sq.lastname);
Again, there are the two incarnations of tableperson, one with the alias name 'p' in the surrounding query and one with the alias name 'sq' in the subquery. The subquery is called once per resulting row of the surrounding query, because the 'p.lastname' may change with every row of 'p'.
Within every family there is at least one person which achieves the condition - it is also conceivable that several persons achieve the condition.
The next example retrieves persons who have a contact. The class of the subquery is: non-correlated table subquery (used as a condition in the IN predicate).
SELECT*FROMpersonWHEREidIN(SELECTperson_idFROMcontact);-- the subquery
The subquery creates multiple rows with one column for each of them. This constitutes a new, intermediate table. Therefore this example is a table subquery.
The IN operator is able to act on this intermediate table. In contrast, it is not possible to use operators like '=' or '>' on this kind of intermediate result. In this case, the SQL compiler will recognize a syntax error.
The next example is an extension of the first one. It adds a correlation criterion between the query and the subquery by requesting the lastname within an email-address.
-- A correlated table subquery, looking for lastnames within e-mail-addressesSELECT*FROMpersonpWHEREidIN(SELECTperson_idFROMcontactcWHEREc.contact_type='email'ANDUPPER(c.contact_value)LIKECONCAT(CONCAT('%',UPPER(p.lastname)),'%'));
The last comparison after the AND is a little bit complex. It uses the functions CONCAT() and UPPER() as well as the predicate LIKE, but this is not of interest for the actual topic 'subquery'. The important part is that the subquery refers to 'p.lastname' of the surrounding query. Only Mr. Goldstein meets the criterion that his e-mail address contains his lastname when the two columns are compared case-insensitive.
Remark: CONCAT() concatenates two strings. UPPER() converts a string to the upper-case. LIKE in combination with the '%' sign looks for one string within another.
Next, there is an example where a non-correlated table subquery is object to a join operation.
-- Persons plus maximum weight of their familySELECT*FROMpersonpJOIN(SELECTlastname,max(weight)max_fam_weightFROMpersonGROUPBYlastname)ASsqONp.lastname=sq.lastname-- join criterion between subquery table 'sq' and table 'p';
The example shows a solution to a common problem. Sometimes there are rows describing an outdated stage of entities. Those rows - for one logical entity - differ from each other in some columns and there is an additional columnversion to track the time flow.
Here is the example tablebooking and its data.
-- The table holds actual and historical valuesCREATETABLEbooking(-- identifying columnsidDECIMALNOTNULL,booking_numberDECIMALNOTNULL,versionDECIMALNOTNULL,-- describing columnsstateCHAR(10)NOTNULL,enter_tsTIMESTAMPNOTNULL,enter_byCHAR(20)NOTNULL,-- ...-- select one of the defined columns as the Primary KeyCONSTRAINTbooking_pkPRIMARYKEY(id),-- forbid duplicate recordingsCONSTRAINTbooking_uniqueUNIQUE(booking_number,version));-- Add dataINSERTINTObookingVALUES(1,4711,1,'created',TIMESTAMP'2014-02-02 10:01:01','Emily');INSERTINTObookingVALUES(2,4711,2,'modified',TIMESTAMP'2014-02-03 11:10:01','Emily');INSERTINTObookingVALUES(3,4711,3,'canceled',TIMESTAMP'2014-02-10 09:01:01','John');--INSERTINTObookingVALUES(4,4712,1,'created',TIMESTAMP'2014-03-10 12:12:12','Emily');INSERTINTObookingVALUES(5,4712,2,'delivered',TIMESTAMP'2014-03-12 06:01:00','Charles');--INSERTINTObookingVALUES(6,4713,1,'created',TIMESTAMP'2014-03-11 08:50:02','Emily');INSERTINTObookingVALUES(7,4713,2,'canceled',TIMESTAMP'2014-03-12 08:40:12','Emily');INSERTINTObookingVALUES(8,4713,3,'reopend',TIMESTAMP'2014-03-13 10:04:32','Jack');INSERTINTObookingVALUES(9,4713,4,'delivered',TIMESTAMP'2014-03-15 06:40:12','Jack');--COMMIT;
The problem is to retrieve allactual rows, which are those with the highest version number within each booking. Bookings are considered to be the same if they have the same booking_number.
The first solution uses a non-correlated table subquery.
SELECT*FROMbookingbWHERE(booking_number,version)IN(SELECTbooking_number,MAX(version)FROMbookingsqGROUPBYbooking_number)-- the subqueryORDERBYbooking_number;
The subquery creates a list of booking numbers together with their highest version. This list is used by the surrounding query to retrieve the required rows with all its columns.
The second solution uses a correlated scalar value subquery.
SELECT*FROMbookingbWHEREversion=(SELECTmax(version)FROMbookingsqWHEREsq.booking_number=b.booking_number)ORDERBYbooking_number;
The surrounding query retrieves all rows of the table. For each of them, it calls the subquery, which retrieves the highest version within this booking_number. In most cases, this highest version differs from the version of the actual row and because of the '=' operator those rows are not part of the result. Only those, whose version is equal to the value determined in the subquery (and whose booking_number is the same as those used in the subquery) are part of the final result.
A variation of the introducing question may be to retrieve only historical rows (all versions except the highest one) for one special booking.
SELECT*FROMbookingbWHEREversion!=(SELECTmax(version)FROMbookingsqWHEREsq.booking_number=b.booking_number)ANDbooking_number=4711ORDERBYversion;
The surrounding query restricts the rows to those of one special booking. The subquery is called only for those rows.
It's easy to run into pitfalls:
-- Unexpected result!SELECT*FROMbookingbWHEREversion!=(SELECTmax(version)FROMbooking)ANDbooking_number=4711ORDERBYversion;
The above query returns all versions of booking 4711 including the actual one! To get the expected result, it's necessary to 'link' the surrounding query and the subquery together.
Find the booking with the most versions.
-- The subselect return exactly ONE single value. Therefor it's a (non-correlated) single value subquery.-- But this is only a intermediate result. The final result may contain several rows, which is not the case in our example database!SELECT*FROMbookingWHEREversion=(SELECTMAX(version)FROMbooking);
Find all bookings with are canceled (in the latest version).
-- It's necessary to link the subquery with the surrounding query.SELECT*FROMbookingbWHEREversion=(SELECTMAX(version)FROMbookingsqWHEREsq.booking_number=b.booking_number)ANDstate='canceled';-- Additionally within the resulting rows there must be a correlation between the version and the state.-- This is accomplished with the AND keyword at the level of the surrounding query. If the AND works within-- the subquery, the result does not meet the expectations.SELECT*FROMbookingbWHEREversion=(SELECTMAX(version)FROMbookingsqWHEREsq.booking_number=b.booking_numberANDstate='canceled');
Create a list of all persons. For each person include the number of persons born in the same city as the person.
-- The subselect uses the place_of_birth of the outside row. Therefore it's a correlated subquery.SELECTfirstname,lastname,place_of_birth,(SELECTCOUNT(*)FROMpersonsqWHEREp.place_of_birth=sq.place_of_birth)cnt-- an arbitrary name for the additional columnFROMpersonp;
Create a list of all persons together with the number of their contact information.
-- The subselect uses the ID of the outside row. Therefore it's a correlated subquery.SELECTfirstname,lastname,(SELECTCOUNT(*)FROMcontactcWHEREp.id=c.person_id)cnt-- an arbitrary name for the additional columnFROMpersonp;
Create a list of all persons together with the number of their e-mail-addresses.
SELECTfirstname,lastname,(SELECTCOUNT(*)FROMcontactcWHEREp.id=c.person_idANDcontact_type='email'-- The subselect is a complete SELECT. Therefor all elements of-- a 'regular' SELECT may be used: Join, functions, ... and: SUBSELECT)cnt-- an arbitrary name of the additional columnFROMpersonp;
Create a list of all persons together with the number of their contact information. (Same question as above.)
Replace the subquery by a JOIN construct.
-- Step 1 (for demonstration purpose only): To retrieve ALL persons, it's necessary to use an OUTER JOINSELECTfirstname,lastname,c.contact_typeFROMpersonpLEFTOUTERJOINcontactcONp.id=c.person_id;---- Step 2 (complete solution): Add the counter. To do so, the result must be grouped.SELECTfirstname,lastname,count(c.contact_type)FROMpersonpLEFTOUTERJOINcontactcONp.id=c.person_idGROUPBYfirstname,lastname;
For which persons there are NO contact information?
-- The subquery returns more than one row. Therefore it's a table subquery.SELECTfirstname,lastnameFROMpersonWHEREidNOTIN(SELECTperson_idFROMcontact);-- the subquery
Often users and applications request information in a form which differs from the structure of existing tables. To achieve those requests the SELECT command offers plenty of possibilities: projections, joins, group by clause, and so on. If there are always the same requests, which is the case particularly for applications, or if the table structure intentionally should be hidden from the application-level, views can be defined. Furthermore, the access rights to views may be different from those to tables.
Views look like a table. They have columns of a certain data type, which can be retrieved in the same way as columns of a table. But views are only definitions, they don't have data of its own! Their data is always the data of a table or is based on another view. A view is adifferent sight to the stored data or somewhat like apredefined SELECT.
One creates a view by specifying its name, optionally column names, and especially the SELECT command on which the view is based. Within this SELECT all elements are allowed in the same way as in a standalone SELECT command. If no column names are specified the column names of the SELECT are used.
CREATEVIEW<view_name>[(column_name,...)]ASSELECT...-- as usual;
As a first example here is the viewperson_view_1 which contains all butid andssn columns of tableperson. Users who have the right to read from this view but not from the tableperson don't have access toid andssn.
CREATEVIEWperson_view_1ASSELECTfirstname,lastname,date_of_birth,place_of_birth,weightFROMperson;-- SELECTs on views have identical syntax as SELECTs on tablesSELECT*FROMperson_view_1ORDERBYlastname;-- The column 'id' is not part of the view. Therefore it is not seen and cannot be used-- anywhere in SELECTs to person_view_1.-- This SELECT will generate an error message because of missing 'id' column:SELECT*FROMperson_view_1WHEREid=5;
As indicated in the above 'order by' example it is possible to use all columns of the view (but not all of the table!) within any part of SELECTs to the view: in the projection, the WHERE, ORDER BY, GROUP BY and HAVING clauses, in function calls and so on.
-- SELECTs on views have identical syntax as SELECTs on tablesSELECTcount(lastname),lastnameFROMperson_view_1GROUPBYlastnameORDERBYlastname;
Next, there is a renaming of a column. The column namelastname of the table will befamilyname in the view.
-- first technique: list the desired column names within parenthesis after the view nameCREATEVIEWperson_view_2a(firstname,familyname,date_of_birth,place_of_birth,weight)ASSELECTfirstname,lastname,date_of_birth,place_of_birth,weightFROMperson;-- second technique: rename the column in the SELECT partCREATEVIEWperson_view_2bASSELECTfirstname,lastnameASfamilyname,date_of_birth,place_of_birth,weightFROMperson;-- Hint: technique 1 overwrites technique 2-- Access to person.lastname is possible via person_view_2a.familyname or person_view_2b.familyname.-- The objects person.familyname, person_view_2a.lastname or person_view_2b.lastname does not exist!
Not only columns can be hidden in a view. It's also possible to hide complete rows, because the view definition may contain a WHERE clause.
-- Restrict access to few rowsCREATEVIEWperson_view_3ASSELECTfirstname,lastname,date_of_birth,place_of_birth,weightFROMpersonWHEREplace_of_birthin('San Francisco','Richland');-- Verify result:SELECT*FROMperson_view_3;
This view contains only persons born in San Francisco or Richland. All other persons are hidden. Therefore the following SELECT retrieves nothing although there are persons in the table which fulfill the condition.
-- No hitSELECT*FROMperson_view_3WHEREplace_of_birth='Dallas';-- One hitSELECT*FROMpersonWHEREplace_of_birth='Dallas';
This example uses the sum() function.
--CREATEVIEWperson_view_4AS-- General hint: Please consider that not all columns are available in a SELECT containing a GROUP BY clauseSELECTlastname,count(lastname)AScount_of_membersFROMpersonGROUPBYlastnameHAVINGcount(*)>1;-- Verify result: 2 rowsSELECT*FROMperson_view_4;-- The computed column 'count_of_members' may be part of a WHERE condition.-- This SELECT results in 1 rowSELECT*FROMperson_view_4WHEREcount_of_members>2;
In this example, the elaborated construct 'GROUP BY / HAVING' is hidden from users and applications.
Views can contain columns from several tables by using JOIN commands. The following example view contains the name of persons in combination with the available contact information. As anINNER JOIN is used, some persons occur multiple, others not at all.
-- Persons and contactsCREATEVIEWperson_view_5ASSELECTp.firstname,p.lastname,c.contact_type,c.contact_valueFROMpersonpJOINcontactcONp.id=c.person_id;-- Verify resultSELECT*FROMperson_view_5;SELECT*FROMperson_view_5WHERElastname='Goldstein';
The columnsperson.id ancontact.person_id are used during the definition of the view. But they are not part of the projection and hence not available for SELECTs to the view.
Hint: The syntax and semantic of join operations is explained on a separatepage.
Within a CREATE VIEW statement, one may use more elements of the regular SELECT statement than it is shown on this page, especially: SET operations, recursive definitions, CASE expressions, ORDER BY, and so on.
If there is an ORDER BY clause within the CREATE VIEW and another one in a SELECT to this view, the later one overwrites the former.
In some cases, but not in general, it should be possible to change data (UPDATE, INSERT or DELETE command) in a table by accessing it via a view. Assume, as a counterexample, that one wants to change the columncount_of_members ofperson_view_4 to a different value. What shall the DBMS do? The column is subject to an aggregate function which counts the number of existing rows in the underlying table. Shall it add some more random values into new rows respectively shall it delete random rows to satisfy the new value ofcount_of_members? Of course not!
On the other hand, a very simple view like 'CREATE VIEW person_0 AS SELECT * from person;', which is a 1:1 copy of the original table, should be manageable by the DBMS. Where is the borderline between updateable and non-updateable views? The SQL standard does not define it. But the concrete SQL implementations offer limited write-access to views based on their own rules. Sometimes these rules are very fixed, in other cases they consist of flexible techniques like 'INSTEAD OF' triggers to give programmers the chance to implement their own rules.
Here are some general rules whichmay be part of the implementors fixed rules to define, which views are update-able in this sense:
If it is possible to use the UPDATE, INSERT or DELETE command to a view, the syntax is the same as with tables.
The DROP VIEW statement deletes a view definition. In doing so the data of the underlying table(s) is not affected.
Don't confuse the DROP command (definitions) with the DELETE command (data)!
DROPVIEWperson_view_1;DROPVIEWperson_view_2a;DROPVIEWperson_view_2b;DROPVIEWperson_view_3;DROPVIEWperson_view_4;DROPVIEWperson_view_5;
Create a view 'hobby_view_1' which contains all columns of table 'hobby' except 'id'.
Rename column 'remark' to 'explanation'. Create two different solutions.
CREATEVIEWhobby_view_1aASSELECThobbyname,remarkASexplanationFROMhobby;-- VerificationSELECT*FROMhobby_view_1a;CREATEVIEWhobby_view_1b(hobbyname,explanation)ASSELECThobbyname,remarkFROMhobby;-- VerificationSELECT*FROMhobby_view_1b;
Create a view 'hobby_view_2' with the same criteria as in the previous example. The only difference
is that the length of the explanation column is limited to 30 character. Hint: use the function
substr(<column name>, 1, 30) to determine the first 30 characters - this is not part of the SQL standard but works in many implementations.
CREATEVIEWhobby_view_2ASSELECThobbyname,substr(remark,1,30)ASexplanationFROMhobby;-- VerificationSELECT*FROMhobby_view_2;
Create a view 'contact_view_3' which contains all rows of table contact with the exception of the 'icq' rows. Count the number of the view rows and compare it with the number of rows in the table 'contact'.
CREATEVIEWcontact_view_3ASSELECT*FROMcontactWHEREcontact_type!='icq';-- an alternate operator with the same semantic as '!=' is '<>'-- VerificationSELECT'view',count(*)FROMcontact_view_3UNIONSELECT'table',count(*)FROMcontact;
Create a view 'contact_view_4' which contains one row per contact type with its notation and the number of occurrences. Afterwards select those which occur more than once.
CREATEVIEWcontact_view_4ASSELECTcontact_type,count(*)AScntFROMcontactGROUPBYcontact_type;-- VerificationSELECT*FROMcontact_view_4;-- Use columns of a view with the same syntax as a column of a table.SELECT*FROMcontact_view_4WHEREcnt>1;
Create a view 'person_view_6' which contains first- and lastname of persons plus the number of persons with the same name as the person itself (family name). Hint: the solution uses acorrelated subquery.
CREATEVIEWperson_view_6ASSELECTfirstname,lastname,(SELECTcount(*)FROMpersonsqWHEREsq.lastname=p.lastname)AScnt_familyFROMpersonp;-- VerificationSELECT*FROMperson_view_6;
Clean up the example database.
DROPVIEWhobby_view_1a;DROPVIEWhobby_view_1b;DROPVIEWhobby_view_2;DROPVIEWcontact_view_3;DROPVIEWcontact_view_4;DROPVIEWperson_view_6;
Hint: Be carefull and deactivateAUTOCOMMIT.
The INSERT command stores one or more new rows into one table. The content of the new rows consists of either fixed values or results from a SELECT, which is evaluated at runtime. So there are two different syntaxes to do the job.
-- The static version of the INSERT commandINSERTINTO<tablename>(<list_of_columnnames>)VALUES(<list_of_values>),(<list_of_values>),(<list_of_values>),...;
Following the tablename, we can list the affected columns and after the keyword 'VALUES' state one or more lists of values to insert. Each list of values represents one new row. The lists of columns and values have to be in accordance such that the quantity of list entries is identical and their data types correlate.
-- One value list results in one new row.INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(91,'Larry, no. 91','Goldstein',DATE'1970-11-20','Dallas','078-05-1120',95);COMMIT;-- The SQL standard - but not all implementations, in particular Oracle - supports a 'row value constructor' by-- enumerate values inside a pair of parenthesis as shown in the above green box.-- Three lists of values (= row value constructors) result in three new rows. Please note the comma after all-- but the last one.INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(92,'Larry, no. 92','Goldstein',DATE'1970-11-20','Dallas','078-05-1120',95),(93,'Larry, no. 93','Goldstein',DATE'1970-11-20','Dallas','078-05-1120',95),(94,'Larry, no. 94','Goldstein',DATE'1970-11-20','Dallas','078-05-1120',95);COMMIT;
We can choose any order of columns, but column names and values should be in concordance.
-- Sometimes things are scrambled. Maybe confusing, but works fine. See weight and id.INSERTINTOperson(date_of_birth,firstname,ssn,lastname,place_of_birth,weight,id)VALUES(DATE'1970-11-20','Larry, no. 95','078-05-1120','Goldstein','Dallas',95,95);COMMIT;
We can omit unnecessary columns.
-- Depending on CREATE TABLE statement the missing columns will get the 'null special marker' or a default value.INSERTINTOperson(id,firstname,lastname,weight)VALUES(96,'Larry, no. 96','Goldstein',95);COMMIT;
Clean up your table.
DELETEFROMpersonWHEREidBETWEEN91AND96;COMMIT;
Unlike in the above paragraph, we can insert values that are not fixed but dynamic such that they are evaluated at runtime from any table, a function, or a computation. Even the number of new rows can be dynamic. All this is done in a subselect which replaces the VALUE clause. We explain this technique at pageAdvanced Insert. The other rules concerning number and sequence of columns or omitting values retain their validity.
Insert a new row for Mr. Peter Hufington with its body weight of 67 kg. He is born in LA.
-- Choose any free idINSERTINTOperson(id,firstname,lastname,weight,place_of_birth)VALUES(81,'Peter, no. 81','Hufington',67,'Los Angeles');COMMIT;-- Check your resultSELECT*FROMperson;
Hint: Be carefull and deactivateAUTOCOMMIT.
The basic syntax and semantic of the INSERT command is described on the pageINSERT. There are examples of how to insert single rows with fixed values into a table. The present page describes how to dynamize the command by the use of subqueries.
First, the values to be inserted may be evaluated in a relatively strict way by reading the system time or other (quasi) constants.
-- Use the key word CURRENT_DATE to determine the actual day.INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(101,'Larry, no. 101','Goldstein',CURRENT_DATE,'Dallas','078-05-1120',95);COMMIT;
Next, the values to be inserted may be evaluated by ascalar value subquery. This means, that single values may be computed at runtime based on the rows of the same or another table.
-- Count the number of rows to determine the next ID. Caution: This handling of IDs is absolutely NOT recommended for real applications!INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES((SELECTCOUNT(*)+1000FROMperson),-- This scalar-value subquery computes a single value, the new ID.-- VALUES ((Select * FROM (SELECT COUNT(*) + 1000 FROM person) tmp), -- MySQL insists in using an intermediate table'Larry, no. ?','Goldstein',CURRENT_DATE,'Dallas','078-05-1120',95);COMMIT;
Similar to the previous example of a single scalar value from a scalar value subquery, one can use atable subquery to get several rows and insert them into the specified table within one INSERT command. This version is able to insert thousands of rows with one single statement. In addition to its dynamic nature, it saves all but one round-trip between the application and the DBMS and therefore is much faster than a lot of single row-based INSERTs.
-- The statement doubles the number of rows within the table. It omits in the table subquery the WHERE clause and therefore-- it reads all existing rows. Caution: This handling of IDs is absolutely NOT recommended for real applications!INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)SELECTid+1100,firstname,lastname,date_of_birth,place_of_birth,ssn,weightFROMperson;COMMIT;
The syntax is changed such that a complete subquery replaces the keyword 'VALUES' with its list of values (often named 'subselect'), which starts with the keyword 'SELECT'. Of course, the number and type of the selected columns must correlate with the number and type of the columns in the specified column list that follows the 'INSERT INTO' keyword. Within the subquery, the complete power of the SELECT statement may be used: JOIN, WHERE, GROUP BY, ORDER BY, and especially other subqueries in a recursive manner. There is a wide range of use cases: create rows with increased version numbers, with percentage increased salary, with the actual timestamp, fixed values from rows of the same or another table, ... .
-- The next two statements compute different weights depending on the old weightINSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)-- the subquery starts hereSELECTid+1200,firstname,lastname,date_of_birth,place_of_birth,ssn,CASEWHENweight<40THENweight+10ELSEweight+5ENDFROMpersonWHEREid<=10;-- only the original 10 rows from the example databaseCOMMIT;-- The same semantic with a more complex syntax (to demonstrate the power of subselect)INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)-- the first subquery starts hereSELECTid+1300,firstname,lastname,date_of_birth,place_of_birth,ssn,-- here starts a subquery of the first subquery. The CASE construct evaluates different-- weights depending on the old weight.(SELECTCASEWHENweight<40THENweight+10ELSEweight+5ENDFROMpersonssq-- alias for the table name in sub-subqueryWHEREsq.id=ssq.id-- link the rows together)FROMpersonsq-- alias for the table name in subqueryWHEREid<=10;-- only the original 10 rows from the example databaseCOMMIT;
The technique shown atStructured Query Language/Example_Database_Data#Grow_up, which multiplies existing data, e.g., for testing purpose, is based on such table subqueries.
DELETEFROMpersonWHEREid>100;COMMIT;
Insert a new person with id 1301, firstname 'Mr. Mean', lastname is the lowest lastname (in the sense of the character encoding, use min() function). Its weight is the average weight of all persons (use avg() function).
-- Two columns are computed during runtimeINSERTINTOperson(id,firstname,lastname,weight)VALUES(1301,'Mr. Mean',(SELECTMIN(lastname)FROMperson),(SELECTAVG(weight)FROMperson)-- the MySQL version with its intermediate tables-- (SELECT * FROM (SELECT MIN(lastname) FROM person) tmp1),-- (SELECT * FROM (SELECT AVG(weight) FROM person) tmp2));COMMIT;-- Check your resultSELECT*FROMpersonWHEREid=1301;
Insert one additional person per family (=lastname) with firstname 'An extraordinary family member', lastname is the family name. Incorporate only the rows from the original example database with id <= 10.
-- Two columns are computed during runtime. The number of involved rows is delimited by the WHERE clause.INSERTINTOperson(id,firstname,lastname)-- here starts the subquerySELECTMAX(id)+1310,-- in this case the max() function works per group'An extraordinary family member',lastnameFROMpersonWHEREid<=10GROUPBYlastname;COMMIT;-- Check your resultSELECT*FROMpersonWHEREid>1310;
Clean up your database.
DELETEFROMpersonWHEREid>1300;COMMIT;
Hint: Be careful and deactivateAUTOCOMMIT.
The UPDATE command modifies one or more existing rows.
UPDATE<tablename>SET<columnname>=<value>,<columnname>=<value>,...WHERE<search_condition>;
Values are assigned to the named columns. Unmentioned columns remain unchanged. The search_condition acts in the same way as in the SELECT command. It restricts the coverage of the command to those rows that satisfy the specified criteria. If the WHERE keyword and the search_condition are omitted,all rows of the table are affected. It is possible to specify search_conditions which hit no rows. In this case, no rows are updated - and no error or exception occurs.
As shown previously in theintroductory page, the UPDATE command is very flexible. Within one command , we can choose an arbitrary combination of the two features:
Additionally, values can have a dynamic nature like system date, timestamp, or timezone - depending on your implementation.
For a recapitulation here is an example:
UPDATEpersonSETfirstname='Jimmy Walker',lastname='de la Crux'WHEREid=5;-- revoke the changesROLLBACK;
We present some more information about additional opportunities of the UPDATE commandhere.
Change columnremark of rowFishing in tablehobby to: 'Catching fishes with or without tools.'.
-- Change one column of one rowUPDATEhobbySETremark='Catching fishes with or without tools.'WHEREhobbyname='Fishing';-- or: WHERE id = 2;COMMIT;-- Check the resultSELECT*FROMhobby;
Change columnremark of all rows in tablehobby to: 'A creative leisure activity.'. Check the success of the command without COMMIT-ting. Revert the changes.
-- no WHERE clause ==> all rows affectedUPDATEhobbySETremark='A creative leisure activity.';-- Check the resultSELECT*FROMhobby;-- We don't want to save these values. So we revert the changes.ROLLBACK;
-- A WHERE clauses, which hits all rowsUPDATEhobbySETremark='A creative leisure activity.'WHEREid>=1;-- Check the resultSELECT*FROMhobby;-- We don't want to save these values. So we revert the changes.ROLLBACK;
Hint: Be careful and deactivateAUTOCOMMIT.
The page in hand offers two additional techniques as an extension to theUPDATE command shown on one of the previous pages:
The values which are assigned to a column may be computed by a correlated or non-correlatedscalar value subquery on the involved table or another one. There are many use cases where this technique is utilized: Increase values linear or in percentage, use values from the same or another table, ... . The situation is similar to that described on the page about theINSERT command.
-- The average weight of all persons is stored in column 'weight' of the first four persons.UPDATEpersonSET-- dynamic computation of a valueweight=(SELECTAVG(weight)FROMperson)-- weight = (SELECT * FROM (SELECT AVG(weight) FROM person) tmp) -- MySQL insists on using an intermediate tableWHEREid<5;-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
The subquery may use the values of the row, which is currently updated. In the next example, persons receive the mean weight of their family. To compute this mean weight, it is necessary to use the column 'lastname' of the actual processed row.
-- The subquery is a 'correlated' scalar value subquery.UPDATEpersonpSET-- 'p.lastname' refers to the lastname of the actual row. The subquery bears all rows in mind, not only those with 'id >= 5'.weight=(SELECTAVG(weight)FROMpersonsqWHEREsq.lastname=p.lastname)-- A hint to MySQL users: MySQL does not support UPDATE in combination with a correlated subquery-- to the same table. Different tables work. MySQL has a different, non-standard concept: multi-table update.WHEREid>=5;-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
The WHERE clause determines which rows of a table are involved by the UPDATE command. This WHERE clause has the same syntax and semantic as the WHERE clause of the SELECT or DELETE command. It may contain complex combinations of boolean operators, predicates like ANY, ALL or EXISTS, and recursively subqueries as described inSELECT: Subquery.
-- UPDATE rows in the 'person' table based on the results of a subquery to the 'contact' table.-- In the example, persons with more than 2 contact information are affected.UPDATEpersonSETfirstname='Has many buddies'WHEREidIN(SELECTperson_idFROMcontactGROUPBYperson_idHAVINGcount(*)>2);-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
The command performs an UPDATE in the tableperson, but the affected rows are identified by a subquery in tablecontact. This technique of grabbing information from other tables offers very flexible strategies to modify the data.
It is no error to select 0 rows in the subquery. In this case, the DBMS executes the UPDATE command as usual and throws no exception. (The same holds true for subqueries in SELECT or DELETE statements.)
Assign the firstname 'Short firstname' to all persons which have a firstname with less than 5 characters.
-- Hint: Some implementations use a different function name: length() or len().UPDATEpersonSETfirstname='Short firstname'WHEREcharacter_length(firstname)<5;-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
Assign the firstname 'No hobby' to all persons which have no hobby.
UPDATEpersonSETfirstname='No hobby'WHEREidNOTIN(SELECTperson_idFROMperson_hobby);-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
Assign the firstname 'Sportsman' to all persons performing one of the hobbies 'Underwater Diving' or 'Yoga'.
UPDATEpersonSETfirstname='Sportsman'WHEREidIN-- The subquery must join to the table 'hobby' to see their column 'hobbyname'.(SELECTph.person_idFROMperson_hobbyphJOINhobbyhONph.hobby_id=h.idANDh.hobbynameIN('Underwater Diving','Yoga'));-- Check the resultSELECT*FROMperson;-- revoke the changesROLLBACK;
Hint: Be careful and deactivateAUTOCOMMIT.
The DELETE command removes rows from a table.
DELETEFROM<tablename>WHERE<search_condition>;
The syntax is straightforward as we do not need to specify any column name - rows are deleted as a whole and not partly. As usual, the search condition specifies the criterion which identifies the affected rows. It can involve zero, one, or more rows. If we omit the WHERE keyword and the search conditionall rows are affected.
-- Delete one rowDELETEFROMpersonWHERElastname='Burton';-- It's only a test. Restore the row.ROLLBACK;
The information about Mr. Burton was deleted and restored again.
We present some more information about the DELETE commandhere. There are also some comments to the interconnection with theTRUNCATE command.
Delete the hobby 'Yoga'.
-- Delete one rowDELETEFROMhobbyWHEREhobbyname='Yoga';-- or: WHERE id = 6;ROLLBACK;-- if we want to restore the rowCOMMIT;-- if we want to commit our work-- Check the resultSELECT*FROMhobby;
Delete all relations between persons and hobbies. Check result. Restore all rows.
-- compact syntax - great impactDELETEFROMperson_hobby;-- Check the resultSELECT*FROMperson_hobby;-- restore everythingROLLBACK;
Hint: Be carefull and deactivateAUTOCOMMIT.
Because the DELETE command deletes rows as a whole and not partly, the syntax is straightforward. Its structure was shown on aprevious page. The page on hand offers only one addition: The WHERE clause isn't limited to simple conditions like 'id = 10' but may contain a subquery. This gives the command much more flexibility.
The use of subqueries as part of a DELETE command is identical to its use within anUPDATE orSELECT command.
There is another command for the deletion of rows. TheTRUNCATE command is very similar to DELETE. TRUNCATE deletesall rows of a table and shows better performance. But it has no mechanism to choose individual rows.
The example command deletes contact information from persons who are born in San Francisco.
-- Delete rows depending on a criteria which resides in a different table.DELETEFROMcontactWHEREperson_idIN(SELECTidFROMpersonWHEREplace_of_birth='San Francisco');-- It's only a test. Restore the rows.ROLLBACK;
Correlated subqueries in combination with DELETE commands, are not supported by all implementations.
It often happens that the DBMS rejects DELETE commands because Foreign Key constraints will be violated during its execution. E.g.: if the command tries to delete a person to whom a contact or hobby information is known, the command fails (as a whole). To overcome such situations, there are different strategies:
Delete hobby information for family Goldstein.
DELETEFROMperson_hobbyWHEREperson_idIN(SELECTidFROMpersonWHERElastname='Goldstein');-- Refrain from deleting the hobby itself - because:-- a) The hobby may be allocated to a different person.-- b) After the information in person_hobby is deleted, there is no longer the possibility to get-- to old assignment between person and hobby.-- It's only a test. Restore the rows.ROLLBACK;
Hint: Be carefull and deactivateAUTOCOMMIT.
In many cases, applications want to store rows in the database without knowing whether these rows previously exist in the database or not. If the rows exist, they must use the UPDATE command, if not, the INSERT command. To do so, the following construct is often used:
-- pseudocodeIF(SELECTCOUNT(*)=0...)THENINSERT...ELSEUPDATE...;
This situation is unpleasant in many ways:
To overcome the disadvantages, the SQL standard defines a MERGE command, which contains the complete code shown above in one single statement. The MERGE performs an INSERT or an UPDATE depending on the existence of individual rows at the target table.
-- Define target, source, match criterion, INSERT and UPDATE within one single commandMERGEINTO<target_table><target_table_alias>-- denote the target tableUSING<syntaxhighlight_table><syntaxhighlight_table_alias>-- denote the source tableON(<match_criterion>)-- define the 'match criterion' which compares the source and-- target rows with the same syntax as in any WHERE clauseWHENMATCHEDTHENUPDATESETcolumn1=value1[,column2=value2...]-- a variant of the regular UPDATE commandWHENNOTMATCHEDTHENINSERT(column1[,column2...])VALUES(value1[,value2...])-- a variant of the regular INSERT command;
The target table is named after the MERGE INTO keyword, the source table after the USING keyword.
The comparison between target rows and source rows, which is necessary to decide between INSERT and UPDATE, is specified after the ON keyword with a syntax, which is identical to the syntax of a WHERE clause. If this comparison matches, the UPDATE will be performed, else the INSERT. In simple cases, the comparison compares Primary Key or Foreign Key columns. But it is also possible to use very sophisticated conditions on any column.
In the 'MATCHED' case, a variant of the UPDATE follows. It differs from the regular UPDATE command in that it has no table name (the table name is already denoted after the MERGE INTO) and no WHERE clause (it uses the match criterion after the ON keyword).
In the 'NOT MATCHED' case, a variant of the INSERT follows. For the same reason as before, the target table is not named within the INSERT.
Create a table 'hobby_shadow' to store some of the 'hobby' rows. The subsequent MERGE command shall perform an INSERT or an UPDATE depending on the existence of correlating rows.
-- store every second row in a new table 'hobby_shadow'CREATETABLEhobby_shadowASSELECT*FROMhobbyWHEREMOD(id,2)=0;SELECT*FROMhobby_shadow;-- INSERT / UPDATE depending on the column 'id'.MERGEINTOhobby_shadowt-- the targetUSING(SELECTid,hobbyname,remarkFROMhobby)s-- the sourceON(t.id=s.id)-- the 'match criterion'WHENMATCHEDTHENUPDATESETremark=concat(s.remark,' Merge / Update')WHENNOTMATCHEDTHENINSERT(id,hobbyname,remark)VALUES(s.id,s.hobbyname,concat(s.remark,' Merge / Insert'));COMMIT;-- Check the resultSELECT*FROMhobby_shadow;
The MERGE command handlesall rows, but there is only 1 round-trip between the application and the DBMS. Some of the rows are handled by the INSERT part of MERGE, others by its UPDATE part. This distinction may be observed by the last part of the column 'remark'.
Typical use cases for the MERGE command areETL processes. Often those processes have to aggregate some values for a grouping criterion (eg: a product line) over a time period. The first access per product line and period has to insert new rows with given values, subsequent accesses have to update them by increasing values.
The SQL standard defines some more features within the MERGE command.
WHEN clause
The WHEN MATCHED and WHEN NOT MATCHED clauses may be extended by an optional query expression likeAND (place_of_birth = 'Dallas'). As a consequence, it's possible to use a series of WHEN MATCHED / WHEN NOT MATCHED clauses.
...WHENMATCHEDAND(t.hobby_nameIN('Fishing','Underwater Diving'))THENUPDATESETremark=concat('Water sports: ',t.remark)WHENMATCHEDAND(t.hobby_nameIN('Astronomy','Microscopy','Literature'))THENUPDATESETremark=concat('Semi-professional leisure activity: ',t.remark)WHENMATCHEDTHENUPDATESETremark=concat('Leisure activity: ',t.remark)...-- The same is possible with WHEN NOT MATCHED in combination with INSERT
DELETE
Within a WHEN MATCHED clause, it is possible to use a DELETE command instead of an UPDATE to remove the matched row. This feature may be combined with the previously presented extension by an optional query expression. In the SQL standard, the DELETE command is not applicable to the WHEN NOT MATCHED clause.
-- Update 'Fishing' and 'Underwater Diving'. Delete all others which have a match between source and target....WHENMATCHEDAND(t.hobby_nameIN('Fishing','Underwater Diving'))THENUPDATESETremark=concat('Water sports: ',t.remark)WHENMATCHEDTHENDELETE...
The MERGE command is clearly defined by standard SQL. The command itself, as well as the extensions described before, are implemented by a lot of DBMS. Deviating from the standard most implementations unfortunately use different and/or additional keywords and - sometimes - different concepts. Even the introductive keywords MERGE INTO may differ from the standard.
A) Create a new table 'contact_merge' with the same structure as 'contact'.
B) Copy row number 3 from 'contact' to 'contact_merge'.
C) Use the MERGE command to insert/update all E-Mail-adresses from 'contact' to 'contact_merge' and add the e-mail-protocol name to the contact values (prepend column contact_value by the string 'mailto:').
-- Create table and copy one rowCREATETABLEcontact_mergeASSELECT*FROMcontactWHEREid=3;SELECT*FROMcontact_merge;-- INSERT / UPDATE depending on the column 'id'.MERGEINTOcontact_merget-- the targetUSING(SELECTid,person_id,contact_type,contact_valueFROMcontactWHEREcontact_type='email')s-- the sourceON(t.id=s.id)-- the 'match criterion'WHENMATCHEDTHENUPDATESETcontact_value=concat('mailto:',t.contact_value)WHENNOTMATCHEDTHENINSERT(id,person_id,contact_type,contact_value)VALUES(s.id,s.person_id,s.contact_type,concat('mailto:',s.contact_value));COMMIT;-- Check the resultSELECT*FROMcontact_merge;
The TRUNCATE TABLE command deletesall rows of a table without causing any triggered action. Unlike the DELETE command, it contains no WHERE clause to specify individual rows.
With respect to the TRUNCATE TABLE command, most DBMS show significantly better performance than with DELETE command. This results from the fact that the DBMS can empty the table (and its indexes) as a whole. It's not necessary to access individual rows.
'TRUNCATE TABLE t2' will fail. This holds true independent from the question whether any row oft1 refers actually to one of the rows oft2 or not. The DBMS checks only the existence of the Forgeign-Key-constraint definition.
The syntax of the TRUNCATE TABLE command is very simple.
TRUNCATETABLE<tablename>;
-- Delete ALL rows of the table 'myTable'TRUNCATETABLEmyTable;-- In most DBMS ROLLBACK is not possible - in opposite to situations with a DELETE command.
To illustrate the difference between the TRUNCATE TABLE command and the DELETE command (without a WHERE clause), one can imagine a trucker, who wants to empty a trailer full of sand at a construction site. To do so, he has two possibilities; either empty the trailer by tilting it - this corresponds to the TRUNCATE TABLE command -, or climb onto the trailer and throw down one grain of sand after the next - this corresponds to the DELETE command.
Delete all rows of table 'person_hobby' using the DELETE command.
Verify that there are no rows left in 'person_hobby'.
Delete all rows of table 'hobby' using the TRUNCATE TABLE command.
What will happen? (Consider that there is an FK constraint from the table empty 'person_hobby' to 'hobby'.)
-- Delete all rows of 'person_hobby' with a DELETE commandDELETEFROMperson_hobby;COMMIT;-- Are there any rows?SELECTcount(*)FROMperson_hobby;-- Try TRUNCATE TABLE command:TRUNCATETABLEhobby;-- An exception will be thrown. Although there is no row in 'person_hobby' referring a row in 'hobby',-- the definition of the Foreign Key constraint exists. This is the reason for the exception.
What will happen in the above example, if the TRUNCATE TABLE command is replaced by a DELETE command?
-- As there is no row in 'person_hobby' referring to 'hobby', the DELETE command deletes all rows in 'hobby'.DELETEFROMhobby;COMMIT;
The original data of the example database can be reconstructed as shown on theexample database data page.
DBMS offers a special service. We canundo a single or even multiple consecutive write and delete operations. To do so, we use the command ROLLBACK. When modifying data, the DBMS writes in a first step all new, changed, or deleted data to a temporary space. During this stage, the modified data is not part of the 'regular' database. If we are sure the modifications shall apply, we use the COMMIT command. If we want to revert our changes, we use the ROLLBACK command. All changes up to the finally COMMIT or ROLLBACK are considered to be part of a so-calledtransaction.
The syntax of COMMIT and ROLLBACK is very simple.
COMMITWORK;-- commits all previous INSERT, UPDATE and DELETE commands, which-- occurred since last COMMIT or ROLLBACKROLLBACKWORK;-- reverts all previous INSERT, UPDATE and DELETE commands, which-- occurred since last COMMIT or ROLLBACK
The keyword 'WORK' is optional.
The feature AUTOCOMMIT automatically performs a COMMIT after every write operation (INSERT, UPDATE, or DELETE). This feature is not part of the SQL standard, but is implemented and activated by default in some implementations. If we want to use the ROLLBACK command, we must deactivate the AUTOCOMMIT. (After an - automatic or explicit - COMMIT command, a ROLLBACK command is syntactically okay, but it does nothing as everything is already committed.) Often we can deactivate the AUTOCOMMIT with a separate command like 'SET autocommit = 0;' or 'SET autocommit off;' or by clicking an icon on a GUI.
To test the following statements, it's necessary to work without AUTOCOMMIT.
Let us insert a new person into the database and test the COMMIT.
-- Store a new person with id 99.INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn,weight)VALUES(99,'Harriet','Flint',DATE'1970-10-19','Dallas','078-05-1120',65);-- Is the new person really in the database? The process which executes the write operation will see its results,-- even if they are actually not committed. (One hit expected.)SELECT*FROMpersonWHEREid=99;-- Try COMMIT commandCOMMIT;-- Is she still in the database? (One hit expected.)SELECT*FROMpersonWHEREid=99;
Now we remove the person from the database.
-- Remove the new personDELETEFROMpersonWHEREid=99;-- Is the person really gone? Again, the process which performs the write operation will see the changes, even-- if they are actually not committed. (No hit expected.)SELECT*FROMpersonWHEREid=99;-- Try COMMIT commandCOMMIT;-- Is the person still in the database? (No hit expected.)SELECT*FROMpersonWHEREid=99;
So far, so boring.
The exciting command is the ROLLBACK. It restores changes of previous INSERT, UPDATE or DELETE commands.
We delete and restore Mrs. Hamilton from our example database.
DELETEFROMpersonWHEREid=3;-- Lisa Hamilton-- no hit expectedSELECT*FROMpersonWHEREid=3;-- ROLLBACK restores the deletionROLLBACK;-- ONE hit expected !!! Else: check AUTOCOMMITSELECT*FROMpersonWHEREid=3;
The ROLLBACK is not restricted to one single row. It may affect several rows, several commands, different kinds of commands, and even several tables.
-- same as aboveDELETEFROMpersonWHEREid=3;-- destroy all e-mail addressesUPDATEcontactSETcontact_value='unknown'WHEREcontact_type='email';-- verify modificationsSELECT*FROMperson;SELECT*FROMcontact;-- A single ROLLBACK command restores the deletion in one table and the modifications in another tableROLLBACK;-- verify ROLLBACKSELECT*FROMperson;SELECT*FROMcontact;
Supose thehobby table contains 9 rows and theperson table 10 rows. We execute the following operations:
add 3 hobbies
add 4 persons
commit
add 5 hobbies
add 6 persons
rollback
How many rows are in the hobby table?
12
How many rows are in the person table?
14
One of the basic steps during database development cycles is the fixing of decisions about the table structure. To do so, there is the CREATE TABLE statement with which developers define tables together with their columns and constraints.
Because a lot of features may be activated by the command, its syntax is a little bit complex. This page shows the most important parts. The syntax is not straight forward. At some points it is possible to use alternative formulations to express the same purpose, e.g. the Primary Key may be defined within the column definition as a column constraint, at the end of the command as a table constraint or as a separate stand-alone command 'ALTER TABLE ADD CONSTRAINT ...;' .
CREATETABLE<tablename>(<column_name><data_type><default_value><identity_specification><column_constraint>,<column_name><data_type><default_value><column_constraint>,...,<table_constraint>,<table_constraint>,...);
After the introductory key words CREATE TABLE, the tablename is specified. Within a pair of parentheses, a list of column definitions follows. Each column is defined by its name, data type, an optional default value, and optional constraints for this individual column.
After the list of column definitions, developers can specify table constraints like Primary and Foreign Keys, Unique conditions, and general column conditions.
An first example was shown at the pageCreate a simple Table and a second one here:
CREATETABLEtest_table(-- define columns (name / type / default value / column constraintidDECIMALPRIMARYKEY,part_numberCHAR(10)DEFAULT'n/a'NOTNULL,part_nameVARCHAR(500),stateDECIMALDEFAULT-1,-- define table constraints (eg: 'n/a' shall correlate with NULL)CONSTRAINTtest_checkCHECK((part_number='n/a'ANDpart_nameISNULL)OR(part_number!='n/a'ANDpart_nameISNOTNULL)));
The table consists of 4 columns. All of them have a data type and some a default value. The columnid acts as the Primary Key. The table constrainttest_check guarantees thatpart_name is mandatory ifpart_number is recorded.
The standard defines a lot of predefined data types: character strings of fixed and variable size, character large objects (CLOB), binary strings of fixed and variable size, binary large objects (BLOB), numeric, boolean, datetime, interval, xml. Beyond, there are complex types like: ROW, REF(erence), ARRAY, MULTISET and user-definded types (UDT). The predefined data types are explained on thenext page. To keep things simple, we use on this page only CHAR, VARCHAR, and DECIMAL.
A column can have a default value. Its data type corresponds to the type of the column. It may be a constant value like the number -1 or the string 'n/a', or it is a system variable or a function call to determine dynamic values like the username or the actual timestamp.
The default clause affects those INSERT and MERGE commands, which do not specify the column. In our example database theperson table has the columnweight with the default value 0. If we omit this column in an INSERT command, the DBMS will store the value 0.
-- This INSERT command omits the 'weight' column. Therefore the value '0' (which is different from-- the NULL value) is stored in the weight column.INSERTINTOperson(id,firstname,lastname,date_of_birth,place_of_birth,ssn)VALUES(11,'Larry','Goldstein',date'1970-11-20','Dallas','078-05-1120');COMMIT;-- This SELECT retrieves the row ...SELECT*FROMpersonWHEREid=11ANDweight=0;-- ... but not this one:SELECT*FROMpersonWHEREid=11ANDweightISNULL;
Theidentity specification serves for the generation of a series of unique values that act as the Primary Key to the table's rows. The standard defines the syntax as: "GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY". Unfortunately, most DBMS vendors do not support this formulation. Instead, they offer different syntaxes and even different concepts to generate primary key values. Some use a combination of generators/sequences and triggers, others a special data type, or different keywords.
An overview about the wide spread of implementations is available in the wikibookSQL Dialects Reference: Auto-increment_column.
The column constraint clause specifies conditions which all values must meet. There are different column constraint types:
TheNOT NULL phrase defines, that it is not allowed to store the NULL value in the column.
-- The column col_1 is per definition not allowed to hold the NULL valueCREATETABLEt1(col_1DECIMALNOTNULL);-- This INSERT command will failINSERTINTOt1(col_1)values(NULL);-- The same applies to the following UPDATE commandINSERTINTOt1(col_1)values(5);UPDATEt1SETcol_1=NULL;
ThePRIMARY KEY phrase defines that the column acts as the Primary Key of the table. This implies that the column is not allowed to store a NULL value and that the values of all rows are distinct from each other.
CREATETABLEt2(col_1DECIMALPRIMARYKEY);-- This INSERT will fail because a primary key column is not allowed to store the NULL value.INSERTINTOt2(col_1)VALUES(NULL);-- This INSERT worksINSERTINTOt2(col_1)VALUES(5);-- But the next INSERT will fail, because only one row with the value '5' is allowed.INSERTINTOt2(col_1)VALUES(5);
TheUNIQUE constraint has a similar meaning as the PRIMARY KEY phrase. But there are two slight differences.
First, the values of different rows of a UNIQUE column are not allowed to be equal, which is the same as with PK. But they are allowed to hold the NULL value, which is different from Primary Key. The existence of NULL values has an implication. As the termnull = null never evaluates totrue (it evaluates tounknown) there may exist multiple rows with the NULL value in a column which is defined to be UNIQUE.
Second, only one Primary Key definition per table is allowed. In contrast, there may be many UNIQUE constraints (on different columns).
CREATETABLEt3(col_1DECIMALUNIQUE);-- works wellINSERTINTOt3(col_1)VALUES(5);-- fails because there is another row with value 5INSERTINTOt3(col_1)VALUES(5);-- works wellINSERTINTOt3(col_1)VALUES(null);-- works alsoINSERTINTOt3(col_1)VALUES(null);-- check the resultsSELECT*FROMt3;
TheFOREIGN KEY condition defines that the column can hold only those values, which are also stored in a different column of (the same or) another table. This different column has to be UNIQUE or a Primary Key, whereas the values of the foreign key column itself may hold identical values for multiple rows. The consequence is that one cannot create a row with a certain value in this column before there is a row with exactly this certain value in the referred table. In our example database, we have acontact table whose columnperson_id refers to the id of persons. It makes sense that one cannot store contact values before storing the appropriate person.
Foreign Keys are the technique to realize one-to-many (1:m) relationships.
-- A table with a column which refers to the 'id' column of table 'person'CREATETABLEt4(col_1DECIMALREFERENCESperson(id));-- This INSERT works as in table 'person' of our example database there is a row with id = 3.INSERTINTOt4(col_1)VALUES(3);-- This statement will fail because in 'person' there is no row with id = 99.INSERTINTOt4(col_1)VALUES(99);
Column checks inspect the values of the column to see whether they meet the defined criterion. Within such column checks, only the actual column is visible. If a condition covers two or more columns (e.g., col_1 > col_2) a table check must be used.
-- 'col_1' shall contain only values from 1 to 10.-- A hint to MySQL users: MySQL before 8.0.16 accepts the syntax of column checks - but ignores the definitions silently. The same applies to MariaDB before 10.2.1.CREATETABLEt5(col_1DECIMALCHECK(col_1BETWEEN1AND10));-- This INSERT works:INSERTINTOt5(col_1)VALUES(3);-- This statement will fail:INSERTINTOt5(col_1)VALUES(99);
Table constraints define rules which are mandatory for the table as a whole. Their semantic and syntax overlaps partially with the previous showncolumn constraints.
Table constraints are defined after the definition of all columns. The syntax starts with the keyword CONSTRAINT, followed by an optional name. The following example includes the optional namest6_pk,t6_ik andt6_fk. It is a good practice to include names. In the case of an error exception, most DBMS will include this name as part of related error messages - if a name isn't defined, the DBMS may use its internal naming convention, which can be cryptic.
In the same manner as shown in thecolumn constraints part Primary Key, UNIQUE and Foreign Key conditions can be expressed as table constraints. The syntax differs slightly from the column constraint syntax; the semantic is identical.
-- A table with a PK column, one UNIQUE column and a FK column.CREATETABLEt6(col_1DECIMAL,col_2CHAR(10),col_3DECIMAL,CONSTRAINTt6_pkPRIMARYKEY(col_1),-- 't6_pk' is the name of the constraintCONSTRAINTt6_ukUNIQUE(col_2),CONSTRAINTt6_fkFOREIGNKEY(col_3)REFERENCESperson(id));
Similar tocolumn constraints part NOT NULL conditions and simple column checks can be expressed as table expressions.
CREATETABLEt7(col_1DECIMAL,col_2DECIMAL,CONSTRAINTt7_col_1_nnCHECK(col_1ISNOTNULL),CONSTRAINTt7_col_2_checkCHECK(col_2BETWEEN1and10));
If a condition affects more than one column, it must be expressed as a table constraint.
CREATETABLEt8(col_1DECIMAL,col_2DECIMAL,col_3DECIMAL,col_4DECIMAL,-- col_1 can hold only those values which are greater than col_2CONSTRAINTt8_check_1CHECK(col_1>col_2),-- If col_3 is NULL, col_4 must be NULL alsoCONSTRAINTt8_check_2CHECK((col_3ISNULLANDcol_4ISNULL)OR(col_3ISNOTNULLANDcol_4ISNOTNULL)));-- These two INSERTs work as they meet all conditionsINSERTINTOt8VALUES(1,0,null,null);INSERTINTOt8VALUES(2,0,5,5);-- Again: MySQL ignores check conditions silently-- This INSERT fails because col_1 is not greater than col_2INSERTINTOt8VALUES(3,6,null,null);-- This INSERT fails because col_3 is not null and col_4 is nullINSERTINTOt8VALUES(4,0,5,null);
As you have seen, some constraints may be defined as part of the column definition, which is called acolumn constraint, or as a separatetable constraint. Table constraints have two advantages. First, they are a little bit more powerful.
Second, they do have their own name! This helps to understand system messages. Furthermore, it opens the possibility to manage constraints after the table exists and contains data. The ALTER TABLE statement can deactivate, activate, or delete constraints. To do so, you have to know their name.
DROPTABLEt1;DROPTABLEt2;DROPTABLEt3;DROPTABLEt4;DROPTABLEt5;DROPTABLEt6;DROPTABLEt7;DROPTABLEt8;
Create a table 'company' with columns 'id' (numeric, primary key), 'name' (strings of variable size up to 200), 'isin' (strings of length 12, not nullable, unique values).
Create a solution with column constraints only and another one with table constraints only.
-- column constraints onlyCREATETABLEcompany_1(idDECIMALPRIMARYKEY,nameVARCHAR(200),isinCHAR(12)NOTNULLUNIQUE);-- table constraints onlyCREATETABLEcompany_2(idDECIMAL,nameVARCHAR(200),isinCHAR(5),CONSTRAINTcompany_2_pkPRIMARYKEY(id),CONSTRAINTcompany_2_ukUNIQUE(isin),CONSTRAINTcompany_2_check_isinCHECK(isinISNOTNULL));
Create a table 'accessory' with columns 'id' (numeric, primary key), 'name' (strings of variable size up to 200, unique), 'hobby_id' (decimal, not nullable, foreign key to column 'id' of table 'hobby').
Create a solution with column constraints only and another one with table constraints only.
-- column constraints onlyCREATETABLEaccessory_1(idDECIMALPRIMARYKEY,nameVARCHAR(200)UNIQUE,hobby_idDECIMALNOTNULLREFERENCEShobby(id));-- table constraints onlyCREATETABLEaccessory_2(idDECIMAL,nameVARCHAR(200),hobby_idDECIMAL,CONSTRAINTaccessory_2_pkPRIMARYKEY(id),CONSTRAINTaccessory_2_ukUNIQUE(name),CONSTRAINTaccessory_2_check_1CHECK(hobby_idISNOTNULL),CONSTRAINTaccessory_2_fkFOREIGNKEY(hobby_id)REFERENCEShobby(id));-- Test some legal and illegal valuesINSERTINTOaccessory_1VALUES(1,'Fishing-rod',2);COMMIT;-- ...
The SQL standard knows three kinds of data types
This page presents only thepredefined data types.Constructed types are one of ARRAY, MULTISET, REF(erence) orROW.User-defined types are comparable to classes in object-oriented language with their own constructors, observers, mutators, methods, inheritance, overloading, overwriting, interfaces, and so on.
The standard groups predefined data types into types with similar characteristics.
Character types hold printable characters, binary types any binary data. Both may have a fixed or variable size with an upper limit. If the upper limit exceeds a certain value, the type is a 'large object' with special methods and functions.
Exact numeric types hold numeric values without digits after the decimal or with a firm number of digits after the decimal. Please note that the standard does not define a separatedata type 'auto-increment' for generating primary keys. Instead it defines the phrase 'GENERATED ALWAYS AS IDENTITY' as part of the CREATE TABLE statement, seeCREATE TABLE statement orauto-increment-columns.
Approximate numeric types hold numeric values with an implementation-defined precision (after the decimal).
Temporal types hold values for INTERVAL (a certain range on the time bar), DATE (year, month, day), TIME with and without TIMEZONE (name of timezone, hour, minute, second including fraction) and TIMESTAMP with and without TIMEZONE (name of timezone, year to second including fraction).
The boolean data type holds the two valuestrue andfalse.
Part 14 of the SQL standard extends the list of predefined data types by introducing the data type XML (Oracle calls it XMLType) together with a bunch of particular functions. Columns of this type hold XML instances.
In the outdated SQL-2 standard, there was a data type 'BIT'. This data type is no longer part of the standard.
Most DBMS implement the majority of predefined data types, but there are some exceptions. Also, the naming differs slightly. An overview of the major implementations is available in the WikibookSQL_Dialects_Reference.
Data types are used within the CREATE TABLE statement as part of column definitions - or during CAST operations.
CREATETABLE<tablename>(<column_name><data_type>...,<column_name><data_type>...,...);
A series of printable characters - which is a string - can be stored withincharacter string types. If all rows of a table use the same fixed size for the strings, the data type isCHAR(<n>) where <n> is the size of the strings. If the size varies from row to row, the data typeVARCHAR(<n>) defines thatup to<n> characters can be stored in the column. So<n> defines the upper limit for this column. The maximum value for<n> depends on the used DBMS implementation. If applications need to store longer strings than it is allowed by this upper system limit, the data typeCLOB must be used. AlsoCLOB has its own upper limit, but this is significantly greater than the upper limit ofVARCHAR.
-- A table with columns of fixed and variable size strings and a CLOB stringCREATETABLEdatatypes_1(idDECIMALPRIMARYKEY,col_1CHAR(10),-- exactly 10 characterscol_2VARCHAR(150),-- up to 150 characterscol_3CLOB-- very large strings (MySQL denotes this data type: 'LONGTEXT'));
Hint: Unlike other programming languages, SQL does not distinguish between acharacter data type and astring data type. It knows only thecharacter string data types CHAR, VARCHAR and CLOB.
Binary data types are similar to character data types. They differ in that they accept a different range of bytes. Binary data types accept all values.
-- A table with columns of fixed and variable size binary data and a BLOBCREATETABLEdatatypes_2(idDECIMALPRIMARYKEY,col_1BINARY(10),-- exactly 10 bytecol_2VARBINARY(150),-- up to 150 bytecol_3BLOB-- very large data: jpeg, mp3, ...);
A hint to Oracle users: The data typeBINARY is not supported, the data typeVARBINARY is denoted asRAW and is deprecated. Oracle recommends the use ofBLOB.
Exact numeric types hold numeric values without digits after the decimal or with a firm number of digits after the decimal. All exact numeric types are signed.
NUMERIC(<p>,<s>) andDECIMAL(<p>,<s>) denotes two types which are nearly the same. <p> (precision) defines a fixed number of all digits within the type and <s> (scale) defines how many of those digits follow the decimal place. Numeric values with more than (p - s) digits before the decimal place cannot be stored and numeric values with more than s digits after the decimal place are truncated to s digits after the decimal place. p and s are optional. It must always be: p ≥ s ≥ 0 and p > 0.
SMALLINT, INTEGER and BIGINT denote data types without a decimal place. The SQL standard did not define their size, but the size of SMALLINT shall be smaller than the size of INTEGER and the size of INTEGER shall be smaller than the size of BIGINT.
-- A table using five exact numeric data typesCREATETABLEdatatypes_3(idDECIMALPRIMARYKEY,col_1DECIMAL(5,2),-- three digits before the decimal and two behindcol_2SMALLINT,-- no decimal pointcol_3INTEGER,-- no decimal pointcol_4BIGINT-- no decimal point. (Not supported by Oracle.));
Approximate numeric types hold numeric values with an implementation-defined precision (after the decimal). All approximate numeric types are signed. Their primary use cases are scientific computations.
There are three types:FLOAT (<p>),REAL andDOUBLE PRECISION, where p denotes the guaranteed precision of theFLOAT data type. The precision ofREAL andDOUBLE PRECISION is implementation defined.
-- A table using the approximate numeric data typesCREATETABLEdatatypes_4(idDECIMALPRIMARYKEY,col_1FLOAT(2),-- two or more digits after the decimal placecol_2REAL,col_3DOUBLEPRECISION);
Data types with respect to temporal aspects are:DATE, TIME, TIMESTAMP andINTERVAL.
DATE stores year, month and day.TIME stores hour, minute and second.TIMESTAMP stores year, month, day, hour, minute and second. Seconds can contain digits after the decimal.TIME andTIMESTAMP can contain the name of a TIME ZONE.
The SQL standard defines two kinds of INTERVALs. The first one is an interval with year and month, the second one is an interval with day, hour, minute, and second.
-- A table using temporal data typesCREATETABLEdatatypes_5(idDECIMALPRIMARYKEY,col_1DATE,-- store year, month and day (Oracle: plus hour, minute and seconds)col_2TIME,col_3TIMESTAMP(9),-- a timestamp with 9 digits after the decimal of secondscol_4TIMESTAMPWITHTIMEZONE,-- a timestamp including the name of a timezonecol_5INTERVALYEARTOMONTH,col_6INTERVALDAYTOSECOND(6)-- an interval with 6 digits after the decimal of seconds);
A hint to Oracle users: The data typeTIME is not supported. UseDATE instead.
A hint to MySQL users: The use of TIME ZONE as part of data types is not supported. MySQL implements a different concept to handle time zones. Fractions of seconds are not supported. The data typeINTERVAL is not supported, but there is a data valueINTERVAL.
SQL has a 3-value-logic. It knows the boolean valuestrue,false andunknown. Columns of the boolean data type can store one of the two valuestrue orfalse.unknown is represented by storing no value, which is the NULL indicator.
-- A table with one column of booleanCREATETABLEdatatypes_6(idDECIMALPRIMARYKEY,col_1BOOLEAN-- not supported by Oracle);
Part 14 of the SQL standard extends the list of predefined data types by introducing the data type XML. The standard also defines a wide range of particular functions for this data type.
-- A table with one column of data type XMLCREATETABLEdatatypes_7(idDECIMALPRIMARYKEY,col_1XML);
A hint to Oracle users: The data typeXML is denoted asXMLType.
A hint to MySQL users: The data typeXML is not supported.
In the context of data types, the standard knowsdomains. The purpose of domains is to constrain the set of valid values that can be stored in a column. The domain-concept is a very early predecessor of user-defined types and may be outdated.
DROPTABLEdatatypes_1;DROPTABLEdatatypes_2;DROPTABLEdatatypes_3;DROPTABLEdatatypes_4;DROPTABLEdatatypes_5;DROPTABLEdatatypes_6;DROPTABLEdatatypes_7;
Create a table 'company' with columns 'id' (numeric, primary key), 'name' (strings of variable size up to 200), 'isin' (strings of length 12), 'stock_price' (numeric with 2 digits before and 2 after the decimal), 'description_text' (a very large string) and description_doc (any binary format).
CREATETABLEcompany(idDECIMALPRIMARYKEY,nameVARCHAR(200),isinCHAR(12),stock_priceDECIMAL(4,2),description_textCLOB,description_docBLOB);
Foreign Keys (FK) define a directed reference from one table (the child) to another table (the parent). This reference acts as long as the involved columns of the two tables contain identical values. It couples one row of the child table to a single row of the parent table - a row of the parent table may be coupled by many rows of the child table.
E.g.: You may have the tabledepartment with columnid and the tableemployee with columndept_id. If you want to assign an employee to a distinct department, you store the department-id in its columndept_id. This can be done in every case-independent from any Foreign Key definition. But in such cases people often have two additional requirements: First, employees shall only be assigned to departments which really exist. Second, as long as employees are assigned to a distinct department, it shall be impossible to delete this department. The main purpose of Foreign Keys is to guarantee these two requirements.
In other words: Foreign Keys guarantee thatno orphans will arise.
Within RDBMs, identical values are used to link rows of different - and sometimes of the same - table together. Because this linking works on the basis of values and not of any link or special reference, it has no direction. In general, we call this technique aJOIN.Foreign Keys have a very similar concept because they also link rows with identical values together. But there are important differences:
-- As part of CREATE TABLE commandCREATETABLE<table_name>(...CONSTRAINT<constraint_name>FOREIGNKEY(<column_name>)REFERENCES<parent_table_name>(<other_column_name>));-- As part of ALTER TABLE commandALTERTABLE<table_name>ADDCONSTRAINT<constraint_name>...;-- same as aboveALTERTABLE<table_name>DROPCONSTRAINT<constraint_name>;-- throw the definition away
Rules:
The example defines the tablesdepartment andemployee. The Foreign Key definition ofemployee declaresdepartment as the parent table ofemployee.
---- The parent table: DEPARTMENTCREATETABLEdepartment(idDECIMAL,dept_noCHAR(10),dept_nameVARCHAR(100),CONSTRAINTdept_pkPRIMARYKEY(id));-- The child table: EMPLOYEECREATETABLEemployee(idDECIMAL,emp_nameVARCHAR(100),dept_idDECIMAL,CONSTRAINTemp_pkPRIMARYKEY(id),CONSTRAINTemp_dept_fkFOREIGNKEY(dept_id)REFERENCESdepartment(id));-- This INSERT will fail because currently there is no department with id 10.INSERTINTOemployee(id,emp_name,dept_id)VALUES(1,'Mike Baker',10);COMMIT;-- It's necessary to store the department first.INSERTINTOdepartment(id,dept_no,dept_name)VALUES(10,'D10','E-Bike Development');INSERTINTOemployee(id,emp_name,dept_id)VALUES(1,'Mike Baker',10);COMMIT;-- The department may have a lot of employeesINSERTINTOemployee(id,emp_name,dept_id)VALUES(2,'Elenore McNeal',10);INSERTINTOemployee(id,emp_name,dept_id)VALUES(3,'Ted Walker',10);COMMIT;-- This DELETE will fail because currently there are employees within the department.DELETEFROMdepartmentWHEREdept_name='E-Bike Development';COMMIT;
This kind of modeling allows the representation of hierarchical tree structures. One or many child nodes (rows) belong to a single parent node (row). In the context of DBMS, this kind of association is called a 1:m relationship.
In the real world, there are more association types than 1:m relationships. Often there are so-called n:m relationships where objects (rows) belong to more than 1 other object (row). Thereby the meaning of parent/child tables gets lost. In ourexample database there is a tablehobby and another tableperson. One person may pursue multiple hobbies. At the same time, multiple persons may pursue the same hobby. This can be designed by creating a third table between the two original tables. The third table holds the IDs of the first and second table. So one can decide which person pursues which hobby.
The technique to realize this n:m situation is the same as shown in the previous chapter with its 1:m association - it is only used twice. We define two Foreign Keys, which start from the 'table-in-the-middle' and refers to the two other tables. In a technical sense, we can say, that the 'table-in-the-middle' is the child table for the two parent tablesperson andhobby.person andhobby are at the same logical level.
--CREATETABLEt1(idDECIMAL,nameVARCHAR(50),-- ...CONSTRAINTt1_pkPRIMARYKEY(id));CREATETABLEt2(idDECIMAL,nameVARCHAR(50),-- ...CONSTRAINTt2_pkPRIMARYKEY(id));CREATETABLEt1_t2(idDECIMAL,t1_idDECIMAL,t2_idDECIMAL,CONSTRAINTt1_t2_pkPRIMARYKEY(id),-- also this table should have its own Primary KeyCONSTRAINTt1_t2_uniqueUNIQUE(t1_id,t2_id),-- every link should occur only onceCONSTRAINTt1_t2_fk_1FOREIGNKEY(t1_id)REFERENCESt1(id),CONSTRAINTt1_t2_fk_2FOREIGNKEY(t2_id)REFERENCESt2(id));
So far, we have assumed that rows of the parent table cannot be deleted if a row in the child table exists, which refers to this parent row. This is the default, but all in all, the SQL standard defines five options to handle this parent/child situation in various ways. The options extend the constraint definition. They are:
Analog to the ON DELETE option, there is an ON UPDATE option. It defines the same five options for the case of changing a column in the parent table, which is referred by the column of a child table.
If ON DELETE or ON UPDATE are not specified, the default action NO ACTION will occur. In some systems, the NO ACTION is implemented in the sense of the RESTRICT option.
An Example:
--CREATETABLEt1_t2(...CONSTRAINTt1_t2_fk_1FOREIGNKEY(t1_id)REFERENCESt1(id)ONUPDATECASCADEONDELETERESTRICT,...);
Hint 1: The concept of updating Primary Keys is controversial.
Hint 2: Not all DBMS support all options.
There is an additional option to decide at what point in time the evaluation of the Foreign Key definition shall occur. The default behavior is to check it with each UPDATE and DELETE command. The second possibility is deferring the check until the end of the transaction, which is the COMMIT command. The purpose of this deferring is to put applications in the position to modify parent tablesbefore child tables (which may be helpful if they utilizeHibernate).
To define this option, the constraint definition must be extended by the keywords [NOT] DEFERRABLE, which are pre- or postfixed by INITIALLY IMMEDIATE (the default) or INITIALLY DEFERRED to specify the initial state after the CREATE TABLE point in time.
--CREATETABLEt1_t2(...CONSTRAINTt1_t2_fk_1FOREIGNKEY(t1_id)REFERENCESt1(id)ONUPDATECASCADEDEFERRABLEINITIALLYIMMEDIATEONDELETERESTRICTDEFERRABLEINITIALLYDEFERRED,...);
Hint: MySQL does not support the DEFERRABLE option, but the Foreign Key checking can be activated and deactivated dynamically by 'SET foreign_key_checks = 0/1;'
Sometimes applications run into cyclic dependencies: Table A contains a reference to table B and vice versa, e.g.: A tableteam contains the columnsid,team_name andteam_leader (which is an id to a player) and the tableplayer contains the columnsid,player_name andteam_id.
--CREATETABLEteam(idDECIMAL,team_nameVARCHAR(50),team_leaderDECIMAL,-- ID of a playerCONSTRAINTteam_pkPRIMARYKEY(id));CREATETABLEplayer(idDECIMAL,player_nameVARCHAR(50),team_idDECIMAL,CONSTRAINTplayer_pkPRIMARYKEY(id));ALTERTABLEteamADDCONSTRAINTteam_fkFOREIGNKEY(team_leader)REFERENCESplayer(id);ALTERTABLEplayerADDCONSTRAINTplayer_fkFOREIGNKEY(team_id)REFERENCESteam(id);
So far, so bad. When the first team-row shall be inserted, the player-row is missed. When the player-row is inserted first, the team-row is missed.
As we have seen above, there is a DEFER option. Using this option, the FK-constraints must be defined such that they are not evaluated immediately with the INSERT commands. They shall be evaluated after all INSERTs at the COMMIT point in time.
-- Throw the above definitions away ...ALTERTABLEteamDROPCONSTRAINTteam_fk;ALTERTABLEplayerDROPCONSTRAINTplayer_fk;-- ... and use DEFERRABLEALTERTABLEteamADDCONSTRAINTteam_fkFOREIGNKEY(team_leader)REFERENCESplayer(id)DEFERRABLEINITIALLYDEFERRED;ALTERTABLEplayerADDCONSTRAINTplayer_fkFOREIGNKEY(team_id)REFERENCESteam(id)DEFERRABLEINITIALLYDEFERRED;
Now we can insert data in any sequence (don't miss to deactivateAUTOCOMMIT).
--INSERTINTOteam(id,team_name,team_leader)VALUES(1,'Wild Tigers',1);INSERTINTOplayer(id,player_name,team_id)VALUES(1,'Johnny Crash',1);-- No checking of Foreign Keys up to hereCOMMIT;-- Commit includes the check of Foreign Keys
Foreign Keys have implications toDROP TABLE andTRUNCATE TABLE commands. As long as a Foreign Key refers to a parent table, this table cannot be dropped (remove structure and data) or truncated (remove data only). This holds true even if there is no actual row referring any row in the parent table - the existence of the Foreign Key is sufficient to refuse DROP and TRUNCATE.
To use DROP or TRUNCATE, it is necessary to drop the constraint first.
Hint: Some implementations offer a DISABLE/ENABLE command to deactivate constraints temporarily.
Is it possible that the parent table of a FK-constraint contains 1 row and the child table is empty?
Yes. Parents without children are absolutely normal.
Is it possible that the child table of a FK-constraint contains 1 row and the parent table is empty?
Yes. Although the main purpose of FK-constraints is the prevention of children without parents (orphans), this situation may occur.
If the column of the child row contains the NULL value, this row relates to no parent row
because 'null = <any value>' evaluates always to UNKNOWN and never to TRUE, even if that <any value> is the NULL value.
Create a tablegenealogy which stores information about people and their ancestors. The columns are: id, first_name, last_name, birth_name, father_id, mother_id.
CREATETABLEgenealogy(idDECIMALPRIMARYKEY,first_nameVARCHAR(100),last_nameVARCHAR(100),birth_nameVARCHAR(100),father_idDECIMAL,mother_idDECIMAL);
Extend the tablegenealogy by two FK-contraints such that the columns 'father_id' and 'mother_id' refer to other rows of this table.
ALTERTABLEgenealogyADDCONSTRAINTgen_fk_1FOREIGNKEY(father_id)REFERENCESgenealogy(id);ALTERTABLEgenealogyADDCONSTRAINTgen_fk_2FOREIGNKEY(mother_id)REFERENCESgenealogy(id);
Insert some data into 'genealogy', e.g.: data from your personal family.
-- For the first rows store NULL in 'father_id' and 'mother_id'!INSERTINTOgenealogy(id,first_name,last_name,birth_name,father_id,mother_id)VALUES(1,'Mike','Miller','Miller',null,null);INSERTINTOgenealogy(id,first_name,last_name,birth_name,father_id,mother_id)VALUES(2,'Eve','Miller','Summer',null,null);INSERTINTOgenealogy(id,first_name,last_name,birth_name,father_id,mother_id)VALUES(3,'Marry','Dylan','Miller',1,2);INSERTINTOgenealogy(id,first_name,last_name,birth_name,father_id,mother_id)VALUES(4,'Henry','Dylan','Dylan',null,3);COMMIT;
The ALTER TABLE command modifies column definitions and table constraints 'on the fly'. This means existing definitions are extended, changed, or deleted, or existing data is cast to a different type, or existing data is evaluated against the new definitions.
-- change column definitionsALTERTABLE<table_name>{ADD|ALTER}[COLUMN]<column_name><column_definition>;ALTERTABLE<table_name>{DROP}[COLUMN]<column_name>;-- change table constraintsALTERTABLE<table_name>{ADD|ALTER}CONSTRAINT<constraint_name><constraint_definition>;ALTERTABLE<table_name>{DROP}CONSTRAINT<constraint_name>;
The following examples are based on the test tablet1.
CREATETABLEt1(idNUMERICPRIMARYKEY,col_1CHAR(4));
The syntax of the ADD COLUMN and ALTER COLUMN phrases are similar to the one shown in thecreate table page.
Existing tables can be extended by additional columns with the ADD COLUMN phrase. Within this phrase, all options of the original Create Table statement are available: data type, default value, NOT NULL, Primary Key, Unique, Foreign Key, Check.
-- add a new column with any characteristicALTERTABLEt1ADDCOLUMNcol_2VARCHAR(100)CHECK(length(col_2)>5);-- Oracle: The key word 'COLUMN' is not allowed.
With the ALTER COLUMN phrase some characteristics of an existing column can be changed
The new definitions must be compatible with the old existing data. If you change, for example, the data type from VARCHAR to NUMERIC, this action can only be successful if it is possible to castall existing VARCHAR data to NUMERIC - the casting of 'xyz' will fail. Casting in the direction from NUMERIC to VARCHAR will be successful as long as the width of the VARCHAR is large enough to store the result.
Hint: Concerning the change of the characteristics of columns, some implementations ignore the syntax of the SQL standard and use other keywords like 'MODIFY'.
ALTERTABLEt1ALTERCOLUMNcol_1SETDATATYPENUMERIC;
ALTERTABLEt1ALTERCOLUMNcol_1SETDEFAULT'n/a';
ALTERTABLEt1ALTERCOLUMNcol_1SETNOTNULL;ALTERTABLEt1ALTERCOLUMNcol_1DROPNOTNULL;
Columns can be dropped from existing tables.
ALTERTABLEt1DROPCOLUMNcol_2;-- Oracle: The key word 'COLUMN' is mandatory.
Hint: As an extension to the SQL standard, some implementations offer a RENAME or SET INVISIBLE command.
Table constraints can be added, modified, or dropped. The syntax is similar to that shown on thecreate table page.
ALTERTABLEt1ADDCONSTRAINTt1_col_1_uniqueUNIQUE(col_1);
ALTERTABLEt1ALTERCONSTRAINTt1_col_1_uniqueUNIQUE(col_1);
Hint: Concerning the change of table constraints, some implementations ignore the syntax of the SQL standard and use other keywords like 'MODIFY'.
ALTERTABLEt1DROPCONSTRAINTt1_col_1_unique;-- MySQL: Not supported. Use 'DROP FOREIGN KEY' or 'DROP INDEX' instead.
Hint: As an extension to the SQL standard, some implementations offer an ENABLE / DISABLE command for constraints.
Add a column 'col_3' to the table 't1': numeric, not null.
ALTERTABLEt1ADDCOLUMNcol_3NUMERICNOTNULL;
Add a Foreign Key from table 't1' column 'col_3' to table 'person' column 'id'.
ALTERTABLEt1ADDCONSTRAINTt1_col_3_fkFOREIGNKEY(col_3)REFERENCESperson(id);
The DROP TABLE command removes the definition and all data of the named table from the database.
DROPTABLE<t1>;
The command handles the table as a whole. It will not fire any trigger. But it considersForeign Key definitions. If any other table refers to the table to be dropped, the DROP TABLE command will fail. The Foreign Key definition must bedropped first.
As the DROP TABLE command handles the table as a whole, it is very fast.
Regular tables are containers to store data for a shorter or longer time periods and to make the data available to various processes. In contrast, sometimes, there is the requirement to handle data for a short time and only for local purposes. The provision of TEMPORARY TABLES accomplishes this. They are subject to the SQL syntax in the same way as regular tables.
The common characteristic of all temporary tables is, that every session (connection) gets itsown incarnation of the temporary table without any side effect to other sessions. This leads to the situation that every session sees only those data, which it has inserted previously. The data isnot shared between different sessions, even if they use the same table name at the same time. Every session works on a different incarnation. A second common characteristic is that with the termination of the session all data of the temporary table is thrown away automatically. An explicit DELETE or DROP TABLE is not necessary.
The concept of temporary tables is similar to the concept of arrays of records within programming languages. The advantage over arrays is the availability of all DML statements known in SQL, eg.: if you need some intermediate data, you can retrieve it from a regular table and store it in a temporary table with one single Insert+Subselect command. The advantage over regular tables is that the overhead of logging and locking might be saved.
There are three slightly different types of temporary tables:
CREATEGLOBALTEMPORARYTABLE<table_name>(...)[ONCOMMIT{PRESERVE|DELETE}ROWS];CREATELOCALTEMPORARYTABLE<table_name>(...)[ONCOMMIT{PRESERVE|DELETE}ROWS];DECLARELOCALTEMPORARYTABLE<table_name>(...)[ONCOMMIT{PRESERVE|DELETE}ROWS];
If the phrase 'ON COMMIT DELETE ROWS' is used, the data is automatically thrown away with every COMMIT command, else at the end of the session (or with a DELETE command).
If a GTT is created, itsdefinition becomes part of the database beyond the end of the defining session. Other sessions using the database will see this table definition as part of the schema. GTTs can be defined simultaneously with regular tables. Applications can use the existing GTTs or create their own. Up to this point, GTTs don't differ from regular tables. The distinction relates to thedata. As with all temporary tables, every session gets its own incarnation of the table and cannot access data from any other session. If the session terminates, all data from the table is thrown away automatically.
A typical use case is an application that needs a temporary protocol about its own activities like successful actions, exceptions, ... to perform recovery activities later on. This information is not of interest to other sessions. Moreover, it may be deleted at the end of a transaction or at the end of the session.
Another use case is an application that stores an intermediate result set and iterates the set's rows to perform actions depending on the column values.
-- The table may be defined by a different session long time before.CREATEGLOBALTEMPORARYTABLEtemp1(tsTIMESTAMP,actionCHAR(100),stateCHAR(50))ONCOMMITPRESERVEROWS;---- Insert some dataINSERTINTOtemp1VALUES(current_timestamp,'node-1-request sended.','OK');INSERTINTOtemp1VALUES(current_timestamp,'node-2-request sended.','OK');INSERTINTOtemp1VALUES(current_timestamp,'node-1-answer received.','Failed');INSERTINTOtemp1VALUES(current_timestamp,'node-2-answer received.','OK');SELECTcount(*)FROMtemp1WHEREstate='OK';...COMMIT;SELECTcount(*)FROMtemp1;-- In this example, all rows should have survived the COMMIT command-- After a disconnect from the database and establishing a new session the table exists and is empty.
Thedefinition of a LTT will never survive the duration of a session. The same applies to itsdata, which accords to the behavior of all temporary tables. In consequence, every session must define its own LTT before it can store anything into it. Multiple sessions can use the same table name simultaneously without affecting each other, which - again - accords to the behavior of all temporary tables.
-- The table must be defined by the same session (connection), which stores data into it.CREATELOCALTEMPORARYTABLEtemp2(tsTIMESTAMP,actionCHAR(100),stateCHAR(50))ONCOMMITPRESERVEROWS;-- After a disconnect from the database and establishing a new session, the table will not exist.
The SQL-standard distinguishes between SQL-sessions and modules within SQL-sessions. It postulates that LTTs are visible only within the module, which has actually created the table. The tables are not shared between different modules of the same SQL-session. But the LTTs definition occurs in the information schema of the DBMS.
The main concept of DLTT is very similar to that of LTT. The difference is that as opposed to the definition of an LTT the definition of a DLTT will not occur in the information schema of the DBMS. It is known only by the module where it is defined. You can imagine a DLTT as some kind of a module-local variable.
-- The declaration must be defined by the same module which stores data into the table.DECLARELOCALTEMPORARYTABLEtemp3(tsTIMESTAMP,actionCHAR(100),stateCHAR(50))ONCOMMITPRESERVEROWS;-- After a disconnect from the module and entering the module again, the declaration will not exist.
MySQL:
Oracle:
Indexes are a key feature of all SQL databases. They provide quick access to the data. Therefore almost all implementations support a CREATE INDEX statement.
Nevertheless, the CREATE INDEX statement isnot part of the SQL standard! The reason for this is unknown. Possibly it is a deliberate decision against all implementation issues. Or it results from the wide range of different syntaxes realized by vendors and the lack of finding a compromise.
On this page, we offer some basic ideas concerning indexes and the syntax which is common to a great number of implementations.
CREATE[UNIQUE]INDEX<index_name>ON<table_name>(<column_name>[,<column_name>]);
DBMSs offer quick access to data stored in their tables. One might think that such high-speed access is due to the fast hardware of modern computers: millions of CPU cycles per second, I/O rates in the range of milliseconds, access to RAM within micro- or nanoseconds, etc. That is true, but only partly so. Instead, the use of intelligent software algorithms, especially in the case of handling large amounts of data, is the dominant factor.
Consider a request to the DBMS to determine, whether or not a person with a certain name can be found in a table with 1 million entries. With a primitive, linear algorithm, the system has to read 500,000 rows (on average) to decide the question. Thebinary search algorithm implements a more sophisticated strategy that answers the question after reading 20 rows or less. In this case, this choice of algorithm leads to a factor of 25,000 in performance. In order to really grasp the magnitude of this improvement, you may want to multiply your salary by 25,000.
Admittedly this comparison between the linear access and the binary search algorithm is a little bit simple. First, DBMS usually read blocks containing multiple rows and not single rows. But this didn't change the situation. If a block contains 100 rows, modify the above example from 1 million to 100 million rows. Second, the binary search algorithm assumes that the data is ordered. This means that during data entry, there is an additional step for sorting the actual input into the existing data. This applies only once and is independent of the number of read accesses. In summary, there is additional work during data entry and much less work during data access. It depends on the typical use of the data whether the additional work is worthwhile.
The index is an additional storage holding data that is copied or deducted from the original data in the table. It consists only ofredundant data. What parts make up the index? In the common case of a binary search strategy, the index holds the original values of the tables column plus a back-reference to the original row. In most cases, the index is organized as abalanced tree with the column's value as the tree's key and the back-reference as additional information for each key.
The binary search algorithm is one of many options for building indexes. The common characteristics of indexes are that they: consist only of redundant information; use additional resources in the sense of CPU cycles, RAM or disc space; and offer better performance for queries on large data sets. In the cases of small tables or tables with many indexes, it is possible that the disadvantages (in performance or resource usage) outweigh the benefits of using an index.
If an application retrieves data by a certain criterion - e.g., a person name for a phone book application - and this criterion consists of a tables column, this column should have an index.
CREATEINDEXperson_lastname_idxONperson(lastname);
The index has its own freely selectable name -person_lastname_idx in this example - and is built on a certain column of a certain table. The index may be defined and created directly after the CREATE TABLE statement (when there is no data in the table) or after some or a huge number of INSERT commands. After it is created, the DBMS should be in the state to answer questions like the following quicker than before.
SELECTcount(*)FROMpersonWHERElastname='Miller';
The index may be used during the evaluation of the WHERE clause. The DBMS has the choice between - on the one hand - reading allperson rows and counting such where the lastname is 'Miller' or - on the other hand - reading the index (possibly with binary search) and counting all nodes with value 'Miller'. Which strategy is used depends on a lot of decisions. If, for example, the DBMS knows that about 30% of all rows contain 'Miller', it may choose a different strategy than if it knows that only 0.3% contains 'Miller'.
A table may have more than one index.
CREATEINDEXperson_firstname_idxONperson(firstname);
What will happen in such a situation to a query like the following one?
SELECTcount(*)FROMpersonWHERElastname='Miller'ANDfirstname='Henry';
Again, the DBMS has more than one choice to retrieve the expected result. It may use only one of the two indexes, read the resulting rows and look for the missing other value. Or it reads both indexes and counts the common back-references. Or it ignores both indexes, reads the data and counts such rows where both criteria apply. As mentioned it depends on a lot of decisions.
If an application typically searches in two columns withinone query, e.g. for first- and lastname, it can be useful to build one index for both columns. This strategy is very different from the above example, where we build two independent indexes, one per column.
CREATEINDEXperson_fullname_idxONperson(lastname,firstname);
In this case the key of the balanced tree is the concatenation of last- and firstname. The DBMS can use this index for queries which ask for last- and firstname. It can also use the index for queries for lastname only. But it cannot use the index in queries for firstname only. The firstname can occur at different places within the balanced tree. Therefore it is worthless for such queries.
In some cases an existing index cannot be used for queries on the underlying column. Suppose the query to person names should be case-insensitive. To do so the application converts all user-input to upper-case and use the UPPER() function to the column in scope.
-- Original user input was: 'miller'SELECTcount(*)FROMpersonWHEREUPPER(lastname)='MILLER';
As the criterion in the WHERE clause looks only for uppercase characters and the index is built in a case-sensitive way, the key in the balanced tree is worthless: 'miller' is sorted at a very different place than 'Miller'. To overcome the problem, one can define an index, which uses exactly the same strategy as the WHERE criterion.
CREATEINDEXperson_uppername_idxONperson(UPPER(lastname));-- not supported by MySQL
Now the 'UPPER()' query can use this so-called functional index.
The Primary Key of every table is unique, which means that no two rows can contain the same value. Sometimes one column or the concatenation of some columns is also unique. To ensure this criterion you can define aUNIQUE CONSTRAINT, or you can define an index with the additional UNIQUE criterion. (Often UNIQUE CONSTRAINTS silently use UNIQUE INDEX in the background.)
CREATEUNIQUEINDEXperson_lastname_unique_idxONperson(lastname);
Unique indexes can only be created on existing data, if the column in scope really has nothing but unique values (which is not the case in our database example).
Indexes can be dropped by the command:
DROPINDEX<index_name>;
For multiuser systems like DBMSs, it is necessary to grant and revoke rights for manipulating its objects. The GRANT command defines which user can manipulate (create, read, change, drop, ...) which object (tables, views, indexes, sequences, triggers, ...).
GRANT<privilege_name>ON<object_name>TO[<user_name>|<role_name>|PUBLIC][WITHGRANTOPTION];
The REVOKE statement deprives the granted rights.
REVOKE<privilege_name>ON<object_name>FROM[<user_name>|<role_name>|PUBLIC];
The example statement grants SELECT and INSERT on tableperson to the userhibernate. The second statement removes the granted rights.
GRANTSELECT,INSERTONpersonTOhibernate;REVOKESELECT,INSERTONpersonFROMhibernate;
Privileges are actions that users can perform. The SQL standard supports only a limited list of privileges, whereas real implementations offer a great bunch of different privileges. The list consists of: SELECT, INSERT, UPDATE, DELETE, CREATE <object_type>, DROP <object_type>, EXECUTE, ... .
The list of object types, to which privileges may be granted, is short in the SQL standard and long for real implementations. It consists of tables, views, indexes, sequences, triggers, procedures, ... .
If there is a great number of users connecting to the DBMS, it is helpful to group users with identical rights to a role and grant privileges not to the individual users but the role. To do so, the role must be created by a CREATE ROLE statement. Afterward, users are joined with this role.
-- Create a role-- (MySQL supports only predefined roles with special semantics).CREATEROLEdepartment_human_resources;-- Enrich the role with rightsGRANTSELECT,INSERT,UPDATE,DELETEONpersonTOdepartment_human_resources;GRANTSELECT,INSERTONhobbyTOdepartment_human_resources;GRANTSELECT,INSERT,UPDATE,DELETEONperson_hobbyTOdepartment_human_resources;-- Join users with the roleGRANTdepartment_human_resourcesTOuser_1;GRANTdepartment_human_resourcesTOuser_2;
Instead of individual usernames, the keyword PUBLIC denotesall known users.
-- Everybody shall be allowed to read the rows of the 'person' table.GRANTSELECTONpersonTOPUBLIC;
If a DBA wants to delegate the managing of rights to special users, he can grant privileges to them and extend the statement with the phrase 'WITH GRANT OPTION'. This enables the users to grant the received privileges to any other user.
-- User 'hibernate' gets the right to pass the SELECT privilege on table 'person' to any other user.GRANTSELECTONpersonTOhibernateWITHGRANTOPTION;
Since its first days, SQL includes a basic form ofpattern matching on strings. The feature is part of the WHERE clause and is triggered by the keyword LIKE. It knows two meta-characters: '_' (underscore) and '%' (percent).
SELECT*FROM<table_name>WHERE<column_name>LIKE<like_criterion>;-- The like_criterion can contain '_' and/or '%'
Over time, additional functionality has been incorporated to the standard. This concerns regular expressions, which are similar to those defined byPOSIX, and regular expressions as defined by the XQuery 1.0 / XPath 2.0 functionfn:matches(). Finally the SQL application packageISO/IEC 13249-2:2003 Part 2: Full-Text defines methods and functions for information retrieval on fulltext.
The LIKE predicate compares a column of type CHAR or VARCHAR (string) with a pattern. This pattern is also a string, but it may contain two characters with a special meaning. The '_' (underscore) represents exactly one arbitrary character, and '%' (percent) represents a string of zero, one or more characters. All other characters represent themselves.
The first example retrieves all rows from tableperson with a first name starting with the two characters 'Jo', eg: 'John', 'Johanna' or 'Jonny W.'. As the meta-character '%' represents also zero characters, rows with the first name 'Jo' are also retrieved.
SELECT*FROMpersonWHEREfirstnameLIKE'Jo%';
The next example retrieves all rows from tableperson with a last name similar to 'Meier', eg: 'Meyer' or 'Maier'. The two underscores represents exactly two (arbitrary) characters. Consider, that there may be unexpected results like 'Miler' (but not 'Miller').
SELECT*FROMpersonWHERElastnameLIKE'M__er';
The definition of the two meta-characters '_' and '%' implies a problem: What if one wants to search for these characters itself? The solution is the same as in other programming languages: one defines an escape mechanism to mask the meta-characters and prefix the meta-character in the pattern by this escape character. In SQL the syntax is:
...WHERE<column_name>LIKE<like_criterion>ESCAPE<escape_character>;
-- Retrieve one of the meta-characters: 'Electric_Motor' but not 'Electric Motor'SELECT*FROMpartsWHEREpartnameLIKE'Electric\_Motor'ESCAPE'\';
-- Retrieve the escape character itself: 'Electric\Motor' but not 'Electric Motor'SELECT*FROMpartsWHEREpartnameLIKE'Electric\\Motor'ESCAPE'\';
The escape character can be any character; it is not limited to the backslash, eg.: ... ESCAPE '!'
In an outdated version of the SQL standard, there was the keyword SIMILAR, which introduced a pattern match in the sense ofPOSIX. Nowadays, it is deprecated.
Some implementations offer pattern matching in this sense, but use arbitrary keywords like REGEXP, REGEXP_MATCHES, REGEXP_LIKE or operators like '~' to activate the feature.
The SQL standard defines the keyword LIKE_REGEX to activate pattern matching as defined by the XQuery 1.0 / XPath 2.0 functionfn:matches().
The SQL standard is complemented byISO/IEC 13249 SQL Multimedia and Application Packages. Part 2Full-Text defines routines and methods for information retrieval in text documents. It deals with objects like: word, phrase, sentence, and paragraph. There are special constructs; to search for words or phrases, search words within a definable distance, thesaurus features like broader terms of a word, soundex-similarity, ranking of results, and much more. The central method of these features is CONTAINS.
-- Retrieve rows with 'word_1' and 'word_2' (in this order) within a sequence of 10 words.SELECT*FROMt1WHERECONTAINS(' "word_1" NEAR "word_2" WITHIN 10 WORDS IN ORDER ')=1;
Such features are far beyond pattern matching - but it is the natural next step. SQL implementations offer them mostly in combination with special text indexes.
There are use cases in which an application wants to compare rows or columns not with a fixed value - e.g.: 'WHERE status = 5' - but with a result of a query which is evaluated at runtime. A first example of such dynamic queries aresubqueries which return exactlyone value: '... WHERE version = (SELECT MAX(version) ...)'. Additionally, sometimes there is the need to compare against a set, which containsmultiple values: '... WHERE version <comparison> (SELECT version FROM t1 WHERE status > 2 ...)'.
To do so, SQL offers some special comparison methods between the table to be queried and the result of the subquery: IN, ALL, ANY/SOME, and EXISTS. They belong to the group of so-calledpredicates.
The IN predicate - as described in a previous chapter - accepts a set of values or rows.
SELECT*FROMpersonWHEREidIN(SELECTperson_idFROMcontact);-- Subquery with potentially a lot of rows.
The subquery selects a lot of values. Therefore it is not possible to use operators like '=' or '>'. They would merely compare single values. But the IN predicate handles the situation and comparesperson.id with every value ofcontact.person_id regardless of the number ofcontact.person_id values. This comparisons are mutually linked in the sense ofboolean OR operations.
The IN predicate can be negated by adding the keyword NOT.
...WHEREidNOTIN...
The ALL predicate compares every row in the sense of aboolean AND with every value of the subquery. It needs - in contrast to the IN predicate - an additional operator, which is one of: <, <=, =, >=, > or <>.
SELECT*FROMpersonWHEREweight>ALL(SELECTweightFROMpersonwherelastname='de Winter');
Common hint: If there is no NULL special marker in the subquery it is possible to replace the ALL predicate by equivalent (and more intuitive) operations:
| <op> ALL | Substitution |
|---|---|
| < ALL | < (SELECT MIN() ...) |
| <= ALL | <= (SELECT MIN() ...) |
| = ALL | '=' or 'IN', if subselect retrieves 1 value. |
| Else: A single value cannot be equal to different values at the same time. (x = a AND x = b) evaluates to 'false' in all cases. | |
| >= ALL | >= (SELECT MAX() ...) |
| > ALL | > (SELECT MAX() ...) |
| <> ALL | '<>' or 'NOT IN', if subselect retrieves 1 value. |
| Else: 'NOT IN'. (x <> a AND x <> b). |
MySQL hint: Because of query rewrite issues the ONLY_FULL_GROUP_BY mode shall be disabled, e.g. by the command: set sql_mode='ANSI'.
The key words ANY and SOME are synonyms, their meaning is the same. Within this Wikibook, we prefer the use of ANY.
The ANY predicate compares every row in the sense of aboolean OR with every value of the subquery. It needs - in contrast to the IN predicate - an additional operator, which is one of: <, <=, =, >=, > or <>.
SELECT*FROMpersonWHEREweight>ANY(SELECTweightFROMpersonwherelastname='de Winter');
Common hint: If there is no NULL special marker in the subquery it is possible to replace the ANY predicate by equivalent (and more intuitive) operations:
| <op> ANY | Substitution |
|---|---|
| < ANY | < (SELECT MAX() ...) |
| <= ANY | <= (SELECT MAX() ...) |
| = ANY | '=' or 'IN', if subselect retrieves 1 value. |
| Else: 'IN'. (x = a OR x = b). | |
| >= ANY | >= (SELECT MIN() ...) |
| > ANY | > (SELECT MIN() ...) |
| <> ANY | '<>' or 'NOT IN', if subselect retrieves 1 value. |
| Else: A single value is always different from two or more different values under an OR conjunction. (x <> a OR x <> b) evaluates to 'true' in all cases. |
MySQL hint: Because of query rewrite issues the ONLY_FULL_GROUP_BY mode shall be disabled, e.g. by the command: set sql_mode='ANSI'.
The EXISTS predicate retrieves rows, if the subquery retrieves one or more rows. Meaningful examples typically use acorrelated subquery.
SELECT*FROMcontactc1WHEREEXISTS(SELECT*FROMcontactc2WHEREc2.person_id=c1.person_id-- correlation criterion between query and subqueryANDc2.contact_type='icq');
The example retrieves all contacts for such persons, which have an ICQ-contact.
The EXISTS predicate can be negated by adding the keyword NOT.
...WHERENOTEXISTS...
In the chapterGrouping we have seen that the key word GROUP BY creates groups of rows within a result set. Additionallyaggregat functions likeSUM() computes condensed values for each of those groups.
Because GROUP BY can summarize by multiple columns, there is often the requirement to compute summary values for 'super-groups', which arise by successively omitting one column after the next from the GROUP BY specification.
To illustrate the situation, we offer an example table and typical questions to such kind of tables.
CREATETABLEcar_pool(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,producerVARCHAR(50)NOTNULL,modelVARCHAR(50)NOTNULL,yyyyDECIMALNOTNULLCHECK(yyyyBETWEEN1970AND2020),counterDECIMALNOTNULLCHECK(counter>=0),CONSTRAINTcar_pool_pkPRIMARYKEY(id));--INSERTINTOcar_poolVALUES(1,'VW','Golf',2005,5);INSERTINTOcar_poolVALUES(2,'VW','Golf',2006,2);INSERTINTOcar_poolVALUES(3,'VW','Golf',2007,3);INSERTINTOcar_poolVALUES(4,'VW','Golf',2008,3);INSERTINTOcar_poolVALUES(5,'VW','Passat',2005,5);INSERTINTOcar_poolVALUES(6,'VW','Passat',2006,1);INSERTINTOcar_poolVALUES(7,'VW','Beetle',2005,1);INSERTINTOcar_poolVALUES(8,'VW','Beetle',2006,2);INSERTINTOcar_poolVALUES(9,'VW','Beetle',2008,4);INSERTINTOcar_poolVALUES(10,'Toyota','Corolla',2005,4);INSERTINTOcar_poolVALUES(11,'Toyota','Corolla',2006,3);INSERTINTOcar_poolVALUES(12,'Toyota','Corolla',2007,2);INSERTINTOcar_poolVALUES(13,'Toyota','Corolla',2008,4);INSERTINTOcar_poolVALUES(14,'Toyota','Prius',2005,1);INSERTINTOcar_poolVALUES(15,'Toyota','Prius',2006,1);INSERTINTOcar_poolVALUES(16,'Toyota','Hilux',2005,1);INSERTINTOcar_poolVALUES(17,'Toyota','Hilux',2006,1);INSERTINTOcar_poolVALUES(18,'Toyota','Hilux',2008,1);--COMMIT;
In the table, there are two different car producers, 6 models and 4 years. Typical questions to such tables are:
As we haveseen, the keyword GROUP BY offers condensed data for exactly one grouping level,producer plusmodel in this case.
SELECTproducer,model,sum(counter)ascntFROMcar_poolGROUPBYproducer,modelORDERBYproducer,cntdesc;--ToyotaCorolla13ToyotaHilux3ToyotaPrius2VWGolf13VWBeetle7VWPassat6
In such situations, one would like to know also the corresponding values for upper groups: perproducer or for the whole table. This can be achieved by submitting a slightly different SELECT.
SELECTproducer,sum(counter)ascntFROMcar_poolGROUPBYproducerORDERBYproducer,cntdesc;--Toyota18VW26----SELECTsum(counter)ascntFROMcar_pool;--44
In principle, it is possible, to combine such SELECTs via UNION or to submit them sequentially. But because this is a standard requirement, SQL offers a more elegant solution, namely the extension of the GROUP BY with the ROLLUP keyword. Based on the results of the GROUP BY, it offers additional rows for every superordinate group, which arises by omitting the grouping criteria one after the other.
SELECTproducer,model,sum(counter)ascntFROMcar_poolGROUPBYROLLUP(producer,model);-- the MySQL syntax is: GROUP BY producer, model WITH ROLLUP--ToyotaCorolla13ToyotaHilux3ToyotaPrius2Toyota18<-- the additional row per first producerVWBeetle7VWGolf13VWPassat6VW26<-- the additional row per next producer44<-- the additional row per all producers
The simple GROUP BY clause creates rows at the level ofproducer plusmodel. The ROLLUP keyword leads to additional rows where first themodel and thenmodel andproducer are omitted.
The ROLLUP keyword offers solutions where a hierarchical point of view is adequate. But in data warehouse applications, one likes to navigate freely through the aggregated data, not only from top to bottom. To support this requirement, the SQL standard offers the keyword CUBE. It is an extension of ROLLUP and offers additional rows forall possible combinations of the GROUP BY columns.
In the case of our above example with the two columnsproducer andmodel, the ROLLUP has created rows for 'producer-only' and 'no criteria' (= complete table). Additional to that, CUBE creates rows for 'model-only'. (If differentproducer would use the samemodel-name, such rows will lead to only 1 additional row.)
SELECTproducer,model,SUM(counter)AScntFROMcar_poolGROUPBYCUBE(producer,model);-- not supported by MySQL--ToyotaCorolla13ToyotaHilux3ToyotaPrius2Toyota-18VWBeetle7VWGolf13VWPassat6VW-26-Beetle7<---Corolla13<---Golf13<-- additional rows for 'model-only'-Hilux3<---Passat6<---Prius2<----44
If there are tree grouping columns c1, c2, and c3, the keywords lead to the following grouping.
| GROUP BY: | (c1, c2, c3) |
| GROUP BY ROLLUP: | (c1, c2, c3), (c1, c2), (c1) and () |
| GROUP BY CUBE: | (c1, c2, c3), (c1, c2), (c1, c3), (c2, c3), (c1), (c2), (c3) and () |
Thewindow functions discussed on this page are a special and very powerful extension to 'traditional' functions. They compute their result not on a single row but on a set of rows (similar toaggregate functions acting in correlation with aGROUP BY clause). This set of rows - and this is the crucial point - 'moves' or 'slides' over all rows, which are determined by theWHERE clause. This 'sliding window' is called aframe or - in terms of the official SQL standard - the 'window frame'.
Here are some examples:
GROUP BY clauses.In contrast toGROUP BY clauses, where only one output row per group exists, withwindow functions all rows of the result set retain their identity and are shown.
Window functions are listed between the two keywordsSELECT andFROM at the same place where usual functions and columns are listed. They contain the keyword OVER.
-- Window functions appear between the key words SELECT and FROMSELECT...,<window_function>,...FROM<tablename>...;-- They consist of three main parts:-- 1. function type (which is the name of the function)-- 2. key word 'OVER'-- 3. specification, which rows constitute the 'sliding window' (partition, order and frame)<window_function>:=<window_function_type>OVER<window_specification><window_function_type>:=ROW_NUMBER()|RANK()|LEAD(<column>)|LAG(<column>)|FIRST_VALUE(<column>)|LAST_VALUE(<column>)|NTH_VALUE(<column>,<n>)|SUM(<column>)|MIN(<column>)|MAX(<column>)|AVG(<column>|COUNT(<column>)<window_specification>:=[<window_partition>][<window_order>][<window_frame>]<window_partition>:=PARTITIONBY<column><window_order>:=ORDERBY<column><window_frame>:=seebelow

Concerning window functions, there are some similar concepts. To distinguish the concepts from each other, it is necessary to use an exact terminology. This terminology is introduced in the next eight paragraphs, which also - roughly - reflect the order of execution. The goal of the first seven steps is the determination of the actual frame, and the eighth step acts on it.
WHERE clause returns a certain number of rows. They constitutes theresult set.ORDER BY clause (syntactically behind theWHERE clause) re-orders theresult set into a certain sequence.SELECT clause. The row, which is actually given to theSELECT clause, is called thecurrent row.WINDOW PARTITION clause divides theresult set intowindow partitions (We will use the shorter termpartition as in the context of our site there is no danger of confusion). If there is noWINDOW PARTITION clause, all rows of theresult set constitutes onepartition. (Thesepartitions are equivalent to groups created by theGROUP BY clause.)Partitions are distinct from each other: there is no overlapping as every row of theresult set belongs to one and only onepartition.WINDOW ORDER clause orders the rows of eachpartition (which may differ from theORDER BY clause).WINDOW FRAME clause defines which rows of the actualpartition belong to the actualwindow frame (We will use the shorter termframe). The clause defines oneframe for every row of theresult set. This is done by determining the lower and upper boundary of affected rows. In consequence, there are as many (mostly different) frames as number of rows in the result set. The upper and lower boundaries are newly determined with every row of the result set! Single rows may be part of more than oneframe.The actualframe is the instantiation of the 'sliding window'. Its rows are ordered according to theWINDOW ORDER clause.WINDOW FRAME clause, the rows of the actualpartition constituteframes with the following default boundaries: The first row of the actualpartition is their lower boundary and thecurrent row is their upper boundary. If there is noWINDOW FRAME clause and noWINDOW ORDER clause, the upper boundary switches to the last row of the actualpartition. Below we will explain how to change this default behavior.We use the following table to demonstrate window functions.
CREATETABLEemployee(-- define columns (name / type / default value / column constraint)idDECIMALPRIMARYKEY,emp_nameVARCHAR(20)NOTNULL,dep_nameVARCHAR(20)NOTNULL,salaryDECIMAL(7,2)NOTNULL,ageDECIMAL(3,0)NOTNULL,-- define table constraints (it's merely an example table)CONSTRAINTempoyee_ukUNIQUE(emp_name,dep_name));INSERTINTOemployeeVALUES(1,'Matthew','Management',4500,55);INSERTINTOemployeeVALUES(2,'Olivia','Management',4400,61);INSERTINTOemployeeVALUES(3,'Grace','Management',4000,42);INSERTINTOemployeeVALUES(4,'Jim','Production',3700,35);INSERTINTOemployeeVALUES(5,'Alice','Production',3500,24);INSERTINTOemployeeVALUES(6,'Michael','Production',3600,28);INSERTINTOemployeeVALUES(7,'Tom','Production',3800,35);INSERTINTOemployeeVALUES(8,'Kevin','Production',4000,52);INSERTINTOemployeeVALUES(9,'Elvis','Service',4100,40);INSERTINTOemployeeVALUES(10,'Sophia','Sales',4300,36);INSERTINTOemployeeVALUES(11,'Samantha','Sales',4100,38);COMMIT;
The example demonstrates how the boundaries 'slides' over the result set. Doing so, they create one frame after the next,one per row of the result set. These frames are part of partitions, the partitions are part of the result set, and the result set is part of the table.
SELECTid,emp_name,dep_name,-- The functions FIRST_VALUE() and LAST_VALUE() explain themselves by their name. They act within the actual frame.FIRST_VALUE(id)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_first_row,LAST_VALUE(id)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_last_row,COUNT(*)OVER(PARTITIONBYdep_nameORDERBYid)ASframe_count,-- The functions LAG() and LEAD() explain themselves by their name. They act within the actual partition.LAG(id)OVER(PARTITIONBYdep_nameORDERBYid)ASprev_row,LEAD(id)OVER(PARTITIONBYdep_nameORDERBYid)ASnext_rowFROMemployee;-- For simplification, we use the same PARTITION and ORDER definitions for all window functions.-- This is not necessary, you can use divergent definitions.
Please notice how the lower boundary (FRAME_FIRST_ROW) and the upper boundary (FRAME_LAST_ROW) changes from row to row.
| ID | EMP_NAME | DEP_NAME | FRAME_FIRST_ROW | FRAME_LAST_ROW | FRAME_COUNT | PREV_ROW | NEXT_ROW |
|---|---|---|---|---|---|---|---|
| 1 | Matthew | Management | 1 | 1 | 1 | - | 2 |
| 2 | Olivia | Management | 1 | 2 | 2 | 1 | 3 |
| 3 | Grace | Management | 1 | 3 | 3 | 2 | - |
| 4 | Jim | Production | 4 | 4 | 1 | - | 5 |
| 5 | Alice | Production | 4 | 5 | 2 | 4 | 6 |
| 6 | Michael | Production | 4 | 6 | 3 | 5 | 7 |
| 7 | Tom | Production | 4 | 7 | 4 | 6 | 8 |
| 8 | Kevin | Production | 4 | 8 | 5 | 7 | - |
| 10 | Sophia | Sales | 10 | 10 | 1 | - | 11 |
| 11 | Samantha | Sales | 10 | 11 | 2 | 10 | - |
| 9 | Elvis | Service | 9 | 9 | 1 | - | - |
The query has noWHERE clause. Therefore all rows of the table are part of the result set. According to theWINDOW PARTITION clause, which is 'PARTITION BY dep_name', the result set is divided into the 4 partitions: 'Management', 'Production', 'Sales' and 'Service'. The frames run within these partitions. As there is noWINDOW FRAME clause the frames start at the first row of the actual partition and runs up to the current row.
You can see that the actual number of rows within a frame (column FRAME_COUNT) grows from 1 up to the sum of all rows within the partition. When the partition switches to the next one, the number starts again with 1.
The columns PREV_ROW and NEXT_ROW show the ids of the previous and next row within the actual partition. As the first row has no predecessor, theNULL indicator is shown. This applies correspondingly to the last row and its successor.
We present some of the<window_function_type> functions and their meaning. The standard as well as most implementations include additional functions and overloaded variants.
| Signature | Scope | Meaning / Return Value |
|---|---|---|
| FIRST_VALUE(<column>) | Actual Frame | The column value of the first row within the frame. |
| LAST_VALUE(<column>) | Actual Frame | The column value of the last row within the frame. |
| LAG(<column>) | Actual Partition | The column value of the predecessor row (the row which is before the current row). |
| LAG(<column>, <n>) | Actual Partition | The column value of the n.-th row before the current row. |
| LEAD(<column>) | Actual Partition | The column value of the successor row (the row which is after the current row). |
| LEAD(<column>, <n>) | Actual Partition | The column value of the n.-th row after the current row. |
| ROW_NUMBER() | Actual Frame | A numeric sequence of the row within the frame. |
| RANK() | Actual Frame | A numeric sequence of the row within the frame. Identical values in the specified order evaluate to the same number. |
| NTH_VALUE(<column>, <n>) | Actual Frame | The column value of the n.-th row within the frame. |
| SUM(<column>) MIN(<column>) MAX(<column>) AVG(<column>) COUNT(<column>) | Actual Frame | As usual. |
Here are some examples:
SELECTid,emp_name,dep_name,ROW_NUMBER()OVER(PARTITIONBYdep_nameORDERBYid)ASrow_number_in_frame,NTH_VALUE(emp_name,2)OVER(PARTITIONBYdep_nameORDERBYid)ASsecond_row_in_frame,LEAD(emp_name,2)OVER(PARTITIONBYdep_nameORDERBYid)AStwo_rows_aheadFROMemployee;
| ID | EMP_NAME | DEP_NAME | ROW_NUMBER_IN_FRAME | SECOND_ROW_IN_FRAME | TWO_ROWS_AHEAD |
|---|---|---|---|---|---|
| 1 | Matthew | Management | 1 | - | Grace |
| 2 | Olivia | Management | 2 | Olivia | - |
| 3 | Grace | Management | 3 | Olivia | - |
| 4 | Jim | Production | 1 | - | Michael |
| 5 | Alice | Production | 2 | Alice | Tom |
| 6 | Michael | Production | 3 | Alice | Kevin |
| 7 | Tom | Production | 4 | Alice | - |
| 8 | Kevin | Production | 5 | Alice | - |
| 10 | Sophia | Sales | 1 | - | - |
| 11 | Samantha | Sales | 2 | Samantha | - |
| 9 | Elvis | Service | 1 | - | - |
The three example shows:
As shown in the above examples, theWINDOW PARTITION clause defines the partitions by using the keywords PARTITION BY and theWINDOW ORDER clause defines the sequence of rows within the partition by using the key words ORDER BY.
The frames are defined by theWINDOW FRAME clause, which optionally follows theWINDOW PARTITION clause and theWINDOW ORDER clause.
With the exception of the lead() and lag() functions, whose scope is the actual partition, all other window functions act on the actual frame. Therefore it is an elementary decision, which rows shall constitute the frame. This is done by establishing the lower and upper boundary (in the sense of theWINDOW ORDER clause). All rows within these two bounds constitute the actual frame. Therefore theWINDOW FRAME clause consists mainly of the definition of the two boundaries - in one of four ways:
SELECT DISTINCT ... orGROUP BY. The resulting frame covers all rows, whose values fall into one of the groups. As every group may be built out of multiple rows (with the same value), the number of rows per frame is not constant.In accordance with these different strategies, there are three keywords 'ROWS', 'GROUPS' and 'RANGE' which leads to the different behavior.
TheWINDOW FRAME clause uses some keywords that modify or specify where the ordered rows of a partition are visualized.
Rows in a partition and the related keywords - <-- UNBOUNDED PRECEDING (first row) ... - <-- 2 PRECEDING - <-- 1 PRECEDING - <-- CURRENT ROW - <-- 1 FOLLOWING - <-- 2 FOLLOWING ... - <-- UNBOUNDED FOLLOWING (last row)
The term UNBOUNDED PRECEDING denotes the first row in a partition and UNBOUNDED FOLLOWING the last row. Counting from the CURRENT ROW there are <n> PRECEDING and <n> FOLLOWING rows. Obviously this PRECEDING/FOLLOWING terminology works only, if there is aWINDOW ORDER clause, which creates an unambiguous sequence.
The (simplified) syntax of theWINDOW FRAME clause is:
<window_frame>:=[ROWS|GROUPS|RANGE]BETWEEN[UNBOUNDEDPRECEDING|<n>PRECEDING|CURRENTROW]AND[UNBOUNDEDFOLLOWING|<n>FOLLOWING|CURRENTROW]
An example of a complete window function with itsWINDOW FRAME clause is:
...SUM(salary)OVER(PARTITIONBYdep_nameORDERBYsalaryROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)asgrowing_sum,...
In this case theWINDOW FRAME clause starts with the keyword 'ROWS'. It defines the lower boundary to the very first row of the partition and the upper boundary to the actual row. This means that the series of frames grows from frame to frame by one additional row until all rows of the partition are handled. Afterward, the next partition starts with a 1-row-frame and repeats the growing.
The ROWS syntax defines a certain number of rows to process.
SELECTid,dep_name,salary,SUM(salary)OVER(PARTITIONBYdep_nameORDERBYsalaryROWSBETWEEN2PRECEDINGANDCURRENTROW)ASsum_over_1or2or3_rowsFROMemployee;
The example acts on a certain number of rows, namely the two rows before the current row (if existing within the partition) and the current row. There is no situation where more than three rows exists in one of the frames. The window function computes the sum of the salary over these maximal three rows.
The sum is reset to zero with every new partition, which is the department in this case. This holds true also for the GROUPS and RANGE syntax.
The ROWS syntax is often used when one is interested in the average about a certain number of rows or in the distance between two rows.
The GROUPS syntax has a similar semantic as the ROWS syntax - with one exception: rows with equal values within the column of theWINDOW ORDER clause count as 1 row. The GROUPS syntax counts the number of distinct values, not the number of rows.
-- Hint: The syntax 'GROUPS' (Feature T620) is not supported by Oracle 11SELECTid,dep_name,salary,SUM(salary)OVER(PARTITIONBYdep_nameORDERBYsalaryGROUPSBETWEEN1PRECEDINGAND1FOLLOWING)ASsum_over_groupsFROMemployee;
The example starts with the keyword GROUPS and defines that it wants to work on three distinct values of the column 'salary'. Possibly more than three rows are satisfying these criteria - in opposite to the equivalent ROWS strategy.
The GROUPS syntax is the appropriate strategy, if one has a varying number of rows within the time period under review, eg.: one has a varying number of measurement values per day and is interested in the average of the variance over a week or month.
At a first glance, the RANGE syntax is similar to the ROWS and GROUPS syntax. But the semantic is very different! Numbers <n> given in this syntax did not specify any counter. They specify thedistance from the value in the current row to the lower or upper boundary. Therefore the ORDER BY column shall be of type NUMERIC, DATE, or INTERVAL.
SELECTid,dep_name,salary,SUM(salary)OVER(PARTITIONBYdep_nameORDERBYsalaryRANGEBETWEEN100PRECEDINGAND50FOLLOWING)ASsum_over_rangeFROMemployee;
This definition leads to the sum over all rows which have a salary from 100 below and 50 over the actual row. In our example table, this criteria applies in some rare cases to more than 1 row.
Typical use cases for the RANGE strategy are situations where someone analyzes a wide numeric range and expects to meet only a few rows within this range, e.g.: a sparse matrix.
If theWINDOW FRAME clause is omitted, its default value is: 'RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'. This leads to a range from the very first row of the partition up the current row plus all rows with the same value as the current row - because the RANGE syntax applies.
If theWINDOW ORDER clause is omitted, theWINDOW FRAME clause is not allowed and all rows of the partition constitute the frame.
If thePARTITION BY clause is omitted, all rows of the result set constitutes the one and only partition.
Although the SQL standard 2003 and his successors define very clear rules concerning window functions, several implementations did not follow them. Some vendors implement only parts of the standard - which is their own responsibility -, but others seem to interpret the standard in a fanciful fashion.
As far as we know, the ROWS syntax conforms to the standard when it is implemented. But it seems that the RANGE syntax sometimes implements what the GROUPS syntax of the SQL standard requires. (Perhaps this is a misrepresentation, and only the public available descriptions of various implementations do not reflect the details.) So: be careful, test your system, and give us feedback on the discussion page.
Show id, emp_name, dep_name, salary and the average salary within the department.
---- To get the average of the department, every frame must be build by ALL rows of the department.--SELECTid,emp_name,dep_name,salary,AVG(salary)OVER(PARTITIONBYdep_nameORDERBYdep_name-- all rows of partition (=department)ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)asavg_salaryFROMemployee;---- It's possible to omit the 'window order' clause. Thereby the frames include ALL rows of the actual partition.-- See: 'Defaults' above.--SELECTid,emp_name,dep_name,salary,AVG(salary)OVER(PARTITIONBYdep_name)asavg_salaryFROMemployee;---- The following statements lead to different results as the frames are composed of a growing number of rows.--SELECTid,emp_name,dep_name,salary,AVG(salary)OVER(PARTITIONBYdep_nameORDERBYsalary)asavg_salaryFROMemployee;---- It's possible to sort the result set by arbitrary rows (test the emp_name, it's interesting)--SELECTid,emp_name,dep_name,salary,AVG(salary)OVER(PARTITIONBYdep_name)asavg_salaryFROMemployeeORDERBYdep_name,salary;
Does older persons earn more money than younger?
To give an answer show id, emp_name, salary, age and the average salary of 3 (or 5) persons, which are in a similar age.
SELECTid,emp_name,salary,age,AVG(salary)OVER(ORDERBYageROWSBETWEEN1PRECEDINGAND1FOLLOWING)ASmean_over_3,AVG(salary)OVER(ORDERBYageROWSBETWEEN2PRECEDINGAND2FOLLOWING)ASmean_over_5FROMemployee;-- As there is no restriction to any other criterion than the age (department or something else), there is-- no need for any PARTITION definition. Averages are computed without any interruption.
Extend the above question and its solution to show the results within the four departments.
SELECTid,emp_name,salary,age,dep_name,AVG(salary)OVER(PARTITIONBYdep_nameORDERBYageROWSBETWEEN1PRECEDINGAND1FOLLOWING)ASmean_over_3,AVG(salary)OVER(PARTITIONBYdep_nameORDERBYageROWSBETWEEN2PRECEDINGAND2FOLLOWING)ASmean_over_5FROMemployee;-- Averages are computed WITHIN departments.
Show id, emp_name, salary and the difference to the salary of the previous person (in ID-order).
-- For mathematician: This is a very first approximation to first derivate.SELECTid,emp_name,salary,salary-LAG(salary)OVER(ORDERBYid)asdiff_salaryFROMemployee;-- And the difference of differences:SELECTid,emp_name,salary,(LAG(salary)OVER(ORDERBYid)-salary)ASdiff_salary_1,(LAG(salary)OVER(ORDERBYid)-salary)-(LAG(salary,2)OVER(ORDERBYid)-LAG(salary)OVER(ORDERBYid))ASdiff_salary_2FROMemployee;
Show the 'surrounding' of a value: id and emp_name of all persons ordered by emp_name. Supplement each row with the two emp_names before and the two after the actual emp_name (in the usual alphabetical order).
SELECTid,LAG(emp_name,2)OVER(ORDERBYemp_name)ASbefore_prev,LAG(emp_name)OVER(ORDERBYemp_name)ASprev,emp_nameASact,LEAD(emp_name)OVER(ORDERBYemp_name)ASfollower,LEAD(emp_name,2)OVER(ORDERBYemp_name)ASbehind_followerFROMemployeeORDERBYemp_name;
TheWITH clause extends SELECT, UPDATE, INSERT or DELETE statements with the definition of one or more 'inline views'. The scope of such 'inline views' is limited to the actual statement. Hence they have a temporary nature. Their purpose is the rearrangement of complex statements in such a way that the intention of the complete statement gets more clearly represented. Primarily, it is a syntax element to support the maintenance of database applications. Secondarily, if complex statements contain identical phrases at different places, and they are rearranged by a singleWITH clause, the DBMS has a better chance to find an optimal execution strategy.
In other words: TheWITH clause does not offer any new feature - with the exception of recursive queries, which will be explained in the next chapter. It offers only a syntax element to express complex queries in a clearly arranged way.
Hint:WITH clause is the terminology of the SQL standard. In everyday speech, this language construct is mostly referred to asCommon Table Expression (CTE) and sometimes asInline View. Oracle calls itSubquery Factoring Clause. Even though it is not the official term, on this page of the Wikibook, we prefer the termCTE as it is an expressive abbreviation.
WITHcte_1[(temp_column_name[,...])]AS(SELECT...)SELECT...FROMcte_1;-- consider the semicolon: the SELECT command (containing a WITH clause) runs from the term 'WITH' up to here.
The keyword WITH introduces the definition of a CTE. First, the CTE gets an arbitrary name - 'cte_1' in our case. This is followed by a SELECT statement, which defines how to retrieve data from one or more tables into the CTE (please consider: it's only a definition - it is not necessary, that the data really is materialized in the CTE). Afterward, a normal SELECT, UPDATE, INSERT, or DELETE statement follows, where the previous defined name of the CTE is used directly in the FROM or at any other place, where view or table names are allowed.
The first example is limited to report only the content of aCTE. In this case it contains information about days where more than one person is born and consists of rows with the date plus a counter.
-- define the CTE 'ambiguous_date'WITHambiguous_dateAS-- this is similar to: CREATE VIEW ambiguous_date AS ...(SELECTcount(*)AScnt_per_date,date_of_birthFROMpersonGROUPBYdate_of_birthHAVINGcount(*)>1)-- use the CTE 'ambiguous_date'SELECT*FROMambiguous_date;-- After the semicolon, the CTE is no longer known by the DBMS. It is gone out of scope.
The similarity betweenCTEs and views is obvious. The main difference is that view definitions keep alive after using the view name in a statement, whereas the scope of aCTE is limited to the statement, where it is defined. This is the reason why some people denoteCTEs asInline Views.
The second example uses theCTE within a SUBSELECT of the main SELECT. The CTE is further modified by a BETWEEN criterion.
-- define the CTEWITHambiguous_dateAS(SELECTcount(*)AScnt_per_date,date_of_birthFROMpersonGROUPBYdate_of_birthHAVINGcount(*)>1)-- the main SELECTSELECT*FROMpersonpWHEREp.date_of_birthIN-- use the CTE(SELECTdate_of_birthFROMambiguous_date-- one can manipulate the CTE in the same manner as every other view or tableWHEREcnt_per_dateBETWEEN3AND10);
The third example uses theCTE at different places within the statement.
-- define the CTEWITHambiguous_dateAS(SELECTcount(*)AScnt_per_date,date_of_birthFROMpersonGROUPBYdate_of_birthHAVINGcount(*)>1)-- use of the CTE at different places and for different purposesSELECTp.*,(SELECTcount(*)ASsum_of_ambiguity_datesFROMambiguous_date)ASnumber_of_ambiguous_datesFROMpersonpWHEREp.date_of_birthIN(SELECTdate_of_birthFROMambiguous_date);
The main SELECT retrieves all persons (more than two rows), which have the same birthdays as such identified by theCTE (two rows). And it amends every person with the number of ambiguous birthdays (which is '2').
TheWITH clause is the basis forrecursive queries, which will be explained in the next chapter.
Sometimes the rows ofone table are structured in such a way that they represent a hierarchy or a networkwithin this table. Typical use cases are management structures, bill of materials (a machine consists of several smaller machines, …), or network structures (e.g.: flight plans).
To retrieve particular rows and all rows that correlate to them, one can use set operations in combination with subqueries to merge them together to one result set. But this technique is limited as one must exactly know the number of levels. Apart from the fact that the number of levels changes from case to case, the subselect syntax differs from level to level. To overcome these restrictions, SQL offers a syntax to express queries in arecursive manner. They retrieve the rows of all affected levels, independent from their number.
The SQL standard uses a special form of itsWITH clause, which is explained on theprevious page, to define recursive queries. The clause occurs before a SELECT, INSERT, UPDATE, or DELETE keyword and is part of the appropriate command.
Hint: TheWITH clause (with or without the 'RECURSIVE' key word) is often referred to as a 'common table expression (CTE)'.
Hint: Oracle supports the syntax of the SQL standard since version 11.2. MySQL 8.0 supports the RECURSIVE keyword. Earlier MySQL versions do not support recursions at all and recommend procedural workarounds.
-- The command starts with a 'with clause', which contains the optional 'RECURSIVE' key word.WITH[RECURSIVE]intermediate_table(temp_column_name[,...])AS(SELECT...FROMreal_table-- initial query to a real table (1)UNIONALL(3)SELECT...FROMintermediate_table-- repetitive query using the intermediate table (2))-- The 'with clause' is part of a regular SELECT.-- This SELECT refers to the final result of the 'with clause'. (4)SELECT...FROMintermediate_table;-- consider the semicolon: the command runs from the 'WITH' up to here.
The evaluation sequence is as follows:
To demonstrate recursive queries, we define an example table. It holds information about persons and their ancestors. Because ancestors are always persons, everything is stored in the same table.father_id andmother_id acts as references to the rows where father's and mother's information is stored. The combination offather_id,mother_id andfirstname acts as a criterion, which uniquely identifies rows according to those three values (we suppose that parents give their children different names).
CREATETABLEfamily_tree(-- define columns (name / type / default value / nullable)idDECIMALNOTNULL,firstnameVARCHAR(50)NOTNULL,lastnameVARCHAR(50)NOTNULL,year_of_birthDECIMALNOTNULL,year_of_deathDECIMAL,father_idDECIMAL,mother_idDECIMAL,-- the primary keyCONSTRAINTfamily_tree_pkPRIMARYKEY(id),-- an additional criterion to uniquely distinguish rows from each otherCONSTRAINTfamily_tree_uniqUNIQUE(father_id,mother_id,firstname),-- two foreign keys (to the same table in this special case) to ensure that no broken links ariseCONSTRAINTfamily_tree_fk1FOREIGNKEY(father_id)REFERENCESfamily_tree(id),CONSTRAINTfamily_tree_fk2FOREIGNKEY(mother_id)REFERENCESfamily_tree(id),-- plausibility checksCONSTRAINTfamily_tree_check1CHECK(year_of_birth>=1800ANDyear_of_birth<2100),CONSTRAINTfamily_tree_check2CHECK((year_of_death>=1800ANDyear_of_death<2100)ORyear_of_deathISNULL));-- a fictional coupleINSERTINTOfamily_treeVALUES(1,'Karl','Miller',1855,1905,null,null);INSERTINTOfamily_treeVALUES(2,'Lisa','Miller',1851,1912,null,null);-- their childrenINSERTINTOfamily_treeVALUES(3,'Ruth','Miller',1878,1888,1,2);INSERTINTOfamily_treeVALUES(4,'Helen','Miller',1880,1884,1,2);INSERTINTOfamily_treeVALUES(5,'Carl','Miller',1882,1935,1,2);INSERTINTOfamily_treeVALUES(6,'John','Miller',1883,1900,1,2);-- some more people; some of them are descendants of the MillersINSERTINTOfamily_treeVALUES(7,'Emily','Newton',1880,1940,null,null);INSERTINTOfamily_treeVALUES(8,'Charly','Miller',1908,1978,5,7);INSERTINTOfamily_treeVALUES(9,'Deborah','Brown',1910,1980,null,null);INSERTINTOfamily_treeVALUES(10,'Chess','Miller',1948,null,8,9);COMMIT;
As afirst example, we retrieve Mr. Karl Miller and all his descendants. To do so, we must retrieve his own row and define a rule, how to 'navigate' from level to level within the family tree.
-- Choose a name for the intermediate table and its columns. The column names may differ from the names in the real table.WITHintermediate_table(id,firstname,lastname)AS(-- Retrieve the starting row (or rows)SELECTid,firstname,lastnameFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALL-- Define the rule for querying the next level. In most cases this is done with a join operation.SELECTf.id,f.firstname,f.lastname-- the alias 'f' refers to the real tableFROMintermediate_tablei-- the alias 'i' refers to the intermediate tableJOINfamily_treefONf.father_id=i.idORf.mother_id=i.id-- the join operation defines, how to reach the next level)-- The final SELECTSELECT*FROMintermediate_table;
You can use all language features of SQL to process the intermediate table further. (It isn't a real table, it is only an intermediate result with the structure of a table). For example, to count the number of descendants.
WITH...-- The 'with clause' as above-- The final SELECTSELECTcount(*)FROMintermediate_table;
To demonstrate the problems in situations where no recursive SELECT is available, we show a syntax with subqueries.
-- This query retrieves only Mr. Karl Miller ...SELECT*FROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALL-- ... and his childrenSELECT*FROMfamily_treeWHEREfather_idIN(SELECTidFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller');
Every level has its own syntax, e.g., to retrieve grandchildren, we need a subquery within a subquery.
As asecond example, we traverse the hierarchy in the opposite direction: from a person to their patrilineal (male-line) ancestors. In comparison to the first example, two things change. The start point of the query is no longer Mr. Karl Miller, as he has no ancestor in our example table. And we have to change the join condition by swapping id and father_id.
-- Retrieve ancestorsWITHintermediate_table(id,father_id,firstname,lastname)AS(-- Retrieve the starting row (or rows)SELECTid,father_id,firstname,lastname-- now we need the 'father_id'FROMfamily_treeWHEREfirstname='Chess'ANDlastname='Miller'UNIONALL-- Define the rule for querying the next level.SELECTf.id,f.father_id,f.firstname,f.lastnameFROMintermediate_tableiJOINfamily_treefONf.id=i.father_id-- compared with the first example this join operation defines the opposite direction)-- The final SELECTSELECT*FROMintermediate_table;
Sometimes we need to know to which level within the hierarchy or network a row belongs to. To display this level, we include a pseudo-column with an arbitrary name into the query. We choose the namehier_level (aslevel is a reserved word in the context of savepoints).
-- We extend the above example to show the hierarchy levelWITHintermediate_table(id,firstname,lastname,hier_level)AS(SELECTid,firstname,lastname,0ashier_level-- set the level of the start point to a fix numberFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALLSELECTf.id,f.firstname,f.lastname,i.hier_level+1-- increment the levelFROMintermediate_tableiJOINfamily_treefONf.father_id=i.idORf.mother_id=i.id)SELECT*FROMintermediate_table;
The level is now available, and we can use it as an additional condition, eg. for a restriction to the first two levels.
-- The with clause remains unchanged...SELECT*FROMintermediate_tableWHEREhier_level<2;-- restrict the result to the first two levels-- or, as with the above solution the intermediate result set is computed over ALL levels and later restricted to the first two:WITHintermediate_table(id,firstname,lastname,hier_level)AS(SELECTid,firstname,lastname,0ashier_levelFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALLSELECTf.id,f.firstname,f.lastname,i.hier_level+1FROMintermediate_tableiJOINfamily_treefONf.father_id=i.idORf.mother_id=i.idWHEREhier_level<1-- restrict the join to the expected result)SELECT*FROMintermediate_table;
Sometimes we want to build a path from the starting point of the hierarchy or network to the actual row, eg. for afaceted classification like1.5.3 or for a simple numbering of the visited nodes. This can be achieved in a similar way as the computing of the level. We need a pseudo-column with an arbitrary name and append actual values to those that have already been formed.
-- Save the path from person to person in an additional column. We choose the name 'hier_path' as its name.WITHintermediate_table(id,firstname,lastname,hier_level,hier_path)AS(SELECTid,firstname,lastname,0ashier_level,firstnameashier_path-- we collect the given namesFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALL-- The SQL standard knows only a two-parameter function concat(). We us it twice.SELECTf.id,f.firstname,f.lastname,i.hier_level+1,concat(concat(i.hier_path,' / '),f.firstname)FROMintermediate_tableiJOINfamily_treefONf.father_id=i.idORf.mother_id=i.id)SELECT*FROMintermediate_table;
There are two ways to traverse hierarchies and networks. You must decide which kind of nodes you want to process first: child nodes (nodes of the next level) or sibling nodes (nodes of the same level). The two methods are calleddepth first andbreadth first. With the keywordsDEPTH FIRST andBREADTH FIRST (the default) you can decide between the two variants.
<with_clause>SEARCH[DEPTHFIRST|BREADTHFIRST]BY<column_name>SET<sequence_number><select_clause>
The key words occur between theWITH clause and theSELECT clause. Since - as opposed to a tree in a programming language like JAVA or C++ or like an XML instance - rows of a table have no implicit order, you must define an order for the nodes within their level. This is done behind theBY key word. After theSET key word, define the name of an additional pseudo-column, where a numbering over all rows is stored automatically.
WITHintermediate_table(id,firstname,lastname,hier_level)AS(SELECTid,firstname,lastname,0AShier_levelFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALLSELECTf.id,f.firstname,f.lastname,i.hier_level+1FROMintermediate_tableiJOINfamily_treefONf.father_id=i.idORf.mother_id=i.id)-- SEARCH BREADTH FIRST BY firstname SET sequence_numberSEARCHDEPTHFIRSTBYfirstnameSETsequence_numberSELECT*FROMintermediate_table;
There are some notable remarks to the above query:
BREADTH FIRST), the family tree is traversed in such a way that after every row its 'child' rows are processed. This is significant at level 1.BY definition:firstname in this case.Retrieve Chess Miller and all Chess'sfemale ancestors.
WITHintermediate_table(id,mother_id,firstname,lastname)AS(SELECTid,mother_id,firstname,lastnameFROMfamily_treeWHEREfirstname='Chess'ANDlastname='Miller'UNIONALLSELECTf.id,f.mother_id,f.firstname,f.lastnameFROMintermediate_tableiJOINfamily_treefONf.id=i.mother_id)SELECT*FROMintermediate_table;
Retrieve Chess Miller and all Chess's ancestors: male and female.
WITHintermediate_table(id,father_id,mother_id,firstname,lastname)AS(SELECTid,father_id,mother_id,firstname,lastnameFROMfamily_treeWHEREfirstname='Chess'ANDlastname='Miller'UNIONALLSELECTf.id,f.father_id,f.mother_id,f.firstname,f.lastnameFROMintermediate_tablei-- extend the JOIN condition!JOINfamily_treefON(f.id=i.mother_idORf.id=i.father_id))SELECT*FROMintermediate_table;
To make the situation a little bit more transparent add a number to the previous query which shows the actual level.
WITHintermediate_table(id,father_id,mother_id,firstname,lastname,hier_level)AS(SELECTid,father_id,mother_id,firstname,lastname,0-- we start with '0'FROMfamily_treeWHEREfirstname='Chess'ANDlastname='Miller'UNIONALLSELECTf.id,f.father_id,f.mother_id,f.firstname,f.lastname,i.hier_level+1FROMintermediate_tableiJOINfamily_treefON(f.id=i.mother_idORf.id=i.father_id))SELECT*FROMintermediate_table;
To make the situation absolutely transparent replace the level by some kind of path (child / parent / grandparent / ...).
WITHintermediate_table(id,father_id,mother_id,firstname,lastname,ancestry)AS(SELECTid,father_id,mother_id,firstname,lastname,firstnameFROMfamily_treeWHEREfirstname='Chess'ANDlastname='Miller'UNIONALLSELECTf.id,f.father_id,f.mother_id,f.firstname,f.lastname,concat(concat(i.ancestry,' / '),f.firstname)FROMintermediate_tableiJOINfamily_treefON(f.id=i.mother_idORf.id=i.father_id))SELECT*FROMintermediate_table;
Retrieve all grandchildren of Karl Miller.
WITHintermediate_table(id,father_id,mother_id,firstname,lastname,hier_level)AS(SELECTid,father_id,mother_id,firstname,lastname,0-- we start with '0'FROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALLSELECTf.id,f.father_id,f.mother_id,f.firstname,f.lastname,i.hier_level+1FROMintermediate_tableiJOINfamily_treefON(f.father_id=i.idANDhier_level<2)-- performance: abort joining after the second level)SELECT*FROMintermediate_tableWHEREhier_level=2;-- this is the restriction to the grandchildren
Retrieve every person in the tablefamily_tree and show its firstname and the firstname of its very first known ancestor in the male line.
WITHintermediate_table(id,father_id,firstname,lastname,initial_row,hier_level)AS(-- The starting points are persons (more than one in our example table) for which no father is known.SELECTid,father_id,firstname,lastname,firstname,0FROMfamily_treeWHEREfather_idISNULLUNIONALL-- The start name is preserved from level to levelSELECTf.id,f.father_id,f.firstname,f.lastname,i.initial_row,i.hier_level+1FROMintermediate_tableiJOINfamily_treefONf.father_id=i.id)SELECT*FROMintermediate_table;-- or:...unchanged'with clause'SELECTid,firstname,'-->',initial_row,'in ',hier_level,'generation(s)'FROMintermediate_table;
a) How many descendants of Carl Miller are stored in the example table?
b) Same question as before, but differentiated per level.
-- a)WITHintermediate_table(id,firstname,lastname,hier_level)AS(SELECTid,firstname,lastname,0AShier_levelFROMfamily_treeWHEREfirstname='Karl'ANDlastname='Miller'UNIONALLSELECTf.id,f.firstname,f.lastname,i.hier_level+1FROMintermediate_tableiJOINfamily_treefONf.father_id=i.id)SELECTcount(*)FROMintermediate_tablewherehier_level>0;-- b) Use the same WITH clause. Only the final SELECT changes....SELECThier_level,count(hier_level)FROMintermediate_tableWHEREhier_level>0GROUPBYhier_level;
As mentioned in aprevious chapter of this wikibook and inwikipedia sometimes there is no value in a column of a row, or - to say it the other way round - the column stores theNULL marker (a flag to indicate the absence of any data), or - to use the notion of the SQL standard - the column stores theNULL value. This NULL marker is very different from the numeric value zero or a string with a length of zero characters! Typically it occurs when an application yet hasn't stored anything in the column of this row.
(A hint to Oracle users: For Oracle, the NULL marker is identical to a string of zero characters.)
The existence of the NULL marker introduces anew fundamental problem. In the usual boolean logic, there are the two logical values TRUE and FALSE. Every comparison evaluates to one of the two - and the comparisons negation evaluates to the opposite one. If a comparison evaluates to TRUE, its negation evaluates to FALSE and vice versa. As an example, in the usual boolean logic, one of the following two comparisons is TRUE, and the other one is FALSE: 'x < 5', 'x >= 5'.
Imagine now the new situation that x holds the NULL marker. It is not feasible that 'NULL < 5' is true (1). But can we say 'NULL < 5' is false (2) and its negation 'NULL >= 5' is true (3)? Is (3) more feasible than (1)? Of course not. (1) and (3) have the same 'degree of truth', so they shall evaluate to the same value. And this value must be different from TRUE and FALSE.
Therefore the usual boolean logic is extended by a third logic value. It is namedUNKNOWN. All comparisons to the NULL marker results per definition in this new value. And the well-known statement 'if a statement is true, its negation is false' gets lost because there is a third option.
SQL's logic is an implementation of this so-calledtrivalent, ternary or three-valued logic (3VL). The existence of the NULL marker in SQL is not withoutcontroversy. But if NULLs are accepted, the 3VL is a necessity.
This page proceeds in two stages: First, it explains the handling of NULLs concerning comparisons, grouping, etc. . Second, it explains the boolean logic for the cases where the new value UNKNOWN interacts with any other boolean value - including itself.
To demonstrate NULL behaviors, we define an example tables: t1, and t2.
CREATETABLEt1(idDECIMALPRIMARYKEY,col_1DECIMAL,col_2VARCHAR(20),col_3DECIMAL);INSERTINTOt1VALUES(1,1,'Hello World',1);INSERTINTOt1VALUES(2,2,NULL,2);INSERTINTOt1VALUES(3,3,'Hello World',NULL);INSERTINTOt1VALUES(4,4,'Hello World',NULL);INSERTINTOt1VALUES(5,5,'Hello Their',NULL);INSERTINTOt1VALUES(6,NULL,'Hello World',NULL);INSERTINTOt1VALUES(7,NULL,'Hello World',NULL);INSERTINTOt1VALUES(8,8,'Hello World',NULL);INSERTINTOt1VALUES(18,18,'Hello World',NULL);CREATETABLEt2(idDECIMALPRIMARYKEY,col_xDECIMAL);INSERTINTOt2VALUES(1,1);INSERTINTOt2VALUES(2,NULL);INSERTINTOt2VALUES(3,3);INSERTINTOt2VALUES(4,4);INSERTINTOt2VALUES(5,5);INSERTINTOt2VALUES(18,18);COMMIT;
SQL knows the six comparison predicates <, <=, =, >=, > and <> (unequal). Their main purpose is the arithmetic comparison of numeric values. Each of them needs two variables or constants (infix notation). This implies that it is possible that one or even both operands hold the NULL marker. As stated before, the common and very simple rule is: "All comparisons to the NULL marker results per definition in this new value (unknown).". Here are some examples:
The WHERE clause returns such rows where it evaluates to TRUE. It does not return rows where it evaluates to FALSE or to UNKNOWN. In consequence, it is not guaranteed that the following SELECT will return the complete tablet1:
-- This SELECT will not return such rows where col_1 holds the NULL marker.SELECT*FROMt1WHEREcol_1>5ORcol_1=5ORcol_1<5;
Of course, there are use cases where rows with the NULL marker must be retrieved. Because the arithmetic comparisons are not able to do so, another language construct must do the job. It is theIS NULL predicate.
-- This SELECT will return exactly these rows where col_1 holds the NULL marker.SELECT*FROMt1WHEREcol_1ISNULL;
For the other predicates, there is no simple rule of thumb. They must be explained one after the other.
TheIN predicate is a shortcut for a sequence of OR operations:
Only the two comparisons 'col_1 = 3' and 'col_1 = 18' are able to retrieve rows (possibly many rows). The comparison 'col_1 = NULL' will never evaluate to TRUE. It's always UNKNOWN, even if col_1 holds the NULL marker. To retrieve those rows, it's necessary - as shown above - to use the 'IS NULL' predicate.
-- Shortcut for: col_1 = 3 OR col_1 = 18 OR col_1 = NULLSELECT*FROMt1WHEREcol_1IN(3,18,NULL);-- the NULL case will never hit with the IN predicate!
This is a little more complex. This will only return 1, 3, and 4, the items that don't have NULL in t2.col_x or t1.col_1.
SELECT*FROMt1WHEREcol_1IN(SELECTcol_xFROMt2WHEREid<10);
The subselect of anEXISTS predicate evaluates to TRUE if the cardinality of the retrieved rows is greater than 0, and to FALSE if the cardinality is 0. It is not possible that the UNKNOWN value occurs.
-- The subselect to t2 can hit some rows - or not. If there are hits in the subselect, ALL rows of t1-- are returned, else no rows of t1 are returned.SELECT*-- The select to table t1FROMt1WHEREEXISTS(SELECT*FROMt2WHEREid<10);-- The subselect to table t2
TheLIKE predicate compares a column with a regular expression. If the column contains the NULL marker, the LIKE predicate returns the UNKNOWN value, which means that the row is not retrieved.
-- The LIKE retrieves only rows where col_2 matches the WHERE statement and col_2 is not NULL.SELECT*FROMt1WHEREcol_2LIKE'Hello %';
Theaggregate functionsCOUNT(<column_name>), MIN(<column_name>), MAX(<column_name>), SUM(<column_name>) andAVG(<column_name>) ignores such rows where <column_name> contains the NULL marker. On the other handCOUNT(*) includes all rows.
If a parameter of one of thescalar functions likeUPPER(), TRIM(), CONCAT(), ABS(), SQRT(), ... contains the NULL marker the resulting value is - in the most cases - the NULL marker.
There are some situations where column values are compared to each other to answer the question, whether they are distinct. For usual numbers and strings, the result of such decisions is obvious. But how shall the DBMS handle NULL markers? Are they distinct from each other, are they equal to each other or is there no answer to this question at all? To get results, which are expected by (nearly) every end-user, the standard defines "Two null values are not distinct.", they build a single group.
SELECT DISTINCT col_1 FROM t1; retrieves one and only row for all rows wherecol_1 holds the NULL marker.
... GROUP BY col_1 ...; builds one and only one group for all rows wherecol_1 holds the NULL marker.
After we have seen how various comparisons and predicates on the NULL marker produce TRUE, FALSE, and UNKNOWN, it's necessary to explain the rules for the new logic value UNKNOWN.
A first elementary operation is the inspection of a truth value: is it TRUE, FALSE or UNKNOWN? Analogous to theIS NULL predicate there are three additional predicates:
-- Check for 'UNKNOWN'SELECT*FROMt1WHERE(col_1=col_3)ISUNKNOWN;-- parenthesis are not necessary-- ... is semantically equivalent toSELECT*FROMt1WHEREcol_1ISNULLORcol_3ISNULL;
In the abstract syntax of logical systemsp shall represent any of its truth values. Here is the three-valued logic truth table:
| p | IS TRUE | IS FALSE | IS UNKNOWN | IS NOT TRUE | IS NOT FALSE | IS NOT UNKNOWN | |
|---|---|---|---|---|---|---|---|
| TRUE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | |
| FALSE | FALSE | TRUE | FALSE | TRUE | FALSE | TRUE | |
| UNKNOWN | FALSE | FALSE | TRUE | TRUE | TRUE | FALSE |
All predicates lead to TRUE or FALSE and never to UNKNOWN.
The next operation is the negation of the new value. Which values evaluate to 'NOT UNKNOWN'? The UNKNOWN value represents the impossibility to decide between TRUE and FALSE. It is not feasible that the negation of this impossibility leads to TRUE or FALSE. Likewise, it is UNKNOWN.
-- Which rows will match? (1)SELECT*FROMt1WHERENOTcol_2=NULL;-- 'col_2 = NULL' evaluates to UNKNOWN in all cases, see above.-- Is this SELECT equivalent to the first one? (2)SELECT*FROMt1EXCEPTSELECT*FROMt1WHEREcol_2=NULL;-- No, it's different!! Independent from NULL markers in col_2, (1) retrieves-- absolutely NO row and (2) retrieves ALL rows.
The above SELECT (1) will retrieve no rows as 'NOT col_2 = NULL' evaluates to the same as 'col_2 = NULL', namely UNKNOWN. And the SELECT (2) will retrieve all rows, as the part after EXCEPT will retrieve no rows, hence only the part before EXCEPT is relevant.
In the abstract syntax of logical systemsp shall represent any of its truth values and NOTp its negation. Herein the following table applies:
| p | NOTp |
|---|---|
| TRUE | FALSE |
| FALSE | TRUE |
| UNKNOWN | UNKNOWN |
There are the two binary operations AND and OR. They evaluate as follows:
| p | q | p ANDq | p ORq |
|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE |
| TRUE | FALSE | FALSE | TRUE |
| TRUE | UNKNOWN | UNKNOWN | TRUE |
| FALSE | TRUE | FALSE | TRUE |
| FALSE | FALSE | FALSE | FALSE |
| FALSE | UNKNOWN | FALSE | UNKNOWN |
| UNKNOWN | TRUE | UNKNOWN | TRUE |
| UNKNOWN | FALSE | FALSE | UNKNOWN |
| UNKNOWN | UNKNOWN | UNKNOWN | UNKNOWN |
The precedence of the operations is defined as usual: IS predicate, NOT, AND, OR.
---- Add a new row to the test data baseINSERTINTOperson(id,firstname,lastname)-- Omit some columns to generate NULL markersVALUES(99,'Tommy','Test');COMMIT;SELECT*FROMperson-- focus all tests to the new rowWHEREid=99-- (1): TRUEAND-- (3): (1) AND (2) ==> TRUE AND UNKNOWN ==> UNKNOWN(date_of_birth=NULL-- (2): UNKNOWN);-- no hitSELECT*FROMpersonWHEREid=99-- (1): TRUEAND-- (3): (1) AND (2) ==> TRUE AND TRUE ==> TRUE(date_of_birthISNULL-- (2): TRUE);-- hitSELECT*FROMpersonWHEREid=99-- (1): TRUEOR-- (3): (1) OR (2) ==> TRUE OR UNKNOWN ==> TRUE(date_of_birth=NULL-- (2): UNKNOWN);-- hitSELECT*FROMpersonWHEREid=99-- (1): TRUEAND-- (4): (1) AND (3) ==> TRUE AND FALSE ==> FALSE(NOT-- (3): NOT (2) ==> NOT TRUE ==> FALSEdate_of_birthISNULL-- (2): TRUE);-- no hit (same as AND date_of_birth IS NOT NULL)-- Clean up the test databaseDELETEFROMpersonWHEREid=99;DROPTABLEIFEXISTSt1,t2;COMMIT;
Atransaction is a grouping ofone or more SQL statements - that contains statements that write to the database, such as INSERT, UPDATE or DELETE, and also the SELECT command can be part of a transaction. All writing statementsmust be part of a transaction. The purpose of transactions is to guarantee that the database changes only from one consistent state to another consistent state. It combines all intermediate states into one change-event. This holds true also in critical situations such as parallel processing, disc crash, power failure, ... . Transactions ensure thedatabase integrity.
To do so they support four basic properties, which all in all are called theACID paradigm.
| Atomic | All SQL statements of the transaction take place or none. |
| Consistent | The sum of all data changes of a transaction transforms the database from one consistent state to another consistent state. |
| Isolated | The isolation level defines, which parts of uncommited transactions are visible to other sessions. |
| Durable | The database retains committed changes even if the system crashes afterward. |
As every SQL statement which writes to the database is part of a transaction, the DBMS silently starts a transaction for every of them if there is no in-complete transaction started by an earlier statement. Alternatively, the application/session can start a transaction explicitly by the commandSTART TRANSACTION.
All subsequent SQL commands are part of this transaction. The transaction remains until it is confirmed or rejected. The confirmation takes place with the commandCOMMIT, the rejection with the commandROLLBACK. Before the COMMIT or ROLLBACK command is submitted, the DBMS stores the results of every writing statement into an intermediate area where it is not visible to other sessions (see:Isolation Levels). Simultaneously with the COMMIT command, all changes of this transaction ends up in the common database, are visible to every other session, and the transaction terminates. If the COMMIT fails for any reason, it happens the same as when the session submits a ROLLBACK command: all changes of this transaction are discarded, and the transaction terminates. Optionally, a session can revert its complete writing actions, which are part of the active transaction, by submitting the single command ROLLBACK.
An Example:
-- Begin the transaction with an explicit command (In general not necessary. Not supported by Oracle.)STARTTRANSACTION;-- Insert some rowsINSERT...;-- Modify those rows or some other rowsUPDATE...;-- Delete some rowsDELETE...;-- If the COMMIT succeeds, the results of the above 3 commands have been transferred to the 'common'-- database and thus 'published' to all other sessions.COMMIT;----STARTTRANSACTION;INSERT...;UPDATE...;DELETE...;-- Discard INSERT, UPDATE and DELETEROLLBACK;
As transactions can include many statements, it is likely that runtime errors or logical errors arise. In some of such cases, applications want to rollback only parts of the actual transaction and commit the rest or resume the processing a second time. To do so, it is possible to define internal transaction boundaries, which reflects all processing from the start of the transaction up to this point in time. Such intermediate boundaries are calledsavepoints. COMMIT and ROLLBACK statements terminate the complete transaction, including its savepoints.
-- Begin the transaction with an explicit commandSTARTTRANSACTION;--INSERT...;-- Define a savepointSAVEPOINTstep_1;--UPDATE...;-- Discard only the UPDATE. The INSERT remains.ROLLBACKTOSAVEPOINTstep_1;-- try again (or do any other action)UPDATE...;-- confirm INSERT and the second UPDATECOMMIT;
During the lifetime of a transaction, a savepoint can be released if it's no longer needed. (It gets implicitly released at the end of the transaction.)
-- ...-- ...RELEASESAVEPOINT<savepoint_name>;-- This has no effect on the results of the previous INSERT, UPDATE or DELETE commands. It only eliminates the-- possibility to ROLLBACK TO SAVEPOINT <savepoint_name>.
A transaction guarantees that the results ofall of its statements are handled on a logical level asone single operation. All writing statements have a temporary nature until the COMMIT command terminates successfully.
This behavior helps to ensure the logical integrity of business logic. E.g.: If one wants to transfer some amount of money from one account to another, at least two rows of the database must be modified. The first modification decreases the amount in one row, and the second one increases it on a different row. If there is a disc crash or power failure between these two write-operations, the application has a problem. But theatomicity property of transactions guarantees that none of the write-operations reaches the database (in the case of any failure or a ROLLBACK) or all of them reach the database (in the case of a successful COMMIT).
There is more detailed information about theatomicity property at Wikipedia.
Transactions guarantee that the database is in a consistent state after they terminate. This consistency occurs at different levels:
In most situations, there are a lot of sessions working simultaneously on the DBMS. They compete for their resources, especially for the data. As long as the data is not modified, this is no problem. The DBMS can deliver the data to all of them.
But if multiple sessions try to modify data at the same point in time, conflicts are inescapable. Here is the timeline of an example with two sessions working on a flight reservation system. Session S1 reads the number of free seats for a flight: 1 free seat. S2 reads the number of free seats for the same flight: 1 free seat. S1 reserves the last seat. S2 reserves the last seat.
The central result of the analysis of such conflicts is that all of them are avoidable, if all transactions (concerning the same data) run sequentially: one after the other. But it's obvious that such a behavior is less efficient. The overall performance is increased if the DBMS does as much work as possible in parallel. The SQL standard offers a systematic of such conflicts and the commandSET TRANSACTION ... to resolve them with the aim to allow parallel operations as much as possible.
The standard identifies three problematic situations:
Depending on the requirements and access strategy of an application, some of the above problems may be tolerable - others not. The standard offers theSET TRANSACTION ... command to define, which are allowed to occur within a transaction and which not. TheSET TRANSACTION ... command must be the first statement within a transaction.
-- define (un)tolerable conflict situations (Oracle does not support all of them)SETTRANSACTIONISOLATIONLEVEL[READUNCOMMITTED|READCOMMITTED|REPEATABLEREAD|SERIALIZABLE];
The following table shows which problems may occur within each level.
| Isolation level | Dirty reads | Non-repeatable reads | Phantoms |
|---|---|---|---|
| Read Uncommitted | may occur | may occur | may occur |
| Read Committed | - | may occur | may occur |
| Repeatable Read | - | - | may occur |
| Serializable | - | - | - |
At Wikipedia there are more detailed informatiton and examples aboutisolation levels andconcurrency control.
Transactions guarantee that every confirmed write-operation will survive (almost) every subsequent disaster. To do so, in most cases, the DBMS writes the changes not only to the database but additionally to logfiles, which shall reside on different devices. So it is possible - after a disc crash - to restore all changes from a database backup plus these logfiles.
There is more detailed information about thedurability property at Wikipedia.
Some DBMS offers - outside of the standard - anAUTOCOMMIT feature. If it is activated, the feature submits automatically a COMMIT command after every writing statement with the consequence that you cannot ROLLBACK a logical unit-of-work consisting of a lot of SQL statements. Furthermore, the use of the SAVEPOINT feature is not possible.
In much cases, the feature is activated by default.
"Technology evolves from the primitive over the complex to the simple." (Antoine de Saint-Exupery)
Often there is the requirement to access the first or top n rows for every unique value of a given column: the cheapest product (= first row) within a product group (= unique values of the given column), the rows with the highest version number per entity within a historic table, the newest 10 log entries per user, ... . In the SQL world, this is a three-step-job: a) group the table over the given column b) order the rows within the created groups according to a criteria and c) access the first or the top n rows within the created, ordered groups.
In complex cases like this one SQL does not offer only a single solution. There are multiple formulations to get the expected result. At the logical level, they are equivalent, but it is likely that their performance differs strongly from each other. And it's likely that the performance of the same formulation differs strongly on different database systems. The deviation in performance results from the fact that SQL in general defines only WHAT the system shall do and not HOW it shall be done. It is the responsibility of the database system to find an optimal execution plan.
We offer some of the possible solutions - from primitive over complex to simple ones. They include subselects, joins, theFETCH FIRST clause, the use of a predicate, and finallywindow functions as the method of choice.
We use the example tableproduct with a small number of data rows to discuss diverse strategies.
CREATETABLEproduct(idINTEGERNOTNULL,nameVARCHAR(50)NOTNULL,product_groupVARCHAR(20)NOTNULL,prizeDECIMAL(5,2),CONSTRAINTproduct_pkPRIMARYKEY(id));INSERTINTOproductVALUES(1,'Big Fat One','Desktop',545);INSERTINTOproductVALUES(2,'SmartAndElegant','Laptop',675);INSERTINTOproductVALUES(3,'Angle','Laptop',398);INSERTINTOproductVALUES(4,'Wizzard 7','Smartphone',380);INSERTINTOproductVALUES(5,'Solid','Desktop',565);INSERTINTOproductVALUES(6,'AllRounder','Smartphone',535);INSERTINTOproductVALUES(7,'WhiteHorse','Laptop',675);INSERTINTOproductVALUES(8,'Workstation ONE','Desktop',499);INSERTINTOproductVALUES(9,'Air','Laptop',450);INSERTINTOproductVALUES(10,'Rusty','Laptop',390);INSERTINTOproductVALUES(11,'Tripple-A','Desktop',580);INSERTINTOproductVALUES(12,'Oxygen 8','Smartphone',450);INSERTINTOproductVALUES(13,'AllDay Basic','Smartphone',75);COMMIT;
With this structure and data, we will try to access the rows with the highest prize per product group.
Thefirst solution uses only theGROUP BY clause and reduces the problem in two ways: a) it offers only the very first row per group (ignoring the second best, third best, etc. rows) by using the functions max() or min() and b) the solution has only access to the grouping criteria and the result of max() / min(). However, due to the nature of theGROUP BY clause all remaining columns are not accessible - seehere.
SELECTproduct_group,MAX(prize)FROMproductGROUPBYproduct_group;product_group|max--------------+-----Smartphone|535Desktop|580Laptop|675-- access to other columns is not possibleSELECT*FROMproductGROUPBYproduct_group;
We can extend this first solution to show more columns by combining it with acorrelated or non-correlated subquery. Thissecond solution offers access to all columns. Nevertheless, the result is not what we expect as the number of accessed rows is 4. TheMAX(prize) criteria is not necessarily unique. Thus we receive 4 rows for the 3 groups out of our small example table. And - as mentioned above - we don't have access to the row with the second-highest prize.
-- SELECT with a non-correlated subquery. The subquery is executed only once.SELECT*FROMproductWHEREprizeIN-- prize is a very weak criterion(SELECTMAX(prize)FROMproductGROUPBYproduct_group);id|name|product_group|prize---+-----------------+---------------+-------11|Tripple-A|Desktop|5802|SmartAndElegant|Laptop|6757|WhiteHorse|Laptop|6756|AllRound|Smartphone|535-- SELECT with a correlated subquery. Observe the performance! The subquery is executed-- once per row of p1 !!!SELECT*FROMproductp1WHEREprizeIN-- prize is a very weak criterion(SELECTMAX(prize)FROMproductp2WHEREp1.product_group=p2.product_group);id|name|product_group|prize---+-----------------+---------------+-------11|Tripple-A|Desktop|5802|SmartAndElegant|Laptop|6757|WhiteHorse|Laptop|6756|AllRound|Smartphone|535
There are problems with these methods. If one uses nothing but theGROUP BY clause, the complete set columns and rows won't be displayed. If theGROUP BY is put into a subquery, all columns are displayed, but multiple rows for the same column will be displayed if more than one row meets the criteria.
The same holds true for thethird solution. One can create aJOIN over the product_group and reduce the resulting rows to those with the highest prize within the group by using theHAVING clause. The result is the same as with the second solution.
SELECTp1.*FROMproductp1JOINproductp2ON(p1.product_group=p2.product_group)GROUPBYp1.id,p1.name,p1.product_group,p1.prizeHAVINGp1.prize=MAX(p2.prize);id|name|product_group|prize---+-----------------+---------------+-------7|WhiteHorse|Laptop|6752|SmartAndElegant|Laptop|67511|Tripple-A|Desktop|5806|AllRound|Smartphone|535
As thefourth solution we offer a last example how to express the same issue - with the same imperfect result. It uses aNOT EXISTS predicate to search those rows, to which there is no higher prize within their group.
SELECT*FROMproductp1WHERENOTEXISTS(SELECT*FROMproductp2WHEREp1.product_group=p2.product_groupANDp1.prize<p2.prize);
To overcome the above shortcomings we make 2 adjustments. First, the link between the twoSELECTs (via join or subselect) must be changed to a column, with unique values. We will use the ID column. Second, we must use theFETCH FIRST clause in combination with anORDER BY to count rows.
First, we show a modification of theupper second solution. TheORDER BY clause sorts the rows into the desired sequence. TheFETCH FIRST clause restricts the number of resulting rows to any desired quantity per group. The result of the subquery is a list of ids. Because the ids are a unique criterion within our example table, the outerSELECT retrieves exactly the expected rows - with all their columns.
-- modification of the second solution (correlated subquery)SELECT*FROMproductp1WHEREidIN(SELECTidFROMproductp2WHEREp1.product_group=p2.product_groupORDERBYprizeDESCFETCHFIRST1ROWONLY-- replace "ONLY" with "WITH TIES" to include rows with identical prize at the cutting edge);id|name|product_group|prize---+-----------------+---------------+-------11|Tripple-A|Desktop|5802|SmartAndElegant|Laptop|6756|AllRound|Smartphone|535
Next, we use aJOIN LATERAL clause, which allows - similar to a correlated subquery - the use of previously named tables and their columns as a link to later named tables and their columns. In the example, every row of p1 is joined with the first row (FETCH FIRST) of p2 within the same group (p1.product_group = p2.product_group). The resulting columns of p2 are propagated to the outside parts of the query with the name p3. Finally, the join takes place over the id (ON p1.id = p3.id). The p2/p3 aliases retrieve only the rows with the highest prize per group, so they become the result.
SELECTp3.*FROMproductp1JOINLATERAL(SELECT*FROMproductp2WHEREp1.product_group=p2.product_groupORDERBYp2.prizeDESCFETCHFIRST1ROWONLY)p3ONp1.id=p3.id;
Window functions offer a very flexible and rich set of features. They work on multiple rows of the (intermediate) result set by 'sliding' over them like a 'window' and produce their results from the rows actually seen in the window.
They are addressed by two parts: the name of the desired function plus a definition of the 'sliding window', eg:SELECT row_number() OVER () as rownum .... In this case, the name of the function is 'row_number()' and the window definition 'OVER ()' remains empty, which leads to a window where all rows are seen. As its name suggests, the function counts the rows within the window.
In our case of 'n-rows-per-group' we must define windows which act on groups of rows (in the sense of theGROUP BY clause). To do so we expand the window definition toOVER (PARTITION BY product_group ... ) and get a counter per group:
SELECTproduct.*,row_number()OVER(PARTITIONBYproduct_groupORDERBYid)asrow_numberFROMproduct;id|name|product_group|prize|row_number----+-----------------+---------------+--------+------------1|BigFatOne|Desktop|545|15|Solid|Desktop|565|28|WorkstationONE|Desktop|499|311|Tripple-A|Desktop|580|42|SmartAndElegant|Laptop|675|13|Angle|Laptop|398|27|WhiteHorse|Laptop|675|39|Air|Laptop|450|410|Rusty|Laptop|390|54|Wizzard7|Smartphone|380|16|AllRounder|Smartphone|535|212|Oxygen8|Smartphone|450|313|AllDayBasic|Smartphone|75|4
Now the row_number starts with the value '1' within each group respectively partition. We can take advantage of this behaviour by sorting the rows as desired and limit the resulting rows to any desired quantity by querying this row_number in an outerWHERE clause.
As a window function cannot be used in theWHERE clause, we must use it in a SELECT which is nested in another SELECT. The outer SELECT reduces the number of lastly retrieved rows to one per group, which is the one with the highest prize as theOVER () clause contains aORDER BY.
SELECTtmp.*FROM(SELECTproduct.*,row_number()OVER(PARTITIONBYproduct_groupORDERBYprizeDESC)ASrownumber_per_groupFROMproduct)tmpWHERErownumber_per_group<2;id|name|product_group|prize|rownumber_per_group----+------------+---------------+--------+---------------------11|Tripple-A|Desktop|580|17|WhiteHorse|Laptop|675|16|AllRounder|Smartphone|535|1
You can easily modify this solution to enlarge the number of retrieved rows or to integrate additional window functions - eg. if you use rank() instead of row_number(), you get the additional row with id=2 and prize=675.
Lastly, we show a more complex query that retrieves additional statistical values per group. For details, please refer to the pageWindow functions.
SELECT*FROM(SELECTproduct.*,row_number()OVER(PARTITIONBYproduct_groupORDERBYprizeDESC)ASrownumber_per_group,min(prize)OVER(PARTITIONBYproduct_groupORDERBYprizeDESCROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ASmin,avg(prize)OVER(PARTITIONBYproduct_groupORDERBYprizeDESCROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ASavg,max(prize)OVER(PARTITIONBYproduct_groupORDERBYprizeDESCROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ASmaxFROMproduct)tmpWHERErownumber_per_group<2;id|name|product_group|prize|rownumber_per_group|min|avg|max----+------------+---------------+--------+---------------------+--------+--------+------11|Tripple-A|Desktop|580|1|499|547.25|5807|WhiteHorse|Laptop|675|1|390|517.60|6756|AllRounder|Smartphone|535|1|75|360.00|535
Sometimes people detect that corrupt or unwanted data exist in their database, e.g.: they forgot to create a Primary Key when using CREATE TABLE allowing time and identical values to be inserted in the key column or they recognize that values in one or the combination of multiple columns are not unique - as a violation of the business rules. The situation is commonly detected when they issue anALTER TABLE ... ADD PRIMARY KEY (...) orCREATE INDEX index_1 ON table_1 (col_1, col_2) command.
In such cases, data must be corrected, or some rows must be deleted. As the first case depends strongly on the individual situation, we focus on the latter action. In general, the SQL command will consist of two parts: The DELETE command and a second part, where the pending rows are identified. In complex situations, it may be necessary to use more than one SQL command (which always is declarative by definition) - maybe a CURSOR with a loop over the affected rows and additional actions depending on the values in different columns.
DELETE-- the DELETE command needs no additional specificationFROMmytableWHERE...-- identify the unwanted rows;
The solutions we discuss here are closely related to the explanations inStructured Query Language/Retrieve Top N Rows per Group. Over there, we locate certain rows within groups. The same must be done here as we want to delete only dedicated rows. At least one row must be kept in each affected group.
We use the same tableproduct on this page. We will eliminate all but one row where the product prize is identical to any other prize within the same product group. The goal is that each row will have a unique combination of product_group and prize.
A first approach to the situation may be a 'sniffing' in the data with aGROUP BY clause for a listing of possibly affected rows.
SELECTproduct_group,prize,COUNT(*)FROMproductGROUPBYproduct_group,prize-- create groupsHAVINGCOUNT(*)>1;-- count the number of rows within each groupproduct_group|prize|count---------------+--------+-------Laptop|675|2-- Count the number of groups where such a problem existsSELECTCOUNT(*)FROM(SELECTproduct_group,prize,COUNT(*)FROMproductGROUPBYproduct_group,prizeHAVINGCOUNT(*)>1)tmp;count-------1
But theGROUP BY clause is not very helpful as it is not possible to show columns other than the grouping columns and the result of some system functions likeCOUNT() (in rare cases a sort over a timestamp together withMAX(id) does help). The question is: how can we identify the 'right' and the 'wrong' rows? We need access to other columns of the rows to identify them. In the best case, we get access to the row's IDs.
To see such details we replace theGROUP BY clause by a window function (this is not the only possible solution). The following SQL command uses the same grouping over the two columns product_group and prize. And it uses a similar way to count affected rows. The main difference is that we see and have access to all columns of all rows.
SELECTproduct.*,COUNT(*)OVER(PARTITIONBYproduct_group,prizeROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ascntFROMproduct;id|name|product_group|prize|cnt----+-----------------+---------------+--------+-----8|WorkstationONE|Desktop|499|11|BigFatOne|Desktop|545|15|Solid|Desktop|565|111|Tripple-A|Desktop|580|110|Rusty|Laptop|390|13|Angle|Laptop|398|19|Air|Laptop|450|17|WhiteHorse|Laptop|675|22|SmartAndElegant|Laptop|675|213|AllDayBasic|Smartphone|75|14|Wizzard7|Smartphone|380|112|Oxygen8|Smartphone|450|16|AllRounder|Smartphone|535|1
ThisSELECT offers everything we need: The last columncnt counts the number of uniqueproduct_group/prize combinations. And the columnid gives us access to every single row.
In the next step, we expand this query and shift it into a subselect (window functions cannot be used in aWHERE clause, only their results). The rows with a counter of '1' are not of interest, we eliminate them from the further processing, order the remaining rows in a deterministic way, and compute an additional column for the position within each group.
SELECTtmp.*FROM(SELECTproduct.*,COUNT(*)OVER(PARTITIONBYproduct_group,prizeORDERBYidROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ascnt,ROW_NUMBER()OVER(PARTITIONBYproduct_group,prizeORDERBYid)asposition_within_groupFROMproduct)tmpWHEREtmp.cnt>1;id|name|product_group|prize|cnt|position_within_group----+-----------------+---------------+--------+-----+-----------------------2|SmartAndElegant|Laptop|675|2|17|WhiteHorse|Laptop|675|2|2
Up to this point our algorithm to identify problematic rows is easy, clear and the same for all use cases: create groups over the columns of interest with thePARTITION BY clause, count the number of rows within the groups, and eliminate groups with a counter of '1'. But now we have to decide which of the rows shall survive and which ones shall be deleted (or modified)? The answer depends strongly on the business logic, the manner in which the data was added into the table, the expectations of your customers, and much more. So you have to make your own decision.
On this page, we choose a simple solution: The row with the smallest ID shall survive; all others will be deleted. For testing purposes, we retrieve the rows we intend to delete, namely those with a position greater 1.
SELECTtmp.*FROM(SELECTproduct.*,COUNT(*)OVER(PARTITIONBYproduct_group,prizeORDERBYidROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ascnt,ROW_NUMBER()OVER(PARTITIONBYproduct_group,prizeORDERBYid)asposition_within_groupFROMproduct)tmpWHEREtmp.cnt>1ANDtmp.position_within_group>1;id|name|product_group|prize|cnt|position_within_group----+------------+---------------+--------+-----+-----------------------7|WhiteHorse|Laptop|675|2|2-- or retrieve the rows which will survive:...ANDtmp.position_within_group=1;
If this is what you expect, you can delete the rows in the final step. Reduce the above command to retrieve only the IDs, shift it into a subselect, and use its result as the input for aDELETE command.
BEGINTRANSACTION;DELETEFROMPRODUCTWHEREidIN(SELECTtmp.idFROM(SELECTproduct.*,COUNT(*)OVER(PARTITIONBYproduct_group,prizeORDERBYidROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)ascnt,ROW_NUMBER()OVER(PARTITIONBYproduct_group,prizeORDERBYid)asposition_within_groupFROMproduct)tmpWHEREtmp.cnt>1ANDtmp.position_within_group>1);COMMIT;-- or: ROLLBACK;
Programs using a database API that conforms to the SQL standard receive an indication about the success or failure of the call. This return code - which is called SQLSTATE - consists of 5 bytes. They are divided into two parts: the first and second byte contains aclass and the following three asubclass. Each class belongs to one of fourcategories: "S" denotes "Success" (class 00), "W" denotes "Warning" (class 01), "N" denotes "No data" (class 02) and "X" denotes "Exception" (all other classes).
The values ofSQLSTATE are published and explained in Wikipedia.
| ACID | An acronym for the four propertiesatomicity,consistency,isolation anddurability. Anytransaction must conform to them.Atomicity means that either all or no data modification will take place.Consistency ensures that transactions transform the database from one valid state to another valid state.Isolation requires that transactions will not affect each other, even if they run at the same time.Durability means that the modifications will keep into the database even if the system crashes immediately after the transaction. q.v.:ACID |
| Attribute | A set of properties (name, datatype, size, ...) used to characterize the data items ofentities. A group of attributes constructs anentity-type (ortable), i.e.: allvalues of a certaincolumn must conform to the same attributes. Attributes are optionally complemented byconstraints. |
| Block | Aggregation of one or more physical blocks of a mass device. Usually, a block contains numerousrows of one or more tables. Sometimes onerow is distributed across several blocks. q.v.:dirty block |
| Clause | A certain language element as part of astatement. E.g.: theWHERE clause defines search criteria. |
| Column | A set ofvalues of a singletable which resides on the same position within itsrows. |
| Constraint | Similar toattributes, constraints define rules at a higher level, data items must conform to. E.g.: nullability, primary and foreign key, uniqueness, default value, user-defined criteria likeSTATUS < 10. |
| Cursor | A cursor is a mechanism by which therows of atable may be acted on (e.g., returned to a host programming language) one at a time. |
| Database | A set oftables. Those tables contain user data and theData Dictionary. |
| Database Management System (DBMS) | A set of computer programs that controls the creation, maintenance and usage of thedatabase. q.v.:DBMS |
| Data Dictionary (DD) | A set of predefinedtables where theDBMS stores information about all user defined objects (tables, views, constraints, ...). |
| Data Control Language (DCL) | A class ofstatements which defines the access rights to data, e.g:GRANT ..., REVOKE, ... . |
| Data Definition Language (DDL) | A class ofstatements which defines logical and physical design of adatabase, e.g.:CREATE TABLE ... . |
| Data Manipulation Language (DML) | A class ofstatements which retrieves and manipulates data, e.g.:SELECT ..., INSERT ..., UPDATE ..., DELETE ..., COMMIT, ROLLBACK. |
| Dirty Block | Ablock whose content has changed in memory, but is still not written to disc. |
| Entity | An identifiable object like anemployee or adepartment. An entity is an instance of anentity-type. Usually there are many instances of a certainentity-type. Every entity is stored in onerow. Entities of sameentity-type are stored inrows of the sametable. So entities are a logical construct androws a physical implementation. |
| Entity-type | A group ofattributes describing the structure ofentities. Asentities of same entity-type are stored inrows of the sametable it can be said, that an entity-type describes atable. (Many people tend to use the termentity as a synonym for entity-type.) |
| Expression | A certain language element as part of astatement. It can produce either scalarvalues or atable. |
| Foreign key | Avalue used to reference aprimary key. It can point to any primary key in the database, whether in its owntable (e.g.: bill of materials) or another table. It can point to its own row. |
| Index | An index is a construct containing copies of original values and backreferences to their original rows. Its purpose is the provision of a fast access to the original data. To achieve this, an index contains some kind of collocation. Remark: Indexes are not part of the SQL standard. Nevertheless, they are part of nearly every DBMS. |
| Junction table | If more than one row of table T1 refers to more than one row of table T2 (many-to-many relationship) you need an intermediatetable to store this relationship. Therows of the intermediate table contain theprimary keys of T1 and T2 asvalues. q.v.:Junction_table |
| Normalization | Tables should conform to special rules - namelyFirst-,Second- andThird-Normal Form. The process of rearranging columns over tables is callednormalization. |
| NULL | Ifnovalue is stored in thecolumn of arow, the standard says, that thenull value is stored. As thisnull value is a flag and not a real value, we use the termnull marker within this wikibook. Thenull marker is used to indicate the absence of any data. For example, it makes a difference whether a temperature is measured and stored as 0 degrees or whether the temperature is not measured and hence not stored. One consequence of the existence of thenull marker is that SQL must know not only the boolean values TRUE and FALSE but also a third one: UNKNOWN. |
| Predicate | A language element which specifies a non-arithmetic condition. E.g: [NOT] IN, [NOT] LIKE, IS [NOT] NULL, [NOT] EXISTS, ANY, ... . |
| Primary key | Avalue or a set ofvalues used to identify a single row uniquely. |
| Query | An often usedstatement, which retrieves data from the database. It is introduced by the keyword SELECT and usually contains apredicate. |
| Relationship | A reference between two different or the sameentity. References are not implemented as links. They base upon thevalues of the entities. |
| Relational Model | A method (and a mathematical theory) to model data as tables (relations), the relationships among each other and all operations on the data. |
| Row | One record in atable containing information about one singleentity. A row has exactly onevalue for each of its columns - in accordance withFirst Normal Form. Thisvalue may beNULL. |
| Statement | A single command which is executed by theDBMS. There are 3 main classes of statements:DML,DDL andDCL. |
| Table (=Relation) | A set ofrows of a certainentity-type, i.e. allrows of a certain table have the same structure. |
| Three Valued Logic (3VL) | SQL knows three boolean values: TRUE, FALSE and UNKNOWN. See:NULL. q.v.:trivalent, ternary or three-valued logic (3VL). |
| Transaction | A logical unit of work consisting of one or more modifications to the database. TheACID criterium must be achieved. A transaction is either saved by theCOMMIT statement or completely canceled by theROLLBACK statement. |
| Value | Implementation of a single data item within a certaincolumn of a certainrow. (You can think of a cell within a spreadsheet.) |
| View | A virtualtable containing only its definition and no real data. The definition consists of aquery to one or more realtables or views.Queries to the view are processed asqueries to the underlying realtables. |
| Logical Design | Implementation |
| entity-type | table |
| entity | row |
| ? | column |
| data item | value |