22.2. Collation Support
The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that theLC_COLLATE
andLC_CTYPE
settings of a database cannot be changed after its creation.
22.2.1. Concepts
Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types aretext
,varchar
, andchar
. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.
The collation of an expression can be the“default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, withORDER BY
clauses and function or operator calls such as<
. The collation to apply for anORDER BY
clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such aslower
,upper
, andinitcap
; by pattern matching operators; and byto_char
and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
Thecollation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when aCOLLATE
clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
For example, consider this table definition:
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES", ...);
Then in
SELECT a < 'foo' FROM test1;
the<
comparison is performed according tode_DE
rules, because the expression combines an implicitly derived collation with the default collation. But in
SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
the comparison is performed usingfr_FR
rules, because the explicit collation derivation overrides the implicit one. Furthermore, given
SELECT a < b FROM test1;
the parser cannot determine which collation to apply, since thea
andb
columns have conflicting implicit collations. Since the<
operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:
SELECT a < b COLLATE "de_DE" FROM test1;
or equivalently
SELECT a COLLATE "de_DE" < b FROM test1;
On the other hand, the structurally similar case
SELECT a || b FROM test1;
does not result in an error, because the||
operator does not care about collations: its result is the same regardless of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in
SELECT * FROM test1 ORDER BY a || 'foo';
the ordering will be done according tode_DE
rules. But this query:
SELECT * FROM test1 ORDER BY a || b;
results in an error, because even though the||
operator doesn't need to know a collation, theORDER BY
clause does. As before, the conflict can be resolved with an explicit collation specifier:
SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
22.2.2. Managing Collations
A collation is an SQL schema object that maps an SQL name to operating system locales. In particular, it maps to a combination ofLC_COLLATE
andLC_CTYPE
. (As the name would suggest, the main purpose of a collation is to setLC_COLLATE
, which controls the sort order. But it is rarely necessary in practice to have anLC_CTYPE
setting that is different fromLC_COLLATE
, so it is more convenient to collect these under one concept than to create another infrastructure for settingLC_CTYPE
per expression.) Also, a collation is tied to a character set encoding (seeSection 22.3). The same collation name may exist for different encodings.
On all platforms, the collations nameddefault
,C
, andPOSIX
are available. Additional collations may be available depending on operating system support. Thedefault
collation selects theLC_COLLATE
andLC_CTYPE
values specified at database creation time. TheC
andPOSIX
collations both specify“traditional C” behavior, in which only the ASCII letters“A
” through“Z
” are treated as letters, and sorting is done strictly by character code byte values.
If the operating system provides support for using multiple locales within a single program (newlocale
and related functions), then when a database cluster is initialized,initdb
populates the system catalogpg_collation
with collations based on all the locales it finds on the operating system at the time. For example, the operating system might provide a locale namedde_DE.utf8
.initdb
would then create a collation namedde_DE.utf8
for encodingUTF8
that has bothLC_COLLATE
andLC_CTYPE
set tode_DE.utf8
. It will also create a collation with the.utf8
tag stripped off the name. So you could also use the collation under the namede_DE
, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.
In case a collation is needed that has different values forLC_COLLATE
andLC_CTYPE
, a new collation may be created using theCREATE COLLATION command. That command can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications.
Within any particular database, only collations that use that database's encoding are of interest. Other entries inpg_collation
are ignored. Thus, a stripped collation name such asde_DE
can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one fewer thing you need to change if you decide to change to another database encoding. Note however that thedefault
,C
, andPOSIX
collations can be used regardless of the database encoding.
Postgres Pro considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,
SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
will draw an error even though theC
andPOSIX
collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended.