1- <!-- $PostgreSQL: pgsql/doc/src/sgml/syntax.sgml,v 1.123 2008/06/26 22:24:42 momjian Exp $ -->
1+ <!-- $PostgreSQL: pgsql/doc/src/sgml/syntax.sgml,v 1.124 2008/10/29 08:04:52 petere Exp $ -->
22
33<chapter id="sql-syntax">
44 <title>SQL Syntax</title>
@@ -189,6 +189,57 @@ UPDATE "my_table" SET "a" = 5;
189189 ampersands. The length limitation still applies.
190190 </para>
191191
192+ <para>
193+ <indexterm><primary>Unicode escape</primary><secondary>in
194+ identifiers</secondary></indexterm> A variant of quoted
195+ identifiers allows including escaped Unicode characters identified
196+ by their code points. This variant starts
197+ with <literal>U&</literal> (upper or lower case U followed by
198+ ampersand) immediately before the opening double quote, without
199+ any spaces in between, for example <literal>U&"foo"</literal>.
200+ (Note that this creates an ambiguity with the
201+ operator <literal>&</literal>. Use spaces around the operator to
202+ avoid this problem.) Inside the quotes, Unicode characters can be
203+ specified in escaped form by writing a backslash followed by the
204+ four-digit hexadecimal code point number or alternatively a
205+ backslash followed by a plus sign followed by a six-digit
206+ hexadecimal code point number. For example, the
207+ identifier <literal>"data"</literal> could be written as
208+ <programlisting>
209+ U&"d\0061t\+000061"
210+ </programlisting>
211+ The following less trivial example writes the Russian
212+ word <quote>slon</quote> (elephant) in Cyrillic letters:
213+ <programlisting>
214+ U&"\0441\043B\043E\043D"
215+ </programlisting>
216+ </para>
217+
218+ <para>
219+ If a different escape character than backslash is desired, it can
220+ be specified using
221+ the <literal>UESCAPE</literal><indexterm><primary>UESCAPE</primary></indexterm>
222+ clause after the string, for example:
223+ <programlisting>
224+ U&"d!0061t!+000061" UESCAPE '!'
225+ </programlisting>
226+ The escape character can be any single character other than a
227+ hexadecimal digit, the plus sign, a single quote, a double quote,
228+ or a whitespace character. Note that the escape character is
229+ written in single quotes, not double quotes.
230+ </para>
231+
232+ <para>
233+ To include the escape character in the identifier literally, write
234+ it twice.
235+ </para>
236+
237+ <para>
238+ The Unicode escape syntax works only when the server encoding is
239+ UTF8. When other server encodings are used, only code points in
240+ the ASCII range (up to <literal>\007F</literal>) can be specified.
241+ </para>
242+
192243 <para>
193244 Quoting an identifier also makes it case-sensitive, whereas
194245 unquoted names are always folded to lower case. For example, the
@@ -245,7 +296,7 @@ UPDATE "my_table" SET "a" = 5;
245296 write two adjacent single quotes, e.g.
246297 <literal>'Dianne''s horse'</literal>.
247298 Note that this is <emphasis>not</> the same as a double-quote
248- character (<literal>"</>).
299+ character (<literal>"</>). <!-- font-lock sanity: " -->
249300 </para>
250301
251302 <para>
@@ -269,14 +320,19 @@ SELECT 'foo' 'bar';
269320 by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is
270321 following the standard.)
271322 </para>
323+ </sect3>
272324
273- <para>
274- <indexterm>
325+ <sect3 id="sql-syntax-strings-escape">
326+ <title>String Constants with C-Style Escapes</title>
327+
328+ <indexterm zone="sql-syntax-strings-escape">
275329 <primary>escape string syntax</primary>
276330 </indexterm>
277- <indexterm>
331+ <indexterm zone="sql-syntax-strings-escape" >
278332 <primary>backslash escapes</primary>
279333 </indexterm>
334+
335+ <para>
280336 <productname>PostgreSQL</productname> also accepts <quote>escape</>
281337 string constants, which are an extension to the SQL standard.
282338 An escape string constant is specified by writing the letter
@@ -287,7 +343,8 @@ SELECT 'foo' 'bar';
287343 Within an escape string, a backslash character (<literal>\</>) begins a
288344 C-like <firstterm>backslash escape</> sequence, in which the combination
289345 of backslash and following character(s) represent a special byte
290- value:
346+ value, as shown in <xref linkend="sql-backslash-table">.
347+ </para>
291348
292349 <table id="sql-backslash-table">
293350 <title>Backslash Escape Sequences</title>
@@ -341,14 +398,24 @@ SELECT 'foo' 'bar';
341398 </tgroup>
342399 </table>
343400
344- It is your responsibility that the byte sequences you create are
345- valid characters in the server character set encoding. Any other
401+ <para>
402+ Any other
346403 character following a backslash is taken literally. Thus, to
347404 include a backslash character, write two backslashes (<literal>\\</>).
348405 Also, a single quote can be included in an escape string by writing
349406 <literal>\'</literal>, in addition to the normal way of <literal>''</>.
350407 </para>
351408
409+ <para>
410+ It is your responsibility that the byte sequences you create are
411+ valid characters in the server character set encoding. When the
412+ server encoding is UTF-8, then the alternative Unicode escape
413+ syntax, explained in <xref linkend="sql-syntax-strings-uescape">,
414+ should be used instead. (The alternative would be doing the
415+ UTF-8 encoding by hand and writing out the bytes, which would be
416+ very cumbersome.)
417+ </para>
418+
352419 <caution>
353420 <para>
354421 If the configuration parameter
@@ -379,6 +446,65 @@ SELECT 'foo' 'bar';
379446 </para>
380447 </sect3>
381448
449+ <sect3 id="sql-syntax-strings-uescape">
450+ <title>String Constants with Unicode Escapes</title>
451+
452+ <indexterm zone="sql-syntax-strings-uescape">
453+ <primary>Unicode escape</primary>
454+ <secondary>in string constants</secondary>
455+ </indexterm>
456+
457+ <para>
458+ <productname>PostgreSQL</productname> also supports another type
459+ of escape syntax for strings that allows specifying arbitrary
460+ Unicode characters by code point. A Unicode escape string
461+ constant starts with <literal>U&</literal> (upper or lower case
462+ letter U followed by ampersand) immediately before the opening
463+ quote, without any spaces in between, for
464+ example <literal>U&'foo'</literal>. (Note that this creates an
465+ ambiguity with the operator <literal>&</literal>. Use spaces
466+ around the operator to avoid this problem.) Inside the quotes,
467+ Unicode characters can be specified in escaped form by writing a
468+ backslash followed by the four-digit hexadecimal code point
469+ number or alternatively a backslash followed by a plus sign
470+ followed by a six-digit hexadecimal code point number. For
471+ example, the string <literal>'data'</literal> could be written as
472+ <programlisting>
473+ U&'d\0061t\+000061'
474+ </programlisting>
475+ The following less trivial example writes the Russian
476+ word <quote>slon</quote> (elephant) in Cyrillic letters:
477+ <programlisting>
478+ U&'\0441\043B\043E\043D'
479+ </programlisting>
480+ </para>
481+
482+ <para>
483+ If a different escape character than backslash is desired, it can
484+ be specified using
485+ the <literal>UESCAPE</literal><indexterm><primary>UESCAPE</primary></indexterm>
486+ clause after the string, for example:
487+ <programlisting>
488+ U&'d!0061t!+000061' UESCAPE '!'
489+ </programlisting>
490+ The escape character can be any single character other than a
491+ hexadecimal digit, the plus sign, a single quote, a double quote,
492+ or a whitespace character.
493+ </para>
494+
495+ <para>
496+ The Unicode escape syntax works only when the server encoding is
497+ UTF8. When other server encodings are used, only code points in
498+ the ASCII range (up to <literal>\007F</literal>) can be
499+ specified.
500+ </para>
501+
502+ <para>
503+ To include the escape character in the string literally, write it
504+ twice.
505+ </para>
506+ </sect3>
507+
382508 <sect3 id="sql-syntax-dollar-quoting">
383509 <title>Dollar-Quoted String Constants</title>
384510