22.3. Character Set Support
The character set support inPostgres Pro allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such asEUC (Extended Unix Code), UTF-8, and Mule internal code. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding). The default character set is selected while initializing yourPostgres Pro database cluster using An important restriction, however, is that each database's character set must be compatible with the database's Table 22.1 shows the character sets available for use inPostgres Pro. Table 22.1. Postgres Pro Character Sets sets the default character set to You can specify a non-default encoding at database creation time, provided that the encoding is compatible with the selected locale: This will create a database named Notice that the above commands specify copying the The encoding for a database is stored in the system catalog On most modern operating systems,Postgres Pro can determine which character set is implied by the Postgres Pro will allow superusers to create databases with Postgres Pro supports automatic character set conversion between server and client for certain character set combinations. The conversion information is stored in the Table 22.2. Client/Server Character Set Conversions To enable automatic character set conversion, you have to tellPostgres Pro the character set (encoding) you would like to use in the client. There are several ways to accomplish this: Using the libpq (Section 31.10) has functions to control the client encoding. Using Also you can use the standard SQL syntax To query the current client encoding: To return to the default encoding: Using Using the configuration variableclient_encoding. If the If the conversion of a particular character is not possible — suppose you chose If the client character set is defined as These are good sources to start learning about various kinds of encoding systems. Contains detailed explanations of The web site of the Unicode Consortium. UTF-8 (8-bit UCS/Unicode Transformation Format) is defined here.initdb. It can be overridden when you create a database, so you can have multiple databases each with a different character set.LC_CTYPE (character classification) andLC_COLLATE (string sort order) locale settings. ForC orPOSIX locale, any character set is allowed, but for other libc-provided locales there is only one character set that will work correctly. (On Windows, however, UTF-8 encoding can be used with any locale.) If you have ICU support configured, ICU-provided locales can be used with most but not all server-side encodings.22.3.1. Supported Character Sets
Name Description Language Server? ICU? Bytes/Char Aliases BIG5Big Five Traditional Chinese No No 1-2 WIN950,Windows950EUC_CNExtended UNIX Code-CN Simplified Chinese Yes Yes 1-3 EUC_JPExtended UNIX Code-JP Japanese Yes Yes 1-3 EUC_JIS_2004Extended UNIX Code-JP, JIS X 0213 Japanese Yes No 1-3 EUC_KRExtended UNIX Code-KR Korean Yes Yes 1-3 EUC_TWExtended UNIX Code-TW Traditional Chinese, Taiwanese Yes Yes 1-3 GB18030National Standard Chinese No No 1-4 GBKExtended National Standard Simplified Chinese No No 1-2 WIN936,Windows936ISO_8859_5ISO 8859-5,ECMA 113 Latin/Cyrillic Yes Yes 1 ISO_8859_6ISO 8859-6,ECMA 114 Latin/Arabic Yes Yes 1 ISO_8859_7ISO 8859-7,ECMA 118 Latin/Greek Yes Yes 1 ISO_8859_8ISO 8859-8,ECMA 121 Latin/Hebrew Yes Yes 1 JOHABJOHAB Korean (Hangul) No No 1-3 KOI8RKOI8-R Cyrillic (Russian) Yes Yes 1 KOI8KOI8UKOI8-U Cyrillic (Ukrainian) Yes Yes 1 LATIN1ISO 8859-1,ECMA 94 Western European Yes Yes 1 ISO88591LATIN2ISO 8859-2,ECMA 94 Central European Yes Yes 1 ISO88592LATIN3ISO 8859-3,ECMA 94 South European Yes Yes 1 ISO88593LATIN4ISO 8859-4,ECMA 94 North European Yes Yes 1 ISO88594LATIN5ISO 8859-9,ECMA 128 Turkish Yes Yes 1 ISO88599LATIN6ISO 8859-10,ECMA 144 Nordic Yes Yes 1 ISO885910LATIN7ISO 8859-13 Baltic Yes Yes 1 ISO885913LATIN8ISO 8859-14 Celtic Yes Yes 1 ISO885914LATIN9ISO 8859-15 LATIN1 with Euro and accents Yes Yes 1 ISO885915LATIN10ISO 8859-16,ASRO SR 14111 Romanian Yes No 1 ISO885916MULE_INTERNALMule internal code Multilingual Emacs Yes No 1-4 SJISShift JIS Japanese No No 1-2 Mskanji,ShiftJIS,WIN932,Windows932SHIFT_JIS_2004Shift JIS, JIS X 0213 Japanese No No 1-2 SQL_ASCIIunspecified (see text) any Yes No 1 UHCUnified Hangul Code Korean No No 1-2 WIN949,Windows949UTF8Unicode, 8-bit all Yes Yes 1-4 UnicodeWIN866Windows CP866 Cyrillic Yes Yes 1 ALTWIN874Windows CP874 Thai Yes No 1 WIN1250Windows CP1250 Central European Yes Yes 1 WIN1251Windows CP1251 Cyrillic Yes Yes 1 WINWIN1252Windows CP1252 Western European Yes Yes 1 WIN1253Windows CP1253 Greek Yes Yes 1 WIN1254Windows CP1254 Turkish Yes Yes 1 WIN1255Windows CP1255 Hebrew Yes Yes 1 WIN1256Windows CP1256 Arabic Yes Yes 1 WIN1257Windows CP1257 Baltic Yes Yes 1 WIN1258Windows CP1258 Vietnamese Yes Yes 1 ABC,TCVN,TCVN5712,VSCII22.3.2. Setting the Character Set
initdb defines the default character set (encoding) for aPostgres Pro cluster. For example,initdb -E EUC_JP
EUC_JP (Extended Unix Code for Japanese). You can use--encoding instead of-E if you prefer longer option strings. If no-E or--encoding option is given,initdb attempts to determine the appropriate encoding to use based on the specified or default locale.createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
korean that uses the character setEUC_KR, and localeko_KR. Another way to accomplish this is to use this SQL command:CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
template0 database. When copying any other database, the encoding and locale settings cannot be changed from those of the source database, because that might result in corrupt data. For more information seeSection 21.3.pg_database. You can see it by using thepsql-l option or the\l command.$
psql -l List of databases Name | Owner | Encoding | Collation | Ctype | Access Privileges -----------+----------+-----------+-------------+-------------+------------------------------------- clocaledb | hlinnaka | SQL_ASCII | C | C | englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 | japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 | korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr | postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka} template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}(7 rows)Important
LC_CTYPE setting, and it will enforce that only the matching database encoding is used. On older systems it is your responsibility to ensure that you use the encoding expected by the locale you have selected. A mistake in this area is likely to lead to strange behavior of locale-dependent operations such as sorting.SQL_ASCII encoding even whenLC_CTYPE is notC orPOSIX. As noted above,SQL_ASCII does not enforce that the data stored in the database has any particular encoding, and so this choice poses risks of locale-dependent misbehavior. Using this combination of settings is deprecated and may someday be forbidden altogether.22.3.3. Automatic Character Set Conversion Between Server and Client
pg_conversion system catalog.Postgres Pro comes with some predefined conversions, as shown inTable 22.2. You can create a new conversion using the SQL commandCREATE CONVERSION.Server Character Set Available Client Character Sets BIG5not supported as a server encoding EUC_CNEUC_CN, MULE_INTERNAL,UTF8EUC_JPEUC_JP, MULE_INTERNAL,SJIS,UTF8EUC_JIS_2004EUC_JIS_2004, SHIFT_JIS_2004,UTF8EUC_KREUC_KR, MULE_INTERNAL,UTF8EUC_TWEUC_TW, BIG5,MULE_INTERNAL,UTF8GB18030not supported as a server encoding GBKnot supported as a server encoding ISO_8859_5ISO_8859_5, KOI8R,MULE_INTERNAL,UTF8,WIN866,WIN1251ISO_8859_6ISO_8859_6, UTF8ISO_8859_7ISO_8859_7, UTF8ISO_8859_8ISO_8859_8, UTF8JOHABnot supported as a server encoding KOI8RKOI8R, ISO_8859_5,MULE_INTERNAL,UTF8,WIN866,WIN1251KOI8UKOI8U, UTF8LATIN1LATIN1, MULE_INTERNAL,UTF8LATIN2LATIN2, MULE_INTERNAL,UTF8,WIN1250LATIN3LATIN3, MULE_INTERNAL,UTF8LATIN4LATIN4, MULE_INTERNAL,UTF8LATIN5LATIN5, UTF8LATIN6LATIN6, UTF8LATIN7LATIN7, UTF8LATIN8LATIN8, UTF8LATIN9LATIN9, UTF8LATIN10LATIN10, UTF8MULE_INTERNALMULE_INTERNAL, BIG5,EUC_CN,EUC_JP,EUC_KR,EUC_TW,ISO_8859_5,KOI8R,LATIN1 toLATIN4,SJIS,WIN866,WIN1250,WIN1251SJISnot supported as a server encoding SHIFT_JIS_2004not supported as a server encoding SQL_ASCIIany (no conversion will be performed) UHCnot supported as a server encoding UTF8all supported encodings WIN866WIN866, ISO_8859_5,KOI8R,MULE_INTERNAL,UTF8,WIN1251WIN874WIN874, UTF8WIN1250WIN1250, LATIN2,MULE_INTERNAL,UTF8WIN1251WIN1251, ISO_8859_5,KOI8R,MULE_INTERNAL,UTF8,WIN866WIN1252WIN1252, UTF8WIN1253WIN1253, UTF8WIN1254WIN1254, UTF8WIN1255WIN1255, UTF8WIN1256WIN1256, UTF8WIN1257WIN1257, UTF8WIN1258WIN1258, UTF8\encoding command inpsql.\encoding allows you to change client encoding on the fly. For example, to change the encoding toSJIS, type:\encoding SJIS
SET client_encoding TO. Setting the client encoding can be done with this SQL command:SET CLIENT_ENCODING TO '
value';SET NAMES for this purpose:SET NAMES '
value';SHOW client_encoding;
RESET client_encoding;
PGCLIENTENCODING. If the environment variablePGCLIENTENCODING is defined in the client's environment, that client encoding is automatically selected when a connection to the server is made. (This can subsequently be overridden using any of the other methods mentioned above.)client_encoding variable is set, that client encoding is automatically selected when a connection to the server is made. (This can subsequently be overridden using any of the other methods mentioned above.)EUC_JP for the server andLATIN1 for the client, and some Japanese characters are returned that do not have a representation inLATIN1 — an error is reported.SQL_ASCII, encoding conversion is disabled, regardless of the server's character set. Just as for the server, use ofSQL_ASCII is unwise unless you are working with all-ASCII data.22.3.4. Further Reading
EUC_JP,EUC_CN,EUC_KR,EUC_TW.