1- postgresql 6.5.1 multi-byte (MB) support READMEJuly 11 1999
1+ PostgreSQL 7.0 multi-byte (MB) support READMEMar 22 2000
22
33Tatsuo Ishii
4- t- ishii@sra.co.jp
4+ ishii@postgresql.org
55 http://www.sra.co.jp/people/t-ishii/PostgreSQL/
66
770. Introduction
88
99The MB support is intended for allowing PostgreSQL to handle
1010multi-byte character sets such as EUC(Extended Unix Code), Unicode and
1111Mule internal code. With the MB enabled you can use multi-byte
12- character sets in regexp ,LIKE and some functions. The default
12+ character sets in regexp ,LIKE and someother functions. The default
1313encoding system chosen is determined while initializing your
1414PostgreSQL installation using initdb(1). Note that this can be
15- overridden when you create a database using createdb(1) orcreate
16- database SQL command. So you could have multiple databases with
17- different encodingsystems .
15+ overridden when you create a database using createdb(1) orby using a
16+ create database SQL command. So you could have multiple databases with
17+ each different encodingsystem .
1818
1919MB also fixes some problems concerning with 8-bit single byte
2020character sets including ISO8859. (I would not say all of problems
@@ -24,11 +24,11 @@ me know if you find any problem while using 8-bit characters)
2424
25251. How to use
2626
27- run configure withthe mb option:
27+ run configure witha multibyte option:
2828
29- % configure --with-mb =encoding_system
29+ %./ configure --enable-multibyte[ =encoding_system]
3030
31- where encoding_system is one of:
31+ wherethe encoding_system is one of:
3232
3333SQL_ASCIIASCII
3434EUC_JPJapanese EUC
@@ -48,21 +48,21 @@ where encoding_system is one of:
4848
4949Example:
5050
51- % configure --with-mb =EUC_JP
51+ %./ configure --enable-multibyte =EUC_JP
5252
53- IfMB is disabled, nothing ischanged except better supporting for
54- 8-bit single byte character sets .
53+ Ifthe encoding system isomitted (./configure --enable-multibyte),
54+ SQL_ASCII is assumed .
5555
56- 2. How to set encoding
56+ 2. How to setthe encoding
5757
5858initdb command defines the default encoding for a PostgreSQL
5959installation. For example:
6060
61- % initdb -e EUC_JP
61+ % initdb -E EUC_JP
6262
6363sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
64- Note that you can use "-pgencoding " instead of "-e " if you like longer
65- option string:-) If no -e or -pgencoding option is given, the encoding
64+ Note that you can use "--encoding " instead of "-E " if you like longer
65+ option string:-) If no -E or --encoding option is given, the encoding
6666specified at the compile time is used.
6767
6868You can create a database with a different encoding.
@@ -75,78 +75,85 @@ another way to accomplish this is to use a SQL command:
7575CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
7676
7777The encoding for a database is represented as "encoding" column in the
78- pg_database system catalog.
78+ pg_database system catalog. You can see that by using -l or \l of psql
79+ command.
7980
80- datname |datdba|encoding|datpath
81- -------------+------+--------+-------------
82- template1 | 1739| 1|template1
83- postgres | 1739| 0|postgres
84- euc_jp | 1739| 1|euc_jp
85- euc_kr | 1739| 3|euc_kr
86- euc_cn | 1739| 2|euc_cn
87- unicode | 1739| 5|unicode
88- mule_internal| 1739| 6|mule_internal
81+ $ psql -l
82+ List of databases
83+ Database | Owner | Encoding
84+ ---------------+---------+---------------
85+ euc_cn | t-ishii | EUC_CN
86+ euc_jp | t-ishii | EUC_JP
87+ euc_kr | t-ishii | EUC_KR
88+ euc_tw | t-ishii | EUC_TW
89+ mule_internal | t-ishii | MULE_INTERNAL
90+ regression | t-ishii | SQL_ASCII
91+ template1 | t-ishii | EUC_JP
92+ test | t-ishii | EUC_JP
93+ unicode | t-ishii | UNICODE
94+ (9 rows)
8995
90- A number in the encoding column is "encoding id" and can be translated
91- to the encoding name using pg_encoding command.
96+ 3. Automatic encoding translation between backend and frontend
9297
93- $ pg_encoding 1
94- EUC_JP
98+ PostgreSQL supports an automatic encoding translation between backend
99+ and frontend for some encodings.
95100
96- If an argument to pg_encoding is not a number, then it is regarded as
97- an encoding name and pg_encoding will return the encoding id.
101+ encoding of backendavailable encoding of frontend
102+ --------------------------------------------------------------------
103+ EUC_JPEUC_JP, SJIS
104+
105+ EUC_TWEUC_TW, BIG5
106+
107+ LATIN2LATIN2, WIN1250
108+
109+ LATIN5LATIN5, WIN, ALT
110+
111+ MULE_INTERNALEUC_JP, SJIS, EUC_KR, EUC_CN,
112+ EUC_TW, BIG5, LATIN1 to LATIN5,
113+ WIN, ALT, WIN1250
98114
99- $ pg_encoding EUC_JP
100- 1
115+ To enable the automatic encoding translation, you have to tell
116+ PostgreSQL the encoding you would like to use in frontend. There are
117+ several ways to accomplish this.
101118
102- 3. PGCLIENTENCODING
119+ o using \encoding command in psql
103120
104- If an environment variable PGCLIENTENCODING is defined on the
105- frontend, automatic encoding translation is done by the backend. For
106- example, if the backend has been compiled with MB=EUC_JP and
107- PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
108- system), then any SJIS strings coming from the frontend would be
109- translated to EUC_JP before going into the parser. Outputs from the
110- backend would be translated to SJIS of course.
121+ \encoding allows you to change frontend encoding on the fly. For
122+ example, to change the encoding to SJIS, type:
111123
112- Supported encodings for PGCLIENTENCODING are:
124+ \encoding SJIS
113125
114- SQL_ASCIIASCII
115- EUC_JPJapanese EUC
116- SJISYet another Japanese encoding
117- EUC_CNChinese EUC
118- EUC_KRKorean EUC
119- EUC_TWTaiwan EUC
120- BIG5Traditional Chinese
121- MULE_INTERNALMule internal
122- LATIN1ISO 8859-1 English and some European languages
123- LATIN2ISO 8859-2 English and some European languages
124- LATIN3ISO 8859-3 English and some European languages
125- LATIN4ISO 8859-4 English and some European languages
126- LATIN5ISO 8859-5 English and some European languages
127- KOI8KOI8-R
128- WINWindows CP1251
129- ALTWindows CP866
130- WIN1250Windows CP1250 (Czech)
126+ o using libpq functions
131127
132- Note that UNICODE is not supported(yet). Also note that the
133- translation is not always possible. Suppose you choose EUC_JP for the
134- backend, LATIN1 for the frontend, then some Japanese characters cannot
135- be translated into latin. In this case, a letter cannot be represented
136- in the Latin character set, would be transformed as:
128+ \encoding actually calls PQsetClientEncoding() for its purpose.
137129
138- (HEXA DECIMAL)
130+ int PQsetClientEncoding(PGconn *conn, const char *encoding)
131+
132+ conn is a connection to the backend, and encoding is an encoding you
133+ want to use. If it successfully sets the encoding, it returns 0,
134+ otherwise -1. The current encoding for this connection can be shown by
135+ using:
136+
137+ int PQclientEncoding(const PGconn *conn)
138+
139+ Note that it returns the "encoding id," not the encoding symbol string
140+ such as "EUC_JP." To convert an encoding id to an encoding symbol, you
141+ can use:
142+
143+ char *pg_encoding_to_char(int encoding_id)
144+
145+ o using PGCLIENTENCODING
146+
147+ If an environment variable PGCLIENTENCODING is defined in the
148+ frontend, an automatic encoding translation is done by the backend.
139149
140- 3. SET CLIENT_ENCODING TO command
150+ o using SET CLIENT_ENCODING TO command
141151
142- Actually setting the frontend side encoding information is done by a
143- new command:
152+ Setting the frontend side encoding can be done a SQL command:
144153
145154SET CLIENT_ENCODING TO 'encoding';
146155
147- where encoding is one of the encodings those can be set to
148- PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
149- purpose:
156+ Also you can use SQL92 syntax "SET NAMES" for this purpose:
150157
151158SET NAMES 'encoding';
152159
@@ -158,10 +165,21 @@ To return to the default encoding:
158165
159166RESET CLIENT_ENCODING;
160167
161- This would reset the frontend encoding to same as the backend
162- encoding, thus no encoding translation would be performed.
168+ 4. About Unicode
163169
164- 4. References
170+ An automatic encoding translation between Unicode and any other
171+ encodings is not supported (yet).
172+
173+ 5. What happens if the translation is not possible?
174+
175+ Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
176+ then some Japanese characters could not be translated into LATIN1. In
177+ this case, a letter cannot be represented in the LATIN1 character set,
178+ would be transformed as:
179+
180+ (HEXA DECIMAL)
181+
182+ 6. References
165183
166184These are good sources to start learning various kind of encoding
167185systems.
@@ -178,6 +196,16 @@ Unicode: http://www.unicode.org/
178196
1791975. History
180198
199+ Mar 22, 2000
200+ * Add new libpq functions PQsetClientEncoding, PQclientEncoding
201+ * ./configure --with-mb=EUC_JP
202+ now deprecated. use
203+ ./configure --enable-multibyte=EUC_JP
204+ instead
205+ * Add SQL_ASCII regression test case
206+ * Add SJIS User Defined Character (UDC) support
207+ * All of above will appear in 7.0
208+
181209July 11, 1999
182210* Add support for WIN1250 (Windows Czech) as a client encoding
183211 (contributed by Pavel Behal)