1
- postgresql 6.5.1 multi-byte (MB) support READMEJuly 11 1999
1
+ PostgreSQL 7.0 multi-byte (MB) support READMEMar 22 2000
2
2
3
3
Tatsuo Ishii
4
- t- ishii@sra.co.jp
4
+ ishii@postgresql.org
5
5
http://www.sra.co.jp/people/t-ishii/PostgreSQL/
6
6
7
7
0. Introduction
8
8
9
9
The MB support is intended for allowing PostgreSQL to handle
10
10
multi-byte character sets such as EUC(Extended Unix Code), Unicode and
11
11
Mule internal code. With the MB enabled you can use multi-byte
12
- character sets in regexp ,LIKE and some functions. The default
12
+ character sets in regexp ,LIKE and someother functions. The default
13
13
encoding system chosen is determined while initializing your
14
14
PostgreSQL installation using initdb(1). Note that this can be
15
- overridden when you create a database using createdb(1) orcreate
16
- database SQL command. So you could have multiple databases with
17
- different encodingsystems .
15
+ overridden when you create a database using createdb(1) orby using a
16
+ create database SQL command. So you could have multiple databases with
17
+ each different encodingsystem .
18
18
19
19
MB also fixes some problems concerning with 8-bit single byte
20
20
character sets including ISO8859. (I would not say all of problems
@@ -24,11 +24,11 @@ me know if you find any problem while using 8-bit characters)
24
24
25
25
1. How to use
26
26
27
- run configure withthe mb option:
27
+ run configure witha multibyte option:
28
28
29
- % configure --with-mb =encoding_system
29
+ %./ configure --enable-multibyte[ =encoding_system]
30
30
31
- where encoding_system is one of:
31
+ wherethe encoding_system is one of:
32
32
33
33
SQL_ASCIIASCII
34
34
EUC_JPJapanese EUC
@@ -48,21 +48,21 @@ where encoding_system is one of:
48
48
49
49
Example:
50
50
51
- % configure --with-mb =EUC_JP
51
+ %./ configure --enable-multibyte =EUC_JP
52
52
53
- IfMB is disabled, nothing ischanged except better supporting for
54
- 8-bit single byte character sets .
53
+ Ifthe encoding system isomitted (./configure --enable-multibyte),
54
+ SQL_ASCII is assumed .
55
55
56
- 2. How to set encoding
56
+ 2. How to setthe encoding
57
57
58
58
initdb command defines the default encoding for a PostgreSQL
59
59
installation. For example:
60
60
61
- % initdb -e EUC_JP
61
+ % initdb -E EUC_JP
62
62
63
63
sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
64
- Note that you can use "-pgencoding " instead of "-e " if you like longer
65
- option string:-) If no -e or -pgencoding option is given, the encoding
64
+ Note that you can use "--encoding " instead of "-E " if you like longer
65
+ option string:-) If no -E or --encoding option is given, the encoding
66
66
specified at the compile time is used.
67
67
68
68
You can create a database with a different encoding.
@@ -75,78 +75,85 @@ another way to accomplish this is to use a SQL command:
75
75
CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
76
76
77
77
The encoding for a database is represented as "encoding" column in the
78
- pg_database system catalog.
78
+ pg_database system catalog. You can see that by using -l or \l of psql
79
+ command.
79
80
80
- datname |datdba|encoding|datpath
81
- -------------+------+--------+-------------
82
- template1 | 1739| 1|template1
83
- postgres | 1739| 0|postgres
84
- euc_jp | 1739| 1|euc_jp
85
- euc_kr | 1739| 3|euc_kr
86
- euc_cn | 1739| 2|euc_cn
87
- unicode | 1739| 5|unicode
88
- mule_internal| 1739| 6|mule_internal
81
+ $ psql -l
82
+ List of databases
83
+ Database | Owner | Encoding
84
+ ---------------+---------+---------------
85
+ euc_cn | t-ishii | EUC_CN
86
+ euc_jp | t-ishii | EUC_JP
87
+ euc_kr | t-ishii | EUC_KR
88
+ euc_tw | t-ishii | EUC_TW
89
+ mule_internal | t-ishii | MULE_INTERNAL
90
+ regression | t-ishii | SQL_ASCII
91
+ template1 | t-ishii | EUC_JP
92
+ test | t-ishii | EUC_JP
93
+ unicode | t-ishii | UNICODE
94
+ (9 rows)
89
95
90
- A number in the encoding column is "encoding id" and can be translated
91
- to the encoding name using pg_encoding command.
96
+ 3. Automatic encoding translation between backend and frontend
92
97
93
- $ pg_encoding 1
94
- EUC_JP
98
+ PostgreSQL supports an automatic encoding translation between backend
99
+ and frontend for some encodings.
95
100
96
- If an argument to pg_encoding is not a number, then it is regarded as
97
- an encoding name and pg_encoding will return the encoding id.
101
+ encoding of backendavailable encoding of frontend
102
+ --------------------------------------------------------------------
103
+ EUC_JPEUC_JP, SJIS
104
+
105
+ EUC_TWEUC_TW, BIG5
106
+
107
+ LATIN2LATIN2, WIN1250
108
+
109
+ LATIN5LATIN5, WIN, ALT
110
+
111
+ MULE_INTERNALEUC_JP, SJIS, EUC_KR, EUC_CN,
112
+ EUC_TW, BIG5, LATIN1 to LATIN5,
113
+ WIN, ALT, WIN1250
98
114
99
- $ pg_encoding EUC_JP
100
- 1
115
+ To enable the automatic encoding translation, you have to tell
116
+ PostgreSQL the encoding you would like to use in frontend. There are
117
+ several ways to accomplish this.
101
118
102
- 3. PGCLIENTENCODING
119
+ o using \encoding command in psql
103
120
104
- If an environment variable PGCLIENTENCODING is defined on the
105
- frontend, automatic encoding translation is done by the backend. For
106
- example, if the backend has been compiled with MB=EUC_JP and
107
- PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
108
- system), then any SJIS strings coming from the frontend would be
109
- translated to EUC_JP before going into the parser. Outputs from the
110
- backend would be translated to SJIS of course.
121
+ \encoding allows you to change frontend encoding on the fly. For
122
+ example, to change the encoding to SJIS, type:
111
123
112
- Supported encodings for PGCLIENTENCODING are:
124
+ \encoding SJIS
113
125
114
- SQL_ASCIIASCII
115
- EUC_JPJapanese EUC
116
- SJISYet another Japanese encoding
117
- EUC_CNChinese EUC
118
- EUC_KRKorean EUC
119
- EUC_TWTaiwan EUC
120
- BIG5Traditional Chinese
121
- MULE_INTERNALMule internal
122
- LATIN1ISO 8859-1 English and some European languages
123
- LATIN2ISO 8859-2 English and some European languages
124
- LATIN3ISO 8859-3 English and some European languages
125
- LATIN4ISO 8859-4 English and some European languages
126
- LATIN5ISO 8859-5 English and some European languages
127
- KOI8KOI8-R
128
- WINWindows CP1251
129
- ALTWindows CP866
130
- WIN1250Windows CP1250 (Czech)
126
+ o using libpq functions
131
127
132
- Note that UNICODE is not supported(yet). Also note that the
133
- translation is not always possible. Suppose you choose EUC_JP for the
134
- backend, LATIN1 for the frontend, then some Japanese characters cannot
135
- be translated into latin. In this case, a letter cannot be represented
136
- in the Latin character set, would be transformed as:
128
+ \encoding actually calls PQsetClientEncoding() for its purpose.
137
129
138
- (HEXA DECIMAL)
130
+ int PQsetClientEncoding(PGconn *conn, const char *encoding)
131
+
132
+ conn is a connection to the backend, and encoding is an encoding you
133
+ want to use. If it successfully sets the encoding, it returns 0,
134
+ otherwise -1. The current encoding for this connection can be shown by
135
+ using:
136
+
137
+ int PQclientEncoding(const PGconn *conn)
138
+
139
+ Note that it returns the "encoding id," not the encoding symbol string
140
+ such as "EUC_JP." To convert an encoding id to an encoding symbol, you
141
+ can use:
142
+
143
+ char *pg_encoding_to_char(int encoding_id)
144
+
145
+ o using PGCLIENTENCODING
146
+
147
+ If an environment variable PGCLIENTENCODING is defined in the
148
+ frontend, an automatic encoding translation is done by the backend.
139
149
140
- 3. SET CLIENT_ENCODING TO command
150
+ o using SET CLIENT_ENCODING TO command
141
151
142
- Actually setting the frontend side encoding information is done by a
143
- new command:
152
+ Setting the frontend side encoding can be done a SQL command:
144
153
145
154
SET CLIENT_ENCODING TO 'encoding';
146
155
147
- where encoding is one of the encodings those can be set to
148
- PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
149
- purpose:
156
+ Also you can use SQL92 syntax "SET NAMES" for this purpose:
150
157
151
158
SET NAMES 'encoding';
152
159
@@ -158,10 +165,21 @@ To return to the default encoding:
158
165
159
166
RESET CLIENT_ENCODING;
160
167
161
- This would reset the frontend encoding to same as the backend
162
- encoding, thus no encoding translation would be performed.
168
+ 4. About Unicode
163
169
164
- 4. References
170
+ An automatic encoding translation between Unicode and any other
171
+ encodings is not supported (yet).
172
+
173
+ 5. What happens if the translation is not possible?
174
+
175
+ Suppose you choose EUC_JP for the backend, LATIN1 for the frontend,
176
+ then some Japanese characters could not be translated into LATIN1. In
177
+ this case, a letter cannot be represented in the LATIN1 character set,
178
+ would be transformed as:
179
+
180
+ (HEXA DECIMAL)
181
+
182
+ 6. References
165
183
166
184
These are good sources to start learning various kind of encoding
167
185
systems.
@@ -178,6 +196,16 @@ Unicode: http://www.unicode.org/
178
196
179
197
5. History
180
198
199
+ Mar 22, 2000
200
+ * Add new libpq functions PQsetClientEncoding, PQclientEncoding
201
+ * ./configure --with-mb=EUC_JP
202
+ now deprecated. use
203
+ ./configure --enable-multibyte=EUC_JP
204
+ instead
205
+ * Add SQL_ASCII regression test case
206
+ * Add SJIS User Defined Character (UDC) support
207
+ * All of above will appear in 7.0
208
+
181
209
July 11, 1999
182
210
* Add support for WIN1250 (Windows Czech) as a client encoding
183
211
(contributed by Pavel Behal)