Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitdf4cba6

Browse files
committed
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1> characters (characters with values >= 0x10000, which are encoded on> four bytes).Also, update mb/expected/unicode.out. This is necessary since thepatches affetc the result of queries using UTF-8.---------------------------------------------------------------Hi,I should have sent the patch earlier, but got delayed by other stuff.Anyway, here is the patch:- most of the functionality is only activated when MULTIBYTE is defined,- check valid UTF-8 characters, client-side only yet, and only on output, you still can send invalid UTF-8 to the server (so, it's only partly compliant to Unicode 3.1, but that's better than nothing).- formats with the correct number of columns (that's why I made it in the first place after all), but only for UNICODE. However, the code allows to plug-in routines for other encodings, as Tatsuo did for the other multibyte functions.- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 characters (characters with values >= 0x10000, which are encoded on four bytes).- doesn't depend on the locale capabilities of the glibc (useful for remote telnet).I would like somebody to check it closely, as it is my first patch topgsql. Also, I created dummy .orig files, so that the two files Icreated are included, I hope that's the right way.Now, a lot of functionality is NOT included here, but I will keep thatfor 7.3 :) That includes all string checking on the server side (whichwill have to be a bit more optimised ;) ), and the input checking onthe client side for UTF-8, though that should not be difficult. It'sjust to send the strings through mbvalidate() before sending them tothe server. Strong checking on UTF-8 strings is mandatory to becompliant with Unicode 3.1+ .Do I have time to look for a patch to include iso-8859-15 for 7.2 ?The euro is coming 1. january 2002 (before 7.3 !) and over 280millions people in Europe will need the euro sign and only iso-8859-15and iso-8859-16 have it (and unfortunately, I don't think all Uniceswill switch to Unicode in the meantime)....err... yes, I know that this is not every single person in Europe thatuses PostgreSql, so it's not exactly 280m, but it's just a matter oftime ! ;)I'll come back (on pgsql-hackers) later to ask a few questionsregarding the full unicode support (normalisation, collation,regexes,...) on the server side :)Here is the patch !Patrice.--Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of theuniverse exploding out of a dot smaller than the head of a pin, but theycan't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe ? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes------------------------------------------http://www.islande.org/ -----
1 parentd07bacd commitdf4cba6

File tree

4 files changed

+529
-70
lines changed

4 files changed

+529
-70
lines changed

‎src/bin/psql/Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
66
# Portions Copyright (c) 1994, Regents of the University of California
77
#
8-
# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.30 2001/02/27 08:13:27 ishii Exp $
8+
# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.31 2001/10/15 01:25:10 ishii Exp $
99
#
1010
#-------------------------------------------------------------------------
1111

@@ -19,7 +19,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
1919

2020
OBJS=command.o common.o help.o input.o stringutils.o mainloop.o\
2121
copy.o startup.o prompt.o variables.o large_obj.o print.o describe.o\
22-
tab-complete.o
22+
tab-complete.o mbprint.o
2323

2424
all: submake psql
2525

‎src/bin/psql/mbprint.c

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
/*
2+
* psql - the PostgreSQL interactive terminal
3+
*
4+
* Copyright 2000 by PostgreSQL Global Development Group
5+
*
6+
* $Header: /cvsroot/pgsql/src/bin/psql/mbprint.c,v 1.1 2001/10/15 01:25:10 ishii Exp $
7+
*/
8+
9+
#include"postgres_fe.h"
10+
#include"mbprint.h"
11+
12+
#ifdefMULTIBYTE
13+
14+
#include"mb/pg_wchar.h"
15+
#include"settings.h"
16+
17+
/*
18+
* This is an implementation of wcwidth() and wcswidth() as defined in
19+
* "The Single UNIX Specification, Version 2, The Open Group, 1997"
20+
* <http://www.UNIX-systems.org/online.html>
21+
*
22+
* Markus Kuhn -- 2001-09-08 -- public domain
23+
*
24+
* customised for PostgreSQL
25+
*
26+
* original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
27+
*/
28+
29+
structmbinterval {
30+
unsigned shortfirst;
31+
unsigned shortlast;
32+
};
33+
34+
/* auxiliary function for binary search in interval table */
35+
staticint
36+
mbbisearch(pg_wcharucs,conststructmbinterval*table,intmax)
37+
{
38+
intmin=0;
39+
intmid;
40+
41+
if (ucs<table[0].first||ucs>table[max].last)
42+
return0;
43+
while (max >=min) {
44+
mid= (min+max) /2;
45+
if (ucs>table[mid].last)
46+
min=mid+1;
47+
elseif (ucs<table[mid].first)
48+
max=mid-1;
49+
else
50+
return1;
51+
}
52+
53+
return0;
54+
}
55+
56+
57+
/* The following functions define the column width of an ISO 10646
58+
* character as follows:
59+
*
60+
* - The null character (U+0000) has a column width of 0.
61+
*
62+
* - Other C0/C1 control characters and DEL will lead to a return
63+
* value of -1.
64+
*
65+
* - Non-spacing and enclosing combining characters (general
66+
* category code Mn or Me in the Unicode database) have a
67+
* column width of 0.
68+
*
69+
* - Other format characters (general category code Cf in the Unicode
70+
* database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
71+
*
72+
* - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
73+
* have a column width of 0.
74+
*
75+
* - Spacing characters in the East Asian Wide (W) or East Asian
76+
* FullWidth (F) category as defined in Unicode Technical
77+
* Report #11 have a column width of 2.
78+
*
79+
* - All remaining characters (including all printable
80+
* ISO 8859-1 and WGL4 characters, Unicode control characters,
81+
* etc.) have a column width of 1.
82+
*
83+
* This implementation assumes that wchar_t characters are encoded
84+
* in ISO 10646.
85+
*/
86+
87+
staticint
88+
ucs_wcwidth(pg_wcharucs)
89+
{
90+
/* sorted list of non-overlapping intervals of non-spacing characters */
91+
staticconststructmbintervalcombining[]= {
92+
{0x0300,0x034E }, {0x0360,0x0362 }, {0x0483,0x0486 },
93+
{0x0488,0x0489 }, {0x0591,0x05A1 }, {0x05A3,0x05B9 },
94+
{0x05BB,0x05BD }, {0x05BF,0x05BF }, {0x05C1,0x05C2 },
95+
{0x05C4,0x05C4 }, {0x064B,0x0655 }, {0x0670,0x0670 },
96+
{0x06D6,0x06E4 }, {0x06E7,0x06E8 }, {0x06EA,0x06ED },
97+
{0x070F,0x070F }, {0x0711,0x0711 }, {0x0730,0x074A },
98+
{0x07A6,0x07B0 }, {0x0901,0x0902 }, {0x093C,0x093C },
99+
{0x0941,0x0948 }, {0x094D,0x094D }, {0x0951,0x0954 },
100+
{0x0962,0x0963 }, {0x0981,0x0981 }, {0x09BC,0x09BC },
101+
{0x09C1,0x09C4 }, {0x09CD,0x09CD }, {0x09E2,0x09E3 },
102+
{0x0A02,0x0A02 }, {0x0A3C,0x0A3C }, {0x0A41,0x0A42 },
103+
{0x0A47,0x0A48 }, {0x0A4B,0x0A4D }, {0x0A70,0x0A71 },
104+
{0x0A81,0x0A82 }, {0x0ABC,0x0ABC }, {0x0AC1,0x0AC5 },
105+
{0x0AC7,0x0AC8 }, {0x0ACD,0x0ACD }, {0x0B01,0x0B01 },
106+
{0x0B3C,0x0B3C }, {0x0B3F,0x0B3F }, {0x0B41,0x0B43 },
107+
{0x0B4D,0x0B4D }, {0x0B56,0x0B56 }, {0x0B82,0x0B82 },
108+
{0x0BC0,0x0BC0 }, {0x0BCD,0x0BCD }, {0x0C3E,0x0C40 },
109+
{0x0C46,0x0C48 }, {0x0C4A,0x0C4D }, {0x0C55,0x0C56 },
110+
{0x0CBF,0x0CBF }, {0x0CC6,0x0CC6 }, {0x0CCC,0x0CCD },
111+
{0x0D41,0x0D43 }, {0x0D4D,0x0D4D }, {0x0DCA,0x0DCA },
112+
{0x0DD2,0x0DD4 }, {0x0DD6,0x0DD6 }, {0x0E31,0x0E31 },
113+
{0x0E34,0x0E3A }, {0x0E47,0x0E4E }, {0x0EB1,0x0EB1 },
114+
{0x0EB4,0x0EB9 }, {0x0EBB,0x0EBC }, {0x0EC8,0x0ECD },
115+
{0x0F18,0x0F19 }, {0x0F35,0x0F35 }, {0x0F37,0x0F37 },
116+
{0x0F39,0x0F39 }, {0x0F71,0x0F7E }, {0x0F80,0x0F84 },
117+
{0x0F86,0x0F87 }, {0x0F90,0x0F97 }, {0x0F99,0x0FBC },
118+
{0x0FC6,0x0FC6 }, {0x102D,0x1030 }, {0x1032,0x1032 },
119+
{0x1036,0x1037 }, {0x1039,0x1039 }, {0x1058,0x1059 },
120+
{0x1160,0x11FF }, {0x17B7,0x17BD }, {0x17C6,0x17C6 },
121+
{0x17C9,0x17D3 }, {0x180B,0x180E }, {0x18A9,0x18A9 },
122+
{0x200B,0x200F }, {0x202A,0x202E }, {0x206A,0x206F },
123+
{0x20D0,0x20E3 }, {0x302A,0x302F }, {0x3099,0x309A },
124+
{0xFB1E,0xFB1E }, {0xFE20,0xFE23 }, {0xFEFF,0xFEFF },
125+
{0xFFF9,0xFFFB }
126+
};
127+
128+
/* test for 8-bit control characters */
129+
if (ucs==0) {
130+
return0;
131+
}
132+
133+
if (ucs<32|| (ucs >=0x7f&&ucs<0xa0)||ucs>0x0010ffff) {
134+
return-1;
135+
}
136+
137+
/* binary search in table of non-spacing characters */
138+
if (mbbisearch(ucs,combining,
139+
sizeof(combining) /sizeof(structmbinterval)-1)) {
140+
return0;
141+
}
142+
143+
/* if we arrive here, ucs is not a combining or C0/C1 control character */
144+
145+
return1+
146+
(ucs >=0x1100&&
147+
(ucs <=0x115f||/* Hangul Jamo init. consonants */
148+
(ucs >=0x2e80&&ucs <=0xa4cf&& (ucs& ~0x0011)!=0x300a&&
149+
ucs!=0x303f)||/* CJK ... Yi */
150+
(ucs >=0xac00&&ucs <=0xd7a3)||/* Hangul Syllables */
151+
(ucs >=0xf900&&ucs <=0xfaff)||/* CJK Compatibility Ideographs */
152+
(ucs >=0xfe30&&ucs <=0xfe6f)||/* CJK Compatibility Forms */
153+
(ucs >=0xff00&&ucs <=0xff5f)||/* Fullwidth Forms */
154+
(ucs >=0xffe0&&ucs <=0xffe6)||
155+
(ucs >=0x20000&&ucs <=0x2ffff)));
156+
}
157+
158+
pg_wchar
159+
utf2ucs(constunsignedchar*c)
160+
{
161+
/* one char version of pg_utf2wchar_with_len.
162+
* no control here, c must point to a large enough string
163+
*/
164+
if ((*c&0x80)==0) {
165+
return (pg_wchar)c[0];
166+
}
167+
elseif ((*c&0xe0)==0xc0) {
168+
return (pg_wchar)(((c[0]&0x1f) <<6) |
169+
(c[1]&0x3f));
170+
}
171+
elseif ((*c&0xf0)==0xe0) {
172+
return (pg_wchar)(((c[0]&0x0f) <<12) |
173+
((c[1]&0x3f) <<6) |
174+
(c[2]&0x3f));
175+
}
176+
elseif ((*c&0xf0)==0xf0) {
177+
return (pg_wchar)(((c[0]&0x07) <<18) |
178+
((c[1]&0x3f) <<12) |
179+
((c[2]&0x3f) <<6) |
180+
(c[3]&0x3f));
181+
}
182+
else {
183+
/* that is an invalid code on purpose */
184+
return0xffffffff;
185+
}
186+
}
187+
188+
/* mb_utf_wcwidth : calculate column length for the utf8 string pwcs
189+
*/
190+
staticint
191+
mb_utf_wcswidth(unsignedchar*pwcs,intlen)
192+
{
193+
intw,l=0;
194+
intwidth=0;
195+
196+
for (;*pwcs&&len>0;pwcs+=l) {
197+
l=pg_utf_mblen(pwcs);
198+
if ((len<l)|| ((w=ucs_wcwidth(utf2ucs(pwcs)))<0)) {
199+
returnwidth;
200+
}
201+
len-=l;
202+
width+=w;
203+
}
204+
returnwidth;
205+
}
206+
207+
staticint
208+
utf_charcheck(constunsignedchar*c)
209+
{
210+
/* Unicode 3.1 compliant validation :
211+
* for each category, it checks the combination of each byte to make sur
212+
* it maps to a valid range. It also returns -1 for the following UCS values:
213+
* ucs > 0x10ffff
214+
* ucs & 0xfffe = 0xfffe
215+
* 0xfdd0 < ucs < 0xfdef
216+
* ucs & 0xdb00 = 0xd800 (surrogates)
217+
*/
218+
if ((*c&0x80)==0) {
219+
return1;
220+
}
221+
elseif ((*c&0xe0)==0xc0) {
222+
/* two-byte char */
223+
if(((c[1]&0xc0)==0x80)&& ((c[0]&0x1f)>0x01)) {
224+
return2;
225+
}
226+
return-1;
227+
}
228+
elseif ((*c&0xf0)==0xe0) {
229+
/* three-byte char */
230+
if (((c[1]&0xc0)==0x80)&&
231+
(((c[0]&0x0f)!=0x00)|| ((c[1]&0x20)==0x20))&&
232+
((c[2]&0xc0)==0x80)) {
233+
intz=c[0]&0x0f;
234+
intyx= ((c[1]&0x3f) <<6) | (c[0]&0x3f);
235+
intlx=yx&0x7f;
236+
237+
/* check 0xfffe/0xffff, 0xfdd0..0xfedf range, surrogates */
238+
if (((z==0x0f)&&
239+
(((yx&0xffe)==0xffe)||
240+
(((yx&0xf80)==0xd80)&& (lx >=0x30)&& (lx <=0x4f))))||
241+
((z==0x0d)&& ((yx&0xb00)==0x800))) {
242+
return-1;
243+
}
244+
return3;
245+
}
246+
return-1;
247+
}
248+
elseif ((*c&0xf8)==0xf0) {
249+
intu= ((c[0]&0x07) <<2) | ((c[1]&0x30) >>4);
250+
251+
/* four-byte char */
252+
if (((c[1]&0xc0)==0x80)&&
253+
(u>0x00)&& (u <=0x10)&&
254+
((c[2]&0xc0)==0x80)&& ((c[3]&0xc0)==0x80)) {
255+
/* test for 0xzzzzfffe/0xzzzzfffff */
256+
if (((c[1]&0x0f)==0x0f)&& ((c[2]&0x3f)==0x3f)&&
257+
((c[3]&0x3e)==0x3e)) {
258+
return-1;
259+
}
260+
return4;
261+
}
262+
return-1;
263+
}
264+
return-1;
265+
}
266+
267+
staticunsignedchar*
268+
mb_utf_validate(unsignedchar*pwcs)
269+
{
270+
intl=0;
271+
unsignedchar*p=pwcs;
272+
unsignedchar*p0=pwcs;
273+
274+
while(*pwcs ) {
275+
if ((l=utf_charcheck(pwcs))>0) {
276+
if (p!=pwcs) {
277+
inti;
278+
for(i=0;i<l;i++) {
279+
*p++=*pwcs++;
280+
}
281+
}
282+
else {
283+
pwcs+=l;
284+
p+=l;
285+
}
286+
}
287+
else {
288+
/* we skip the char */
289+
pwcs++;
290+
}
291+
}
292+
if (p!=pwcs) {
293+
*p='\0';
294+
}
295+
returnp0;
296+
}
297+
298+
/*
299+
* public functions : wcswidth and mbvalidate
300+
*/
301+
302+
int
303+
pg_wcswidth(unsignedchar*pwcs,intlen) {
304+
if (pset.encoding==PG_UTF8) {
305+
returnmb_utf_wcswidth(pwcs,len);
306+
}
307+
else {
308+
/* obviously, other encodings may want to fix this, but I don't know them
309+
* myself, unfortunately.
310+
*/
311+
returnlen;
312+
}
313+
}
314+
315+
unsignedchar*
316+
mbvalidate(unsignedchar*pwcs) {
317+
if (pset.encoding==PG_UTF8) {
318+
returnmb_utf_validate(pwcs);
319+
}
320+
else {
321+
/* other encodings needing validation should add their own routines here
322+
*/
323+
returnpwcs;
324+
}
325+
}
326+
#else/* !MULTIBYTE */
327+
328+
/* in single-byte environment, all cells take 1 column */
329+
intpg_wcswidth(unsignedchar*pwcs,intlen) {
330+
returnlen;
331+
}
332+
#endif
333+
334+

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp