NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commitdf4cba6

committed

Commit Patrice's patches except:

> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1> characters (characters with values >= 0x10000, which are encoded on> four bytes).Also, update mb/expected/unicode.out. This is necessary since thepatches affetc the result of queries using UTF-8.---------------------------------------------------------------Hi,I should have sent the patch earlier, but got delayed by other stuff.Anyway, here is the patch:- most of the functionality is only activated when MULTIBYTE is defined,- check valid UTF-8 characters, client-side only yet, and only on output, you still can send invalid UTF-8 to the server (so, it's only partly compliant to Unicode 3.1, but that's better than nothing).- formats with the correct number of columns (that's why I made it in the first place after all), but only for UNICODE. However, the code allows to plug-in routines for other encodings, as Tatsuo did for the other multibyte functions.- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 characters (characters with values >= 0x10000, which are encoded on four bytes).- doesn't depend on the locale capabilities of the glibc (useful for remote telnet).I would like somebody to check it closely, as it is my first patch topgsql. Also, I created dummy .orig files, so that the two files Icreated are included, I hope that's the right way.Now, a lot of functionality is NOT included here, but I will keep thatfor 7.3 :) That includes all string checking on the server side (whichwill have to be a bit more optimised ;) ), and the input checking onthe client side for UTF-8, though that should not be difficult. It'sjust to send the strings through mbvalidate() before sending them tothe server. Strong checking on UTF-8 strings is mandatory to becompliant with Unicode 3.1+ .Do I have time to look for a patch to include iso-8859-15 for 7.2 ?The euro is coming 1. january 2002 (before 7.3 !) and over 280millions people in Europe will need the euro sign and only iso-8859-15and iso-8859-16 have it (and unfortunately, I don't think all Uniceswill switch to Unicode in the meantime)....err... yes, I know that this is not every single person in Europe thatuses PostgreSql, so it's not exactly 280m, but it's just a matter oftime ! ;)I'll come back (on pgsql-hackers) later to ask a few questionsregarding the full unicode support (normalisation, collation,regexes,...) on the server side :)Here is the patch !Patrice.--Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of theuniverse exploding out of a dot smaller than the head of a pin, but theycan't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe ? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes------------------------------------------http://www.islande.org/ -----

1 parentd07bacd commitdf4cba6Copy full SHA for df4cba6

File tree

4 files changed

+529

-70

lines changed

src
- bin/psql
- test/mb/expected
  - unicode.out

4 files changed

+529

-70

lines changed

`‎src/bin/psql/Makefile`

Lines changed: 2 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -5,7 +5,7 @@`
`5`	`5`	`# Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group`
`6`	`6`	`# Portions Copyright (c) 1994, Regents of the University of California`
`7`	`7`	`#`
`8`		`-# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.30 2001/02/27 08:13:27 ishii Exp $`
	`8`	`+# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.31 2001/10/15 01:25:10 ishii Exp $`
`9`	`9`	`#`
`10`	`10`	`#-------------------------------------------------------------------------`
`11`	`11`
`@@ -19,7 +19,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)`
`19`	`19`
`20`	`20`	`OBJS=command.o common.o help.o input.o stringutils.o mainloop.o\`
`21`	`21`	`copy.o startup.o prompt.o variables.o large_obj.o print.o describe.o\`
`22`		`-tab-complete.o`
	`22`	`+tab-complete.o mbprint.o`
`23`	`23`
`24`	`24`	`all: submake psql`
`25`	`25`

`‎src/bin/psql/mbprint.c`

Lines changed: 334 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,334 @@`
	`1`	`+/*`
	`2`	`+ * psql - the PostgreSQL interactive terminal`
	`3`	`+ *`
	`4`	`+ * Copyright 2000 by PostgreSQL Global Development Group`
	`5`	`+ *`
	`6`	`+ * $Header: /cvsroot/pgsql/src/bin/psql/mbprint.c,v 1.1 2001/10/15 01:25:10 ishii Exp $`
	`7`	`+ */`
	`8`	`+`
	`9`	`+#include"postgres_fe.h"`
	`10`	`+#include"mbprint.h"`
	`11`	`+`
	`12`	`+#ifdefMULTIBYTE`
	`13`	`+`
	`14`	`+#include"mb/pg_wchar.h"`
	`15`	`+#include"settings.h"`
	`16`	`+`
	`17`	`+/*`
	`18`	`+ * This is an implementation of wcwidth() and wcswidth() as defined in`
	`19`	`+ * "The Single UNIX Specification, Version 2, The Open Group, 1997"`
	`20`	`+ * <http://www.UNIX-systems.org/online.html>`
	`21`	`+ *`
	`22`	`+ * Markus Kuhn -- 2001-09-08 -- public domain`
	`23`	`+ *`
	`24`	`+ * customised for PostgreSQL`
	`25`	`+ *`
	`26`	`+ * original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c`
	`27`	`+ */`
	`28`	`+`
	`29`	`+structmbinterval {`
	`30`	`+unsigned shortfirst;`
	`31`	`+unsigned shortlast;`
	`32`	`+};`
	`33`	`+`
	`34`	`+/* auxiliary function for binary search in interval table */`
	`35`	`+staticint`
	`36`	`+mbbisearch(pg_wcharucs,conststructmbinterval*table,intmax)`
	`37`	`+{`
	`38`	`+intmin=0;`
	`39`	`+intmid;`
	`40`	`+`
	`41`	`+if (ucs<table[0].first\|\|ucs>table[max].last)`
	`42`	`+return0;`
	`43`	`+while (max >=min) {`
	`44`	`+mid= (min+max) /2;`
	`45`	`+if (ucs>table[mid].last)`
	`46`	`+min=mid+1;`
	`47`	`+elseif (ucs<table[mid].first)`
	`48`	`+max=mid-1;`
	`49`	`+else`
	`50`	`+return1;`
	`51`	`+}`
	`52`	`+`
	`53`	`+return0;`
	`54`	`+}`
	`55`	`+`
	`56`	`+`
	`57`	`+/* The following functions define the column width of an ISO 10646`
	`58`	`+ * character as follows:`
	`59`	`+ *`
	`60`	`+ * - The null character (U+0000) has a column width of 0.`
	`61`	`+ *`
	`62`	`+ * - Other C0/C1 control characters and DEL will lead to a return`
	`63`	`+ * value of -1.`
	`64`	`+ *`
	`65`	`+ * - Non-spacing and enclosing combining characters (general`
	`66`	`+ * category code Mn or Me in the Unicode database) have a`
	`67`	`+ * column width of 0.`
	`68`	`+ *`
	`69`	`+ * - Other format characters (general category code Cf in the Unicode`
	`70`	`+ * database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.`
	`71`	`+ *`
	`72`	`+ * - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)`
	`73`	`+ * have a column width of 0.`
	`74`	`+ *`
	`75`	`+ * - Spacing characters in the East Asian Wide (W) or East Asian`
	`76`	`+ * FullWidth (F) category as defined in Unicode Technical`
	`77`	`+ * Report #11 have a column width of 2.`
	`78`	`+ *`
	`79`	`+ * - All remaining characters (including all printable`
	`80`	`+ * ISO 8859-1 and WGL4 characters, Unicode control characters,`
	`81`	`+ * etc.) have a column width of 1.`
	`82`	`+ *`
	`83`	`+ * This implementation assumes that wchar_t characters are encoded`
	`84`	`+ * in ISO 10646.`
	`85`	`+ */`
	`86`	`+`
	`87`	`+staticint`
	`88`	`+ucs_wcwidth(pg_wcharucs)`
	`89`	`+{`
	`90`	`+/* sorted list of non-overlapping intervals of non-spacing characters */`
	`91`	`+staticconststructmbintervalcombining[]= {`
	`92`	`+{0x0300,0x034E }, {0x0360,0x0362 }, {0x0483,0x0486 },`
	`93`	`+{0x0488,0x0489 }, {0x0591,0x05A1 }, {0x05A3,0x05B9 },`
	`94`	`+{0x05BB,0x05BD }, {0x05BF,0x05BF }, {0x05C1,0x05C2 },`
	`95`	`+{0x05C4,0x05C4 }, {0x064B,0x0655 }, {0x0670,0x0670 },`
	`96`	`+{0x06D6,0x06E4 }, {0x06E7,0x06E8 }, {0x06EA,0x06ED },`
	`97`	`+{0x070F,0x070F }, {0x0711,0x0711 }, {0x0730,0x074A },`
	`98`	`+{0x07A6,0x07B0 }, {0x0901,0x0902 }, {0x093C,0x093C },`
	`99`	`+{0x0941,0x0948 }, {0x094D,0x094D }, {0x0951,0x0954 },`
	`100`	`+{0x0962,0x0963 }, {0x0981,0x0981 }, {0x09BC,0x09BC },`
	`101`	`+{0x09C1,0x09C4 }, {0x09CD,0x09CD }, {0x09E2,0x09E3 },`
	`102`	`+{0x0A02,0x0A02 }, {0x0A3C,0x0A3C }, {0x0A41,0x0A42 },`
	`103`	`+{0x0A47,0x0A48 }, {0x0A4B,0x0A4D }, {0x0A70,0x0A71 },`
	`104`	`+{0x0A81,0x0A82 }, {0x0ABC,0x0ABC }, {0x0AC1,0x0AC5 },`
	`105`	`+{0x0AC7,0x0AC8 }, {0x0ACD,0x0ACD }, {0x0B01,0x0B01 },`
	`106`	`+{0x0B3C,0x0B3C }, {0x0B3F,0x0B3F }, {0x0B41,0x0B43 },`
	`107`	`+{0x0B4D,0x0B4D }, {0x0B56,0x0B56 }, {0x0B82,0x0B82 },`
	`108`	`+{0x0BC0,0x0BC0 }, {0x0BCD,0x0BCD }, {0x0C3E,0x0C40 },`
	`109`	`+{0x0C46,0x0C48 }, {0x0C4A,0x0C4D }, {0x0C55,0x0C56 },`
	`110`	`+{0x0CBF,0x0CBF }, {0x0CC6,0x0CC6 }, {0x0CCC,0x0CCD },`
	`111`	`+{0x0D41,0x0D43 }, {0x0D4D,0x0D4D }, {0x0DCA,0x0DCA },`
	`112`	`+{0x0DD2,0x0DD4 }, {0x0DD6,0x0DD6 }, {0x0E31,0x0E31 },`
	`113`	`+{0x0E34,0x0E3A }, {0x0E47,0x0E4E }, {0x0EB1,0x0EB1 },`
	`114`	`+{0x0EB4,0x0EB9 }, {0x0EBB,0x0EBC }, {0x0EC8,0x0ECD },`
	`115`	`+{0x0F18,0x0F19 }, {0x0F35,0x0F35 }, {0x0F37,0x0F37 },`
	`116`	`+{0x0F39,0x0F39 }, {0x0F71,0x0F7E }, {0x0F80,0x0F84 },`
	`117`	`+{0x0F86,0x0F87 }, {0x0F90,0x0F97 }, {0x0F99,0x0FBC },`
	`118`	`+{0x0FC6,0x0FC6 }, {0x102D,0x1030 }, {0x1032,0x1032 },`
	`119`	`+{0x1036,0x1037 }, {0x1039,0x1039 }, {0x1058,0x1059 },`
	`120`	`+{0x1160,0x11FF }, {0x17B7,0x17BD }, {0x17C6,0x17C6 },`
	`121`	`+{0x17C9,0x17D3 }, {0x180B,0x180E }, {0x18A9,0x18A9 },`
	`122`	`+{0x200B,0x200F }, {0x202A,0x202E }, {0x206A,0x206F },`
	`123`	`+{0x20D0,0x20E3 }, {0x302A,0x302F }, {0x3099,0x309A },`
	`124`	`+{0xFB1E,0xFB1E }, {0xFE20,0xFE23 }, {0xFEFF,0xFEFF },`
	`125`	`+{0xFFF9,0xFFFB }`
	`126`	`+};`
	`127`	`+`
	`128`	`+/* test for 8-bit control characters */`
	`129`	`+if (ucs==0) {`
	`130`	`+return0;`
	`131`	`+}`
	`132`	`+`
	`133`	`+if (ucs<32\|\| (ucs >=0x7f&&ucs<0xa0)\|\|ucs>0x0010ffff) {`
	`134`	`+return-1;`
	`135`	`+}`
	`136`	`+`
	`137`	`+/* binary search in table of non-spacing characters */`
	`138`	`+if (mbbisearch(ucs,combining,`
	`139`	`+sizeof(combining) /sizeof(structmbinterval)-1)) {`
	`140`	`+return0;`
	`141`	`+}`
	`142`	`+`
	`143`	`+/* if we arrive here, ucs is not a combining or C0/C1 control character */`
	`144`	`+`
	`145`	`+return1+`
	`146`	`+(ucs >=0x1100&&`
	`147`	`+ (ucs <=0x115f\|\|/* Hangul Jamo init. consonants */`
	`148`	`+ (ucs >=0x2e80&&ucs <=0xa4cf&& (ucs& ~0x0011)!=0x300a&&`
	`149`	`+ucs!=0x303f)\|\|/* CJK ... Yi */`
	`150`	`+ (ucs >=0xac00&&ucs <=0xd7a3)\|\|/* Hangul Syllables */`
	`151`	`+ (ucs >=0xf900&&ucs <=0xfaff)\|\|/* CJK Compatibility Ideographs */`
	`152`	`+ (ucs >=0xfe30&&ucs <=0xfe6f)\|\|/* CJK Compatibility Forms */`
	`153`	`+ (ucs >=0xff00&&ucs <=0xff5f)\|\|/* Fullwidth Forms */`
	`154`	`+ (ucs >=0xffe0&&ucs <=0xffe6)\|\|`
	`155`	`+ (ucs >=0x20000&&ucs <=0x2ffff)));`
	`156`	`+}`
	`157`	`+`
	`158`	`+pg_wchar`
	`159`	`+utf2ucs(constunsignedchar*c)`
	`160`	`+{`
	`161`	`+/* one char version of pg_utf2wchar_with_len.`
	`162`	`+ * no control here, c must point to a large enough string`
	`163`	`+ */`
	`164`	`+if ((*c&0x80)==0) {`
	`165`	`+return (pg_wchar)c[0];`
	`166`	`+}`
	`167`	`+elseif ((*c&0xe0)==0xc0) {`
	`168`	`+return (pg_wchar)(((c[0]&0x1f) <<6) \|`
	`169`	`+ (c[1]&0x3f));`
	`170`	`+}`
	`171`	`+elseif ((*c&0xf0)==0xe0) {`
	`172`	`+return (pg_wchar)(((c[0]&0x0f) <<12) \|`
	`173`	`+ ((c[1]&0x3f) <<6) \|`
	`174`	`+ (c[2]&0x3f));`
	`175`	`+}`
	`176`	`+elseif ((*c&0xf0)==0xf0) {`
	`177`	`+return (pg_wchar)(((c[0]&0x07) <<18) \|`
	`178`	`+ ((c[1]&0x3f) <<12) \|`
	`179`	`+ ((c[2]&0x3f) <<6) \|`
	`180`	`+ (c[3]&0x3f));`
	`181`	`+}`
	`182`	`+else {`
	`183`	`+/* that is an invalid code on purpose */`
	`184`	`+return0xffffffff;`
	`185`	`+}`
	`186`	`+}`
	`187`	`+`
	`188`	`+/* mb_utf_wcwidth : calculate column length for the utf8 string pwcs`
	`189`	`+ */`
	`190`	`+staticint`
	`191`	`+mb_utf_wcswidth(unsignedchar*pwcs,intlen)`
	`192`	`+{`
	`193`	`+intw,l=0;`
	`194`	`+intwidth=0;`
	`195`	`+`
	`196`	`+for (;*pwcs&&len>0;pwcs+=l) {`
	`197`	`+l=pg_utf_mblen(pwcs);`
	`198`	`+if ((len<l)\|\| ((w=ucs_wcwidth(utf2ucs(pwcs)))<0)) {`
	`199`	`+returnwidth;`
	`200`	`+}`
	`201`	`+len-=l;`
	`202`	`+width+=w;`
	`203`	`+}`
	`204`	`+returnwidth;`
	`205`	`+}`
	`206`	`+`
	`207`	`+staticint`
	`208`	`+utf_charcheck(constunsignedchar*c)`
	`209`	`+{`
	`210`	`+/* Unicode 3.1 compliant validation :`
	`211`	`+ * for each category, it checks the combination of each byte to make sur`
	`212`	`+ * it maps to a valid range. It also returns -1 for the following UCS values:`
	`213`	`+ * ucs > 0x10ffff`
	`214`	`+ * ucs & 0xfffe = 0xfffe`
	`215`	`+ * 0xfdd0 < ucs < 0xfdef`
	`216`	`+ * ucs & 0xdb00 = 0xd800 (surrogates)`
	`217`	`+ */`
	`218`	`+if ((*c&0x80)==0) {`
	`219`	`+return1;`
	`220`	`+}`
	`221`	`+elseif ((*c&0xe0)==0xc0) {`
	`222`	`+/* two-byte char */`
	`223`	`+if(((c[1]&0xc0)==0x80)&& ((c[0]&0x1f)>0x01)) {`
	`224`	`+return2;`
	`225`	`+}`
	`226`	`+return-1;`
	`227`	`+}`
	`228`	`+elseif ((*c&0xf0)==0xe0) {`
	`229`	`+/* three-byte char */`
	`230`	`+if (((c[1]&0xc0)==0x80)&&`
	`231`	`+(((c[0]&0x0f)!=0x00)\|\| ((c[1]&0x20)==0x20))&&`
	`232`	`+((c[2]&0xc0)==0x80)) {`
	`233`	`+intz=c[0]&0x0f;`
	`234`	`+intyx= ((c[1]&0x3f) <<6) \| (c[0]&0x3f);`
	`235`	`+intlx=yx&0x7f;`
	`236`	`+`
	`237`	`+/* check 0xfffe/0xffff, 0xfdd0..0xfedf range, surrogates */`
	`238`	`+if (((z==0x0f)&&`
	`239`	`+ (((yx&0xffe)==0xffe)\|\|`
	`240`	`+ (((yx&0xf80)==0xd80)&& (lx >=0x30)&& (lx <=0x4f))))\|\|`
	`241`	`+((z==0x0d)&& ((yx&0xb00)==0x800))) {`
	`242`	`+return-1;`
	`243`	`+}`
	`244`	`+return3;`
	`245`	`+}`
	`246`	`+return-1;`
	`247`	`+}`
	`248`	`+elseif ((*c&0xf8)==0xf0) {`
	`249`	`+intu= ((c[0]&0x07) <<2) \| ((c[1]&0x30) >>4);`
	`250`	`+`
	`251`	`+/* four-byte char */`
	`252`	`+if (((c[1]&0xc0)==0x80)&&`
	`253`	`+(u>0x00)&& (u <=0x10)&&`
	`254`	`+((c[2]&0xc0)==0x80)&& ((c[3]&0xc0)==0x80)) {`
	`255`	`+/* test for 0xzzzzfffe/0xzzzzfffff */`
	`256`	`+if (((c[1]&0x0f)==0x0f)&& ((c[2]&0x3f)==0x3f)&&`
	`257`	`+((c[3]&0x3e)==0x3e)) {`
	`258`	`+return-1;`
	`259`	`+}`
	`260`	`+return4;`
	`261`	`+}`
	`262`	`+return-1;`
	`263`	`+}`
	`264`	`+return-1;`
	`265`	`+}`
	`266`	`+`
	`267`	`+staticunsignedchar*`
	`268`	`+mb_utf_validate(unsignedchar*pwcs)`
	`269`	`+{`
	`270`	`+intl=0;`
	`271`	`+unsignedchar*p=pwcs;`
	`272`	`+unsignedchar*p0=pwcs;`
	`273`	`+`
	`274`	`+while(*pwcs ) {`
	`275`	`+if ((l=utf_charcheck(pwcs))>0) {`
	`276`	`+if (p!=pwcs) {`
	`277`	`+inti;`
	`278`	`+for(i=0;i<l;i++) {`
	`279`	`+p++=pwcs++;`
	`280`	`+}`
	`281`	`+}`
	`282`	`+else {`
	`283`	`+pwcs+=l;`
	`284`	`+p+=l;`
	`285`	`+}`
	`286`	`+}`
	`287`	`+else {`
	`288`	`+/* we skip the char */`
	`289`	`+pwcs++;`
	`290`	`+}`
	`291`	`+}`
	`292`	`+if (p!=pwcs) {`
	`293`	`+*p='\0';`
	`294`	`+}`
	`295`	`+returnp0;`
	`296`	`+}`
	`297`	`+`
	`298`	`+/*`
	`299`	`+ * public functions : wcswidth and mbvalidate`
	`300`	`+ */`
	`301`	`+`
	`302`	`+int`
	`303`	`+pg_wcswidth(unsignedchar*pwcs,intlen) {`
	`304`	`+if (pset.encoding==PG_UTF8) {`
	`305`	`+returnmb_utf_wcswidth(pwcs,len);`
	`306`	`+}`
	`307`	`+else {`
	`308`	`+/* obviously, other encodings may want to fix this, but I don't know them`
	`309`	`+ * myself, unfortunately.`
	`310`	`+ */`
	`311`	`+returnlen;`
	`312`	`+}`
	`313`	`+}`
	`314`	`+`
	`315`	`+unsignedchar*`
	`316`	`+mbvalidate(unsignedchar*pwcs) {`
	`317`	`+if (pset.encoding==PG_UTF8) {`
	`318`	`+returnmb_utf_validate(pwcs);`
	`319`	`+}`
	`320`	`+else {`
	`321`	`+/* other encodings needing validation should add their own routines here`
	`322`	`+ */`
	`323`	`+returnpwcs;`
	`324`	`+}`
	`325`	`+}`
	`326`	`+#else/* !MULTIBYTE */`
	`327`	`+`
	`328`	`+/* in single-byte environment, all cells take 1 column */`
	`329`	`+intpg_wcswidth(unsignedchar*pwcs,intlen) {`
	`330`	`+returnlen;`
	`331`	`+}`
	`332`	`+#endif`
	`333`	`+`
	`334`	`+`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitdf4cba6

File tree

4 files changed

4 files changed

`‎src/bin/psql/Makefile`

`‎src/bin/psql/mbprint.c`

0 commit comments