NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit7f380c5

committed

Reduce size of backend scanner's tables.

Previously, the core scanner's yy_transition[] array had 37045 elements.Since that number is larger than INT16_MAX, Flex generated the array tocontain 32-bit integers. By reimplementing some of the bulkier scannerrules, this patch reduces the array to 20495 elements. The much smallertotal length, combined with the consequent use of 16-bit integers forthe array elements reduces the binary size by over 200kB. This wasaccomplished in two ways:1. Consolidate handling of quote continuations into a new start condition,rather than duplicating that logic for five different string types.2. Treat Unicode strings and identifiers followed by a UESCAPE sequenceas three separate tokens, rather than one. The logic to de-escapeUnicode strings is moved to the filter code in parser.c, which alreadyhad the ability to provide special processing for token sequences.While we could have implemented the conversion in the grammar, thatapproach was rejected for performance and maintainability reasons.Performance in microbenchmarks of raw parsing seems equal or slightlyfaster in most cases, and it's reasonable to expect that in real-worldusage (with more competition for the CPU cache) there will be a largerwin. The exception is UESCAPE sequences; lexing those is about 10%slower, primarily because the scanner now has to be called three timesrather than one. This seems acceptable since that feature is veryrarely used.The psql and epcg lexers are likewise modified, primarily because wewant to keep them all in sync. Since those lexers don't use thespace-hogging -CF option, the space savings is much less, but it'sstill good for perhaps 10kB apiece.While at it, merge the ecpg lexer's handling of C-style comments usedin SQL and in C. Those have different rules regarding nested comments,but since we already have the ability to keep track of the previousstart condition, we can use that to handle both cases within a singlestart condition. This matches the core scanner more closely.John NaylorDiscussion:https://postgr.es/m/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com

1 parent259bbe1 commit7f380c5Copy full SHA for 7f380c5

File tree

19 files changed

+671

-619

lines changed

src
- backend/parser
- fe_utils
  - psqlscan.l
- include
  - fe_utils
    - psqlscan_int.h
  - mb
    - pg_wchar.h
  - parser
    - kwlist.h
    - scanner.h
- interfaces/ecpg
  - preproc
  - test/expected
    - preproc-strings.c
    - preproc-strings.stderr
- pl/plpgsql/src
  - pl_gram.y
- test/regress
  - expected
    - strings.out
  - sql
    - strings.sql

19 files changed

+671

-619

lines changed

`‎src/backend/parser/gram.y`

Lines changed: 7 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -598,10 +598,13 @@ static Node makeRecursiveViewSelect(char relname, List aliases, Node query);`
`598`	`598`	`* the set of keywords. PL/pgSQL depends on this so that it can share the`
`599`	`599`	`* same lexer. If you add/change tokens here, fix PL/pgSQL to match!`
`600`	`600`	`*`
	`601`	`+ * UIDENT and USCONST are reduced to IDENT and SCONST in parser.c, so that`
	`602`	`+ * they need no productions here; but we must assign token codes to them.`
	`603`	`+ *`
`601`	`604`	`* DOT_DOT is unused in the core SQL grammar, and so will always provoke`
`602`	`605`	`* parse errors. It is needed by PL/pgSQL.`
`603`	`606`	`*/`
`604`		`-%token<str>IDENTFCONSTSCONSTBCONSTXCONSTOp`
	`607`	`+%token<str>IDENTUIDENTFCONSTSCONSTUSCONSTBCONSTXCONSTOp`
`605`	`608`	`%token<ival>ICONSTPARAM`
`606`	`609`	`%tokenTYPECASTDOT_DOTCOLON_EQUALSEQUALS_GREATER`
`607`	`610`	`%tokenLESS_EQUALSGREATER_EQUALSNOT_EQUALS`
`@@ -691,8 +694,8 @@ static Node makeRecursiveViewSelect(char relname, List aliases, Node query);`
`691`	`694`	`TREAT TRIGGER TRIM TRUE_P`
`692`	`695`	`TRUNCATE TRUSTED TYPE_P TYPES_P`
`693`	`696`
`694`		`-UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED`
`695`		`-UNTIL UPDATE USER USING`
	`697`	`+UESCAPEUNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN`
	`698`	`+UNLISTEN UNLOGGEDUNTIL UPDATE USER USING`
`696`	`699`
`697`	`700`	`VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING`
`698`	`701`	`VERBOSE VERSION_P VIEW VIEWS VOLATILE`
`@@ -15374,6 +15377,7 @@ unreserved_keyword:`
`15374`	`15377`	`\| TRUSTED`
`15375`	`15378`	`\| TYPE_P`
`15376`	`15379`	`\| TYPES_P`
	`15380`	`+\| UESCAPE`
`15377`	`15381`	`\| UNBOUNDED`
`15378`	`15382`	`\| UNCOMMITTED`
`15379`	`15383`	`\| UNENCRYPTED`

`‎src/backend/parser/parser.c`

Lines changed: 281 additions & 1 deletion

Original file line number	Diff line number	Diff line change
`@@ -21,8 +21,14 @@`
`21`	`21`
`22`	`22`	`#include"postgres.h"`
`23`	`23`
	`24`	`+#include"mb/pg_wchar.h"`
`24`	`25`	`#include"parser/gramparse.h"`
`25`	`26`	`#include"parser/parser.h"`
	`27`	`+#include"parser/scansup.h"`
	`28`	`+`
	`29`	`+staticboolcheck_uescapechar(unsignedcharescape);`
	`30`	`+staticcharstr_udeescape(constcharstr,charescape,`
	`31`	`+intposition,core_yyscan_tyyscanner);`
`26`	`32`
`27`	`33`
`28`	`34`	`/*`
`@@ -75,6 +81,10 @@ raw_parser(const char *str)`
`75`	`81`	`* scanner backtrack, which would cost more performance than this filter`
`76`	`82`	`* layer does.`
`77`	`83`	`*`
	`84`	`+ * We also use this filter to convert UIDENT and USCONST sequences into`
	`85`	`+ * plain IDENT and SCONST tokens. While that could be handled by additional`
	`86`	`+ * productions in the main grammar, it's more efficient to do it like this.`
	`87`	`+ *`
`78`	`88`	`* The filter also provides a convenient place to translate between`
`79`	`89`	`* the core_YYSTYPE and YYSTYPE representations (which are really the`
`80`	`90`	`* same thing anyway, but notationally they're different).`
`@@ -104,7 +114,7 @@ base_yylex(YYSTYPE lvalp, YYLTYPE llocp, core_yyscan_t yyscanner)`
`104`	`114`	`* If this token isn't one that requires lookahead, just return it. If it`
`105`	`115`	`* does, determine the token length. (We could get that via strlen(), but`
`106`	`116`	`* since we have such a small set of possibilities, hardwiring seems`
`107`		`- * feasible and more efficient.)`
	`117`	`+ * feasible and more efficient --- at least for the fixed-length cases.)`
`108`	`118`	`*/`
`109`	`119`	`switch (cur_token)`
`110`	`120`	`{`
`@@ -117,6 +127,10 @@ base_yylex(YYSTYPE lvalp, YYLTYPE llocp, core_yyscan_t yyscanner)`
`117`	`127`	`caseWITH:`
`118`	`128`	`cur_token_length=4;`
`119`	`129`	`break;`
	`130`	`+caseUIDENT:`
	`131`	`+caseUSCONST:`
	`132`	`+cur_token_length=strlen(yyextra->core_yy_extra.scanbuf+*llocp);`
	`133`	`+break;`
`120`	`134`	`default:`
`121`	`135`	`returncur_token;`
`122`	`136`	`}`
`@@ -190,7 +204,273 @@ base_yylex(YYSTYPE lvalp, YYLTYPE llocp, core_yyscan_t yyscanner)`
`190`	`204`	`break;`
`191`	`205`	`}`
`192`	`206`	`break;`
	`207`	`+`
	`208`	`+caseUIDENT:`
	`209`	`+caseUSCONST:`
	`210`	`+/* Look ahead for UESCAPE */`
	`211`	`+if (next_token==UESCAPE)`
	`212`	`+{`
	`213`	`+/* Yup, so get third token, which had better be SCONST */`
	`214`	`+constchar*escstr;`
	`215`	`+`
	`216`	`+/* Again save and restore llocp /`
	`217`	`+cur_yylloc=*llocp;`
	`218`	`+`
	`219`	`+/* Un-truncate current token so errors point to third token */`
	`220`	`+*(yyextra->lookahead_end)=yyextra->lookahead_hold_char;`
	`221`	`+`
	`222`	`+/* Get third token */`
	`223`	`+next_token=core_yylex(&(yyextra->lookahead_yylval),`
	`224`	`+llocp,yyscanner);`
	`225`	`+`
	`226`	`+/* If we throw error here, it will point to third token */`
	`227`	`+if (next_token!=SCONST)`
	`228`	`+scanner_yyerror("UESCAPE must be followed by a simple string literal",`
	`229`	`+yyscanner);`
	`230`	`+`
	`231`	`+escstr=yyextra->lookahead_yylval.str;`
	`232`	`+if (strlen(escstr)!=1\|\| !check_uescapechar(escstr[0]))`
	`233`	`+scanner_yyerror("invalid Unicode escape character",`
	`234`	`+yyscanner);`
	`235`	`+`
	`236`	`+/* Now restore llocp; errors will point to first token /`
	`237`	`+*llocp=cur_yylloc;`
	`238`	`+`
	`239`	`+/* Apply Unicode conversion */`
	`240`	`+lvalp->core_yystype.str=`
	`241`	`+str_udeescape(lvalp->core_yystype.str,`
	`242`	`+escstr[0],`
	`243`	`+*llocp,`
	`244`	`+yyscanner);`
	`245`	`+`
	`246`	`+/*`
	`247`	`+ * We don't need to revert the un-truncation of UESCAPE. What`
	`248`	`+ * we do want to do is clear have_lookahead, thereby consuming`
	`249`	`+ * all three tokens.`
	`250`	`+ */`
	`251`	`+yyextra->have_lookahead= false;`
	`252`	`+}`
	`253`	`+else`
	`254`	`+{`
	`255`	`+/* No UESCAPE, so convert using default escape character */`
	`256`	`+lvalp->core_yystype.str=`
	`257`	`+str_udeescape(lvalp->core_yystype.str,`
	`258`	`+'\\',`
	`259`	`+*llocp,`
	`260`	`+yyscanner);`
	`261`	`+}`
	`262`	`+`
	`263`	`+if (cur_token==UIDENT)`
	`264`	`+{`
	`265`	`+/* It's an identifier, so truncate as appropriate */`
	`266`	`+truncate_identifier(lvalp->core_yystype.str,`
	`267`	`+strlen(lvalp->core_yystype.str),`
	`268`	`+true);`
	`269`	`+cur_token=IDENT;`
	`270`	`+}`
	`271`	`+elseif (cur_token==USCONST)`
	`272`	`+{`
	`273`	`+cur_token=SCONST;`
	`274`	`+}`
	`275`	`+break;`
`193`	`276`	`}`
`194`	`277`
`195`	`278`	`returncur_token;`
`196`	`279`	`}`
	`280`	`+`
	`281`	`+/* convert hex digit (caller should have verified that) to value */`
	`282`	`+staticunsignedint`
	`283`	`+hexval(unsignedcharc)`
	`284`	`+{`
	`285`	`+if (c >='0'&&c <='9')`
	`286`	`+returnc-'0';`
	`287`	`+if (c >='a'&&c <='f')`
	`288`	`+returnc-'a'+0xA;`
	`289`	`+if (c >='A'&&c <='F')`
	`290`	`+returnc-'A'+0xA;`
	`291`	`+elog(ERROR,"invalid hexadecimal digit");`
	`292`	`+return0;/* not reached */`
	`293`	`+}`
	`294`	`+`
	`295`	`+/* is Unicode code point acceptable in database's encoding? */`
	`296`	`+staticvoid`
	`297`	`+check_unicode_value(pg_wcharc,intpos,core_yyscan_tyyscanner)`
	`298`	`+{`
	`299`	`+/* See also addunicode() in scan.l */`
	`300`	`+if (c==0\|\|c>0x10FFFF)`
	`301`	`+ereport(ERROR,`
	`302`	`+(errcode(ERRCODE_SYNTAX_ERROR),`
	`303`	`+errmsg("invalid Unicode escape value"),`
	`304`	`+scanner_errposition(pos,yyscanner)));`
	`305`	`+`
	`306`	`+if (c>0x7F&&GetDatabaseEncoding()!=PG_UTF8)`
	`307`	`+ereport(ERROR,`
	`308`	`+(errcode(ERRCODE_SYNTAX_ERROR),`
	`309`	`+errmsg("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8"),`
	`310`	`+scanner_errposition(pos,yyscanner)));`
	`311`	`+}`
	`312`	`+`
	`313`	`+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */`
	`314`	`+staticbool`
	`315`	`+check_uescapechar(unsignedcharescape)`
	`316`	`+{`
	`317`	`+if (isxdigit(escape)`
	`318`	`+\|\|escape=='+'`
	`319`	`+\|\|escape=='\''`
	`320`	`+\|\|escape=='"'`
	`321`	`+\|\|scanner_isspace(escape))`
	`322`	`+return false;`
	`323`	`+else`
	`324`	`+return true;`
	`325`	`+}`
	`326`	`+`
	`327`	`+/*`
	`328`	`+ * Process Unicode escapes in "str", producing a palloc'd plain string`
	`329`	`+ *`
	`330`	`+ * escape: the escape character to use`
	`331`	`+ * position: start position of U&'' or U&"" string token`
	`332`	`+ * yyscanner: context information needed for error reports`
	`333`	`+ */`
	`334`	`+staticchar*`
	`335`	`+str_udeescape(constchar*str,charescape,`
	`336`	`+intposition,core_yyscan_tyyscanner)`
	`337`	`+{`
	`338`	`+constchar*in;`
	`339`	`+char*new,`
	`340`	`+*out;`
	`341`	`+pg_wcharpair_first=0;`
	`342`	`+`
	`343`	`+/*`
	`344`	`+ * This relies on the subtle assumption that a UTF-8 expansion cannot be`
	`345`	`+ * longer than its escaped representation.`
	`346`	`+ */`
	`347`	`+new=palloc(strlen(str)+1);`
	`348`	`+`
	`349`	`+in=str;`
	`350`	`+out=new;`
	`351`	`+while (*in)`
	`352`	`+{`
	`353`	`+if (in[0]==escape)`
	`354`	`+{`
	`355`	`+if (in[1]==escape)`
	`356`	`+{`
	`357`	`+if (pair_first)`
	`358`	`+gotoinvalid_pair;`
	`359`	`+*out++=escape;`
	`360`	`+in+=2;`
	`361`	`+}`
	`362`	`+elseif (isxdigit((unsignedchar)in[1])&&`
	`363`	`+isxdigit((unsignedchar)in[2])&&`
	`364`	`+isxdigit((unsignedchar)in[3])&&`
	`365`	`+isxdigit((unsignedchar)in[4]))`
	`366`	`+{`
	`367`	`+pg_wcharunicode;`
	`368`	`+`
	`369`	`+unicode= (hexval(in[1]) <<12)+`
	`370`	`+(hexval(in[2]) <<8)+`
	`371`	`+(hexval(in[3]) <<4)+`
	`372`	`+hexval(in[4]);`
	`373`	`+check_unicode_value(unicode,`
	`374`	`+in-str+position+3,/* 3 for U&" */`
	`375`	`+yyscanner);`
	`376`	`+if (pair_first)`
	`377`	`+{`
	`378`	`+if (is_utf16_surrogate_second(unicode))`
	`379`	`+{`
	`380`	`+unicode=surrogate_pair_to_codepoint(pair_first,unicode);`
	`381`	`+pair_first=0;`
	`382`	`+}`
	`383`	`+else`
	`384`	`+gotoinvalid_pair;`
	`385`	`+}`
	`386`	`+elseif (is_utf16_surrogate_second(unicode))`
	`387`	`+gotoinvalid_pair;`
	`388`	`+`
	`389`	`+if (is_utf16_surrogate_first(unicode))`
	`390`	`+pair_first=unicode;`
	`391`	`+else`
	`392`	`+{`
	`393`	`+unicode_to_utf8(unicode, (unsignedchar*)out);`
	`394`	`+out+=pg_mblen(out);`
	`395`	`+}`
	`396`	`+in+=5;`
	`397`	`+}`
	`398`	`+elseif (in[1]=='+'&&`
	`399`	`+isxdigit((unsignedchar)in[2])&&`
	`400`	`+isxdigit((unsignedchar)in[3])&&`
	`401`	`+isxdigit((unsignedchar)in[4])&&`
	`402`	`+isxdigit((unsignedchar)in[5])&&`
	`403`	`+isxdigit((unsignedchar)in[6])&&`
	`404`	`+isxdigit((unsignedchar)in[7]))`
	`405`	`+{`
	`406`	`+pg_wcharunicode;`
	`407`	`+`
	`408`	`+unicode= (hexval(in[2]) <<20)+`
	`409`	`+(hexval(in[3]) <<16)+`
	`410`	`+(hexval(in[4]) <<12)+`
	`411`	`+(hexval(in[5]) <<8)+`
	`412`	`+(hexval(in[6]) <<4)+`
	`413`	`+hexval(in[7]);`
	`414`	`+check_unicode_value(unicode,`
	`415`	`+in-str+position+3,/* 3 for U&" */`
	`416`	`+yyscanner);`
	`417`	`+if (pair_first)`
	`418`	`+{`
	`419`	`+if (is_utf16_surrogate_second(unicode))`
	`420`	`+{`
	`421`	`+unicode=surrogate_pair_to_codepoint(pair_first,unicode);`
	`422`	`+pair_first=0;`
	`423`	`+}`
	`424`	`+else`
	`425`	`+gotoinvalid_pair;`
	`426`	`+}`
	`427`	`+elseif (is_utf16_surrogate_second(unicode))`
	`428`	`+gotoinvalid_pair;`
	`429`	`+`
	`430`	`+if (is_utf16_surrogate_first(unicode))`
	`431`	`+pair_first=unicode;`
	`432`	`+else`
	`433`	`+{`
	`434`	`+unicode_to_utf8(unicode, (unsignedchar*)out);`
	`435`	`+out+=pg_mblen(out);`
	`436`	`+}`
	`437`	`+in+=8;`
	`438`	`+}`
	`439`	`+else`
	`440`	`+ereport(ERROR,`
	`441`	`+(errcode(ERRCODE_SYNTAX_ERROR),`
	`442`	`+errmsg("invalid Unicode escape value"),`
	`443`	`+scanner_errposition(in-str+position+3,/* 3 for U&" */`
	`444`	`+yyscanner)));`
	`445`	`+}`
	`446`	`+else`
	`447`	`+{`
	`448`	`+if (pair_first)`
	`449`	`+gotoinvalid_pair;`
	`450`	`+`
	`451`	`+out++=in++;`
	`452`	`+}`
	`453`	`+}`
	`454`	`+`
	`455`	`+/* unfinished surrogate pair? */`
	`456`	`+if (pair_first)`
	`457`	`+gotoinvalid_pair;`
	`458`	`+`
	`459`	`+*out='\0';`
	`460`	`+`
	`461`	`+/*`
	`462`	`+ * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII`
	`463`	`+ * codes; but it's probably not worth the trouble, since this isn't likely`
	`464`	`+ * to be a performance-critical path.`
	`465`	`+ */`
	`466`	`+pg_verifymbstr(new,out-new, false);`
	`467`	`+returnnew;`
	`468`	`+`
	`469`	`+invalid_pair:`
	`470`	`+ereport(ERROR,`
	`471`	`+(errcode(ERRCODE_SYNTAX_ERROR),`
	`472`	`+errmsg("invalid Unicode surrogate pair"),`
	`473`	`+scanner_errposition(in-str+position+3,/* 3 for U&" */`
	`474`	`+yyscanner)));`
	`475`	`+returnNULL;/* keep compiler quiet */`
	`476`	`+}`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit7f380c5

File tree

19 files changed

19 files changed

`‎src/backend/parser/gram.y`

`‎src/backend/parser/parser.c`

0 commit comments