NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commit911588a

committed

Add fast path for validating UTF-8 text

Our previous validator used a traditional algorithm that performedcomparison and branching one byte at a time. It's useful in thatwe always know exactly how many bytes we have validated, but thatprecision comes at a cost. Input validation can show up prominentlyin profiles of COPY FROM, and future improvements to COPY FROM suchas parallelism or faster line parsing will put more pressure on inputvalidation. Hence, add fast paths for both ASCII and multibyte UTF-8:Use bitwise operations to check 16 bytes at a time for ASCII. Ifthat fails, use a "shift-based" DFA on those bytes to handle thegeneral case, including multibyte. These paths are relatively freeof branches and thus robust against all kinds of byte patterns. Withthese algorithms, UTF-8 validation is several times faster, dependingon platform and the input byte distribution.The previous coding in pg_utf8_verifystr() is retained for shortstrings and for when the fast path returns an error.Review, performance testing, and additional hacking by: HeikkiLinakangas, Vladimir Sitnikov, Amit Khandekar, Thomas Munro, andGreg StarkDiscussion:https://www.postgresql.org/message-id/CAFBsxsEV_SzH%2BOLyCiyon%3DiwggSyMh_eF6A3LU2tiWf3Cy2ZQg%40mail.gmail.com

1 parente2c52be commit911588aCopy full SHA for 911588a

File tree

4 files changed

+570

-0

lines changed

src
- common
  - wchar.c
- include/mb
  - pg_wchar.h
- test/regress
  - expected
    - conversion.out
  - sql
    - conversion.sql

4 files changed

+570

-0

lines changed

`‎src/common/wchar.c‎`

Lines changed: 215 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1750,11 +1750,226 @@ pg_utf8_verifychar(const unsigned char *s, int len)`
`1750`	`1750`	`returnl;`
`1751`	`1751`	`}`
`1752`	`1752`
	`1753`	`+/*`
	`1754`	`+ * The fast path of the UTF-8 verifier uses a deterministic finite automaton`
	`1755`	`+ * (DFA) for multibyte characters. In a traditional table-driven DFA, the`
	`1756`	`+ * input byte and current state are used to compute an index into an array of`
	`1757`	`+ * state transitions. Since the address of the next transition is dependent`
	`1758`	`+ * on this computation, there is latency in executing the load instruction,`
	`1759`	`+ * and the CPU is not kept busy.`
	`1760`	`+ *`
	`1761`	`+ * Instead, we use a "shift-based" DFA as described by Per Vognsen:`
	`1762`	`+ *`
	`1763`	`+ * https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725`
	`1764`	`+ *`
	`1765`	`+ * In a shift-based DFA, the input byte is an index into array of integers`
	`1766`	`+ * whose bit pattern encodes the state transitions. To compute the next`
	`1767`	`+ * state, we simply right-shift the integer by the current state and apply a`
	`1768`	`+ * mask. In this scheme, the address of the transition only depends on the`
	`1769`	`+ * input byte, so there is better pipelining.`
	`1770`	`+ *`
	`1771`	`+ * The naming convention for states and transitions was adopted from a UTF-8`
	`1772`	`+ * to UTF-16/32 transcoder, whose table is reproduced below:`
	`1773`	`+ *`
	`1774`	`+ * https://github.com/BobSteagall/utf_utils/blob/6b7a465265de2f5fa6133d653df0c9bdd73bbcf8/src/utf_utils.cpp`
	`1775`	`+ *`
	`1776`	`+ * ILL ASC CR1 CR2 CR3 L2A L3A L3B L3C L4A L4B L4C CLASS / STATE`
	`1777`	`+ * ==========================================================================`
	`1778`	`+ * err, END, err, err, err, CS1, P3A, CS2, P3B, P4A, CS3, P4B, \| BGN/END`
	`1779`	`+ * err, err, err, err, err, err, err, err, err, err, err, err, \| ERR`
	`1780`	`+ * \|`
	`1781`	`+ * err, err, END, END, END, err, err, err, err, err, err, err, \| CS1`
	`1782`	`+ * err, err, CS1, CS1, CS1, err, err, err, err, err, err, err, \| CS2`
	`1783`	`+ * err, err, CS2, CS2, CS2, err, err, err, err, err, err, err, \| CS3`
	`1784`	`+ * \|`
	`1785`	`+ * err, err, err, err, CS1, err, err, err, err, err, err, err, \| P3A`
	`1786`	`+ * err, err, CS1, CS1, err, err, err, err, err, err, err, err, \| P3B`
	`1787`	`+ * \|`
	`1788`	`+ * err, err, err, CS2, CS2, err, err, err, err, err, err, err, \| P4A`
	`1789`	`+ * err, err, CS2, err, err, err, err, err, err, err, err, err, \| P4B`
	`1790`	`+ *`
	`1791`	`+ * In the most straightforward implementation, a shift-based DFA for UTF-8`
	`1792`	`+ * requires 64-bit integers to encode the transitions, but with an SMT solver`
	`1793`	`+ * it's possible to find state numbers such that the transitions fit within`
	`1794`	`+ * 32-bit integers, as Dougall Johnson demonstrated:`
	`1795`	`+ *`
	`1796`	`+ * https://gist.github.com/dougallj/166e326de6ad4cf2c94be97a204c025f`
	`1797`	`+ *`
	`1798`	`+ * This packed representation is the reason for the seemingly odd choice of`
	`1799`	`+ * state values below.`
	`1800`	`+ */`
	`1801`	`+`
	`1802`	`+/* Error */`
	`1803`	`+#defineERR 0`
	`1804`	`+/* Begin */`
	`1805`	`+#defineBGN 11`
	`1806`	`+/* Continuation states, expect 1/2/3 continuation bytes */`
	`1807`	`+#defineCS1 16`
	`1808`	`+#defineCS2 1`
	`1809`	`+#defineCS3 5`
	`1810`	`+/* Leading byte was E0/ED, expect 1 more continuation byte */`
	`1811`	`+#defineP3A 6`
	`1812`	`+#defineP3B 20`
	`1813`	`+/* Leading byte was F0/F4, expect 2 more continuation bytes */`
	`1814`	`+#defineP4A 25`
	`1815`	`+#defineP4B 30`
	`1816`	`+/* Begin and End are the same state */`
	`1817`	`+#defineEND BGN`
	`1818`	`+`
	`1819`	`+/* the encoded state transitions for the lookup table */`
	`1820`	`+`
	`1821`	`+/* ASCII */`
	`1822`	`+#defineASC (END << BGN)`
	`1823`	`+/* 2-byte lead */`
	`1824`	`+#defineL2A (CS1 << BGN)`
	`1825`	`+/* 3-byte lead */`
	`1826`	`+#defineL3A (P3A << BGN)`
	`1827`	`+#defineL3B (CS2 << BGN)`
	`1828`	`+#defineL3C (P3B << BGN)`
	`1829`	`+/* 4-byte lead */`
	`1830`	`+#defineL4A (P4A << BGN)`
	`1831`	`+#defineL4B (CS3 << BGN)`
	`1832`	`+#defineL4C (P4B << BGN)`
	`1833`	`+/* continuation byte */`
	`1834`	`+#defineCR1 (END << CS1) \| (CS1 << CS2) \| (CS2 << CS3) \| (CS1 << P3B) \| (CS2 << P4B)`
	`1835`	`+#defineCR2 (END << CS1) \| (CS1 << CS2) \| (CS2 << CS3) \| (CS1 << P3B) \| (CS2 << P4A)`
	`1836`	`+#defineCR3 (END << CS1) \| (CS1 << CS2) \| (CS2 << CS3) \| (CS1 << P3A) \| (CS2 << P4A)`
	`1837`	`+/* invalid byte */`
	`1838`	`+#defineILL ERR`
	`1839`	`+`
	`1840`	`+staticconstuint32Utf8Transition[256]=`
	`1841`	`+{`
	`1842`	`+/* ASCII */`
	`1843`	`+`
	`1844`	`+ILL,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1845`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1846`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1847`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1848`	`+`
	`1849`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1850`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1851`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1852`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1853`	`+`
	`1854`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1855`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1856`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1857`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1858`	`+`
	`1859`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1860`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1861`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1862`	`+ASC,ASC,ASC,ASC,ASC,ASC,ASC,ASC,`
	`1863`	`+`
	`1864`	`+/* continuation bytes */`
	`1865`	`+`
	`1866`	`+/* 80..8F */`
	`1867`	`+CR1,CR1,CR1,CR1,CR1,CR1,CR1,CR1,`
	`1868`	`+CR1,CR1,CR1,CR1,CR1,CR1,CR1,CR1,`
	`1869`	`+`
	`1870`	`+/* 90..9F */`
	`1871`	`+CR2,CR2,CR2,CR2,CR2,CR2,CR2,CR2,`
	`1872`	`+CR2,CR2,CR2,CR2,CR2,CR2,CR2,CR2,`
	`1873`	`+`
	`1874`	`+/* A0..BF */`
	`1875`	`+CR3,CR3,CR3,CR3,CR3,CR3,CR3,CR3,`
	`1876`	`+CR3,CR3,CR3,CR3,CR3,CR3,CR3,CR3,`
	`1877`	`+CR3,CR3,CR3,CR3,CR3,CR3,CR3,CR3,`
	`1878`	`+CR3,CR3,CR3,CR3,CR3,CR3,CR3,CR3,`
	`1879`	`+`
	`1880`	`+/* leading bytes */`
	`1881`	`+`
	`1882`	`+/* C0..DF */`
	`1883`	`+ILL,ILL,L2A,L2A,L2A,L2A,L2A,L2A,`
	`1884`	`+L2A,L2A,L2A,L2A,L2A,L2A,L2A,L2A,`
	`1885`	`+L2A,L2A,L2A,L2A,L2A,L2A,L2A,L2A,`
	`1886`	`+L2A,L2A,L2A,L2A,L2A,L2A,L2A,L2A,`
	`1887`	`+`
	`1888`	`+/* E0..EF */`
	`1889`	`+L3A,L3B,L3B,L3B,L3B,L3B,L3B,L3B,`
	`1890`	`+L3B,L3B,L3B,L3B,L3B,L3C,L3B,L3B,`
	`1891`	`+`
	`1892`	`+/* F0..FF */`
	`1893`	`+L4A,L4B,L4B,L4B,L4C,ILL,ILL,ILL,`
	`1894`	`+ILL,ILL,ILL,ILL,ILL,ILL,ILL,ILL`
	`1895`	`+};`
	`1896`	`+`
	`1897`	`+staticvoid`
	`1898`	`+utf8_advance(constunsignedchars,uint32state,intlen)`
	`1899`	`+{`
	`1900`	`+/* Note: We deliberately don't check the state's value here. */`
	`1901`	`+while (len>0)`
	`1902`	`+{`
	`1903`	`+/*`
	`1904`	`+ * It's important that the mask value is 31: In most instruction sets,`
	`1905`	`+ * a shift by a 32-bit operand is understood to be a shift by its mod`
	`1906`	`+ * 32, so the compiler should elide the mask operation.`
	`1907`	`+ */`
	`1908`	`+state=Utf8Transition[s++] >> (*state&31);`
	`1909`	`+len--;`
	`1910`	`+}`
	`1911`	`+`
	`1912`	`+*state &=31;`
	`1913`	`+}`
	`1914`	`+`
`1753`	`1915`	`staticint`
`1754`	`1916`	`pg_utf8_verifystr(constunsignedchar*s,intlen)`
`1755`	`1917`	`{`
`1756`	`1918`	`constunsignedchar*start=s;`
	`1919`	`+constintorig_len=len;`
	`1920`	`+uint32state=BGN;`
	`1921`	`+`
	`1922`	`+/*`
	`1923`	`+ * Sixteen seems to give the best balance of performance across different`
	`1924`	`+ * byte distributions.`
	`1925`	`+ */`
	`1926`	`+#defineSTRIDE_LENGTH 16`
	`1927`	`+`
	`1928`	`+if (len >=STRIDE_LENGTH)`
	`1929`	`+{`
	`1930`	`+while (len >=STRIDE_LENGTH)`
	`1931`	`+{`
	`1932`	`+/*`
	`1933`	`+ * If the chunk is all ASCII, we can skip the full UTF-8 check,`
	`1934`	`+ * but we must first check for a non-END state, which means the`
	`1935`	`+ * previous chunk ended in the middle of a multibyte sequence.`
	`1936`	`+ */`
	`1937`	`+if (state!=END\|\| !is_valid_ascii(s,STRIDE_LENGTH))`
	`1938`	`+utf8_advance(s,&state,STRIDE_LENGTH);`
	`1939`	`+`
	`1940`	`+s+=STRIDE_LENGTH;`
	`1941`	`+len-=STRIDE_LENGTH;`
	`1942`	`+}`
	`1943`	`+`
	`1944`	`+/*`
	`1945`	`+ * The error state persists, so we only need to check for it here. In`
	`1946`	`+ * case of error we start over from the beginning with the slow path`
	`1947`	`+ * so we can count the valid bytes.`
	`1948`	`+ */`
	`1949`	`+if (state==ERR)`
	`1950`	`+{`
	`1951`	`+len=orig_len;`
	`1952`	`+s=start;`
	`1953`	`+}`
	`1954`	`+`
	`1955`	`+/*`
	`1956`	`+ * We treat all other states as success, but it's possible the fast`
	`1957`	`+ * path exited in the middle of a multibyte sequence, since that`
	`1958`	`+ * wouldn't have caused an error. Before checking the remaining bytes,`
	`1959`	`+ * walk backwards to find the last byte that could have been the start`
	`1960`	`+ * of a valid sequence.`
	`1961`	`+ */`
	`1962`	`+while (s>start)`
	`1963`	`+{`
	`1964`	`+s--;`
	`1965`	`+len++;`
	`1966`	`+`
	`1967`	`+if (!IS_HIGHBIT_SET(*s)\|\|pg_utf_mblen(s)>1)`
	`1968`	`+break;`
	`1969`	`+}`
	`1970`	`+}`
`1757`	`1971`
	`1972`	`+/* check remaining bytes */`
`1758`	`1973`	`while (len>0)`
`1759`	`1974`	`{`
`1760`	`1975`	`intl;`

`‎src/include/mb/pg_wchar.h‎`

Lines changed: 53 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -699,4 +699,57 @@ extern intmic2latin_with_table(const unsigned char mic, unsigned char p,`
`699`	`699`	`externWCHARpgwin32_message_to_UTF16(constcharstr,intlen,int*utf16len);`
`700`	`700`	`#endif`
`701`	`701`
	`702`	`+`
	`703`	`+/*`
	`704`	`+ * Verify a chunk of bytes for valid ASCII.`
	`705`	`+ *`
	`706`	`+ * Returns false if the input contains any zero bytes or bytes with the`
	`707`	`+ * high-bit set. Input len must be a multiple of 8.`
	`708`	`+ */`
	`709`	`+staticinlinebool`
	`710`	`+is_valid_ascii(constunsignedchar*s,intlen)`
	`711`	`+{`
	`712`	`+uint64chunk,`
	`713`	`+highbit_cum=UINT64CONST(0),`
	`714`	`+zero_cum=UINT64CONST(0x8080808080808080);`
	`715`	`+`
	`716`	`+Assert(len %sizeof(chunk)==0);`
	`717`	`+`
	`718`	`+while (len>0)`
	`719`	`+{`
	`720`	`+memcpy(&chunk,s,sizeof(chunk));`
	`721`	`+`
	`722`	`+/*`
	`723`	`+ * Capture any zero bytes in this chunk.`
	`724`	`+ *`
	`725`	`+ * First, add 0x7f to each byte. This sets the high bit in each byte,`
	`726`	`+ * unless it was a zero. If any resulting high bits are zero, the`
	`727`	`+ * corresponding high bits in the zero accumulator will be cleared.`
	`728`	`+ *`
	`729`	`+ * If none of the bytes in the chunk had the high bit set, the max`
	`730`	`+ * value each byte can have after the addition is 0x7f + 0x7f = 0xfe,`
	`731`	`+ * and we don't need to worry about carrying over to the next byte. If`
	`732`	`+ * any input bytes did have the high bit set, it doesn't matter`
	`733`	`+ * because we check for those separately.`
	`734`	`+ */`
	`735`	`+zero_cum &= (chunk+UINT64CONST(0x7f7f7f7f7f7f7f7f));`
	`736`	`+`
	`737`	`+/* Capture any set bits in this chunk. */`
	`738`	`+highbit_cum \|=chunk;`
	`739`	`+`
	`740`	`+s+=sizeof(chunk);`
	`741`	`+len-=sizeof(chunk);`
	`742`	`+}`
	`743`	`+`
	`744`	`+/* Check if any high bits in the high bit accumulator got set. */`
	`745`	`+if (highbit_cum&UINT64CONST(0x8080808080808080))`
	`746`	`+return false;`
	`747`	`+`
	`748`	`+/* Check if any high bits in the zero accumulator got cleared. */`
	`749`	`+if (zero_cum!=UINT64CONST(0x8080808080808080))`
	`750`	`+return false;`
	`751`	`+`
	`752`	`+return true;`
	`753`	`+}`
	`754`	`+`
`702`	`755`	`#endif/* PG_WCHAR_H */`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit911588a

File tree

4 files changed

4 files changed

`‎src/common/wchar.c‎`

`‎src/include/mb/pg_wchar.h‎`

0 commit comments