NotificationsYou must be signed in to change notification settings
Fork5.2k
Star19k

Commit8a51027

committed

Further optimize nbtree search scan key comparisons.

Postgres 17 commite0b1ee1 added two complementary optimizations tonbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpagewas made to avoid needlessly evaluating keys that are guaranteed to besatisfied by applying page-level context. "prechecked" did this forkeys required in the current scan direction, while "firstmatch" did itfor keys required in the opposite-to-scan direction only.The "prechecked" design had a number of notable issues. It didn'taccount for the fact that an = array scan key's sk_argument field mightneed to advance at the point of the page precheck (it didn't check theprecheck tuple against the key's array, only the key's sk_argument,which needlessly made it ineffective in cases involving stepping to apage having advanced the scan's arrays using a truncated high key)."prechecked" was also completely ineffective when only one scan keywasn't guaranteed to be satisfied by every tuple (it didn't recognizethat it was still safe to avoid evaluating other, earlier keys).The "firstmatch" optimization had similar limitations. It could only beapplied after _bt_readpage found its first matching tuple, regardless ofwhy any earlier tuples failed to satisfy the scan's index quals. Thisallowed unsatisfied non-required scan keys to impede the optimization.Replace both optimizations with a new optimization, without any of theselimitations: the "startikey" optimization. Affected _bt_readpage callsgenerate a page-level key offset ("startikey"), that their _bt_checkkeyscalls can then start at. This is an offset to the first key that isn'tknown to be satisfied by every tuple on the page.Although this is independently useful work, its main goal is to avoidperformance regressions with index scans that use skip arrays, but stillnever manage to skip over irrelevant leaf pages. We must avoid wastingCPU cycles on overly granular skip array maintenance in these cases.The new "startikey" optimization helps with this by selectivelydisabling array maintenance for the duration of a _bt_readpage call.This has no lasting consequences for the scan's array keys (they'llstill reliably track the scan's progress through the index's key spacewhenever the scan is "between pages").Skip scan adds skip arrays during preprocessing using simple, staticrules, and decides how best to navigate/apply the scan's skip arraysdynamically, at runtime. The "startikey" optimization enables thisapproach. As a result of all this, the planner doesn't need to generatedistinct, competing index paths (one path for skip scan, another for anequivalent traditional full index scan). The overall effect is to makescan runtime close to optimal, even when the planner works off anincorrect cardinality estimate. Scans will also perform well given askipped column with data skew: individual groups of pages with manydistinct values (in respect of a skipped column) can be read about asefficiently as before -- without the scan being forced to give up onskipping over other groups of pages that are provably irrelevant.Many scans that cannot possibly skip will still benefit from the use ofskip arrays, since they'll allow the "startikey" optimization to be aseffective as possible (by allowing preprocessing to mark all the scan'skeys as required). A scan that uses a skip array on "a" for a qual"WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now,even when every tuple read by the scan has its own distinct "a" value.However, there are still some remaining regressions, affecting certaintrickier cases.Scans whose index quals have several range skip arrays, each on somehigh cardinality column, can still be slower than they were before theintroduction of skip scan -- even with the new "startikey" optimization.There are also known regressions affecting very selective index scansthat use a skip array. The underlying issue with such selective scansis that they never get as far as reading a second leaf page, and so willnever get a chance to consider applying the "startikey" optimization.In principle, all regressions could be avoided by teaching preprocessingto not add skip arrays whenever they aren't expected to help, but itseems best to err on the side of robust performance.Follow-up to commit92fe23d, which added nbtree skip scan.Author: Peter Geoghegan <pg@bowt.ie>Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>Discussion:https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.comDiscussion:https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com

1 parent92fe23d commit8a51027Copy full SHA for 8a51027

File tree

5 files changed

+483

-153

lines changed

src
- backend/access/nbtree
- include/access
  - nbtree.h

5 files changed

+483

-153

lines changed

`‎src/backend/access/nbtree/nbtpreprocesskeys.c‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1390,6 +1390,7 @@ _bt_preprocess_array_keys(IndexScanDesc scan, int *new_numberOfKeys)`
`1390`	`1390`	`arrayKeyData= (ScanKey)palloc(numArrayKeyData*sizeof(ScanKeyData));`
`1391`	`1391`
`1392`	`1392`	`/* Allocate space for per-array data in the workspace context */`
	`1393`	`+so->skipScan= (numSkipArrayKeys>0);`
`1393`	`1394`	`so->arrayKeys= (BTArrayKeyInfo)palloc(numArrayKeyssizeof(BTArrayKeyInfo));`
`1394`	`1395`
`1395`	`1396`	`/* Allocate space for ORDER procs used to help _bt_checkkeys */`

`‎src/backend/access/nbtree/nbtree.c‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -349,6 +349,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)`
`349`	`349`	`else`
`350`	`350`	`so->keyData=NULL;`
`351`	`351`
	`352`	`+so->skipScan= false;`
`352`	`353`	`so->needPrimScan= false;`
`353`	`354`	`so->scanBehind= false;`
`354`	`355`	`so->oppositeDirCheck= false;`

`‎src/backend/access/nbtree/nbtsearch.c‎`

Lines changed: 39 additions & 38 deletions

Original file line number	Diff line number	Diff line change
`@@ -1648,47 +1648,14 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1648`	`1648`	`pstate.finaltup=NULL;`
`1649`	`1649`	`pstate.page=page;`
`1650`	`1650`	`pstate.firstpage=firstpage;`
	`1651`	`+pstate.forcenonrequired= false;`
	`1652`	`+pstate.startikey=0;`
`1651`	`1653`	`pstate.offnum=InvalidOffsetNumber;`
`1652`	`1654`	`pstate.skip=InvalidOffsetNumber;`
`1653`	`1655`	`pstate.continuescan= true;/* default assumption */`
`1654`		`-pstate.prechecked= false;`
`1655`		`-pstate.firstmatch= false;`
`1656`	`1656`	`pstate.rechecks=0;`
`1657`	`1657`	`pstate.targetdistance=0;`
`1658`	`1658`
`1659`		`-/*`
`1660`		`- * Prechecking the value of the continuescan flag for the last item on the`
`1661`		`- * page (for backwards scan it will be the first item on a page). If we`
`1662`		`- * observe it to be true, then it should be true for all other items. This`
`1663`		`- * allows us to do significant optimizations in the _bt_checkkeys()`
`1664`		`- * function for all the items on the page.`
`1665`		`- *`
`1666`		`- * With the forward scan, we do this check for the last item on the page`
`1667`		`- * instead of the high key. It's relatively likely that the most`
`1668`		`- * significant column in the high key will be different from the`
`1669`		`- * corresponding value from the last item on the page. So checking with`
`1670`		`- * the last item on the page would give a more precise answer.`
`1671`		`- *`
`1672`		`- * We skip this for the first page read by each (primitive) scan, to avoid`
`1673`		`- * slowing down point queries. They typically don't stand to gain much`
`1674`		`- * when the optimization can be applied, and are more likely to notice the`
`1675`		`- * overhead of the precheck. Also avoid it during scans with array keys,`
`1676`		`- * which might be using skip scan (XXX fixed in next commit).`
`1677`		`- */`
`1678`		`-if (!pstate.firstpage&& !arrayKeys&&minoff<maxoff)`
`1679`		`-{`
`1680`		`-ItemIdiid;`
`1681`		`-IndexTupleitup;`
`1682`		`-`
`1683`		`-iid=PageGetItemId(page,ScanDirectionIsForward(dir) ?maxoff :minoff);`
`1684`		`-itup= (IndexTuple)PageGetItem(page,iid);`
`1685`		`-`
`1686`		`-/* Call with arrayKeys=false to avoid undesirable side-effects */`
`1687`		`-_bt_checkkeys(scan,&pstate, false,itup,indnatts);`
`1688`		`-pstate.prechecked=pstate.continuescan;`
`1689`		`-pstate.continuescan= true;/* reset */`
`1690`		`-}`
`1691`		`-`
`1692`	`1659`	`if (ScanDirectionIsForward(dir))`
`1693`	`1660`	`{`
`1694`	`1661`	`/* SK_SEARCHARRAY forward scans must provide high key up front */`
`@@ -1716,6 +1683,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1716`	`1683`	`so->scanBehind=so->oppositeDirCheck= false;/* reset */`
`1717`	`1684`	`}`
`1718`	`1685`
	`1686`	`+/*`
	`1687`	`+ * Consider pstate.startikey optimization once the ongoing primitive`
	`1688`	`+ * index scan has already read at least one page`
	`1689`	`+ */`
	`1690`	`+if (!pstate.firstpage&&minoff<maxoff)`
	`1691`	`+_bt_set_startikey(scan,&pstate);`
	`1692`	`+`
`1719`	`1693`	`/* load items[] in ascending order */`
`1720`	`1694`	`itemIndex=0;`
`1721`	`1695`
`@@ -1752,6 +1726,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1752`	`1726`	`{`
`1753`	`1727`	`Assert(!passes_quals&&pstate.continuescan);`
`1754`	`1728`	`Assert(offnum<pstate.skip);`
	`1729`	`+Assert(!pstate.forcenonrequired);`
`1755`	`1730`
`1756`	`1731`	`offnum=pstate.skip;`
`1757`	`1732`	`pstate.skip=InvalidOffsetNumber;`
`@@ -1761,7 +1736,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1761`	`1736`	`if (passes_quals)`
`1762`	`1737`	`{`
`1763`	`1738`	`/* tuple passes all scan key conditions */`
`1764`		`-pstate.firstmatch= true;`
`1765`	`1739`	`if (!BTreeTupleIsPosting(itup))`
`1766`	`1740`	`{`
`1767`	`1741`	`/* Remember it */`
`@@ -1816,7 +1790,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1816`	`1790`	`inttruncatt;`
`1817`	`1791`
`1818`	`1792`	`truncatt=BTreeTupleGetNAtts(itup,rel);`
`1819`		`-pstate.prechecked= false;/* precheck didn't cover HIKEY */`
	`1793`	`+pstate.forcenonrequired= false;`
	`1794`	`+pstate.startikey=0;/* _bt_set_startikey ignores HIKEY */`
`1820`	`1795`	`_bt_checkkeys(scan,&pstate,arrayKeys,itup,truncatt);`
`1821`	`1796`	`}`
`1822`	`1797`
`@@ -1855,6 +1830,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1855`	`1830`	`so->scanBehind=so->oppositeDirCheck= false;/* reset */`
`1856`	`1831`	`}`
`1857`	`1832`
	`1833`	`+/*`
	`1834`	`+ * Consider pstate.startikey optimization once the ongoing primitive`
	`1835`	`+ * index scan has already read at least one page`
	`1836`	`+ */`
	`1837`	`+if (!pstate.firstpage&&minoff<maxoff)`
	`1838`	`+_bt_set_startikey(scan,&pstate);`
	`1839`	`+`
`1858`	`1840`	`/* load items[] in descending order */`
`1859`	`1841`	`itemIndex=MaxTIDsPerBTreePage;`
`1860`	`1842`
`@@ -1894,6 +1876,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1894`	`1876`	`Assert(!BTreeTupleIsPivot(itup));`
`1895`	`1877`
`1896`	`1878`	`pstate.offnum=offnum;`
	`1879`	`+if (arrayKeys&&offnum==minoff&&pstate.forcenonrequired)`
	`1880`	`+{`
	`1881`	`+pstate.forcenonrequired= false;`
	`1882`	`+pstate.startikey=0;`
	`1883`	`+}`
`1897`	`1884`	`passes_quals=_bt_checkkeys(scan,&pstate,arrayKeys,`
`1898`	`1885`	`itup,indnatts);`
`1899`	`1886`
`@@ -1905,6 +1892,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1905`	`1892`	`{`
`1906`	`1893`	`Assert(!passes_quals&&pstate.continuescan);`
`1907`	`1894`	`Assert(offnum>pstate.skip);`
	`1895`	`+Assert(!pstate.forcenonrequired);`
`1908`	`1896`
`1909`	`1897`	`offnum=pstate.skip;`
`1910`	`1898`	`pstate.skip=InvalidOffsetNumber;`
`@@ -1914,7 +1902,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1914`	`1902`	`if (passes_quals&&tuple_alive)`
`1915`	`1903`	`{`
`1916`	`1904`	`/* tuple passes all scan key conditions */`
`1917`		`-pstate.firstmatch= true;`
`1918`	`1905`	`if (!BTreeTupleIsPosting(itup))`
`1919`	`1906`	`{`
`1920`	`1907`	`/* Remember it */`
`@@ -1970,6 +1957,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,`
`1970`	`1957`	`so->currPos.itemIndex=MaxTIDsPerBTreePage-1;`
`1971`	`1958`	`}`
`1972`	`1959`
	`1960`	`+/*`
	`1961`	`+ * If _bt_set_startikey told us to temporarily treat the scan's keys as`
	`1962`	`+ * nonrequired (possible only during scans with array keys), there must be`
	`1963`	`+ * no lasting consequences for the scan's array keys. The scan's arrays`
	`1964`	`+ * should now have exactly the same elements as they would have had if the`
	`1965`	`+ * nonrequired behavior had never been used. (In general, a scan's arrays`
	`1966`	`+ * are expected to track its progress through the index's key space.)`
	`1967`	`+ *`
	`1968`	`+ * We are required (by _bt_set_startikey) to call _bt_checkkeys against`
	`1969`	`+ * pstate.finaltup with pstate.forcenonrequired=false to allow the scan's`
	`1970`	`+ * arrays to recover. Assert that that step hasn't been missed.`
	`1971`	`+ */`
	`1972`	`+Assert(!pstate.forcenonrequired);`
	`1973`	`+`
`1973`	`1974`	`return (so->currPos.firstItem <=so->currPos.lastItem);`
`1974`	`1975`	`}`
`1975`	`1976`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit8a51027

File tree

5 files changed

5 files changed

`‎src/backend/access/nbtree/nbtpreprocesskeys.c‎`

`‎src/backend/access/nbtree/nbtree.c‎`

`‎src/backend/access/nbtree/nbtsearch.c‎`

0 commit comments