Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitd168b66

Browse files
Enhance nbtree index tuple deletion.
Teach nbtree and heapam to cooperate in order to eagerly removeduplicate tuples representing dead MVCC versions. This is "bottom-updeletion". Each bottom-up deletion pass is triggered lazily in responseto a flood of versions on an nbtree leaf page. This usually involves a"logically unchanged index" hint (these are produced by the executormechanism added by commit9dc718b).The immediate goal of bottom-up index deletion is to avoid "unnecessary"page splits caused entirely by version duplicates. It naturally has aneven more useful effect, though: it acts as a backstop againstaccumulating an excessive number of index tuple versions for any given_logical row_. Bottom-up index deletion complements what we might nowcall "top-down index deletion": index vacuuming performed by VACUUM.Bottom-up index deletion responds to the immediate local needs ofqueries, while leaving it up to autovacuum to perform infrequent cleansweeps of the index. The overall effect is to avoid certainpathological performance issues related to "version churn" from UPDATEs.The previous tableam interface used by index AMs to perform tupledeletion (the table_compute_xid_horizon_for_tuples() function) has beenreplaced with a new interface that supports certain new requirements.Many (perhaps all) of the capabilities added to nbtree by this commitcould also be extended to other index AMs. That is left as work for alater commit.Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logicto consider extra index tuples (that are not LP_DEAD-marked) fordeletion in passing. This increases the number of index tuples deletedsignificantly in many cases. The LP_DEAD deletion process (which is nowcalled "simple deletion" to clearly distinguish it from bottom-updeletion) won't usually need to visit any extra table blocks to checkthese extra tuples. We have to visit the same table blocks anyway togenerate a latestRemovedXid value (at least in the common case where theindex deletion operation's WAL record needs such a value).Testing has shown that the "extra tuples" simple deletion enhancementincreases the number of index tuples deleted with almost any workloadthat has LP_DEAD bits set in leaf pages. That is, it almost never failsto delete at least a few extra index tuples. It helps most of all incases that happen to naturally have a lot of delete-safe tuples. It'snot uncommon for an individual deletion operation to end up deleting anorder of magnitude more index tuples compared to the old naive approach(e.g., custom instrumentation of the patch shows that this happensfairly often when the regression tests are run).Add a further enhancement that augments simple deletion and bottom-updeletion in indexes that make use of deduplication: Teach nbtree's_bt_delitems_delete() function to support granular TID deletion inposting list tuples. It is now possible to delete individual TIDs fromposting list tuples provided the TIDs have a tableam block number of atable block that gets visited as part of the deletion process (visitingthe table block can be triggered directly or indirectly). Setting theLP_DEAD bit of a posting list tuple is still an all-or-nothing thing,but that matters much less now that deletion only needs to start outwith the right _general_ idea about which index tuples are deletable.Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.No bump in BTREE_VERSION, since there are no changes to the on-diskrepresentation of nbtree indexes. Indexes built on PostgreSQL 12 orPostgreSQL 13 will automatically benefit from bottom-up index deletion(i.e. no reindexing required) following a pg_upgrade. The enhancementto simple deletion is available with all B-Tree indexes following apg_upgrade, no matter what PostgreSQL version the user upgrades from.Author: Peter Geoghegan <pg@bowt.ie>Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>Reviewed-By: Victor Yegorov <vyegorov@gmail.com>Discussion:https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
1 parent9dc718b commitd168b66

File tree

19 files changed

+2112
-442
lines changed

19 files changed

+2112
-442
lines changed

‎doc/src/sgml/btree.sgml

Lines changed: 124 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -629,6 +629,109 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
629629
</para>
630630
</sect2>
631631

632+
<sect2 id="btree-deletion">
633+
<title>Bottom-up index deletion</title>
634+
<para>
635+
B-Tree indexes are not directly aware that under MVCC, there might
636+
be multiple extant versions of the same logical table row; to an
637+
index, each tuple is an independent object that needs its own index
638+
entry. <quote>Version churn</quote> tuples may sometimes
639+
accumulate and adversely affect query latency and throughput. This
640+
typically occurs with <command>UPDATE</command>-heavy workloads
641+
where most individual updates cannot apply the
642+
<acronym>HOT</acronym> optimization. Changing the value of only
643+
one column covered by one index during an <command>UPDATE</command>
644+
<emphasis>always</emphasis> necessitates a new set of index tuples
645+
&mdash; one for <emphasis>each and every</emphasis> index on the
646+
table. Note in particular that this includes indexes that were not
647+
<quote>logically modified</quote> by the <command>UPDATE</command>.
648+
All indexes will need a successor physical index tuple that points
649+
to the latest version in the table. Each new tuple within each
650+
index will generally need to coexist with the original
651+
<quote>updated</quote> tuple for a short period of time (typically
652+
until shortly after the <command>UPDATE</command> transaction
653+
commits).
654+
</para>
655+
<para>
656+
B-Tree indexes incrementally delete version churn index tuples by
657+
performing <firstterm>bottom-up index deletion</firstterm> passes.
658+
Each deletion pass is triggered in reaction to an anticipated
659+
<quote>version churn page split</quote>. This only happens with
660+
indexes that are not logically modified by
661+
<command>UPDATE</command> statements, where concentrated build up
662+
of obsolete versions in particular pages would occur otherwise. A
663+
page split will usually be avoided, though it's possible that
664+
certain implementation-level heuristics will fail to identify and
665+
delete even one garbage index tuple (in which case a page split or
666+
deduplication pass resolves the issue of an incoming new tuple not
667+
fitting on a leaf page). The worst case number of versions that
668+
any index scan must traverse (for any single logical row) is an
669+
important contributor to overall system responsiveness and
670+
throughput. A bottom-up index deletion pass targets suspected
671+
garbage tuples in a single leaf page based on
672+
<emphasis>qualitative</emphasis> distinctions involving logical
673+
rows and versions. This contrasts with the <quote>top-down</quote>
674+
index cleanup performed by autovacuum workers, which is triggered
675+
when certain <emphasis>quantitative</emphasis> table-level
676+
thresholds are exceeded (see <xref linkend="autovacuum"/>).
677+
</para>
678+
<note>
679+
<para>
680+
Not all deletion operations that are performed within B-Tree
681+
indexes are bottom-up deletion operations. There is a distinct
682+
category of index tuple deletion: <firstterm>simple index tuple
683+
deletion</firstterm>. This is a deferred maintenance operation
684+
that deletes index tuples that are known to be safe to delete
685+
(those whose item identifier's <literal>LP_DEAD</literal> bit is
686+
already set). Like bottom-up index deletion, simple index
687+
deletion takes place at the point that a page split is anticipated
688+
as a way of avoiding the split.
689+
</para>
690+
<para>
691+
Simple deletion is opportunistic in the sense that it can only
692+
take place when recent index scans set the
693+
<literal>LP_DEAD</literal> bits of affected items in passing.
694+
Prior to <productname>PostgreSQL</productname> 14, the only
695+
category of B-Tree deletion was simple deletion. The main
696+
differences between it and bottom-up deletion are that only the
697+
former is opportunistically driven by the activity of passing
698+
index scans, while only the latter specifically targets version
699+
churn from <command>UPDATE</command>s that do not logically modify
700+
indexed columns.
701+
</para>
702+
</note>
703+
<para>
704+
Bottom-up index deletion performs the vast majority of all garbage
705+
index tuple cleanup for particular indexes with certain workloads.
706+
This is expected with any B-Tree index that is subject to
707+
significant version churn from <command>UPDATE</command>s that
708+
rarely or never logically modify the columns that the index covers.
709+
The average and worst case number of versions per logical row can
710+
be kept low purely through targeted incremental deletion passes.
711+
It's quite possible that the on-disk size of certain indexes will
712+
never increase by even one single page/block despite
713+
<emphasis>constant</emphasis> version churn from
714+
<command>UPDATE</command>s. Even then, an exhaustive <quote>clean
715+
sweep</quote> by a <command>VACUUM</command> operation (typically
716+
run in an autovacuum worker process) will eventually be required as
717+
a part of <emphasis>collective</emphasis> cleanup of the table and
718+
each of its indexes.
719+
</para>
720+
<para>
721+
Unlike <command>VACUUM</command>, bottom-up index deletion does not
722+
provide any strong guarantees about how old the oldest garbage
723+
index tuple may be. No index can be permitted to retain
724+
<quote>floating garbage</quote> index tuples that became dead prior
725+
to a conservative cutoff point shared by the table and all of its
726+
indexes collectively. This fundamental table-level invariant makes
727+
it safe to recycle table <acronym>TID</acronym>s. This is how it
728+
is possible for distinct logical rows to reuse the same table
729+
<acronym>TID</acronym> over time (though this can never happen with
730+
two logical rows whose lifetimes span the same
731+
<command>VACUUM</command> cycle).
732+
</para>
733+
</sect2>
734+
632735
<sect2 id="btree-deduplication">
633736
<title>Deduplication</title>
634737
<para>
@@ -666,15 +769,17 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
666769
</note>
667770
<para>
668771
The deduplication process occurs lazily, when a new item is
669-
inserted that cannot fit on an existing leaf page. This prevents
670-
(or at least delays) leaf page splits. Unlike GIN posting list
671-
tuples, B-Tree posting list tuples do not need to expand every time
672-
a new duplicate is inserted; they are merely an alternative
673-
physical representation of the original logical contents of the
674-
leaf page. This design prioritizes consistent performance with
675-
mixed read-write workloads. Most client applications will at least
676-
see a moderate performance benefit from using deduplication.
677-
Deduplication is enabled by default.
772+
inserted that cannot fit on an existing leaf page, though only when
773+
index tuple deletion could not free sufficient space for the new
774+
item (typically deletion is briefly considered and then skipped
775+
over). Unlike GIN posting list tuples, B-Tree posting list tuples
776+
do not need to expand every time a new duplicate is inserted; they
777+
are merely an alternative physical representation of the original
778+
logical contents of the leaf page. This design prioritizes
779+
consistent performance with mixed read-write workloads. Most
780+
client applications will at least see a moderate performance
781+
benefit from using deduplication. Deduplication is enabled by
782+
default.
678783
</para>
679784
<para>
680785
<command>CREATE INDEX</command> and <command>REINDEX</command>
@@ -702,25 +807,16 @@ options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns
702807
deduplication isn't usually helpful.
703808
</para>
704809
<para>
705-
B-Tree indexes are not directly aware that under MVCC, there might
706-
be multiple extant versions of the same logical table row; to an
707-
index, each tuple is an independent object that needs its own index
708-
entry. <quote>Version duplicates</quote> may sometimes accumulate
709-
and adversely affect query latency and throughput. This typically
710-
occurs with <command>UPDATE</command>-heavy workloads where most
711-
individual updates cannot apply the <acronym>HOT</acronym>
712-
optimization (often because at least one indexed column gets
713-
modified, necessitating a new set of index tuple versions &mdash;
714-
one new tuple for <emphasis>each and every</emphasis> index). In
715-
effect, B-Tree deduplication ameliorates index bloat caused by
716-
version churn. Note that even the tuples from a unique index are
717-
not necessarily <emphasis>physically</emphasis> unique when stored
718-
on disk due to version churn. The deduplication optimization is
719-
selectively applied within unique indexes. It targets those pages
720-
that appear to have version duplicates. The high level goal is to
721-
give <command>VACUUM</command> more time to run before an
722-
<quote>unnecessary</quote> page split caused by version churn can
723-
take place.
810+
It is sometimes possible for unique indexes (as well as unique
811+
constraints) to use deduplication. This allows leaf pages to
812+
temporarily <quote>absorb</quote> extra version churn duplicates.
813+
Deduplication in unique indexes augments bottom-up index deletion,
814+
especially in cases where a long-running transactions holds a
815+
snapshot that blocks garbage collection. The goal is to buy time
816+
for the bottom-up index deletion strategy to become effective
817+
again. Delaying page splits until a single long-running
818+
transaction naturally goes away can allow a bottom-up deletion pass
819+
to succeed where an earlier deletion pass failed.
724820
</para>
725821
<tip>
726822
<para>

‎doc/src/sgml/ref/create_index.sgml

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -386,17 +386,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
386386
<para>
387387
The fillfactor for an index is a percentage that determines how full
388388
the index method will try to pack index pages. For B-trees, leaf pages
389-
are filled to this percentage during initial indexbuild, and also
389+
are filled to this percentage during initial indexbuilds, and also
390390
when extending the index at the right (adding new largest key values).
391391
If pages
392392
subsequently become completely full, they will be split, leading to
393-
gradual degradation inthe index's efficiency. B-trees use a default
393+
fragmentation oftheon-diskindex structure. B-trees use a default
394394
fillfactor of 90, but any integer value from 10 to 100 can be selected.
395-
If the table is static then fillfactor 100 is best to minimize the
396-
index's physical size, but for heavily updated tables a smaller
397-
fillfactor is better to minimize the need for page splits. The
398-
other index methods use fillfactor in different but roughly analogous
399-
ways; the default fillfactor varies between methods.
395+
</para>
396+
<para>
397+
B-tree indexes on tables where many inserts and/or updates are
398+
anticipated can benefit from lower fillfactor settings at
399+
<command>CREATE INDEX</command> time (following bulk loading into the
400+
table). Values in the range of 50 - 90 can usefully <quote>smooth
401+
out</quote> the <emphasis>rate</emphasis> of page splits during the
402+
early life of the B-tree index (lowering fillfactor like this may even
403+
lower the absolute number of page splits, though this effect is highly
404+
workload dependent). The B-tree bottom-up index deletion technique
405+
described in <xref linkend="btree-deletion"/> is dependent on having
406+
some <quote>extra</quote> space on pages to store <quote>extra</quote>
407+
tuple versions, and so can be affected by fillfactor (though the effect
408+
is usually not significant).
409+
</para>
410+
<para>
411+
In other specific cases it might be useful to increase fillfactor to
412+
100 at <command>CREATE INDEX</command> time as a way of maximizing
413+
space utilization. You should only consider this when you are
414+
completely sure that the table is static (i.e. that it will never be
415+
affected by either inserts or updates). A fillfactor setting of 100
416+
otherwise risks <emphasis>harming</emphasis> performance: even a few
417+
updates or inserts will cause a sudden flood of page splits.
418+
</para>
419+
<para>
420+
The other index methods use fillfactor in different but roughly
421+
analogous ways; the default fillfactor varies between methods.
400422
</para>
401423
</listitem>
402424
</varlistentry>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp