Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commita1b4f28

Browse files
committed
Consider BufFiles when adjusting hashjoin parameters
Until now ExecChooseHashTableSize() considered only the size of thein-memory hash table, and ignored the memory needed for the batch files.Which can be a significant amount, because each batch needs two BufFiles(each with a BLCKSZ buffer). The same issue applies to increasing thenumber of batches during execution.It's also possible to trigger a "batch explosion", e.g. due to duplicatevalues or skew. We've seen reports of joins with hundreds of thousands(or even millions) of batches, consuming gigabytes of memory, triggeringOOM errors. These cases may be fairly rare, but it's clearly possible tohit them.These issues can't be prevented during planning. Even if we improvethat, it does not help with execution-time batch explosion. We canhowever reduce the impact and use as little memory as possible.This patch improves the behavior by adjusting how the memory is dividedbetween the hash table and batch files. It may be better to use fewerbatch files, even if it means the hash table will exceed the limit.The capacity of the hash node may be increased either by doubling henumber of batches, or doubling the size of the in-memory hash table. Theoutcome is the same, but the memory usage may be very different. For lownbatch values it's better to add batches, for high nbatch values it'sbetter to allow a larger hash table.The patch considers both options, both during the initial sizing andthen during execution, to minimize how much the limit gets exceeded.It might seem this patch is relaxing the memory limit - allowing it tobe exceeded. But that's not really the case. It has always been likethat, except the memory used by batches was ignored.Allowing the hash table to grow may also prevent the batch explosion.If there's a large batch that can't be split (due to hash collisions orduplicate values), at some point the memory limit will increase enoughfor the batch to fit into the hash table.This patch was in the works for a long time. The early versions wereposted in 2019, and revived every year or two when we happened to getthe next report of OOM due to a hashjoin batch explosion. Each of thosepatch versions were reviewed by a couple people. I'm mentioning onlyMelanie Plageman and Robert Haas, because they reviewed the lastversion, and the older patches are very different.Reviewed-by: Melanie Plageman, Robert HaasDiscussion:https://postgr.es/m/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.meDiscussion:https://postgr.es/m/20190504003414.bulcbnge3rhwhcsh%40developmentDiscussion:https://postgr.es/m/20190428141901.5dsbge2ka3rxmpk6%40development
1 parent8b886a4 commita1b4f28

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed

‎src/backend/executor/nodeHash.c

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -848,6 +848,90 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
848848
nbatch=pg_nextpower2_32(Max(2,minbatch));
849849
}
850850

851+
/*
852+
* Optimize the total amount of memory consumed by the hash node.
853+
*
854+
* The nbatch calculation above focuses on the size of the in-memory hash
855+
* table, assuming no per-batch overhead. Now adjust the number of batches
856+
* and the size of the hash table to minimize total memory consumed by the
857+
* hash node.
858+
*
859+
* Each batch file has a BLCKSZ buffer, and we may need two files per
860+
* batch (inner and outer side). So with enough batches this can be
861+
* significantly more memory than the hashtable itself.
862+
*
863+
* The total memory usage may be expressed by this formula:
864+
*
865+
* (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes
866+
*
867+
* where (inner_rel_bytes / nbatch) is the size of the in-memory hash
868+
* table and (2 * nbatch * BLCKSZ) is the amount of memory used by file
869+
* buffers. But for sufficiently large values of inner_rel_bytes value
870+
* there may not be a nbatch value that would make both parts fit into
871+
* hash_table_bytes.
872+
*
873+
* In this case we can't enforce the memory limit - we're going to exceed
874+
* it. We can however minimize the impact and use as little memory as
875+
* possible. (We haven't really enforced it before either, as we simply
876+
* ignored the batch files.)
877+
*
878+
* The formula for total memory usage says that given an inner relation of
879+
* size inner_rel_bytes, we may divide it into an arbitrary number of
880+
* batches. This determines both the size of the in-memory hash table and
881+
* the amount of memory needed for batch files. These two terms work in
882+
* opposite ways - when one decreases, the other increases.
883+
*
884+
* For low nbatch values, the hash table takes most of the memory, but at
885+
* some point the batch files start to dominate. If you combine these two
886+
* terms, the memory consumption (for a fixed size of the inner relation)
887+
* has a u-shape, with a minimum at some nbatch value.
888+
*
889+
* Our goal is to find this nbatch value, minimizing the memory usage. We
890+
* calculate the memory usage with half the batches (i.e. nbatch/2), and
891+
* if it's lower than the current memory usage we know it's better to use
892+
* fewer batches. We repeat this until reducing the number of batches does
893+
* not reduce the memory usage - we found the optimum. We know the optimum
894+
* exists, thanks to the u-shape.
895+
*
896+
* We only want to do this when exceeding the memory limit, not every
897+
* time. The goal is not to minimize memory usage in every case, but to
898+
* minimize the memory usage when we can't stay within the memory limit.
899+
*
900+
* For this reason we only consider reducing the number of batches. We
901+
* could try the opposite direction too, but that would save memory only
902+
* when most of the memory is used by the hash table. And the hash table
903+
* was used for the initial sizing, so we shouldn't be exceeding the
904+
* memory limit too much. We might save memory by using more batches, but
905+
* it would result in spilling more batch files, which does not seem like
906+
* a great trade off.
907+
*
908+
* While growing the hashtable, we also adjust the number of buckets, to
909+
* not have more than one tuple per bucket (load factor 1). We can only do
910+
* this during the initial sizing - once we start building the hash,
911+
* nbucket is fixed.
912+
*/
913+
while (nbatch>0)
914+
{
915+
/* how much memory are we using with current nbatch value */
916+
size_tcurrent_space=hash_table_bytes+ (2*nbatch*BLCKSZ);
917+
918+
/* how much memory would we use with half the batches */
919+
size_tnew_space=hash_table_bytes*2+ (nbatch*BLCKSZ);
920+
921+
/* If the memory usage would not decrease, we found the optimum. */
922+
if (current_space<new_space)
923+
break;
924+
925+
/*
926+
* It's better to use half the batches, so do that and adjust the
927+
* nbucket in the opposite direction, and double the allowance.
928+
*/
929+
nbatch /=2;
930+
nbuckets *=2;
931+
932+
*space_allowed= (*space_allowed)*2;
933+
}
934+
851935
Assert(nbuckets>0);
852936
Assert(nbatch>0);
853937

@@ -890,6 +974,47 @@ ExecHashTableDestroy(HashJoinTable hashtable)
890974
pfree(hashtable);
891975
}
892976

977+
/*
978+
* Consider adjusting the allowed hash table size, depending on the number
979+
* of batches, to minimize the overall memory usage (for both the hashtable
980+
* and batch files).
981+
*
982+
* We're adjusting the size of the hash table, not the (optimal) number of
983+
* buckets. We can't change that once we start building the hash, due to how
984+
* ExecHashGetBucketAndBatch calculates batchno/bucketno from the hash. This
985+
* means the load factor may not be optimal, but we're in damage control so
986+
* we accept slower lookups. It's still much better than batch explosion.
987+
*
988+
* Returns true if we chose to increase the batch size (and thus we don't
989+
* need to add batches), and false if we should increase nbatch.
990+
*/
991+
staticbool
992+
ExecHashIncreaseBatchSize(HashJoinTablehashtable)
993+
{
994+
/*
995+
* How much additional memory would doubling nbatch use? Each batch may
996+
* require two buffered files (inner/outer), with a BLCKSZ buffer.
997+
*/
998+
size_tbatchSpace= (hashtable->nbatch*2*BLCKSZ);
999+
1000+
/*
1001+
* Compare the new space needed for doubling nbatch and for enlarging the
1002+
* in-memory hash table. If doubling the hash table needs less memory,
1003+
* just do that. Otherwise, continue with doubling the nbatch.
1004+
*
1005+
* We're either doubling spaceAllowed of batchSpace, so which of those
1006+
* increases the memory usage the least is the same as comparing the
1007+
* values directly.
1008+
*/
1009+
if (hashtable->spaceAllowed <=batchSpace)
1010+
{
1011+
hashtable->spaceAllowed *=2;
1012+
return true;
1013+
}
1014+
1015+
return false;
1016+
}
1017+
8931018
/*
8941019
* ExecHashIncreaseNumBatches
8951020
*increase the original number of batches in order to reduce
@@ -913,6 +1038,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
9131038
if (oldnbatch>Min(INT_MAX /2,MaxAllocSize / (sizeof(void*)*2)))
9141039
return;
9151040

1041+
/* consider increasing size of the in-memory hash table instead */
1042+
if (ExecHashIncreaseBatchSize(hashtable))
1043+
return;
1044+
9161045
nbatch=oldnbatch*2;
9171046
Assert(nbatch>1);
9181047

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp