NotificationsYou must be signed in to change notification settings
Fork6
Star31

Commita1b4f28

committed

Consider BufFiles when adjusting hashjoin parameters

Until now ExecChooseHashTableSize() considered only the size of thein-memory hash table, and ignored the memory needed for the batch files.Which can be a significant amount, because each batch needs two BufFiles(each with a BLCKSZ buffer). The same issue applies to increasing thenumber of batches during execution.It's also possible to trigger a "batch explosion", e.g. due to duplicatevalues or skew. We've seen reports of joins with hundreds of thousands(or even millions) of batches, consuming gigabytes of memory, triggeringOOM errors. These cases may be fairly rare, but it's clearly possible tohit them.These issues can't be prevented during planning. Even if we improvethat, it does not help with execution-time batch explosion. We canhowever reduce the impact and use as little memory as possible.This patch improves the behavior by adjusting how the memory is dividedbetween the hash table and batch files. It may be better to use fewerbatch files, even if it means the hash table will exceed the limit.The capacity of the hash node may be increased either by doubling henumber of batches, or doubling the size of the in-memory hash table. Theoutcome is the same, but the memory usage may be very different. For lownbatch values it's better to add batches, for high nbatch values it'sbetter to allow a larger hash table.The patch considers both options, both during the initial sizing andthen during execution, to minimize how much the limit gets exceeded.It might seem this patch is relaxing the memory limit - allowing it tobe exceeded. But that's not really the case. It has always been likethat, except the memory used by batches was ignored.Allowing the hash table to grow may also prevent the batch explosion.If there's a large batch that can't be split (due to hash collisions orduplicate values), at some point the memory limit will increase enoughfor the batch to fit into the hash table.This patch was in the works for a long time. The early versions wereposted in 2019, and revived every year or two when we happened to getthe next report of OOM due to a hashjoin batch explosion. Each of thosepatch versions were reviewed by a couple people. I'm mentioning onlyMelanie Plageman and Robert Haas, because they reviewed the lastversion, and the older patches are very different.Reviewed-by: Melanie Plageman, Robert HaasDiscussion:https://postgr.es/m/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.meDiscussion:https://postgr.es/m/20190504003414.bulcbnge3rhwhcsh%40developmentDiscussion:https://postgr.es/m/20190428141901.5dsbge2ka3rxmpk6%40development

1 parent8b886a4 commita1b4f28Copy full SHA for a1b4f28

File tree

1 file changed

+129

-0

lines changed

src/backend/executor
- nodeHash.c

1 file changed

+129

-0

lines changed

`‎src/backend/executor/nodeHash.c`

Lines changed: 129 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -848,6 +848,90 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,`
`848`	`848`	`nbatch=pg_nextpower2_32(Max(2,minbatch));`
`849`	`849`	`}`
`850`	`850`
	`851`	`+/*`
	`852`	`+ * Optimize the total amount of memory consumed by the hash node.`
	`853`	`+ *`
	`854`	`+ * The nbatch calculation above focuses on the size of the in-memory hash`
	`855`	`+ * table, assuming no per-batch overhead. Now adjust the number of batches`
	`856`	`+ * and the size of the hash table to minimize total memory consumed by the`
	`857`	`+ * hash node.`
	`858`	`+ *`
	`859`	`+ * Each batch file has a BLCKSZ buffer, and we may need two files per`
	`860`	`+ * batch (inner and outer side). So with enough batches this can be`
	`861`	`+ * significantly more memory than the hashtable itself.`
	`862`	`+ *`
	`863`	`+ * The total memory usage may be expressed by this formula:`
	`864`	`+ *`
	`865`	`+ * (inner_rel_bytes / nbatch) + (2 * nbatch * BLCKSZ) <= hash_table_bytes`
	`866`	`+ *`
	`867`	`+ * where (inner_rel_bytes / nbatch) is the size of the in-memory hash`
	`868`	`+ * table and (2 * nbatch * BLCKSZ) is the amount of memory used by file`
	`869`	`+ * buffers. But for sufficiently large values of inner_rel_bytes value`
	`870`	`+ * there may not be a nbatch value that would make both parts fit into`
	`871`	`+ * hash_table_bytes.`
	`872`	`+ *`
	`873`	`+ * In this case we can't enforce the memory limit - we're going to exceed`
	`874`	`+ * it. We can however minimize the impact and use as little memory as`
	`875`	`+ * possible. (We haven't really enforced it before either, as we simply`
	`876`	`+ * ignored the batch files.)`
	`877`	`+ *`
	`878`	`+ * The formula for total memory usage says that given an inner relation of`
	`879`	`+ * size inner_rel_bytes, we may divide it into an arbitrary number of`
	`880`	`+ * batches. This determines both the size of the in-memory hash table and`
	`881`	`+ * the amount of memory needed for batch files. These two terms work in`
	`882`	`+ * opposite ways - when one decreases, the other increases.`
	`883`	`+ *`
	`884`	`+ * For low nbatch values, the hash table takes most of the memory, but at`
	`885`	`+ * some point the batch files start to dominate. If you combine these two`
	`886`	`+ * terms, the memory consumption (for a fixed size of the inner relation)`
	`887`	`+ * has a u-shape, with a minimum at some nbatch value.`
	`888`	`+ *`
	`889`	`+ * Our goal is to find this nbatch value, minimizing the memory usage. We`
	`890`	`+ * calculate the memory usage with half the batches (i.e. nbatch/2), and`
	`891`	`+ * if it's lower than the current memory usage we know it's better to use`
	`892`	`+ * fewer batches. We repeat this until reducing the number of batches does`
	`893`	`+ * not reduce the memory usage - we found the optimum. We know the optimum`
	`894`	`+ * exists, thanks to the u-shape.`
	`895`	`+ *`
	`896`	`+ * We only want to do this when exceeding the memory limit, not every`
	`897`	`+ * time. The goal is not to minimize memory usage in every case, but to`
	`898`	`+ * minimize the memory usage when we can't stay within the memory limit.`
	`899`	`+ *`
	`900`	`+ * For this reason we only consider reducing the number of batches. We`
	`901`	`+ * could try the opposite direction too, but that would save memory only`
	`902`	`+ * when most of the memory is used by the hash table. And the hash table`
	`903`	`+ * was used for the initial sizing, so we shouldn't be exceeding the`
	`904`	`+ * memory limit too much. We might save memory by using more batches, but`
	`905`	`+ * it would result in spilling more batch files, which does not seem like`
	`906`	`+ * a great trade off.`
	`907`	`+ *`
	`908`	`+ * While growing the hashtable, we also adjust the number of buckets, to`
	`909`	`+ * not have more than one tuple per bucket (load factor 1). We can only do`
	`910`	`+ * this during the initial sizing - once we start building the hash,`
	`911`	`+ * nbucket is fixed.`
	`912`	`+ */`
	`913`	`+while (nbatch>0)`
	`914`	`+{`
	`915`	`+/* how much memory are we using with current nbatch value */`
	`916`	`+size_tcurrent_space=hash_table_bytes+ (2nbatchBLCKSZ);`
	`917`	`+`
	`918`	`+/* how much memory would we use with half the batches */`
	`919`	`+size_tnew_space=hash_table_bytes2+ (nbatchBLCKSZ);`
	`920`	`+`
	`921`	`+/* If the memory usage would not decrease, we found the optimum. */`
	`922`	`+if (current_space<new_space)`
	`923`	`+break;`
	`924`	`+`
	`925`	`+/*`
	`926`	`+ * It's better to use half the batches, so do that and adjust the`
	`927`	`+ * nbucket in the opposite direction, and double the allowance.`
	`928`	`+ */`
	`929`	`+nbatch /=2;`
	`930`	`+nbuckets *=2;`
	`931`	`+`
	`932`	`+space_allowed= (space_allowed)*2;`
	`933`	`+}`
	`934`	`+`
`851`	`935`	`Assert(nbuckets>0);`
`852`	`936`	`Assert(nbatch>0);`
`853`	`937`
`@@ -890,6 +974,47 @@ ExecHashTableDestroy(HashJoinTable hashtable)`
`890`	`974`	`pfree(hashtable);`
`891`	`975`	`}`
`892`	`976`
	`977`	`+/*`
	`978`	`+ * Consider adjusting the allowed hash table size, depending on the number`
	`979`	`+ * of batches, to minimize the overall memory usage (for both the hashtable`
	`980`	`+ * and batch files).`
	`981`	`+ *`
	`982`	`+ * We're adjusting the size of the hash table, not the (optimal) number of`
	`983`	`+ * buckets. We can't change that once we start building the hash, due to how`
	`984`	`+ * ExecHashGetBucketAndBatch calculates batchno/bucketno from the hash. This`
	`985`	`+ * means the load factor may not be optimal, but we're in damage control so`
	`986`	`+ * we accept slower lookups. It's still much better than batch explosion.`
	`987`	`+ *`
	`988`	`+ * Returns true if we chose to increase the batch size (and thus we don't`
	`989`	`+ * need to add batches), and false if we should increase nbatch.`
	`990`	`+ */`
	`991`	`+staticbool`
	`992`	`+ExecHashIncreaseBatchSize(HashJoinTablehashtable)`
	`993`	`+{`
	`994`	`+/*`
	`995`	`+ * How much additional memory would doubling nbatch use? Each batch may`
	`996`	`+ * require two buffered files (inner/outer), with a BLCKSZ buffer.`
	`997`	`+ */`
	`998`	`+size_tbatchSpace= (hashtable->nbatch2BLCKSZ);`
	`999`	`+`
	`1000`	`+/*`
	`1001`	`+ * Compare the new space needed for doubling nbatch and for enlarging the`
	`1002`	`+ * in-memory hash table. If doubling the hash table needs less memory,`
	`1003`	`+ * just do that. Otherwise, continue with doubling the nbatch.`
	`1004`	`+ *`
	`1005`	`+ * We're either doubling spaceAllowed of batchSpace, so which of those`
	`1006`	`+ * increases the memory usage the least is the same as comparing the`
	`1007`	`+ * values directly.`
	`1008`	`+ */`
	`1009`	`+if (hashtable->spaceAllowed <=batchSpace)`
	`1010`	`+{`
	`1011`	`+hashtable->spaceAllowed *=2;`
	`1012`	`+return true;`
	`1013`	`+}`
	`1014`	`+`
	`1015`	`+return false;`
	`1016`	`+}`
	`1017`	`+`
`893`	`1018`	`/*`
`894`	`1019`	`* ExecHashIncreaseNumBatches`
`895`	`1020`	`*increase the original number of batches in order to reduce`
`@@ -913,6 +1038,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)`
`913`	`1038`	`if (oldnbatch>Min(INT_MAX /2,MaxAllocSize / (sizeof(void)2)))`
`914`	`1039`	`return;`
`915`	`1040`
	`1041`	`+/* consider increasing size of the in-memory hash table instead */`
	`1042`	`+if (ExecHashIncreaseBatchSize(hashtable))`
	`1043`	`+return;`
	`1044`	`+`
`916`	`1045`	`nbatch=oldnbatch*2;`
`917`	`1046`	`Assert(nbatch>1);`
`918`	`1047`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commita1b4f28

File tree

1 file changed

1 file changed

`‎src/backend/executor/nodeHash.c`

0 commit comments