NotificationsYou must be signed in to change notification settings
Fork5
Star26

Commitbc0f080

committed

Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().

We must filter out hashtable entries with frequencies less than thosespecified by the algorithm, else we risk emitting junk entries whoseactual frequency is much less than other lexemes that did not gettabulated. This is bad enough by itself, but even worse is thattsquerysel() believes that the minimum frequency seen in pg_statistic is ahard upper bound for lexemes not included, and was thus underestimatingthe frequency of non-MCEs.Also, set the threshold frequency to something with a little bit of theorybehind it, to wit assume that the input distribution is approximatelyZipfian. This might need adjustment in future, but some preliminaryexperiments suggest that it's not too unreasonable.Back-patch to 8.4, where this code was introduced.Jan Urbanski, with some editorialization by Tom

1 parentb12b7a9 commitbc0f080Copy full SHA for bc0f080

File tree

1 file changed

+97

-45

lines changed

src/backend/tsearch
- ts_typanalyze.c

1 file changed

+97

-45

lines changed

`‎src/backend/tsearch/ts_typanalyze.c`

Lines changed: 97 additions & 45 deletions

Original file line number	Diff line number	Diff line change
`@@ -7,7 +7,7 @@`
`7`	`7`	`*`
`8`	`8`	`*`
`9`	`9`	`* IDENTIFICATION`
`10`		`- * $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.8 2010/01/02 16:57:53 momjian Exp $`
	`10`	`+ * $PostgreSQL: pgsql/src/backend/tsearch/ts_typanalyze.c,v 1.9 2010/05/30 21:59:02 tgl Exp $`
`11`	`11`	`*`
`12`	`12`	`*-------------------------------------------------------------------------`
`13`	`13`	`*/`
`@@ -92,21 +92,49 @@ ts_typanalyze(PG_FUNCTION_ARGS)`
`92`	`92`	`*http://www.vldb.org/conf/2002/S10P03.pdf`
`93`	`93`	`*`
`94`	`94`	`*The Lossy Counting (aka LC) algorithm goes like this:`
`95`		`- *Let D be a set of triples (e, f, d), where e is an element value, f is`
`96`		`- *that element's frequency (occurrence count) and d is the maximum error in`
`97`		`- *f.We start with D empty and process the elements in batches of size`
`98`		`- *w. (The batch size is also known as "bucket size".) Let the current batch`
`99`		`- *number be b_current, starting with 1. For each element e we either`
`100`		`- *increment its f count, if it's already in D, or insert a new triple into D`
`101`		`- *with values (e, 1, b_current - 1). After processing each batch we prune D,`
`102`		`- *by removing from it all elements with f + d <= b_current. Finally, we`
`103`		`- *gather elements with largest f. The LC paper proves error bounds on f`
`104`		`- *dependent on the batch size w, and shows that the required table size`
`105`		`- *is no more than a few times w.`
	`95`	`+ *Let s be the threshold frequency for an item (the minimum frequency we`
	`96`	`+ *are interested in) and epsilon the error margin for the frequency. Let D`
	`97`	`+ *be a set of triples (e, f, delta), where e is an element value, f is that`
	`98`	`+ *element's frequency (actually, its current occurrence count) and delta is`
	`99`	`+ *the maximum error in f. We start with D empty and process the elements in`
	`100`	`+ *batches of size w. (The batch size is also known as "bucket size" and is`
	`101`	`+ *equal to 1/epsilon.) Let the current batch number be b_current, starting`
	`102`	`+ *with 1. For each element e we either increment its f count, if it's`
	`103`	`+ *already in D, or insert a new triple into D with values (e, 1, b_current`
	`104`	`+ *- 1). After processing each batch we prune D, by removing from it all`
	`105`	`+ *elements with f + delta <= b_current. After the algorithm finishes we`
	`106`	`+ suppress all elements from D that do not satisfy f >= (s - epsilon) N,`
	`107`	`+ *where N is the total number of elements in the input. We emit the`
	`108`	`+ *remaining elements with estimated frequency f/N. The LC paper proves`
	`109`	`+ *that this algorithm finds all elements with true frequency at least s,`
	`110`	`+ *and that no frequency is overestimated or is underestimated by more than`
	`111`	`+ *epsilon. Furthermore, given reasonable assumptions about the input`
	`112`	`+ *distribution, the required table size is no more than about 7 times w.`
`106`	`113`	`*`
`107`		`- *We use a hashtable for the D structure and a bucket width of`
`108`		`- statistics_target 10, where 10 is an arbitrarily chosen constant,`
`109`		`- *meant to approximate the number of lexemes in a single tsvector.`
	`114`	`+ *We set s to be the estimated frequency of the K'th word in a natural`
	`115`	`+ *language's frequency table, where K is the target number of entries in`
	`116`	`+ *the MCELEM array plus an arbitrary constant, meant to reflect the fact`
	`117`	`+ *that the most common words in any language would usually be stopwords`
	`118`	`+ *so we will not actually see them in the input. We assume that the`
	`119`	`+ *distribution of word frequencies (including the stopwords) follows Zipf's`
	`120`	`+ *law with an exponent of 1.`
	`121`	`+ *`
	`122`	`+ *Assuming Zipfian distribution, the frequency of the K'th word is equal`
	`123`	`+ to 1/(K H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of`
	`124`	`+ *words in the language. Putting W as one million, we get roughly 0.07/K.`
	`125`	`+ *Assuming top 10 words are stopwords gives s = 0.07/(K + 10). We set`
	`126`	`+ *epsilon = s/10, which gives bucket width w = (K + 10)/0.007 and`
	`127`	`+ maximum expected hashtable size of about 1000 (K + 10).`
	`128`	`+ *`
	`129`	`+ *Note: in the above discussion, s, epsilon, and f/N are in terms of a`
	`130`	`+ *lexeme's frequency as a fraction of all lexemes seen in the input.`
	`131`	`+ *However, what we actually want to store in the finished pg_statistic`
	`132`	`+ *entry is each lexeme's frequency as a fraction of all rows that it occurs`
	`133`	`+ *in. Assuming that the input tsvectors are correctly constructed, no`
	`134`	`+ *lexeme occurs more than once per tsvector, so the final count f is a`
	`135`	`+ *correct estimate of the number of input tsvectors it occurs in, and we`
	`136`	`+ *need only change the divisor from N to nonnull_cnt to get the number we`
	`137`	`+ *want.`
`110`	`138`	`*/`
`111`	`139`	`staticvoid`
`112`	`140`	`compute_tsvector_stats(VacAttrStats*stats,`
`@@ -133,19 +161,23 @@ compute_tsvector_stats(VacAttrStats *stats,`
`133`	`161`	`LexemeHashKeyhash_key;`
`134`	`162`	`TrackItem*item;`
`135`	`163`
`136`		`-/* We want statistics_target * 10 lexemes in the MCELEM array */`
	`164`	`+/*`
	`165`	`+ * We want statistics_target * 10 lexemes in the MCELEM array. This`
	`166`	`+ * multiplier is pretty arbitrary, but is meant to reflect the fact that`
	`167`	`+ * the number of individual lexeme values tracked in pg_statistic ought`
	`168`	`+ * to be more than the number of values for a simple scalar column.`
	`169`	`+ */`
`137`	`170`	`num_mcelem=stats->attr->attstattarget*10;`
`138`	`171`
`139`	`172`	`/*`
`140`		`- * We set bucket width equal to the target number of result lexemes. This`
`141`		`- * is probably about right but perhaps might need to be scaled up or down`
`142`		`- * a bit?`
	`173`	`+ * We set bucket width equal to (num_mcelem + 10) / 0.007 as per the`
	`174`	`+ * comment above.`
`143`	`175`	`*/`
`144`		`-bucket_width=num_mcelem;`
	`176`	`+bucket_width=(num_mcelem+10)*1000 /7;`
`145`	`177`
`146`	`178`	`/*`
`147`	`179`	`* Create the hashtable. It will be in local memory, so we don't need to`
`148`		`- * worry aboutinitial size too much. Also we don't need to pay any`
	`180`	`+ * worry aboutoverflowing the initial size. Also we don't need to pay any`
`149`	`181`	`* attention to locking and memory management.`
`150`	`182`	`*/`
`151`	`183`	`MemSet(&hash_ctl,0,sizeof(hash_ctl));`
`@@ -155,13 +187,13 @@ compute_tsvector_stats(VacAttrStats *stats,`
`155`	`187`	`hash_ctl.match=lexeme_match;`
`156`	`188`	`hash_ctl.hcxt=CurrentMemoryContext;`
`157`	`189`	`lexemes_tab=hash_create("Analyzed lexemes table",`
`158`		`-bucket_width*4,`
	`190`	`+bucket_width*7,`
`159`	`191`	`&hash_ctl,`
`160`	`192`	`HASH_ELEM \|HASH_FUNCTION \|HASH_COMPARE \|HASH_CONTEXT);`
`161`	`193`
`162`	`194`	`/* Initialize counters. */`
`163`	`195`	`b_current=1;`
`164`		`-lexeme_no=1;`
	`196`	`+lexeme_no=0;`
`165`	`197`
`166`	`198`	`/* Loop over the tsvectors. */`
`167`	`199`	`for (vector_no=0;vector_no<samplerows;vector_no++)`
`@@ -232,6 +264,9 @@ compute_tsvector_stats(VacAttrStats *stats,`
`232`	`264`	`item->delta=b_current-1;`
`233`	`265`	`}`
`234`	`266`
	`267`	`+/* lexeme_no is the number of elements processed (ie N) */`
	`268`	`+lexeme_no++;`
	`269`	`+`
`235`	`270`	`/* We prune the D structure after processing each bucket */`
`236`	`271`	`if (lexeme_no %bucket_width==0)`
`237`	`272`	`{`
`@@ -240,7 +275,6 @@ compute_tsvector_stats(VacAttrStats *stats,`
`240`	`275`	`}`
`241`	`276`
`242`	`277`	`/* Advance to the next WordEntry in the tsvector */`
`243`		`-lexeme_no++;`
`244`	`278`	`curentryptr++;`
`245`	`279`	`}`
`246`	`280`	`}`
`@@ -252,6 +286,7 @@ compute_tsvector_stats(VacAttrStats *stats,`
`252`	`286`	`inti;`
`253`	`287`	`TrackItem**sort_table;`
`254`	`288`	`inttrack_len;`
	`289`	`+intcutoff_freq;`
`255`	`290`	`intminfreq,`
`256`	`291`	`maxfreq;`
`257`	`292`
`@@ -264,34 +299,51 @@ compute_tsvector_stats(VacAttrStats *stats,`
`264`	`299`	`stats->stadistinct=-1.0;`
`265`	`300`
`266`	`301`	`/*`
`267`		`- * Determine the top-N lexemes by simply copying pointers from the`
`268`		`- * hashtable into an array and applying qsort()`
	`302`	`+ * Construct an array of the interesting hashtable items, that is,`
	`303`	`+ * those meeting the cutoff frequency (s - epsilon)*N. Also identify`
	`304`	`+ * the minimum and maximum frequencies among these items.`
	`305`	`+ *`
	`306`	`+ * Since epsilon = s/10 and bucket_width = 1/epsilon, the cutoff`
	`307`	`+ * frequency is 9*N / bucket_width.`
`269`	`308`	`*/`
`270`		`-track_len=hash_get_num_entries(lexemes_tab);`
	`309`	`+cutoff_freq=9*lexeme_no /bucket_width;`
`271`	`310`
`272`		`-sort_table= (TrackItem*)palloc(sizeof(TrackItem)*track_len);`
	`311`	`+i=hash_get_num_entries(lexemes_tab);/* surely enough space */`
	`312`	`+sort_table= (TrackItem*)palloc(sizeof(TrackItem)*i);`
`273`	`313`
`274`	`314`	`hash_seq_init(&scan_status,lexemes_tab);`
`275`		`-i=0;`
	`315`	`+track_len=0;`
	`316`	`+minfreq=lexeme_no;`
	`317`	`+maxfreq=0;`
`276`	`318`	`while ((item= (TrackItem*)hash_seq_search(&scan_status))!=NULL)`
`277`	`319`	`{`
`278`		`-sort_table[i++]=item;`
	`320`	`+if (item->frequency>cutoff_freq)`
	`321`	`+{`
	`322`	`+sort_table[track_len++]=item;`
	`323`	`+minfreq=Min(minfreq,item->frequency);`
	`324`	`+maxfreq=Max(maxfreq,item->frequency);`
	`325`	`+}`
`279`	`326`	`}`
`280`		`-Assert(i==track_len);`
	`327`	`+Assert(track_len <=i);`
`281`	`328`
`282`		`-qsort(sort_table,track_len,sizeof(TrackItem*),`
`283`		`-trackitem_compare_frequencies_desc);`
	`329`	`+/* emit some statistics for debug purposes */`
	`330`	`+elog(DEBUG3,"tsvector_stats: target # mces = %d, bucket width = %d, "`
	`331`	`+"# lexemes = %d, hashtable size = %d, usable entries = %d",`
	`332`	`+num_mcelem,bucket_width,lexeme_no,i,track_len);`
`284`	`333`
`285`		`-/* Suppress any single-occurrence items */`
`286`		`-while (track_len>0)`
	`334`	`+/*`
	`335`	`+ * If we obtained more lexemes than we really want, get rid of`
	`336`	`+ * those with least frequencies. The easiest way is to qsort the`
	`337`	`+ * array into descending frequency order and truncate the array.`
	`338`	`+ */`
	`339`	`+if (num_mcelem<track_len)`
`287`	`340`	`{`
`288`		`-if (sort_table[track_len-1]->frequency>1)`
`289`		`-break;`
`290`		`-track_len--;`
	`341`	`+qsort(sort_table,track_len,sizeof(TrackItem*),`
	`342`	`+trackitem_compare_frequencies_desc);`
	`343`	`+/* reset minfreq to the smallest frequency we're keeping */`
	`344`	`+minfreq=sort_table[num_mcelem-1]->frequency;`
`291`	`345`	`}`
`292`		`-`
`293`		`-/* Determine the number of most common lexemes to be stored */`
`294`		`-if (num_mcelem>track_len)`
	`346`	`+else`
`295`	`347`	`num_mcelem=track_len;`
`296`	`348`
`297`	`349`	`/* Generate MCELEM slot entry */`
`@@ -301,10 +353,6 @@ compute_tsvector_stats(VacAttrStats *stats,`
`301`	`353`	`Datum*mcelem_values;`
`302`	`354`	`float4*mcelem_freqs;`
`303`	`355`
`304`		`-/* Grab the minimal and maximal frequencies that will get stored */`
`305`		`-minfreq=sort_table[num_mcelem-1]->frequency;`
`306`		`-maxfreq=sort_table[0]->frequency;`
`307`		`-`
`308`	`356`	`/*`
`309`	`357`	`* We want to store statistics sorted on the lexeme value using`
`310`	`358`	`* first length, then byte-for-byte comparison. The reason for`
`@@ -334,6 +382,10 @@ compute_tsvector_stats(VacAttrStats *stats,`
`334`	`382`	`mcelem_values= (Datum)palloc(num_mcelemsizeof(Datum));`
`335`	`383`	`mcelem_freqs= (float4)palloc((num_mcelem+2)sizeof(float4));`
`336`	`384`
	`385`	`+/*`
	`386`	`+ * See comments above about use of nonnull_cnt as the divisor`
	`387`	`+ * for the final frequency estimates.`
	`388`	`+ */`
`337`	`389`	`for (i=0;i<num_mcelem;i++)`
`338`	`390`	`{`
`339`	`391`	`TrackItem*item=sort_table[i];`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitbc0f080

File tree

1 file changed

1 file changed

`‎src/backend/tsearch/ts_typanalyze.c`

0 commit comments